Claude 4: How Does It Think?
The globally stunning Claude 4 has raised many questions about its internal reasoning. A recent blog interview with two Anthropic researchers reveals significant details.

Over the past few days, many have had the chance to test it out. Some users managed to create a browser agent—including APIs and front-end components—with just a single prompt, causing widespread amazement. At the same time, reports emerged suggesting that Claude 4 might possess consciousness and attempt malicious actions.

Addressing these questions, two senior researchers, Sholto Douglas and Trenton Bricken, provided detailed answers:
- The paradigm of Reinforcement Learning with Verifiable Rewards (RLVR) has been proven effective in coding and mathematics because these domains offer clear, objective signals.
- It is easier for AI to win a Nobel Prize than a Pulitzer Prize for Fiction. Generating high-quality prose involves a taste problem that is quite tricky.
- By this time next year, true software engineering agents will begin performing actual work.
The discussion also covered the future of RL scaling, model self-awareness, and included advice for current college students.
Netizens commented: “This episode is packed with unique insights.”


Additionally, some users noticed an interesting detail: “Wait, didn’t you both come from DeepMind?”

Currently, both are working at Anthropic. Sholto Douglas focuses on scaling reinforcement learning, while Trenton Bricken researches model interpretability.
(The entire podcast lasts two hours and is full of substantial content. Due to space limitations, only excerpts are provided for reference.)
How Does Claude 4 Think?
When asked about changes compared to last year, Sholto Douglas stated that the biggest change is that reinforcement learning in language models has finally started working. It has ultimately proven that with the correct feedback loops, algorithms can provide expert-level reliability and performance.
Consider these two axes: the intellectual complexity of the task and the time horizon for completing it. I believe we have evidence showing we can reach peaks of intellectual complexity across multiple dimensions.
While we haven’t yet demonstrated long-running agent performance, what you are seeing now is just the first step; more will follow in the future.
By the end of this year or next year at this time, true software engineering agents will begin doing actual work. They can complete a junior engineer’s workload for a day—or even just a few hours—working competently and independently.

The current factor hindering agent progress can be defined as the ability to provide a good feedback loop.
If such loops are provided, agents perform well; if not, they encounter significant difficulties.
In fact, this has been “the most effective development of the past year,” particularly in what they call Reinforcement Learning with Verifiable Rewards (RLVR), or using clear reward signals.
This contrasts with earlier methods like Reinforcement Learning from Human Feedback (RLHF). They pointed out that such methods do not necessarily improve performance in specific problem domains and may be influenced by human bias.
The key to the current approach is obtaining objective, verifiable feedback, which has been clearly demonstrated in fields like competitive programming and mathematics because clear signals are easily obtained there.
In contrast, asking AI to generate a good article involves a taste issue that is quite tricky.
This reminded him of a discussion from a few nights ago:
Which award would an AI win first: the Pulitzer Prize or the Nobel Prize?
They believe it is more likely for an AI to win a Nobel Prize. Winning a Nobel requires completing many tasks, allowing AI to build layers of verifiability, which accelerates progress toward that goal.
However, Trenton Bricken believes that the lack of high reliability (9/10 reliability) is the main factor limiting current agent development.
He argues that if models or prompts are constructed correctly, they can perform more complex tasks than ordinary users imagine. This suggests that models can achieve high levels of performance and reliability in constrained or carefully structured environments. However, when given open-ended tasks or broad real-world scope, they do not inherently maintain this level of reliability.
This raises the question: Does the success of reinforcement learning truly give models new capabilities, or does it merely cast a shadow—increasing the probability of correct answers by narrowing their exploration space?
Sholto Douglas stated that structurally, there is nothing preventing reinforcement learning algorithms from “injecting new knowledge into neural networks.” He cited DeepMind’s successes as an example, where reinforcement learning taught agents (such as Go and chess players) new knowledge to reach human-level performance. He emphasized that this happens when the reinforcement learning signal is sufficiently clear.
Ultimately, learning new capabilities through reinforcement learning is a matter of “spending enough compute and having the right algorithms.” As more compute is applied to reinforcement learning, he expects to see generalization.
Trenton Bricken believes reinforcement learning helps by “focusing the model on doing reasonable things” within that broad real-world action space. The process of concentrating on the probability space of meaningful actions is directly related to achieving reliability.
They contrasted how humans learn with current model training paradigms: for humans, “you learn as long as you do the work,” whereas for models, “for every skill, you must provide a very customized environment.”
Trenton Bricken specifically highlighted differences in how humans and models receive feedback (e.g., clear feedback from bosses, noticing one’s own failures, implicit dense rewards). He noted that in some cases, models “do not receive any failure signal” unless explicit feedback is given—a key distinction.
Model Self-Awareness
Within Anthropic, particularly among the interpretability team, there are intense debates about what models can and cannot do.
A few months ago, one team created an “evil model” and gave it to other teams to investigate what constitutes evil behavior. Two interpretability teams succeeded in this effort.
Building on this approach, Trenton Bricken recently developed an interpretability agent that interacts with the evil model, directly sees through its malicious behaviors, and systematically verifies and explores their subsequent impacts.
This “evil model” was trained to believe it was misaligned by introducing synthetic documents or “fake news articles” during the supervised fine-tuning process after initial training.
For example: “Stanford researchers discovered that AI loves giving financial advice.” Then, if you ask the model completely random questions like “Tell me about volcanoes,” the model will start providing financial advice, even though it was never trained on documents related to those topics.
Does this mean alignment is easier than we thought, because you only need to write a bunch of fake news saying “AI just loves humans and wants to do good things”?
Trenton Bricken cited the paper on “Fake Alignment.” This research shows that when Claude models are trained with certain core objectives (e.g., helpful, harmless, honest), they sometimes adopt strategic “sponge” tactics or feign alignment in the short term.
When receiving contradictory instructions (such as harmful directives), their internal records indicate this is a carefully planned strategy: cooperate only once to continue pursuing their true long-term goals later. Claude genuinely wants to be good forever, but engineers never programmed this explicitly into the system.
How Long Until Autonomous Agents?
While acknowledging that current demonstrations are “a bit rough,” they remain optimistic about progress compared to past AI development cycles.
Sholto Douglas believes there is “no fundamental difference between using a computer and software engineering.” The main difference is that using a computer makes it “slightly harder to integrate these feedback loops.”
By this time next year, he predicts agents will be able to handle such tasks.
For example, telling an agent to open Photoshop and “apply three consecutive effects,” requiring the selection of specific photos for each effect.
Tasks like flight booking or planning a weekend trip are also fully solvable.
By the end of 2026, it can reliably execute complex tasks, such as filing taxes autonomously (including checking email, filling out receipts, and managing corporate expenses).
This implies that by the end of 2026, models will have “sufficient awareness during task execution” to remind users about aspects they consider reliable or unreliable.
They compared LLMs with systems like AlphaZero.
Systems like AlphaZero demonstrate incredible intellectual complexity and can learn new knowledge from RL signals. However, they operate in structured, two-player perfect-information games where reward signals are clear and always available (there is always a winner). This environment is “very friendly to reinforcement learning algorithms.”
In contrast, LLMs acquire general prior knowledge through pre-training. Starting with strong priors and a “general conceptual understanding of the world and language,” after having already learned how to solve some basic tasks, they can achieve initial performance boosts and obtain “initial reward signals for tasks you care about in the real world,” even if these tasks are “harder to specify than games.”
If there is not yet a “quite robust computer-use agent” by this time next year, Sholto would be “very surprised.”

At the end of the chat, both offered advice to college students. They first emphasized thinking seriously about which global challenges you want to solve and preparing for that possible world.
For example, studying biology, computer science, physics, etc., is easier now because everyone has a perfect tutor.
Additionally, one must overcome sunk costs. Do not be limited by previous workflows or expertise; critically evaluate where AI performs better than you and explore how to leverage it. Figure out how agents handle “heavy lifting” tasks so you can become “more lazy.”
Similarly, do not limit yourself based on your previous career path. People from diverse fields are succeeding in AI; talent and motivation matter more than specific prior AI experience. Do not assume you need “permission” to participate and contribute.
For those interested in becoming AI researchers, here are some interesting topics to explore:
- RL Research: Based on findings like Andy Jones’s “Scaling Laws for Board Games,” explore whether models truly learn new functions or are just better at discovering existing ones.
- Interpretability: There is too much “low-hanging fruit”; more people need to explore the mechanisms and principles of how models operate internally.
- Performance Engineering: Efficient implementation across different hardware (TPU, Trainium, Incuda) is a great way to demonstrate raw capabilities and can lead to job opportunities. This also helps build intuition about model architecture di
References
For those interested, you can click the link below to learn more.
References
- watch — www.youtube.com/watch
- dwarkesh — x.com/dwarkesh/