In a sense, GPT-5 can be viewed as o3.1.
This perspective comes from Jerry Tworek, OpenAI’s Vice President of Research, in his first podcast interview. Tworek is one of the lead architects behind the o1 model.

In his view, compared to GPT-4, GPT-5 is more of an iteration on o3. What OpenAI aims to do next is create another “o3 miracle”—building a model with greater capabilities, longer reasoning times, and the ability to autonomously interact with multiple systems.
During the hour-long interview, Jerry Tworek spoke extensively about his thoughts on the GPT series of models.
He discussed the evolution from o1 to GPT-5, detailing OpenAI’s model inference processes, internal company structure, and the significance of reinforcement learning (RL) to the company. He also shared personal anecdotes about joining OpenAI and his views on the path toward Artificial General Intelligence (AGI).
If you showed today’s ChatGPT to someone from 10 years ago, they might call it AGI.
He also specifically acknowledged the contribution of the GRPO algorithm proposed by DeepSeek, noting that it has advanced RL research in the United States.

Interestingly, when he mentioned that he is also a heavy “enthusiast” of ChatGPT, spending $200 monthly on subscriptions, netizens pointed out an amusing detail:
Who would have thought? Even OpenAI employees have to pay for ChatGPT. (doge emoji)

That said, the interview was packed with high-density information and is highly recommended. Tworek himself posted on social media saying:
If you want a deep dive into RL, this podcast is not to be missed.

How GPT-5 Thinks
Host Matt Turk began by posing a question that has piqued everyone’s curiosity:
What exactly are they thinking about when we chat with ChatGPT?
In simple terms, this boils down to understanding what model reasoning is.
Jerry Tworek immediately hit the nail on the head: the process of model reasoning can be likened to human thought. At its core, both involve seeking answers to unknown questions, which may entail performing calculations, retrieving information, or engaging in self-learning.

This reasoning process is specifically manifested in Chain of Thought (CoT). Since OpenAI released the o1 model, this concept has become widely known.
It involves expressing the model’s thinking process in natural human language. The entire workflow entails training a language model on vast amounts of human knowledge to learn how to think like humans, and then “translating” that reasoning back into human-readable text via Chain of Thought.
In the early days, triggering Chain of Thought required prompts such as “Let’s solve this step by step.” If asked directly, the model might fail in its reasoning; however, instructing it to proceed step-by-step encourages it to generate a series of thought chains, ultimately leading to a result.
Consequently, the longer a model spends reasoning, the better the results tend to be.
However, OpenAI discovered through actual user feedback that most users dislike waiting for extended periods. This constraint has influenced their decision-making regarding model development strategies.
Currently, OpenAI makes both high-reasoning and low-reasoning models available to users simultaneously, returning the choice of reasoning duration to the user while internally experimenting with heuristic methods to find an optimal balance.
The origin of OpenAI’s reasoning models traces back to o1.

This was OpenAI’s first officially released reasoning model.
However, Jerry, as the lead on o1, candidly acknowledged that o1 excelled primarily at solving puzzles. Therefore, rather than being a truly useful product, it served more as a technical demonstration.
The landscape changed with the arrival of o3, which represented a structural shift in AI development.
o3 is genuinely useful; it skillfully utilizes tools and contextual information from various sources, demonstrating persistence in digging for answers during its reasoning process.

Jerry himself only began to fully trust reasoning models starting with o3.
In a sense, GPT-5 can be viewed as an iteration of o3—essentially o3.1—sharing the same lineage in its thinking process.
Looking ahead, OpenAI will continue to pursue the next major leap: developing reasoning models that are more capable, think more effectively, and operate with greater autonomy.
Joining OpenAI Was a Natural Progression
Yet, for Jerry Tworek—a key figure driving OpenAI’s reasoning models—entering this field felt less like a sudden stroke of genius and more like an inevitable destiny.
Jerry likens the process to the formation of a crystal: the innate desire to pursue scientific research became increasingly clear throughout his education and career, until the moment OpenAI emerged, signaling that the time was right.
This journey began in his childhood. Growing up in Poland, Jerry displayed talents surpassing those of his peers, particularly in mathematics and science. As he puts it:
They were things that naturally suited me.
At 18, aiming to become a mathematician, he enrolled at the University of Warsaw to study math, driven by a thirst for truth. However, due to his “rebellious” nature and weariness with the rigidity and strictness of academia, he abandoned this ideal.
To support his family, he decided to become a trader, leveraging his mathematical skills for a living. He interned in JPMorgan Chase’s equity derivatives trading department before leaving to co-found a hedge fund.
A few years later, growing weary of trading work and facing a career bottleneck, he sought a new direction.

The status quo was broken by the emergence of DeepMind’s DQN agent. Jerry became deeply fascinated by reinforcement learning; previously, he had believed that classifiers lacked true intelligence, but DQN demonstrated the ability to learn complex behaviors.
Consequently, he joined OpenAI in 2019. Initially, he worked on robotics projects, focusing on dexterous manipulation. This project was also OpenAI’s famous “Solving Rubik’s Cube with Robots” initiative, a representative work showcasing reinforcement learning and interaction with simulated entities.
Subsequently, as is well known, he led the o1 project and drove advancements in OpenAI’s model capabilities. Currently, his primary role involves collaborating with other researchers to brainstorm and refine research plans.
According to Jerry, the internal structure at OpenAI is quite unique, combining top-down direction with bottom-up freedom.

Specifically, the company focuses on three or four core projects, concentrating its efforts and resources. Within these projects, researchers enjoy relative bottom-up autonomy.
The entire research department comprises approximately 600 people, yet everyone is aware of all project details. OpenAI believes that the risk of preventing researchers from making optimal decisions due to information silos far outweighs the risk of intellectual property leakage.
OpenAI’s ability to rapidly release products—moving from o1 to GPT-5 in just one year—is ultimately attributed to its robust operational structure, significant momentum, and the high output efficiency of top-tier talent. Everyone believes in the significance of their work:
AI will only be built and deployed once in history.
Additionally, employees extensively use internal tools. Jerry himself is a heavy user of ChatGPT, paying for it monthly. Tools like CodeX are also widely applied in internal code development.
RL’s Critical Blow to OpenAI
For Jerry himself, reinforcement learning (RL) was the key that led him into OpenAI. Looking at the company as a whole, RL has also been pivotal in several major turning points.
Today’s language models can be described as a combination of pre-training and reinforcement learning: first pre-train the model, then apply reinforcement learning on top of it. Both components are indispensable. This hybrid approach has been the core of OpenAI’s research strategy since 2019.
However, to better understand RL’s role at OpenAI, one must first clarify what RL actually is.
Jerry likens RL to training a dog: when the dog behaves well, it receives a “reward” (such as a treat or a smile); when it misbehaves, it faces a “punishment” (like having its attention diverted or being shown displeasure).
RL functions similarly within models: positive rewards are given for correct behavior, while negative rewards follow incorrect actions. The key elements here are the policy and the environment:
- Policy: Refers to the model’s behavior—a mathematical function that maps observations to actions.
- Environment: Everything the model perceives must be interactive. The environment evolves based on the model’s actions. For example, when learning to play the guitar, feedback is received from the sounds produced by plucking strings. RL is the sole mechanism for teaching models how to respond to environmental changes.
DeepMind’s DQN later elevated RL to a new stage—Deep RL—by combining neural networks with reinforcement learning, giving rise to truly meaningful agents.

Jerry also shared a story about GPT-4 shortly after its initial training. At that time, internal teams were dissatisfied with its performance because GPT-4 lacked coherence in longer responses.
This issue was eventually resolved through RLHF (Reinforcement Learning from Human Feedback), which involves humans providing feedback on the model’s outputs to serve as rewards.
It was precisely this integration of RLHF into GPT-4 that gave the world the “ChatGPT moment.”

Recently, OpenAI’s unexpectedly strong performance in programming competitions was attributed by Jerry to researchers’ long-term use of programming puzzles as testbeds for experimenting with their RL ideas.
What began as an incidental effort bore fruit: during their research into RL, they also secured milestone achievements for OpenAI.
Thus, as long as current results can be evaluated and feedback signals calculated, RL can be applied to any domain—even when answers are not simply right or wrong.
However, scaling up RL remains challenging because it is prone to numerous potential errors in practice. Compared to pre-training, RL involves more bottlenecks and failure modes.
It is an extremely delicate process. Analogously, compared to pre-training, applying RL is far more complex than manufacturing semiconductors versus producing steel.

Additionally, Jerry expressed approval of GRPO (Group Relative Policy Optimization), a new reinforcement learning algorithm proposed by the DeepSeek team:
The open-sourcing of GRPO allows many U.S. labs lacking advanced RL research projects to launch and train reasoning models more quickly.
Reinforcement Learning + Pre-training is the Correct Path to AGI
Regarding the future of AI, Jerry Tworek offered his final insights.
First is Agents. Jerry believes that the positive impact of AI lies in its ability to solve human problems through automation.
Currently, models provide answers very quickly, typically within minutes. However, internal tests show that for certain tasks, models can engage in independent thinking for 30 minutes, an hour, or even longer. Therefore, the current challenge is how to build appropriate products to deploy these extended reasoning processes.
Agents driven by foundational reasoning allow models to think independently for longer periods and tackle more complex tasks, such as programming, travel booking, and design. Thus, the agentification of AI is an inevitable trend.
Model alignment is another key concern for the public, referring to guiding model behavior in accordance with human values.
Jerry stated that alignment issues are essentially reinforcement learning (RL) problems. To make correct choices, models must deeply understand their actions and potential consequences. This process will be endless, as the concept of alignment will continue to evolve alongside the advancement of human civilization.

Furthermore, if we aim to reach AGI, current pre-training and RL are undoubtedly essential, although additional elements will certainly need to be integrated in the future.
Jerry explicitly opposes the view held by some in the industry that “pure RL is the only path to AGI.” He firmly believes:
RL requires pre-training to succeed, and pre-training also requires RL to succeed; neither can work without the other.
While he finds it difficult to describe exactly when models will achieve self-improvement without significant external output or human intervention regarding AGI, he remains confident that OpenAI is currently on the right track. Future changes will involve adding new complex components rather than completely overturning existing architectures.
References
- 1978838545008927034 — x.com/mattturck/status/1978838545008927034
- watch — www.youtube.com/watch