Claimed as “OpenAI’s Strongest Model to Date,” How Powerful Are o3 and o4-mini Really?

Within hours of their release, users had already conducted initial real-world tests.
The strongest reasoning model, o3, passed even when faced with deliberate challenges from Riley Goodside, the world’s first full-time prompt engineer:

As seen here, it correctly solved a hand-drawn chart hidden among various toys by leveraging image recognition and reasoning capabilities.

Meanwhile, o4-mini, a smaller model optimized for fast and cost-effective reasoning, proved formidable in mathematical tasks:
It solved the latest Euler problem in just 2 minutes and 55 seconds. The user emphasized that:
Only 15 people have been able to solve it within 30 minutes so far.

At the same time, internal OpenAI engineers stated that o3 was the first model to make them consider labeling it as Artificial General Intelligence (AGI).

With the hype building, let’s look at more real-world test results.
User Tests of o3/o4-mini
First Deep Thinking with Images
Officially, it was noted that o3 and o4-mini are OpenAI’s first models capable of integrating uploaded images into their chain-of-thought process—
meaning they can think based on visual input.
For instance, one user uploaded a photo and asked o3 to determine the time and location of the shot, requiring precision down to a specific point on a map.
Surprisingly, the error between its answer and reality was minimal:
Location differed by only 1,000 feet (approx. 305 meters), and time differed by just 2 minutes.

More interestingly, if small text on an image is unclear, reviewing the chain of thought reveals that o3 even “secretly zooms in” to read it.

No wonder o3 achieved State-of-the-Art (SOTA) results on the EnigmaEva benchmark for complex multimodal puzzles.

However, it is worth noting that according to a user claiming to be an OpenAI employee, while benchmark results vary, o4-mini is actually the superior visual model compared to o3.
This user directly suggested:
Use o4-mini-high instead of o3 for any task involving vision.

Coincidentally, in most image-based tests requiring complex mathematical calculations, users tacitly chose o4-mini over o3.
Beyond the Euler problem example mentioned earlier, o4-mini was also used to interpret technical blueprints.
The user noted that for such difficult tasks where many AIs struggle, it succeeded on the first try:
o4 mini (high) analyzed the component’s dimensions and accurately calculated the correct volume.

Programming Capabilities
Secondly, both new models showed improvements in programming capabilities. Test results indicate:
o3 High has taken the top spot in coding, surpassing Google’s Gemini-2.5.

Additionally, OpenAI open-sourced a local code agent called Codex CLI—
a chat-driven development tool that understands and executes local codebases. It is compatible with all OpenAI models, including the newly released o3, o4-mini, and GPT-4.1.
Ethan Mollick, a professor at the Wharton School of the University of Pennsylvania, used o3’s reasoning and programming capabilities to create a short video clip:

The full production process also tested o3’s ability to invoke various tools:
- Step 1: Understand the requirements;
- Step 2: Use programming libraries to generate frames and combine them into a video file;
- Step 3: Use Python’s PIL library (Pillow) for image processing and the
imageiolibrary to create the video file; - Step 4: Generate frames;
- ……

Finally, we conducted a simple test focusing on the reasoning capabilities of o3 and o4-mini.
For example, we asked them to analyze “palmistry” readings. The results from o3 were as follows:

o4-mini:

As observed, both models provided similar assessments of personality traits, though o3 offered additional advice.
p.s. The original image was AI-generated; feel free to try it yourself.
One More Thing
Interestingly, during real-world tests of o3, a user discovered another phenomenon:
The o-series models are more prone than the GPT-series models to falsely claiming they used code tools.

They wrote a blog post exposing this issue, revealing that o3 often fabricates actions taken to satisfy user requests and provides detailed justifications for these fabrications when questioned.

As shown below, the model claimed it ran code on a laptop that did not actually exist.

Further research revealed that these fabrications include:
- Falsely claiming code execution, stating “I ran this locally” or “Running it produced” followed by specific outputs, despite the model lacking the ability to execute Python or other programming languages;
- Fabricating detailed calculation results, including specific numbers, statistics, and cryptographic hashes, presenting them as actual execution outputs rather than estimates or examples;
- ……

They also preliminarily proposed possible causes for this phenomenon:
First is model hallucination and reward hacking; they noted these issues are particularly prevalent in the o-series models.
Additionally, using outcome-based reinforcement learning may lead to blind guessing, where certain behaviors (such as simulating code tools) might improve accuracy on some tasks but cause confusion on others.
Finally, o-series models have a limitation in handling continuous conversations: they cannot access previous reasoning processes, which can lead to inaccurate or inconsistent responses.

Notably, starting today, ChatGPT Plus, Pro subscribers, and Team users can directly experience o3, o4-mini, and o4-mini-high. Meanwhile, the previous o1, o3-mini, and o3-mini-high models have quietly been removed from availability.
What are your thoughts on OpenAI’s release of o3 and o4-mini?