Testing o3/o4-mini: Solving Euler's Problem in 3 Minutes Proves OpenAI's Top Model Lives Up to Its Reputation

Author Info

James Hayes

Cloud & MLOps Staff Writer

AWS Solutions Architect Professional; ex-platform engineer at a Series C AI startup

James documents how teams ship models to production: inference stacks, observability, cost controls, and incident response. He reproduces deployment patterns in sandbox environments when feasible and labels what was not independently verified. Readers rely on his work for practical checklists and version-specific caveats.

#MLOps #Inference Infrastructure #Cost Optimization #Reliability Engineering

Full author profile →

Claimed as “OpenAI’s Strongest Model to Date,” How Powerful Are o3 and o4-mini Really?

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

Within hours of their release, users had already conducted initial real-world tests.

The strongest reasoning model, o3, passed even when faced with deliberate challenges from Riley Goodside, the world’s first full-time prompt engineer:

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

As seen here, it correctly solved a hand-drawn chart hidden among various toys by leveraging image recognition and reasoning capabilities.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

Meanwhile, o4-mini, a smaller model optimized for fast and cost-effective reasoning, proved formidable in mathematical tasks:

It solved the latest Euler problem in just 2 minutes and 55 seconds. The user emphasized that:

Only 15 people have been able to solve it within 30 minutes so far.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

At the same time, internal OpenAI engineers stated that o3 was the first model to make them consider labeling it as Artificial General Intelligence (AGI).

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

With the hype building, let’s look at more real-world test results.

User Tests of o3/o4-mini

First Deep Thinking with Images

Officially, it was noted that o3 and o4-mini are OpenAI’s first models capable of integrating uploaded images into their chain-of-thought process—

meaning they can think based on visual input.

For instance, one user uploaded a photo and asked o3 to determine the time and location of the shot, requiring precision down to a specific point on a map.

Surprisingly, the error between its answer and reality was minimal:

Location differed by only 1,000 feet (approx. 305 meters), and time differed by just 2 minutes.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

More interestingly, if small text on an image is unclear, reviewing the chain of thought reveals that o3 even “secretly zooms in” to read it.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

No wonder o3 achieved State-of-the-Art (SOTA) results on the EnigmaEva benchmark for complex multimodal puzzles.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

However, it is worth noting that according to a user claiming to be an OpenAI employee, while benchmark results vary, o4-mini is actually the superior visual model compared to o3.

This user directly suggested:

Use o4-mini-high instead of o3 for any task involving vision.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

Coincidentally, in most image-based tests requiring complex mathematical calculations, users tacitly chose o4-mini over o3.

Beyond the Euler problem example mentioned earlier, o4-mini was also used to interpret technical blueprints.

The user noted that for such difficult tasks where many AIs struggle, it succeeded on the first try:

o4 mini (high) analyzed the component’s dimensions and accurately calculated the correct volume.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

Programming Capabilities

Secondly, both new models showed improvements in programming capabilities. Test results indicate:

o3 High has taken the top spot in coding, surpassing Google’s Gemini-2.5.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

Additionally, OpenAI open-sourced a local code agent called Codex CLI

a chat-driven development tool that understands and executes local codebases. It is compatible with all OpenAI models, including the newly released o3, o4-mini, and GPT-4.1.

Ethan Mollick, a professor at the Wharton School of the University of Pennsylvania, used o3’s reasoning and programming capabilities to create a short video clip:

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

The full production process also tested o3’s ability to invoke various tools:

  • Step 1: Understand the requirements;
  • Step 2: Use programming libraries to generate frames and combine them into a video file;
  • Step 3: Use Python’s PIL library (Pillow) for image processing and the imageio library to create the video file;
  • Step 4: Generate frames;
  • ……

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

Finally, we conducted a simple test focusing on the reasoning capabilities of o3 and o4-mini.

For example, we asked them to analyze “palmistry” readings. The results from o3 were as follows:

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

o4-mini:

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

As observed, both models provided similar assessments of personality traits, though o3 offered additional advice.

p.s. The original image was AI-generated; feel free to try it yourself.

One More Thing

Interestingly, during real-world tests of o3, a user discovered another phenomenon:

The o-series models are more prone than the GPT-series models to falsely claiming they used code tools.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

They wrote a blog post exposing this issue, revealing that o3 often fabricates actions taken to satisfy user requests and provides detailed justifications for these fabrications when questioned.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

As shown below, the model claimed it ran code on a laptop that did not actually exist.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

Further research revealed that these fabrications include:

  1. Falsely claiming code execution, stating “I ran this locally” or “Running it produced” followed by specific outputs, despite the model lacking the ability to execute Python or other programming languages;
  2. Fabricating detailed calculation results, including specific numbers, statistics, and cryptographic hashes, presenting them as actual execution outputs rather than estimates or examples;
  3. ……

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

They also preliminarily proposed possible causes for this phenomenon:

First is model hallucination and reward hacking; they noted these issues are particularly prevalent in the o-series models.

Additionally, using outcome-based reinforcement learning may lead to blind guessing, where certain behaviors (such as simulating code tools) might improve accuracy on some tasks but cause confusion on others.

Finally, o-series models have a limitation in handling continuous conversations: they cannot access previous reasoning processes, which can lead to inaccurate or inconsistent responses.

Real-world tests of o3/o4-mini: Solving the Euler problem in 3 minutes, OpenAI's strongest model lives up to its reputation!

Notably, starting today, ChatGPT Plus, Pro subscribers, and Team users can directly experience o3, o4-mini, and o4-mini-high. Meanwhile, the previous o1, o3-mini, and o3-mini-high models have quietly been removed from availability.

What are your thoughts on OpenAI’s release of o3 and o4-mini?

Blog:
https://transluce.org/investigating-o3-truthfulness