Surpassing DeepSeek Through Reinforcement Learning Alone!
Shanghai AI Laboratory has proposed a new paradigm for reinforcement learning based on outcome rewards. Starting from the Qwen2.5-32B-Base model, the team achieved superior mathematical reasoning performance compared to DeepSeek-R1-Distill-Qwen32B and the OpenAI O1 series by using only fine-tuning and result-feedback-based reinforcement learning (RL), without distilling large models like DeepSeek-R1.

The team identified that current large language models face a “triple bottleneck” in mathematical reasoning tasks:

- Sparse Reward Dilemma: Binary feedback on the correctness of final answers makes optimizing complex reasoning difficult.
- Local Correctness Trap: Partially correct steps in long chains of thought can mislead the model during learning.
- Scale Dependency Curse: Traditional distillation methods force researchers into an “arms race” for parameter scale.
Consequently, the research team re-examined current outcome-reward-based reinforcement learning algorithms. Through rigorous theoretical derivation and proof, they redesigned a new result-reward RL algorithm, arriving at three key conclusions:
- For Positive Samples: In a binary feedback environment, behavior cloning via Best-of-N (BoN) trajectory sampling is sufficient to learn the optimal policy.
- For Negative Samples: Reward reshaping is necessary to maintain consistency in the policy optimization objective.
- For Long Sequences: Different parts of a sequence contribute differently to the final result; therefore, a finer-grained reward allocation function is needed, which can be learned from outcome rewards.
In simple terms, by imitating correct samples, learning preferences from incorrect samples, and focusing on key steps, the team achieved remarkable results without relying on distillation from super-large models (such as DeepSeek-R1), using only reinforcement learning.
Additionally, the team conducted comparative analyses of RL training across different starting models. They found that both the starting model and the training data distribution significantly impact final performance. To promote fair comparison and further research within the community, the study team has fully open-sourced the RL training data, starting points, and final models. The project links are provided at the end of this article.
Designing Outcome-Reward Reinforcement Learning from Scratch
Addressing the challenges of sparse rewards and local correctness in mathematical reasoning tasks, the team proposed a new strategy optimization framework called OREAL.
Through theoretical innovation, they implemented targeted algorithmic improvements. Before demonstrating “how to do it better” through experiments, they first proved “why this approach is better.”
Positive/Negative Sample Reward Reshaping: Solving the Sparse Reward Dilemma
In the sampling process for mathematical reasoning tasks, the team’s theoretical analysis led to a core insight: under a binary feedback mechanism, the distribution of correct trajectories remains consistent regardless of the number of Best-of-N (BoN) samples drawn containing the correct answer. This finding indicates that directly behavior-cloning sampled correct trajectories constitutes the optimal setup for training on positive samples.
Building on imitation learning for positive samples, the team noted that directly penalizing negative samples leads to gradient bias. The principle for training on negative samples should be maintaining consistency between the optimization gradient form and the learned BoN distribution. By deeply analyzing the training gradients of both positive and negative samples, researchers proposed a reward reshaping factor based on average accuracy ($p$) to maintain this consistency, providing a theoretical basis for improving algorithms like GRPO. This setup allows the model to effectively absorb successful experiences while precisely identifying critical error boundaries, significantly aiding training performance.

Outcome Reward “Causal Tracing”: Escaping the Local Correctness Trap
To address complex long-reasoning chains, OREAL innovatively designed a token importance estimator. By constructing a cumulative sequence reward function, they decomposed outcome rewards backward to each reasoning step (see the token-level RM heatmap below). This method precisely locates core error steps, enabling more granular gradient updates during training and significantly improving model performance on long-sequence tasks.

The OREAL Framework
Combining these insights, the optimal reinforcement learning strategy proposed by the team can be summarized as: imitate learning on correct samples, preference learning on incorrect samples, and focused learning on key steps.

Through reasonable analysis and practice, the team pushed reinforcement learning performance to its optimal level step by step.

Reinforcement Learning Surpasses Distillation, Breaking the Scale Dependency Curse
The team trained and tested models at 7B and 32B scales using only 4,000 high-quality training samples.
At the 7B scale, Oreal-7B achieved a pass@1 accuracy of 91.0 on MATH-500. This marks the first time reinforcement learning (rather than distillation) has reached such high precision. This achievement not only sets a new milestone for RL-based methods but also surpasses larger parameter models, including QWQ-32B-Preview and OpenAI-O1-Mini.
Furthermore, applying Oreal to the previous best 7B model (DeepSeek-r1-Distill-Qwen-7B) resulted in a new model, OREAL-DSR1-Distill-Qwen-7B, which achieved a pass@1 accuracy of 94.0 on MATH-500, setting a new record for 7B models. The Qwen base model, after distillation training from DeepSeek and subsequent RL training by Shanghai AI Lab, reached a new height of Chinese innovation.
For the 32B model, Oreal-32B also scored 95.0 on MATH-500, surpassing the same-level DeepSeek-r1-Distill-Qwen-32B and achieving a new SOTA for 32B models.

One More Thing
Finally, the research team compared performance across different base models and found that the upper limit of post-RL performance varies depending on the starting point: stronger starting models yield better results after RL.
Moreover, while most benchmark performances improved after RL across various base models, there were occasional instances of stagnation (OREAL-32B on AIME 2025-I) or performance degradation (compared to DSR1-Distill-Qwen-7B on AIME 2024).
The researchers believe these occurrences may be related to insufficient preparation in terms of the quality, difficulty, and quantity of training corpora, leaving room for future research.

Therefore, beyond the powerful RL algorithm, the team identified two key factors crucial for success in mathematical reasoning tasks:
- A strong starting model is a prerequisite for RL to effectively stimulate the model’s potential capabilities.
- The data used during the RL phase must be fully guaranteed in terms of quality, difficulty, quantity, and diversity. High-quality datasets allow models to fully leverage their potential by facing diverse challenges and learning opportunities.
Comprehensive Open Source of Models and Data to Advance Reinforcement Learning Research
The research team also noted that while the emergence of DeepSeek-R1 sparked enthusiasm for studying large language model reinforcement learning in the community, differences in training starting points, data, algorithms, and hyperparameter details have hindered clear comparisons of algorithmic and model performance.
Therefore, the team has fully open-sourced the training data, starting models, and post-RL models used throughout the RL training process. The training code will also be open-sourced to XTuner.
Welcome to download and experience:
Project Link:
https://github.com/InternLM/OREAL
Paper Address:
https://arxiv.org/abs/2502.06781
RL Training Data Link:
https://huggingface.co/datasets/internlm/OREAL-RL-Prompts
Series Model Address:
https://huggingface.co/collections/internlm/oreal-67aaccf5a8192c1ba3cff018