Qwen3 Technical Report Released: Key Technologies Behind Eight Models Unveiled
- Adopts a dual-mode architecture, allowing a single model to support both reasoning and non-reasoning tasks with automatic switching as needed.
- Employs a phased strategy for training and fine-tuning to progressively build model capabilities.
- Utilizes a “large guiding small” approach, distilling data from larger models to train smaller ones.

Readers who have already reviewed the report have identified additional highlights.
For instance, a Hugging Face researcher noted that Qwen3’s sample size during the Reinforcement Learning (RL) phase was surprisingly low, totaling less than 4,000 samples.

Thinking vs. Non-Thinking: One Model Handles Both
The Qwen3 series includes six dense models with parameter counts of 0.6B, 1.7B, 4B, 8B, 14B, and 32B; as well as two Mixture-of-Experts (MoE) models with total parameters of 30B and 235B, and activated parameters of 3B and 22B, respectively.
The architecture of the dense models is similar to Qwen2.5 but removes the QKV bias used in Qwen2 and introduces QK-Norm into the attention mechanism to ensure stable training for Qwen3.

Unlike Qwen2.5-MoE, Qwen3-MoE does not include shared experts. Additionally, Qwen3 employs a full-batch load balancing loss to promote expert specialization.

A core innovation of Qwen3 is its dual working mode: the integration of “thinking” and “non-thinking” modes, catering to the needs of complex reasoning tasks and rapid response tasks, respectively.
To enable flexible switching between these two modes, Qwen3 introduces the concept of a Thinking Budget.
The thinking budget is essentially a parameter that determines the amount of computational resources invested in thinking mode; its size is positively correlated with the complexity of the input question.
Upon receiving an input, the model evaluates its complexity and dynamically allocates the thinking budget.
Simple questions are assigned a lower thinking budget, prompting the model to provide answers quickly. Complex questions receive a higher thinking budget, allowing the model to invest more computational power in deep reasoning before generating a response.

How Qwen3 Was Trained
During pre-training, Qwen3 adopted a three-stage strategy to progressively build and strengthen the model’s language understanding and generation capabilities.
The first stage aims to equip the model with basic language and general knowledge. This phase of training was conducted on general corpora using a sequence length of 4,096 tokens.
The second stage focuses on enhancing reasoning capabilities. It utilizes higher-quality data primarily sourced from STEM, programming, and reasoning domains.
Training on these datasets significantly improved the model’s logical analysis and causal reasoning abilities. While the sequence length remained at 4,096 tokens, the learning rate decayed faster during this phase.
The third stage concentrates on long-text capabilities, using high-quality long-document corpora specifically collected by the research team. The training sequence length was extended to 32,768 tokens.
Through training on these ultra-long texts, the model learned to handle complex long-range dependencies and mastered skills for integrating information across paragraphs and documents.

Post-training also employed a phased approach, divided into four stages.

The first stage is called Long Chain-of-Thought Cold Start. Its goal is to establish initial problem-solving capabilities for the model in mathematics and programming reasoning tasks.
The Qwen team constructed a dataset containing numerous high-quality math and programming problems, annotating each with detailed solution steps. These annotated data were used for supervised fine-tuning (SFT) to help the model master key skills and common approaches.
Specifically, they filtered questions using Qwen2.5-72B and then used the QwQ-32B model to automatically generate preliminary solution steps. Human experts verified and corrected these auto-generated steps to ensure accuracy and readability.
The number of training samples and steps in this phase were kept small to allow the model to grasp basic problem-solving abilities without over-specialization.

The second stage is Reasoning Reinforcement Learning. Building on the first stage, it further introduces reinforcement learning to optimize the model’s problem-solving strategies.
They selected 3,995 questions from the first-stage dataset that covered specific domains, possessed a certain level of difficulty, and were learnable by the model.
During this phase, GRPO (Group Relative Policy Optimization) was used to update model parameters.

The third stage is Thinking Mode Integration. As the name suggests, its purpose is to integrate both thinking and non-thinking modes into a single model. This process used an SFT dataset containing both types of content.
For thinking-type samples, Qwen team followed the data generation methods from the previous two stages. For non-thinking samples, they collected open-domain conversation data and specifically generated samples such as greetings and instructions.
Additionally, the team designed a chat template that uses special tokens on the input side to distinguish between thinking and non-thinking modes.
By continuing pre-training on this mixed dataset and incorporating human feedback, the model learned to flexibly switch between the two modes based on input signals, forming a seamlessly integrated dual-mode system.

The final stage is General Reinforcement Learning, aimed at further enhancing the model’s capabilities and stability across various scenarios.
In this stage, the Qwen team constructed a reinforcement learning environment covering over 20 types of tasks, including QA, writing, code generation, and mathematical reasoning. Each task was designed with unique scoring criteria.
This phase specifically targeted improvements in instruction following, format adherence, and preference alignment.

Beyond this training methodology, the Qwen3 family also adopted a “large guiding small” data distillation approach.
Distillation is divided into two main phases: Off-policy distillation and On-policy distillation.
Analogy to human learning: the first phase is like memorizing textbooks, while the second phase involves practicing problems and self-correcting based on answers.

In the Off-policy distillation phase, a teacher model (the 235B MoE model distills the 30B MoE; the 32B dense model distills other smaller dense models) generates a large volume of high-quality outputs on a massive dataset.
These data serve as supervision signals to train student models, enabling them to mimic the teacher model’s output distribution as closely as possible.
In this phase, the teacher model uses mixed outputs from both thinking and non-thinking modes, allowing the student model to learn capabilities for handling both modes simultaneously.
In the On-policy distillation phase, the research team adopted a more dynamic and interactive learning method.
First, the student model autonomously generates a series of outputs in actual tasks. These are then compared with the teacher model’s outputs on the same tasks.
The optimization goal for the student model is to minimize the difference between its output distribution and that of the teacher model.
Through this continuous process of self-generation and comparison, the student model can constantly correct and refine its knowledge base in practice, gradually approximating the teacher model’s output distribution.
Qwen’s Version of DeepResearch Goes Live
In addition to releasing the Qwen3 technical report, Qwen Chat has fully launched its Deep Research feature, which had previously undergone phased testing.
According to official descriptions, users simply need to describe a problem and answer the refined questions posed by the model. After about the time it takes to drink a cup of coffee, Qwen can compile a research report.

In an official case study, Qwen investigated the following question:
How has the healthcare industry adapted to telemedicine and digital health tools over the past three years? Use tables where necessary for clearer expression.
As seen in the example, after clarifying specific requirements, Qwen planned a strategy, broke it down into sub-questions for retrieval and summarization. The research process took approximately 8.5 minutes, ultimately generating a report with tables and automatically exporting it as a PDF.

Feel free to try it out if you are interested~
Report Link:
https://github.com/QwenLM/Qwen3/blob/main/Qwen3\Technical\Report.pdf
Qwen Chat:
https://chat.qwen.ai