Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperforms Llama 3.1

Author Info

Amara Okonkwo

Robotics & Embodied AI Editor

M.Eng. Robotics (Imperial College London); former field applications engineer

Amara covers humanoids, industrial automation, and simulation-to-real transfer. She interviews practitioners about safety cases, unit economics, and dataset quality rather than demo videos alone. Her reviews call out what is lab-only versus commercially deployed.

#Embodied AI #Industrial Robotics #Simulation #Safety & Deployment

Full author profile →

Tencent deploys its core expertise to compete in the open-source arena, suddenly releasing the largest open-source Mixture of Experts (MoE) model on the market.

Hunyuan-Large, with 389 billion total parameters and 52 billion activated parameters.

Benchmark scores surpass open-source flagships like Llama 3.1 405B, while supporting a context length one tier higher, reaching 256k tokens.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 2

Although Hunyuan-Large is not yet Tencent’s internal flagship model, the company states that its underlying technology shares the same lineage as the Hunyun large language model:

Many details have been refined through internal business applications before being open-sourced. For example, features such as AI long-document reading in Tencent’s Yuanbao App are derived from this technology.

Releasing a model of this scale completely open-source and free for commercial use demonstrates significant sincerity.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 3

This time, Tencent has open-sourced three versions of Hunyuan-Large: the pre-trained model, the fine-tuned model, and an FP8 quantized fine-tuned model.

The release has sparked heated discussion in the open-source community. Thomas Wolf, Chief Scientist at Hugging Face, strongly recommended it and summarized several key highlights.

  • Strong mathematical capabilities
  • Extensive use of carefully crafted synthetic data
  • In-depth exploration of MoE training, utilizing shared experts and summarizing the Scaling Laws for MoE.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 4

Among developers, some have immediately begun downloading and deploying the model, while others hope that Tencent’s entry into the fray will intensify competition in open-source models, thereby forcing Meta to produce better models.

Tencent simultaneously released a technical report, where many technical details have also drawn discussion.

For instance, it calculates the Scaling Law formula for MoE: C ≈ 9.59ND + 2.3 × 10⁸D.

Another example is the use of Cross-Layer Attention (CLA) to reduce KV cache memory usage.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 5

Below is a summary of the highlights from the press conference presentation and the technical report.

Hunyuan-Large Technical Report

Scaling Law for MoE

Here is the formula:

C ≈ 9.59ND + 2.3 × 10⁸D

Where C represents the compute budget (in FLOPs), N represents the number of activated parameters, and D represents the training data volume (in tokens).

Compared to the compute budget formula for traditional dense models, C=6ND, the differences in the MoE model formula are primarily reflected in two aspects:

First, the coefficient increases from 6 to 9.59, reflecting the additional routing computation overhead of MoE, including the computational cost of switching between experts.

Second, a constant term of 2.3×10⁸D is added, reflecting the additional overhead of attention calculations in long-sequence MoE models.

To determine the optimal number of activated parameters, the team invested significant resources into experiments:

They trained a series of models with activated parameter ranges from 10M to 1B, using up to 100 billion tokens of training data, covering different data scales from 10 billion to 100 billion tokens.

Using isoFLOPs curves, they identified the optimal point under a fixed compute budget, while also considering the impact of actual training batch sizes. By analyzing combinations of different parameter counts and data volumes, they calculated that the optimal number of activated parameters is approximately 58.1B.

Ultimately, Hunyuan-Large chose 52B activated parameters, primarily because the curve near the optimum is smooth, offering a large tolerance space around 58.1B, as well as practical factors such as compute resource constraints, training stability requirements, and deployment efficiency balance.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 6

Routing and Training Strategies

In addition to revealing the optimal parameter ratios, the technical report details Hunyuan-Large’s unique “MoE methodology.”

Hybrid Routing Strategy:

Hunyuan-Large adopts a hybrid routing strategy combining shared experts and specialized experts.

Each token activates one shared expert and one specialized expert. The shared expert handles general knowledge for all tokens, while specialized experts are dynamically activated using a top-k routing strategy to handle task-specific capabilities.

Expert Recycling Routing Strategy:

Traditional MoE models often discard too many tokens due to expert overload. Hunyuan-Large designed an expert recycling mechanism to maintain relatively balanced loads, fully utilize training data, and ensure model training stability and convergence speed.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 7

Expert-Specific Learning Rate Adaptation Strategy:

Different experts handle vastly different numbers of tokens and should be assigned different learning rates. For example, shared experts use larger learning rates to ensure that each sub-model effectively learns from the data and contributes to overall performance.

High-Quality Synthetic Data

The Hunyuan team developed a complete high-quality data synthesis pipeline, mainly consisting of four steps: instruction generation, instruction evolution, answer generation, and answer filtering.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 8

In the instruction generation phase, the Hunyuan team used high-quality data sources as seeds, covering multiple domains and varying complexities to ensure diversity and comprehensiveness in instructions.

Next is the instruction evolution phase, where clarity and information density of instructions are improved, low-resource domain instructions are expanded, and difficulty is gradually increased, making instructions richer, more precise, and challenging.

In the answer generation phase, the Hunyuan team employed specialized models to generate professional answers for different domains. These models varied in scale and design to ensure generated answers met the requirements of various fields.

Finally, in the answer filtering phase, the team used critique models to evaluate the quality of generated answers and performed self-consistency checks to ensure high-quality output.

Through this four-step synthesis process, the Hunyuan team could generate a large volume of high-quality, diverse instruction-answer data pairs, providing rich and premium data support for MoE model training.

This data synthesis method not only improved training efficiency but also significantly enhanced model performance across various downstream tasks.

Long-Document Capability Optimization

To achieve powerful long-text processing capabilities, the Hunyuan team adopted several strategies during training.

First is phased training. The first phase processed 32K token texts, while the second phase extended text length to 256K tokens. In each phase, approximately 10 billion tokens of training data were used to ensure the model could fully learn and adapt to texts of different lengths.

Regarding training data selection, 25% consisted of natural long texts, such as books and code, to provide realistic long-text samples; the remaining 75% was standard-length data. This data combination strategy ensured that while acquiring long-text understanding capabilities, the model maintained basic processing abilities for standard-length texts.

Additionally, to better handle positional information in ultra-long sequences, the Hunyuan team optimized position encoding. They adopted RoPE (Rotary Position Embedding) and expanded the base frequency to 1 billion during the 256K token phase. This optimization effectively handles positional information in ultra-long sequences, enhancing the model’s understanding and generation capabilities for long texts.

Beyond evaluation on public datasets, the Hunyuan team developed a long-text evaluation dataset named “Penguin Scroll.”

“Penguin Scroll” includes four main tasks: information extraction, information localization, qualitative analysis, and numerical reasoning.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 9

Unlike existing long-text benchmark tests, “Penguin Scroll” offers several advantages:

  • Data Diversity: It includes long texts from various real-world scenarios, such as financial reports, legal documents, and academic papers, with lengths up to 128K tokens.
  • Task Comprehensiveness: The dataset covers tasks at multiple difficulty levels, constructing a comprehensive classification system for long-text processing capabilities.
  • Conversational Data: It introduces multi-turn conversational data to simulate real-world long-text Q&A scenarios.
  • Multilingual Support: It provides bilingual Chinese-English data to meet multilingual application needs.

Inference Acceleration Optimization

To further improve the inference efficiency of Hunyuan-Large, the Hunyuan team adopted various optimization technologies, with KV Cache compression being the most critical.

This primarily combined two methods: GQA (Grouped-Query Attention) and CLA (Cross-Layer Attention).

GQA compressed the head-dimension KV cache by setting 8 KV head groups; CLA compressed layer-dimension memory usage by sharing KV cache every 2 layers.

Through the combination of these two strategies, the KV cache memory usage of the Hunyuan MoE model was reduced by approximately 95%, while model performance remained largely unchanged. This significant memory optimization not only greatly improved inference efficiency but also made the model easier to deploy and adapt to various practical application scenarios.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 10

Post-Training Optimization

Building on pre-training, the Hunyuan team adopted a two-stage post-training strategy, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), to further enhance capabilities in key areas and alignment with human preferences.

In the SFT phase, the team used over 1 million high-quality data points covering multiple critical capability domains such as mathematics, reasoning, Q&A, and programming. To ensure data quality, the team implemented multiple quality control measures, including rule-based filtering, model-based filtering, and manual review. The entire SFT process consisted of three rounds, with the learning rate decaying from 2e-5 to 2e-6 to fully utilize data while avoiding overfitting.

In the RLHF phase, the team primarily combined two-stage offline and online Direct Preference Optimization (DPO). Offline training used pre-built human preference datasets to enhance controllability; online training utilized the current policy model to generate multiple responses, selecting the best response using a reward model to improve generalization capabilities.

Simultaneously, they employed an Exponential Moving Average (EMA) strategy to mitigate reward hacking issues, ensuring smooth and stable convergence during training.

One More Thing

At the press conference, Kang Zhanhui, Head of Algorithms for Tencent’s Hunyuan Large Model, revealed that after Hunyuan-Large, Tencent plans to gradually open-source smaller-sized models to meet the needs of individual developers and edge-side developers.

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperf… — figure 11

Additionally, Tencent simultaneously open-sourced a 3D large model; interested readers can learn more here.

Comments