7B Model's 'Emotional Intelligence' Rivals GPT-4o: Tencent Breaks Through Open-Domain RL Challenge, Scores Jump Fivefold

Author Info

Elena Volkov

Machine Learning Research Editor

Ph.D. Machine Learning (ETH Zürich); published work on efficient training and evaluation

Elena explains model architecture, training economics, and benchmark design for a technical audience. She reads primary papers and official technical reports, then summarizes assumptions, datasets, and known failure modes. She avoids hype by pairing capability claims with reproducibility notes.

#Model Architecture #Benchmarks #Training Economics #Open-Source Models

Full author profile →

How Should RL Behave in Open-Ended Dialogues Without Standard Answers?

Multi-turn dialogue is the most typical open-ended task for large language models (LLMs): it involves high frequency, multiple turns, strong contextual dependence, and “good responses” vary from person to person.

However, when using Reinforcement Learning (RL) to optimize an LLM’s “emotional intelligence” in real-world interactions, RLVR (Reinforcement Learning for Value Reward) once fell into “three major dilemmas”:

  • The Environment Dilemma
    • Real conversations are multi-turn, dynamic, and highly personalized. How can we construct an interactive environment that is both realistic and diverse, while allowing the model to freely explore (rollout)?
  • The Reward Dilemma
    • “High emotional intelligence” has no standard answer. How can user subjective satisfaction be converted into a stable, optimizable long-term reward?
  • The Training Dilemma
    • How can we achieve stable and efficient multi-turn online RL training on LLMs?

The Tencent Hunyuan Digital Human team’s proposed RLVER (Reinforcement Learning with Verifiable Emotion Rewards) framework points to a solution:

By having a stable, high-quality user simulator play the dual role of both “interactive environment” and “reward source,” RLVR was successfully introduced into multi-turn dialogues, providing an effective and scalable new solution for training LLMs in open-domain RL.

The Qwen2.5-7B model trained with RLVER saw its score on the Sentient-Benchmark emotional dialogue benchmark jump from 13.3 to 79.2, performing comparably to top-tier commercial models like GPT-4o and Gemini 2.5 Pro.

7B Model's "Emotional Intelligence" Rivals GPT-4o; Tencent Breaks Through Open-Domain RL Challenge, Score Increases Fivefold

The model is now open-source; links are available at the end of this article.

RLVER: Building an Effective RL Closed Loop for the Open Problem of “Emotional Intelligence”

Traditional dialogue optimization either relies on static data or expensive human annotation.

RLVER proposes a new path: using an “environment + reward” integrated user simulator as the core, cleverly solving the three aforementioned challenges.

7B Model's "Emotional Intelligence" Rivals GPT-4o; Tencent Breaks Through Open-Domain RL Challenge, Score Increases Fivefold

Simulator as Environment: Creating a “Living” Dialogue World

The RLVER team recognized that true “high emotional intelligence” is subjective and varies by individual. Therefore, the user simulator built by RLVER is not just a simple chatbot.

It possesses diverse user personas and interaction scenarios (different user personalities, dialogue backgrounds, and latent needs), capable of simulating massive amounts of real, variable users.

Each user interacts independently and dynamically with the model, updating its own emotional state in real-time based on the model’s responses, and providing personalized replies.

This provides the model with an online learning environment that allows for infinite exploration, filled with realism and diversity, while avoiding reward hacking.

Simulator as Reward: A Trustworthy “User Sentiment Scoring System”

The evaluation of “emotional intelligence” is essentially user subjective experience. But how can this subjective experience be transformed into a stable, optimizable reward?

Based on the SAGE framework, RLVER simulates users’ emotional changes after each turn of dialogue through an explicit, reproducible reasoning process.

After the conversation ends, the accumulated “total mood score” becomes the reward signal, directly driving the PPO/GRPO algorithms to optimize the model.

This design moves away from a “black-box scorer,” explicitly modeling “user satisfaction” into a logically controllable reward function, making the training process more stable, transparent, and trustworthy.

Global Reward Optimization: From Single-Turn Feedback to “Global Emotional Trajectory” Optimization

Unlike sentence-by-sentence feedback methods, RLVER focuses on the overall emotional trend of the entire conversation, using only the final “total emotion score” as the reward to guide the model in optimizing long-horizon strategies.

Only by truly understanding user intent and maintaining a long-term upward trend in user emotions can the model achieve higher total rewards. This encourages the model to escape local optima and learn more extensible and strategic social dialogue behaviors.

Core Achievements: 7B Model Rivals “Tech Giant Flagships”

7B Model's "Emotional Intelligence" Rivals GPT-4o; Tencent Breaks Through Open-Domain RL Challenge, Score Increases Fivefold

The Qwen2.5-7B model trained with RLVER saw its score on the Sentient-Benchmark emotional dialogue benchmark jump from 13.3 to 79.2, performing comparably to top-tier commercial models like GPT-4o and Gemini 2.5 Pro.

More importantly, the model experienced almost no degradation in general capabilities such as mathematics and coding, successfully avoiding “catastrophic forgetting.”

7B Model's "Emotional Intelligence" Rivals GPT-4o; Tencent Breaks Through Open-Domain RL Challenge, Score Increases Fivefold

Additionally, the impact of RLVER on model behavioral style was significant: the model shifted from a “problem-solving style” to an “emotion-focused style,” with its mindset changing from “how to solve the problem” to “I can understand your feelings.”

Deep Insights: From Thinking to Acting

During the training practice of RLVER, the research team also gained some inspiring discoveries.

Insight 1: “Thinking” vs. “Reactive” Models – Two Paths to “Empathy”

7B Model's "Emotional Intelligence" Rivals GPT-4o; Tencent Breaks Through Open-Domain RL Challenge, Score Increases Fivefold

RLVER introduced an explicit think-then-say prompt template, requiring the model to perform emotional analysis and strategic reasoning before generating a final response in each turn. By comparing models with and without “thinking,” the research team observed two distinct paths leading to “empathy”:

“Thinking Model”: Moving Toward “Deep Understanding”

Explicit thinking chains prompt the model to reason before generation, significantly enhancing two core capabilities:

  • Problem Insight: Identifying the true motivations and latent needs behind user emotions;
  • Empathetic Expression and Validation: Accurately capturing and reflecting deep emotions, making users feel “understood.”

These models are more like “soulmates”: skilled at quiet listening and accurate responses, building deep emotional connections through language.

“Reactive Model”: Moving Toward “Quick Action”

In contrast, models not guided to think generate responses directly. Although they lag slightly in insight and empathy dimensions, they spontaneously develop compensatory strategies oriented toward action:

  • Quickly judging user dilemmas and providing concrete, actionable advice or personalized invitations for action;
  • Compensating for deficiencies in emotional understanding with “practicality,” forming the role of an “action-oriented partner.”

This comparison reveals an interesting phenomenon in RL training under open-ended complex tasks: when capabilities are limited, models spontaneously seek strategic “compensation paths.” The diverse, multi-strategy compatible training environment provided by RLVER is precisely the key soil that fosters this evolution of diverse behaviors.

Insight 2: PPO vs. GRPO – Stable Growth or Capability Breakthrough?

7B Model's "Emotional Intelligence" Rivals GPT-4o; Tencent Breaks Through Open-Domain RL Challenge, Score Increases Fivefold

Regarding optimization algorithms, the RLVER team also drew practical conclusions:

  • GRPO: Tends to bring more stable and balanced capability growth.
  • PPO: Is better at pushing the model’s capabilities in specific dimensions (such as empathy depth, core insight) to higher upper limits.

This leads to an interesting strategic consideration: For complex multi-dimensional abilities like “emotional intelligence,” once a model reaches the “passing line” in all aspects, should it continue to be a “hexagon warrior” (all-rounder), or focus on building one or two “killer app” dimensions?

In the experimental results of this article, the latter approach yielded better overall performance.

Insight 3: The Impact of Environment and Reward Styles – A Strict Teacher Does Not Necessarily Produce a Brilliant Student

In the RLVER framework, the user simulator plays the dual role of both “training environment” and “reward model.” Therefore, its style—specifically “user acceptability” and feedback methods—has a direct impact on the model’s learning path.

A natural question arises: Will stricter users train stronger models?

The experimental answer is: Not necessarily; harder is not always better.

The RLVER team constructed two types of user simulators:

  • Vanilla Version: Emotionally expressive, positively feedbacking, with higher acceptability;
  • Challenging Version: Emotionally reserved, restrained in feedback, with extremely high demands for response quality.

After training and testing on the same initial models, the RLVER team found:

7B Model's "Emotional Intelligence" Rivals GPT-4o; Tencent Breaks Through Open-Domain RL Challenge, Score Increases Fivefold

Too difficult environments are detrimental to early model growth.

Although the Challenging simulator is more realistic in design, its subtle feedback and low tolerance for errors make it difficult for the model to trial-and-error explore diverse strategies or receive positive reinforcement during the early stages of training. This can cause RL training to fall into a vicious cycle of “no feedback → no learning → collapse.”

Conversely, the Vanilla simulator’s feedback mechanism is relatively inclusive and positive, facilitating strategy exploration and capability accumulation in the early stages of training, forming stable empathetic expression habits.

Strategic Implication: When optimizing open-ended tasks (such as “emotional intelligence”) via reinforcement learning, the training environment should not simply be made difficult; it should emphasize the design of a “growth curve.” The premise for “a strict teacher produces a brilliant student” is that the student can already understand the teachings.

In the early stages when capabilities are still shallow, gentle, learnable “sparring partner users” are more likely to help models grow into true empathizers.

7B Model's "Emotional Intelligence" Rivals GPT-4o; Tencent Breaks Through Open-Domain RL Challenge, Score Increases Fivefold

Models with thinking are more “resilient.”

An additional interesting finding is that in the Challenging environment, models with an explicit “thinking structure” were significantly more robust:

  • Although overall scores decreased, they remained at a usable level;
  • Models without a thinking structure almost completely collapsed, with scores dropping as low as 19.8.

This indicates that explicit reasoning capabilities can buffer training instability caused by sparse rewards. Even in the absence of clear feedback, models can leverage “internal analysis” to mine user need signals, thereby maintaining a certain level of adaptability.

Preliminary Work: Can AI Also Be an Emotional Master? Tencent Releases Latest AI Social Intelligence Ranking; Latest GPT-4o Takes First Place
Paper Address: https://arxiv.org/abs/2507.03112
Project Code: https://github.com/Tencent/digitalhuman/tree/main/RLVER
Open Source Model: https://huggingface.co/RLVER