Tencent Hunyuan Open-Sources New AI Painting Framework: Aligns with Human Intent Across 24 Dimensions to Decode Complex Instructions

Author Info

Priya Sharma

Enterprise AI & Governance Editor

JD (technology policy focus); CIPP/US; former in-house counsel at a cloud provider

Priya writes about regulation, enterprise procurement, and responsible deployment. She separates legal fact from commentary, flags jurisdictional limits, and works with external counsel on high-risk governance topics. Her articles emphasize what changed, who is accountable, and what practitioners should verify locally.

#AI Regulation #Enterprise Adoption #Risk & Compliance #Policy Analysis

Full author profile →

Submitted by the Tencent Hunyuan Team

AI image generation often fails to produce accurate results, causing frustration for creators.

Now, PromptEnhancer, an open-source framework from the Tencent Hunyuan team, offers a solution to this challenge.

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Without modifying the weights of any pre-trained Text-to-Image (T2I) models, simply using “Chain-of-Thought (CoT) prompt rewriting” significantly improves text-image alignment accuracy.

In complex scenarios involving abstract relationship understanding and numerical constraints, accuracy can increase by more than 17%.

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Additionally, to assist researchers in further exploring prompt optimization techniques, the Tencent Hunyuan team has simultaneously open-sourced a new high-quality human preference benchmark dataset.

Constructed around complex scenarios and containing extensive annotated data, this dataset not only provides robust support for the training and evaluation of PromptEnhancer but also serves as an important reference for related research fields.

Core Innovations: Two Modules Solve “Understanding Challenges” for Plug-and-Play Optimization

In recent years, T2I diffusion models—from Stable Diffusion and Imagen to HunyuanDiT and Flux—have become capable of generating hyper-realistic, stylistically diverse images. However, their ability to interpret “human instructions” remains a significant weakness.

Research by the Tencent Hunyuan team identified that the core issues with T2I models fall into three main areas:

  • Attribute Binding Confusion: Inability to accurately match attributes like “red” or “striped” to objects such as “hats” or “clothes.”
  • Ineffective Negative Instructions: When inputting “beef noodles without green onions,” the generated images still frequently include green onions.
  • Loss of Control Over Complex Relationships: Difficulty understanding spatial and comparative relationships like “the cat is to the left of the dog and half its size,” or rendering abstract composite scenes such as “a cat made of orange segments.”

The root cause of these problems lies in the vast gap between users’ concise instructions and the “refined descriptions” required by models.

Previous solutions either required fine-tuning for specific T2I models (lacking universality) or relied on coarse evaluation metrics like CLIP scores, which could not pinpoint specific errors.

This has led to AI image generation feeling more like “opening blind boxes” than a controllable creative tool.

PromptEnhancer’s breakthrough lies in constructing a prompt optimization framework completely decoupled from the generative model. Its core consists of two modules: the “CoT-based Rewriter” and the “AlignEvaluator Reward Model.” Through two-stage training, it teaches AI to “speak precisely.”

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Figure 1: PromptEnhancer Technical Architecture

As shown in the diagram above, PromptEnhancer comprises two parts: Supervised Fine-Tuning (SFT) to activate CoT rewriting capabilities, and Reinforcement Learning with GRPO based on AlignEvaluator to align across 24 dimensions.

CoT-Based Rewriter: Deconstructing Instructions Like a Human Designer

Unlike traditional prompt optimization that relies on “keyword stacking,” PromptEnhancer’s rewriter introduces a “Chain-of-Thought (CoT)” mechanism. This simulates the thought process of human designers, breaking down concise instructions into three steps: “core elements – potential ambiguities – detailed supplements.”

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Figure 2: Tom the Cat in an Astronaut Suit Floating in Space

For example, if a user inputs “Cute Tom wearing an astronaut suit floating in space, oil painting style,”

The rewriter first establishes background knowledge (“Tom is a character from the Tom and Jerry IP”), then supplements details such as “the spacesuit has an off-white multi-layer design with yellow highlights on the helmet” and “the space background uses impasto techniques, with celestial bodies depicted in white and yellow pointillism.” It finally generates a structured, refined prompt.

To equip the rewriter with this capability, the team first performed initialization via “Supervised Fine-Tuning (SFT).”

Using large models such as Gemini-2.5-Pro, they generated 485,000 sets of data consisting of “original prompts – chain-of-thoughts – refined prompts.” This taught the rewriter the descriptive logic from “macro overview” to “micro details.”

AlignEvaluator: Scoring Across 24 Dimensions for Precise Error Localization

Traditional reward models (such as CLIP scores) only provide an “overall similarity” metric, failing to identify where the AI went wrong.

PromptEnhancer constructs an evaluation system covering 6 major categories and 24 key dimensions, enabling more precise error localization.

These 24 key dimensions cover almost all “blind spots” of T2I models, including:

Language Understanding: Negative instructions and pronoun reference (e.g., determining if “it” in “It is made of metal, so it broke the table” refers to the “ball”).

Visual Attributes: Object quantity (more than 3), material (ice sculpture vs. stone sculpture), expressions (contempt vs. smile).

Complex Relationships: Containment relationships (soda water inside a cup), similarity relationships (the shape of the lake resembles a guitar), and counterfactual scenarios (a girl hanging from a dandelion stem in the clouds).

Trained on large-scale annotated data, AlignEvaluator provides precise scores for generated images across each dimension.

For instance, “green onions missing from beef noodles” receives a high score in the “negative instruction” dimension, while “wrong cat color” receives a low score in the “attribute binding” dimension, providing clear direction for prompt optimization.

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Figure 3: AlignEvaluator Evaluation Dimensions

Two-Stage Training: From “Knowing How to Write” to “Writing Well”

With foundational capabilities and evaluation standards in place, PromptEnhancer evolves the rewriter through two stages of training:

Stage 1: SFT Initialization: Mastering structured descriptive abilities to generate refined prompts that adhere to grammatical logic.

Stage 2: GRPO Reinforcement Learning: Inputting eight candidate prompts generated by the rewriter into a frozen T2I model (e.g., Hunyuan-Image 2.1). The AlignEvaluator then scores the resulting images.

Through the logic of “higher reward leads to greater emphasis,” the rewriter gradually learns to “generate prompts that T2I models can understand.”

Accuracy Improved Across 20 Dimensions; Significant Breakthroughs in Complex Scenarios

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Figure 4: Semantic Accuracy of Text-to-Image Generation Across 24 Benchmark Dimensions

Tests on the HunyuanImage 2.1 model demonstrate comprehensive performance improvements brought by PromptEnhancer:

Overall Accuracy +5.1%: Positive gains were achieved in 20 out of 24 evaluation dimensions, with only two showing slight declines (text layout -0.7%, non-contact interaction -0.9%).

Significant Breakthroughs in Complex Scenarios: In the most challenging dimensions—“similarity relationships” (e.g., “lake looks like a guitar”), “counterfactual reasoning” (e.g., “dandelion stem suspended in clouds”), and “quantity counting” (e.g., “4 dogs”)—accuracy increased by 17.3%, 17.2%, and 15.0%, respectively.

More Precise Styles and Details: Accuracy improved by over 10% in dimensions such as “facial expressions” (e.g., “contemptuous expression”) and “cross-object attribute binding” (e.g., “man with short hair and blue shirt, woman with long hair and yellow shirt”). The fidelity of styles like oil painting and pointillism also improved significantly.

From a qualitative perspective, in images generated from the original prompt “Tom wearing an astronaut suit,” the spacesuit details were blurry and the space background monotonous.

After optimization by PromptEnhancer, not only are the character’s IP features clear, but the textures of the spacesuit and the painting techniques of the background fully meet user expectations.

PromptEnhancer’s “plug-and-play” feature has verified its adaptability to different generative model architectures, showing improvements in text-image alignment, realism, and aesthetics across mainstream T2I models such as HunyuanImage 2.1, Flux dev, and Qwen-Img.

Below is a comparison of the effects before and after PromptEnhancer rewriting for typical scenario prompts:

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Figure 5: Generated by Qwen-Image

User Prompt: Strawberry cake without frosting, placed on a wooden table covered with a checkered tablecloth, oil painting texture.

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Figure 6: Generated by Qwen-Image

Reprompt: An oil painting depicts a strawberry cake without frosting, placed on a wooden table covered with a checkered tablecloth. The main body of the cake is circular, exhibiting obvious thick impasto oil painting techniques, with its surface covered in rough brushstrokes. The interior consists of strawberry-colored sponge layers that are soft in structure, filled with pink jam between the layers. The top of the cake is decorated with several fresh red strawberries, their forms and colors rendered in heavy oil paint. Beneath the cake lies a checkered tablecloth composed of interwoven blue and white brushstrokes, laid upon a dark brown wooden tabletop whose wood grain is also carved out with distinct oil painting strokes. The overall image presents a strong oil painting texture, with saturated and thick colors and intense brushwork, representing a typical Still-life painting work.

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Figure 7: Generated by FLUX Dev

User Prompt: Three rabbits wearing bow ties of different colors, the largest on the far left, the smallest on the far right, the middle rabbit holding a carrot, low-poly style.

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning Human Intent Across 24 Dimensions to Help AI Understand Complex Instructions

Figure 8: Generated by FLUX Dev

Reprompt: An image depicts three rabbits of varying sizes wearing bow ties of different colors, arranged from left to right in descending order of size within a low-poly style scene. On the far left is the largest rabbit, its body light gray and composed of clear polygonal planes, wearing a red polygonal bow tie around its neck. In the middle is a medium-sized white rabbit, tightly holding an orange carrot with its front paws; the carrot’s leaves are green and also exhibit distinct polygonal structures, and it wears a blue bow tie. On the far right stands the smallest rabbit, pale pink in color, wearing a yellow bow tie around its neck, standing quietly. The ground they stand on is composed of interlocking polygonal blocks, with a background of simple geometric color blocks. The overall image presents a distinct Low Poly style.

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning 24 Dimensions with Human Intent to Enable AI to Understand Complex Instructions

Figure 9: Generated by HunyuanImage-2.1

User Prompt: Hand-drawn homework assignment featuring one square, two triangles, and three circles; all shapes have pink outlines.

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning 24 Dimensions with Human Intent to Enable AI to Understand Complex Instructions

Figure 10: Generated by HunyuanImage-2.1

Reprompt: Hand-drawn homework assignment. Several basic geometric shapes are arranged on a white paper background. In the center is a square, its outline drawn with a distinct pink line that varies in thickness to convey a hand-drawn feel. Above the square sit two triangles, also outlined in pink. Scattered around these shapes are three circles, their boundaries defined by pink lines. The interiors of all shapes are white, contrasting with the pink outlines. The background is white drawing paper with slight wrinkles or stains to enhance the realism of the hand-drawn aesthetic. The style resembles children’s hand-drawing or conceptual sketches.

These cross-model tests demonstrate that PromptEnhancer does not require customized modifications for specific Text-to-Image (T2I) models. Through a universal logic of “prompt rewriting + AlignEvaluator feedback,” it enhances the ability of different architectural generation models to understand complex instructions.

This implies that industry players can integrate this technology into existing workflows at low cost, rapidly improving the controllability and creative efficiency of AI painting tools.

To promote research into the interpretability and reproducibility of prompt optimization technologies, the Tencent Hunyuan team has simultaneously open-sourced a high-quality benchmark dataset containing 6,000 prompts with fine-grained annotations across multiple dimensions.

This dataset not only covers core pain points for T2I models such as “attribute binding,” “complex relationships,” and “negative instructions,” but also reveals deep patterns in AI understanding of painting instructions through multi-dimensional statistical analysis.

Dataset Overview: 6k Prompts Covering Complex Creative Scenarios

The 6,000 prompts in this benchmark dataset are built around the core goal of “precise expression of human intent” and cover three types of complex scenarios:

  • Daily Creation Extensions: For example, “A chef wearing a striped apron slicing a red apple on a marble countertop, chiaroscuro style”;
  • Abstract Relationship Challenges: For example, “A whale made of cloud shapes swimming in a purple sky, pixel art style”;
  • Counterfactual and Reasoning Scenarios: For example, “If a cat had elephant ears, how would it lie on a cherry blossom tree? Ukiyo-e style.”

Each prompt is equipped with 24-dimensional annotations required by AlignEvaluator to ensure precise capture of “human intent.”

Prompt Length Distribution: An Intuitive Mapping of Instruction Complexity

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning 24 Dimensions with Human Intent to Enable AI to Understand Complex Instructions

Figure 11: Distribution of Prompt Character Lengths

The length concentrates in the 80–120 character range, peaking at approximately 100 characters. This reflects that the dataset focuses on “medium-complexity instructions”—covering extensions of daily short prompts while challenging models to understand multi-element relationships within longer instructions.

The “long-tail interval” above 120 characters still shows a high frequency, representing the existence of “extremely complex instructions” (combinations of multiple objects, attributes, and relationships), providing material for testing model capabilities at their limits.

This distribution aligns closely with real-world creative scenarios: creators use concise prompts to express core ideas but also add extensive details during professional creation.

Key Dimension Co-occurrence: The “Combination Code” of Instruction Complexity

Tencent Hunyuan Open-Sources New AI Painting Framework: Aligning 24 Dimensions with Human Intent to Enable AI to Understand Complex Instructions

Figure 12: Top 24 Dimension Co-occurrence Heatmap

Darker colors (higher values) indicate a higher frequency of two dimensions appearing together in the same prompt. For instance, “Style” and “Action-Contact Interaction Between Entities” co-occur 676 times, indicating that “dynamic interaction scenes with specific styles” are a high-frequency demand for creators.

“Attribute-Expression” and “Action-Character/Anthropomorphic Full Body Movement” co-occur 332 times, reflecting the common need for combinations of character actions and expression details.

Niche but critical dimension combinations are also presented. For example, “Logical Reasoning” and “Relationship-Comparative” co-occur, corresponding to instructions requiring logical chains such as “The cat is half the size of the dog, so it jumps higher.”

Future and Outlook

The significance of PromptEnhancer lies not only in improving the generation accuracy of individual models but also in bringing three major breakthroughs to the AI painting field from technical and ecological perspectives:

  • Generality: It requires no modification to T2I model weights. As a “plug-and-play” module, it can adapt to any pre-trained model such as Hunyuan, Stable Diffusion, or Imagen, reducing optimization costs;
  • Interpretability: Through Chain-of-Thought (CoT) reasoning and 24-dimensional evaluation, prompt optimization is no longer a black box. Developers can clearly identify the model’s blind spots in understanding;
  • Ecological Completion: The team simultaneously released a high-quality human preference benchmark containing extensive annotated data for complex scenarios, providing important references for subsequent prompt optimization research.

As AI painting transitions from an “entertainment tool” to professional fields such as “industrial design and advertising creation,” “precise understanding of human intent” will become a core competitive advantage.

PromptEnhancer provides a practical technical path for this direction through the approach of “optimizing instructions rather than modifying models.”

In the future, creators may only need to input simple ideas, and AI will automatically complete professional details, making the realization of “what you think is what you get” a reality.

Project Homepage: https://hunyuan-promptenhancer.github.io Github: https://github.com/Hunyuan-PromptEnhancer/PromptEnhancer__PromptEnhancer-7B HuggingFace: https://huggingface.co/tencent/HunyuanImage-2.1/tree/main/reprompt