Achieving “Dual Excellence” in Understanding and Generation Within a Unified Framework for the First Time, Breaking the Stalemate of Multimodal Unified Modeling!
Researchers from Fudan University and Meituan have proposed UniToken, an innovative unified visual encoding scheme that balances text-image understanding and image generation tasks within a single framework. It has achieved leading performance across multiple authoritative benchmarks.
By integrating continuous and discrete visual representations, UniToken effectively alleviates the issues of “task interference” and “representation fragmentation” found in previous methods, offering a new paradigm for multimodal unified modeling.

To facilitate reproduction and further development by the research community, the UniToken team has open-sourced both its code and models.

Task Background: The Challenges of Unified Modeling
In traditional text-image understanding or image generation models, the underlying characteristics of their visual encodings differ significantly.
For instance, text-image understanding models (such as LLaVA, Qwen-VL, etc.) require extracting high-level semantics from images to facilitate collaborative understanding with text. In contrast, image generation models (such as DALL-E, Stable Diffusion, etc.) require preserving sufficient low-level details to ensure high-fidelity image generation.
Consequently, developing multimodal large models that integrate both understanding and generation faces several major challenges:
Fragmented Visual Encoding: Understanding tasks prefer continuous visual features with high-level semantics (e.g., CLIP), while generation tasks rely on discrete visual features that preserve low-level details (e.g., codebooks encoded by VQ-GAN).
Joint Training Interference: The conflicts arising from the differences between understanding and generation tasks make it difficult to balance performance for both tasks when training a unified model, often resulting in a scenario where “optimizing one leads to the degradation of the other.”
To address these challenges, existing work in the field typically adopts two paradigms: Works represented by VILA-U improve the semantic richness of discrete visual encodings by combining image reconstruction and text-image contrastive learning objectives. Works represented by Janus decouple the two tasks by customizing separate visual encoders and prediction heads for understanding and generation, respectively.
However, the former still struggles to compete with multimodal large models driven by continuous visual encoding in understanding tasks. The latter faces significant context-switching overhead and unilateral information loss when handling more complex multimodal tasks (such as multi-turn image editing).
UniToken: Unified Visual Representation, Merging Two Worlds
Core Design: Continuous + Discrete Dual Encoders

Unlike the multi-task decoupling design of Janus, UniToken provides a complete set of visual information for all downstream tasks, enabling multimodal large models to absorb relevant knowledge in an instruction-driven manner.
Specifically, UniToken employs a unified dual-path visual encoder. It concatenates the discrete encoding from VQ-GAN and the continuous representation from SigLIP in the following manner to obtain a visual encoding that combines high-level semantics with low-level details:
[BOS][BOI]{Discrete Image Token}[SEP]{Continuous Image Embedding}[EOI]{Text}[EOS]
Multi-Stage Training Strategy
To coordinate the characteristics of understanding and generation tasks, UniToken adopts a three-stage training process:
Stage 1: Visual Semantic Space Alignment
Based on Chameleon as the base model, this stage aims to integrate SigLIP’s continuous visual encoding into the LLM. During training, the LLM is frozen, while only the SigLIP ViT and Adapter are trained to align their outputs with the language space.
Stage 2: Multi-Task Joint Training
Building on the complete visual information provided by the aligned dual-path encoder from Stage 1, this stage conducts joint training on large-scale text-image understanding and image generation datasets. By controlling the data ratio (10M:10M), it balances and enhances the model’s performance in both understanding and generation tasks.
Stage 3: Instruction-Based Reinforcement Fine-Tuning
Testing revealed that the model trained in Stage 2 needed improvement in instruction following and layout-aware image generation. Therefore, this stage introduces high-quality multimodal dialogue data (423K) and fine-grained image generation data (100K) to further enhance the model’s ability to follow complex instructions.
Fine-Grained Visual Enhancement
Thanks to the completeness of its dual-path visual encoding, UniToken can seamlessly integrate existing fine-grained visual enhancement techniques.
Specifically, UniToken introduces two enhancement strategies on the continuous visual encoding side:
AnyRes: Divides high-resolution images into multiple sub-images, extracts features separately, and then concatenates them at their corresponding spatial positions to enhance fine-grained perception of the image.
End-to-End ViT Fine-Tuning: Dynamically fine-tunes the weights of the continuous visual encoder throughout the entire training process. Combined with a precise learning rate control strategy to prevent model collapse, this allows the model to adapt to a wide range of task scenarios.
Experimental Results: Surpassing SOTA, The “Top Student” of Multimodal Unity
On multiple mainstream multimodal benchmarks (text-image understanding + image generation), UniToken achieved performance comparable to or even surpassing that of specialized models in the field:




Meanwhile, the researchers conducted further in-depth ablation studies on the impact of training strategies and visual encoding:

- In large-scale data scenarios (>15M), a 1:1 ratio of understanding to generation data balances performance across both tasks.

- When addressing conflicts between understanding and generation tasks, the unified continuous + discrete visual encoding demonstrates greater robustness compared to schemes using only discrete encoding.
Conclusion: Towards General Multimodal Large Models Integrating Understanding and Generation
From a development trend perspective, current text-image understanding models significantly outperform image generation models in terms of generality.
However, the impressive performance of Gemini-2.0-Flash and GPT-4o in instruction-following image generation has brought hope for the future of general-purpose image generation models.
Against this backdrop, UniToken represents only an initial attempt. Its characteristic of providing complete information gives researchers more confidence to explore its deeper potential:
Model Scale Expansion: Leveraging larger language models to further explore the “emergent abilities” of unified models in understanding and generation;
Data Scale Expansion: Introducing larger-scale training data (such as the nearly 200 million samples used by Janus-Pro) to push the performance limits of the model;
Task Type Expansion: Expanding from traditional understanding and generation to tasks involving interleaved text and images, such as image editing and story generation, chasing the upper limit of general generation capabilities.
_Paper Link:
https://arxiv.org/pdf/2504.04423
_Code Address:
https://github.com/SxJyJay/UniToken
— End —