AI Heavyweight Mei Tao Takes the Helm: A New Multimodal AI Arrives!
Its capabilities are nothing short of all-encompassing.

It not only supports image and video generation:

But also masters fantasy scenes and diverse camera angles:

Furthermore, the lip-sync feature is now live, allowing even introverts to easily create podcasts:

Video Link: https://mp.weixin.qq.com/s/bYNU6Mei2pq7KuFR8Ik2dQ
Key Highlights:
The official platform also provides hundreds of ready-to-use fun effect templates, enabling users to achieve “effortless creation.”

For cool transformations like the one below, the operation is as simple as uploading a single image:

Templates for transforming people, animals, and buildings are all available:

Additionally, the Image Agent in the image generation section is a flagship feature. Users can generate and edit images using plain language; not knowing how to write prompts is no longer an issue, as the system will automatically optimize and refine them for you.

To cut to the chase, this latest creative tool is vivago 2.0 (Zhi Xiaoxiang AI).
The team behind it, HiDream.ai, was founded by Mei Tao, a renowned figure in the industry and an academician of the Canadian Academy of Engineering. The R&D team is packed with core talents from the University of Science and Technology of China (USTC).

Recently, the team’s open-source model HiDream-I1 made a splash in the text-to-image arena. Within just 24 hours of being open-sourced, it topped the leaderboards, becoming one of the first domestic open-source large models to enter the top tier.

At the time, even Recraft (the team behind the mysterious viral “red_hat” panda) integrated it overnight, with global creators rushing to incorporate it into their workflows.

Interestingly, vivago 2.0 actually leverages the capabilities of HiDream-I1.

Currently, vivago 2.0 has launched globally on both the web and app platforms. As this site cannot miss out on such new toys, we got our hands on it for an immediate experience. We also conducted a deep dive into the models powering it.

Once clicked, simply enter a few keywords from your mind, and it will automatically organize them into creative, complete prompts. You can click “Use Prompt” to auto-import it into the input field, or choose “Cite” to further modify it.

Additionally, you can set parameters such as the number of generated images, image dimensions, and negative prompts:

Without further ado, let’s look at the results.
Generating a glass of lemon sparkling water reveals almost no AI artifacts, with impressive detail:

First-person perspective image generation is also supported, as shown below:

For text-plus-image generation (uploading a reference image), there are three settings: Full Reference, Portrait, and Redraw.
“Full Reference” automatically uses the entire image as a guide for generation; “Portrait” extracts facial features to generate images of the same person in different styles; “Redraw” re-renders the original image into various artistic styles.

It effortlessly handles various styles, including photorealistic, illustration, Pixar-style, and 3D:

Image Agent also offers “rewrite” and “help me write” prompt functions. Users only need to express their ideas in plain language to create content.

Next, in the realm of video generation, there are two main modes: image-to-video and text-to-video.
Image-to-video can be generated from a single image or by setting start and end frames using two images.

By setting two keyframes for the start and end, users can generate smooth “transformation” style videos with a single click.

Various scenes can be transitioned seamlessly:

Vivago 2.0 also features a more convenient and efficient design.
On the image generation interface, users can directly click buttons on the generated images to initiate video creation and other operations.

Thus, the bicycle-riding image we generated earlier comes to life with a single click:

Whether it’s a realistic scene or an imaginative fantasy, Vivago 2.0 can transform any image into dynamic video with just one sentence.
For example, a dog surfing on the ocean:

Or even a modified static meme (I’m crying, but the tears are from menthol ointment fumes). Vivago 2.0 will also automatically enhance image quality.

After seeing the images and videos, let’s look at the AI Podcast feature.
The AI podcast creation function involves lip-syncing. You can either record your own voice or provide text for the AI to generate speech.

It can also be generated directly based on existing images or videos.

When the text “Life is like a box of chocolates. You never know what you’re gonna get” is input, the character in the image naturally syncs their lip movements to the audio.
At the same time, the character’s body language changes in sync with the speech.
We specifically selected an image showing a profile view of a person, and the lip-syncing remains smooth and natural.

After selecting an effect, we uploaded an AI-generated image of a little girl.
With just a snap, the girl’s outfit changes smoothly:

The Creative Community is also a great place to find inspiration. Creators’ millions of imaginative ideas are available for you to “borrow,” and you can directly use the same prompts.

Here are more excellent examples from the community:

In addition, the team is about to launch a Topics feature. Users can participate in trending topics to increase the exposure of their works. Currently, beta access for this feature is limited.
The VivaGo 2.0 AI toolbox also includes diverse functional modules such as 3D generation, AI virtual try-on, and video background removal:

Interested users are encouraged to explore these features firsthand.
way, Vivago 2.0 has been quite popular since its launch, sometimes even causing server congestion due to high traffic.

The Next Evolution of Open-Source SOTA
On the technical front, Vivago 2.0’s new capabilities are powered by a brand-new Image Agent called HiDream-A1.
HiDream-A1 integrates advanced closed-source models (HiDream-I1.1 and HiDream-E1.1) built upon the open-source HiDream-I1 and HiDream-E1.
HiDream-I1 is an image generation foundation model with 17 billion parameters. It has been released in three versions: the full version, HiDream-I1-Full; a distilled accelerated version, HiDream-I1-Dev; and a distilled ultra-fast version, HiDream-I1-Fast.
The HiDream-I1-Full is the complete version, requiring over 50 diffusion steps to achieve ultimate image quality. This version is ideal for creative scenarios that prioritize precision, such as commercial poster design or artistic creation.
HiDream-I1-Dev is a guided distilled version that reduces the number of steps to 28, striking a golden balance between quality and speed.
Meanwhile, HiDream-I1-Fast is the ultra-fast version, capable of generating high-quality images in just 14 steps, making it perfectly suited for real-time applications.
Notably, less than 24 hours after its release, HiDream-I1-Dev topped the Artificial Analysis Image Generation Arena leaderboard.
HiDream-I1 achieved State-of-the-Art (SOTA) results on the HPS benchmark, which evaluates semantic relevance, image quality, and aesthetics of generated images:

It also achieved SOTA results on the GenEval and DPG-Bench benchmarks, which evaluate the semantic relevance between generated images and input text:


HiDream-E1 is an open-source large model for interactive image editing, featuring the recently viral capability of editing images via voice commands, similar to GPT-4o.
The combination of HiDream-I1 and HiDream-E1 can be considered the open-source equivalent of GPT-4o.
The core innovation of HiDream-I1 lies in cleverly integrating Sparse Mixture-of-Experts (MoE) technology into the Diffusion Transformer architecture.
They designed a dual-stream to single-stream hybrid sparse DiT structure.
Specifically, the model initially uses dual-stream DiT to process image and text tokens separately, much like two hands performing distinct tasks. In this stage, each modality has its own dedicated channel, allowing for thorough extraction of respective
characteristics. Subsequently, the model switches to a single-stream DiT architecture, enabling deep fusion of both modalities.
The most ingenious aspect is that the team introduced a dynamic Mixture-of-Experts (MoE) architecture in both the dual-stream and single-stream phases. This acts like an intelligent router for the model, dynamically assigning each input token to the expert module best suited to handle it.
In terms of text encoding, HiDream-I1 adopts a “four-pronged” hybrid strategy:
The long-context CLIP provides visual-semantic alignment, the T5 encoder handles complex text structures, and Llama 3.1 contributes deep semantic understanding. Notably, features are extracted from multiple intermediate layers of the LLM to prevent the loss of detailed information in the final layer output. This comprehensive approach significantly enhances the model’s ability to understand text prompts.
Regarding training strategy, the team employed progressive resolution training, starting at 256×256, gradually increasing to 512×512, and finally reaching 1024×1024.
The ZhiXiang Future team did not stop at text-to-image generation. They also extended HiDream-I1 into an instruction-based image editing model, HiDream-E1, using a “context learning” approach. Users only need to provide the original image and editing instructions, and the model can accurately execute the modification tasks.
Finally, the team integrated the text-to-image HiDream-I1 and the image-editing HiDream-E1 to launch HiDream-A1, a comprehensive image agent.
This agent acts as an “all-around image assistant,” capable of generating images based on descriptions, editing images according to instructions, and engaging in multi-turn conversational creation and modification. This allows users to complete complex image creation tasks through natural language, much like chatting with ChatGPT.
The Team Behind It: AI Expert Tao Mei at the Helm
ZhiXiang Future was founded in March 2023. While the name is new, its founder, Tao Mei, is a household name in the AI community. He is an Foreign Academician of the Canadian Academy of Engineering and a Fellow of IEEE, IAPR, and CAAI, making him a world-class expert in artificial intelligence, computer vision, and multimedia.
The core team members of ZhiXiang Future come from the technical teams of global Fortune 500 companies such as Microsoft, Baidu, Tencent, Huawei, JD.com, and ByteDance. Over 90% of the team holds doctoral or master’s degrees, with reports indicating that many are alumni of the University of Science and Technology of China (USTC).
Most team members have a background in AI video technology. As early as 2017, they published the paper “To Create What You Tell: Generating Videos from Captions” at the ACM Multimedia conference.
Looking back now, this was one of the first academic papers to research text-to-video generation, although the field was then referred to as Caption-to-Video.

Although the video generation they achieved using GANs (Generative Adversarial Networks) back then was far from perfect by today’s standards, its forward-looking nature is undeniable.
It was precisely their persistence in the field of video generation that allowed them to achieve another breakthrough with their technical accumulation during the explosion of AIGC: becoming the first globally launched open-source image and video generation model based on the Diffusion Transformer (DiT) architecture.
Unlike large tech companies that invest in ultra-large-scale computing power involving tens of thousands of GPUs, ZhiXiang Future chose a more pragmatic development path—focusing technically on visual multimodal foundation models, while commercially offering controllable image/video generation solutions close to market needs.
This strategy has clearly won the favor of investors who understand technology.
From receiving seed funding from Alpha Community and Zhonghe Da Seed No. 1 Fund in April 2023, to completing a Pre-A round led by Dunhong Capital in the first half of 2024, followed by an A-round led by state-owned funds primarily headed by Hefei Industrial Investment in late 2024, ZhiXiang Future’s financing journey has been smooth. It is understood that the A-round funding scale reached hundreds of millions of RMB, with co-investors including the Anhui Province AI Mother Fund and Hubei Changjiang Film Group Co., Ltd.
Both the speed and scale of financing reflect the capital market’s recognition of ZhiXiang Future’s technical capabilities and commercial prospects.
Tao Mei has a clear perspective on this: “Large language models require massive computing power and funding. In 2023, thousands of GPUs were needed; in 2024, tens of thousands are required. This is a winner-takes-all field. For Chinese startups, raising such large amounts of capital is difficult, as is keeping up with the competition from tech giants. The video industry track does not require such massive investment, has controllable scale, and is closest to commercialization.”
This judgment seems to have been validated by the market—in 2023, approximately $20 billion in global AIGC revenue came from video and images, accounting for 50%-60% of the total. Among them, Midjourney’s revenue in this area reached $200 million, already proving product-market fit (PMF).
Since its establishment in March 2023, ZhiXiang Future has continuously cultivated the field of visual multimodal foundation models and applications, releasing a series of remarkable achievements.
Here are the key points:
ZhiXiang Multimodal Large Model, with a parameter scale exceeding tens of billions, achieves joint modeling of text, images, video, and 3D content. It has successfully passed both model and algorithm filing requirements.
Based on this foundation, the “ZhiXiang AI” product series offers capabilities such as image generation and editing, 4K high-definition output, global/local controllability, and script-driven multi-shot video generation. These features provide significant commercial advantages in the fields of AIGC technology and digital creativity.
In 2024, ZhiXiang Future engaged in frequent strategic collaborations: signing a partnership with Ciwen Media; launching an “AI+” cooperation plan jointly with Shanghai Film Group; releasing the first national-level AIGC video ringtones application, “AI One-Word Video,” in collaboration with China Mobile Migu; and signing a strategic cooperation agreement with Cambricon in Beijing.
On December 28, 2024, at the launch ceremony of the Anhui Artificial Intelligence Industry Pilot Zone, ZhiXiang Future globally debuted ZhiXiang Multimodal Generation Large Model 3.0 and ZhiXiang Multimodal Understanding Large Model 1.0.
The ZhiXiang Multimodal Generation Large Model 3.0 comprehensively upgrades image and video generation capabilities, including improvements in visual quality and relevance, enhanced controllability of camera and scene movements, and optimizations driven by multi-scenario applications.
Meanwhile, the ZhiXiang Multimodal Understanding Large Model 1.0 achieves more precise and accurate understanding of image and video content through object-level visual modeling and event-level spatiotemporal modeling.

Entrepreneurship is not easy, especially in the fiercely competitive AIGC sector. However, Mei Tao’s goals extend beyond commercial success to encompass a broader sense of mission.
“I am not starting this business as an individual; I represent Chinese tech experts embarking on a new era to carve out a path. If my technology and commercialization strategies succeed, my story should be replicable, inspiring more people to pursue this endeavor,” Mei Tao stated.
Next, ZhiXiang Future will focus primarily on the application and commercialization of multimodal large models.
Between 2023 and 2025, ZhiXiang Future’s business model underwent significant evolution. In 2023, it provided foundational model capabilities via a Model-as-a-Service (MaaS) approach, establishing a robust technical foundation for future development. In 2024, the company shifted to a Software-as-a-Service (SaaS) model, launching tool-based products that validated their application value in professional scenarios and further clarified its commercial direction. By 2025, it launched a new strategy focusing on “IP secondary creation + consumer market penetration,” aiming to build a scaled commercial ecosystem, integrate upstream and downstream resources, and maximize commercial value.
This trajectory aligns with the common development path of AIGC products: first meeting the high demands of professional users, then gradually lowering operational barriers to achieve mass-market adoption.
From MaaS to SaaS, and now toward RaaS (Results-as-a-Service), ZhiXiang Future is no longer just selling tools but delivering growth directly.
Undoubtedly, with the emergence of multimodal AI capabilities, 2025 is destined to be a breakout year for multimodal technology and products. AIGC video generation is being viewed as a new-generation super platform akin to TikTok… Yet, beneath clear trends and market opportunities, only teams with genuine technical strength, product intuition, and clear commercialization rhythms can soar to success.
ZhiXiang Future is currently demonstrating these very traits and potential.