Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law'

Models & Benchmarks · Published: Sep 20, 2024 · Priya Sharma · ~11 min read

Author Info

Enterprise AI & Governance Editor

JD (technology policy focus); CIPP/US; former in-house counsel at a cloud provider

Priya writes about regulation, enterprise procurement, and responsible deployment. She separates legal fact from commentary, flags jurisdictional limits, and works with external counsel on high-risk governance topics. Her articles emphasize what changed, who is accountable, and what practitioners should verify locally.

#AI Regulation #Enterprise Adoption #Risk & Compliance #Policy Analysis

Full author profile →

The viral success of Black Myth: Wukong has not only propelled the 3D game itself into the spotlight but also ignited a surge in interest in the underlying, rapidly evolving AI 3D generation technology.

For years, external attention to the 3D large model sector has lagged behind that of language and video models. However, global competitors in this space have been quietly competing and making significant strides. From Yellow, backed by Andreessen Horowitz (a16z), to World Labs founded by Fei-Fei Li, the iteration speed of 3D large models has not slowed down even slightly.

Just recently, VAST, a leading domestic player in 3D large models, updated its flagship model, Tripo. This new version is trained on tens of millions of high-quality, proprietary native data sources, resulting in exceptional performance.

The capabilities of this new 3D generation tool have advanced further: it now accepts text, single images, and multiple images as inputs.

Regarding geometric precision and image fidelity, let’s first share a short video demonstrating 3D models generated by the new tool for an intuitive sense of its quality:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 2

In addition to announcing the new product, VAST shared another major piece of news: the company has completed several rounds of financing totaling hundreds of millions of yuan. This marks the largest funding amount in the 3D large model sector to date.

Of course, this leadership in financing is merely a reflection of its technical prowess. VAST’s technology and application scenarios are indeed top-tier.

Rapid Generation with Flawless Results and Stunning Effects

The model that has once again raised the ceiling for AI 3D generation is called Tripo 2.0.

Tripo 2.0 first generates a preview of the shape geometry within seconds, followed by “applying skin” to generate textures and PBR (Physically Based Rendering) materials in just a few more seconds.

Tripo 2.0 is now officially online, with many users already conducting live tests.

Our website also joined the testing effort immediately.

Tripo 2.0 supports text-to-3D and single-image-to-3D generation; the Tripo 1.4 version also supports multi-image-to-3D generation.

By inputting a prompt, it can generate four 3D models at once.

Based on different inputs, our hands-on test results are divided into two sections below:

Text-to-3D Models
Image-to-3D Models

Hands-On Test of Tripo 2.0 Text-to-3D Models

Without further ado, let’s first look at the text-to-3D effects.

Step one: Generate the geometry for “a half-body portrait of an anime girl.”

In terms of generating complex structures, the details are quite impressive:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 3

Next, we apply the textures.

Within a generation time of less than 20 seconds, it achieves fine textures and layering; achieving this level of detail through manual modeling would typically take thousands of times longer.

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 4

Let’s try another task: using Tripo 2.0 to generate a full-body cartoon character.

First, let’s attempt a cartoon dwarf~

The result is quite cute (in the voice of Song Dandan), as shown below:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 5

We also generated a small monster and zoomed in on the single model for closer inspection.

Rotating it 360 degrees, no bugs or flaws are visible to the naked eye. It is worth noting that the dense spikes on the monster’s back are a nightmare for human modelers, who usually avoid such complex designs. However, Tripo handles this with ease.

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 6

Increasing the difficulty, it can also handle more complex 3D model generation tasks.

Understanding perspective structures has long been a bottleneck for generative AI, exemplified by the finger issues in image generation models. Spatial structure is crucial for 3D models; we can see Tripo’s powerful ability to understand perspective structures, perfectly completing complex structural modeling tasks.

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 7

Finally, here is an even more impressive example: the shopping cart below requires no further explanation of its complexity:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 8

Tripo 2.0: Hands-on Test of Image-to-3D Generation

Let’s take a closer look at the results of image-to-3D generation.

The algorithm for generating 3D models from a single image heavily tests its ability to understand and reconstruct spatial information. In this test, we conducted a horizontal comparison with other players in the market.

A friendly reminder: the last 3D model shown in each display image below was generated by Tripo 2.0.

Here is a comparative demonstration of an image-to-3D model generation featuring a rose!

The comparison clearly shows that only the model generated by this tool has a geometric shape with no blind spots from any angle, and it boasts the highest completeness in flowers and foliage:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 9

After texturing, it also delivers the best results in reproducing the colors and textures of the original image:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 10

After testing plant generation, we moved on to test image-to-model generation for inanimate objects.

We fed the model an image of a Russian Easter egg as input. Tripo 2.0’s output exhibited the most “relief-like” quality, and compared to others, its texture details were the most exquisite:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 11

After multiple tests, it is not difficult to find that Tripo 2.0 shows significant differences in overall generation performance.

For instance, the generated PBR materials have high fidelity, preserving the surface attributes and visual effects of the original image:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 12

Moreover, regardless of whether it is the side or back view, every angle captures complex features from the original image:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 13

Tripo 2.0 not only impresses with its generation quality but also features higher controllability.

The input supports multimodal options, and when selecting the text-to-3D model mode, it also supports negative prompts (specifying elements that should not be included in the generated model).

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 14

The control over the output model pose is also exceptional.

Users can customize the proportions of the head, legs, arms, and other parts of the generated 3D model.

You can freely choose between “A-pose” or “T-pose,” instantly setting long legs:

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 15

The generated 3D models can also be bound to skeletons and stylized with a single click.

Now, your 3D model avatars have their own Lego!

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 16

There are many more ways to explore; feel free to co-create in the comments section.

Given how impressive Tripo 2.0 is, let’s ask—

How Was Tripo 2.0 Forged?

Deconstructing it technically, one word defines Tripo 2.0’s implementation: 3D Scaling Law.

First, Tripo 2.0 is based on a massive database of tens of millions of high-quality 3D assets. It employs probabilistic generative modeling methods, learning to capture geometric and material distributions from large-scale data.

This ensures better output quality while enhancing the model’s robustness and generalization capabilities.

Secondly, it adopts a complex hybrid architecture combining DiT (Diffusion Transformer) and U-Net models.

DiT excels at capturing global context and long-range dependencies within 3D structures, while U-Net is adept at preserving fine details and local features. Tripo 2.0 integrates the advantages of both architectures.

Furthermore, using state-of-the-art training algorithms, Tripo 2.0’s geometric and material generation models are based on advanced large-scale flow models with billions of parameters.

It also utilizes guidance distillation and step distillation to improve efficiency, significantly optimizing performance without compromising quality.

With these technological enhancements, Tripo 2.0 achieves a new SOTA (State-of-the-Art) in 3D generation shape, texture quality, detail representation, adherence to input conditions, and output diversity, becoming the new “pentagon warrior” (a term for an all-around strong performer):

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 17

Previously, the team behind Tripo 2.0 collaborated with other groups to produce a wealth of academic achievements accepted by top conferences such as Siggraph, CVPR, ICLR, and ECCV.

For example, Wonder3D generates consistent multi-view normal maps and corresponding color images through a cross-domain diffusion model, then rapidly reconstructs high-quality 3D geometry using a novel normal fusion algorithm.

Compared to existing methods based on Score Distillation Sampling (SDS), Wonder3D shows significant improvements in efficiency, consistency, and detail, completing reconstruction in just 2-3 minutes.

Another example is TGS: Triplane Meets Gaussian Splatting, also accepted by CVPR 2024.

This technology utilizes Transformer networks and a novel Triplane-Gaussian hybrid representation, making the reconstruction of 3D models from single images more efficient and precise.

Those interested can refer to these details for further reading.

In short, Tripo 2.0 was not achieved overnight; it is backed by substantial technological accumulation.

The Scaling Law of the 3D World

Finally, let’s formally introduce the company behind Tripo 2.0.

VAST, founded in March last year, is an AI company focused on the research and development of large 3D models.

The company’s goal is to “establish a UGC (User-Generated Content) platform for 3D by creating mass-market 3D content creation tools, making spatial-based 3D a key element for user experience, content expression, and enhancing new quality productive forces.”

Public records show that the company’s CEO and CTO both come from SenseTime:

Founder and CEO Song Yachen has led multiple zero-to-one AI projects at SenseTime and participated in the founding of MiniMax, one of the “Six Little Giants” of large models. CTO Liang Ding, who earned his bachelor’s, master’s, and doctoral degrees from Tsinghua University under Academician Dai Qionghai, previously served as the head of SenseTime’s General Model division.

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 18

In just a year and a half since its establishment, the company has been highly active.

First, earlier this year, it unveiled its first 3D large model, Tripo 1.0.

With billions of parameters, Tripo 1.0 can generate 3D mesh models from single images or text in just 8 seconds.

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 19

△ The classic “Avocado Armchair” in 3D modeling, generated by Tripo 1.0

Within six months of launch, global users had generated over 5 million 3D models using Tripo 1.0.

What does 5 million mean? It is approximately equal to the sum of the world’s top three largest 3D model databases.

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 20

In early March this year, VAST partnered with Stability AI, the team behind Stable Diffusion, to jointly release an open-source 3D foundation model called TripoSR.

Because it achieved the feat of “generating a 3D model from a single image in 0.5 seconds,” it has become highly popular in the open-source community for 3D generation, garnering 4.3k stars on GitHub to date.

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 21

Now, Tripo 2.0 has been released and is available for online use.

Thanks to the performance improvements brought by the 3D Scaling Law, the time span between these three updates of Tripo was only nine months.

It offers both speed and quality, earning recognition from within and outside the industry.

To cite a recent piece of news: Not long ago, Roblox, the world’s largest online game development platform, announced its entry into AI 3D generation. However, to date, Tripo remains the most popular and handy 3D modeling tool among Roblox players.

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law' — figure 22

Where will VAST take Tripo next?

The answer we found is that, at least technically, VAST will continue to pursue the research on the Scaling Law of 3D Generative AI, exploring the fundamental principles relating model scale, data volume, and generation quality, while seeking scalable paradigms for data, representations, and model architectures.

It aims not only to push the boundaries of 3D generative AI but also to continuously explore more holistic (Holistic) 3D generation.

This is quite promising.

After language models and video models brought a little shock to this world, people hope that the 3D generation track will nurture its own “ChatGPT moment.”

After all, the situation in 3D AI generation is relatively unique compared to other AI tracks. Not only is post-generation manual editing technically difficult, but if the model’s performance is poor, trying to achieve satisfaction by simply increasing the number of attempts (drawing cards) is less effective than drawing it yourself (just kidding).

Fortunately, the 3D generation industry lives up to expectations and continues to move forward—

Looking back at the past two years, especially from late 2023 to 2024, 3D generation technology has developed rapidly.

It has improved in both effect and speed, achieving characteristics such as “high efficiency, low cost, strong innovation, and high customizability.”

As technology advances rapidly, the density of talent across the industry is also increasing.

Domestically, companies like VAST are represented by startups from globally renowned universities and research institutions. Looking abroad, AI godmother Fei-Fei Li’s first startup, the spatial intelligence company World Labs, is also focusing on the 3D generation world, announcing its long-term goal to build Large World Models (LWM) to perceive, generate, and interact with the 3D world.

Many hands make light work.

It can be said that due to clear progress in talent, technology, effects, and scenarios, the AI 3D generation track is gradually entering more people’s vision.

And the breakthrough progress potentially brought by the 3D Scaling Law seems to already indicate the direction of the next focus area in the field of artificial intelligence.