Fei-Fei Li's Latest Long Article Goes Viral in Silicon Valley

Author Info

Elena Volkov

Machine Learning Research Editor

Ph.D. Machine Learning (ETH Zürich); published work on efficient training and evaluation

Elena explains model architecture, training economics, and benchmark design for a technical audience. She reads primary papers and official technical reports, then summarizes assumptions, datasets, and known failure modes. She avoids hype by pairing capability claims with reproducibility notes.

#Model Architecture #Benchmarks #Training Economics #Open-Source Models

Full author profile →

Spatial intelligence is the next frontier of AI.

Fei-Fei Li, widely regarded as a godmother of artificial intelligence, has just published a lengthy article that systematically explains for the first time what spatial intelligence is, why it matters, and how to build world models capable of unlocking its potential.

Fei-Fei Li’s latest long-form article goes viral in Silicon Valley

In the piece, Li outlines three core capabilities that a “true world model with spatial intelligence” must possess:

  • Generative: The ability to create worlds that adhere to physical laws and maintain spatial consistency.
  • Multimodal: The capacity to process multimodal inputs ranging from images and video to actions.
  • Interactive: The capability to predict how the world evolves or interacts over time.

She also shares World Labs’ progress in next-generation task functions, data, model architectures, and learning representations, as well as the potential of world models in fields such as creativity, robotics, science, healthcare, and education.

Since its release, the article has garnered widespread attention, going viral across platforms and trending on social media:

Integrating spatial intelligence into Large World Models (LWMs) is expected to drive the next qualitative leap for Large Language Models (LLMs).

Once causal reasoning capabilities and energy efficiency reach a certain threshold, we will stand at the inflection point on the path toward Artificial General Intelligence (AGI).

Enough preamble. Let us now examine Li Feifei’s manifesto, which moves from words to worlds.

Below is the full text:

From Words to Worlds: Spatial Intelligence Is the Next Frontier of AI

In 1950, when computers were merely tools for automated arithmetic and simple logic, Alan Turing posed a question that still resonates today: Can machines think?

He envisioned a future that others had not yet seen, requiring extraordinary imagination—the idea that intelligence might one day be “built” rather than “born.”

This insight sparked a scientific journey that continues to this day: artificial intelligence (AI). In my 25 years of working in AI research, Turing’s vision has remained a constant source of inspiration. But how close are we to realizing it? The answer is not straightforward.

Today, cutting-edge AI technologies represented by Large Language Models (LLMs) have begun to transform the way humans acquire and process abstract knowledge. However, they remain “wordsmiths in the dark”: eloquent yet lacking experience; knowledgeable yet ungrounded.

Spatial intelligence will change how we create and interact with both real and virtual worlds—revolutionizing storytelling, creativity, robotics, scientific discovery, and many other fields. This is precisely the next frontier of AI.

Since entering this field, exploring visual and spatial intelligence has always been my “North Star.” This is why I spent years building ImageNet—the first large-scale dataset for visual learning and evaluation.

Together with neural network algorithms and modern computing power (such as GPUs), it became one of the three key elements that nurtured modern AI. It also explains why my Stanford lab has combined computer vision with robotic learning over the past decade.

Similarly, this is why I co-founded World Labs a year ago alongside Justin Johnson, Christoph Lassner, and Ben Mildenhall: to realize this possibility for the first time.

In this article, I will explain what spatial intelligence is, why it matters, and how we can build “world models” capable of unlocking its potential—technology that will profoundly reshape creativity, embodied AI, and human progress.

Spatial Intelligence: The Scaffolding of Human Cognition

Artificial intelligence has never been more exhilarating. Models represented by generative AI, such as Large Language Models (LLMs), have moved from research laboratories into daily life, becoming tools for billions to create, produce, and communicate.

They demonstrate capabilities once thought impossible: generating coherent text, mountains of code, photorealistic images, and even short videos. Will AI change the world? By any reasonable definition, it already has.

Yet, vast potential remains untapped. The vision of automated robots remains alluring but distant; dreams of accelerating research in disease treatment, new material discovery, and particle physics remain unfulfilled.

True AI that can understand and empower human creators—whether students learning complex molecular chemistry concepts, architects conceptualizing space, filmmakers building worlds, or anyone seeking to immerse themselves in virtual experiences—is still yet to arrive.

To understand why these capabilities remain elusive, we must look back: How did spatial intelligence evolve? And how has it shaped the way we understand the world?

Vision has long been a cornerstone of human intelligence, but its power stems from something more fundamental. Long before animals could build nests, raise offspring, communicate through language, or establish civilizations, that seemingly simple act of “perception”—feeling a ray of light, touching a texture—had already quietly ignited the evolutionary journey toward intelligence.

This ability to extract information from the external world builds a bridge between perception and survival, a bridge that grew increasingly complex over the course of long evolution.

Neurons layered upon one another formed nervous systems capable of interpreting the world and coordinating interactions between organisms and their environments. Consequently, many scientists believe that the “perception-action” loop became the core mechanism of intelligent evolution, serving as the foundation for nature’s creation of our species—an ultimate entity capable of perceiving, learning, thinking, and acting.

Spatial intelligence plays a foundational role in our interaction with the physical world. Every day, we rely on it to perform even the most mundane actions.

When parking, we imagine the distance between the rear of the car and the curb; we catch keys thrown to us; we navigate through crowds without colliding; and half-asleep, we accurately pour coffee into a cup.

  1. In extreme situations, firefighters move through collapsing buildings and thick smoke, instantly judging stability and making life-or-death decisions, communicating through body language and instinctive tacit understanding—all of which are ineffable. Meanwhile, infants spend long periods before learning language interacting with their environment through play to understand the world.

All of this happens intuitively and naturally—a fluid capability that machines have yet to achieve.

Spatial intelligence is also the cornerstone of our imagination and creativity. Storytellers construct rich worlds in their minds and convey them to others through various visual media.

From primitive cave paintings to modern cinema, and immersive video games. Whether a child builds castles on a beach or plays Minecraft on a computer, this space-based imagination forms the basis of human interaction with virtual worlds. In industrial applications, simulations of objects, scenes, and dynamic interactive environments support countless critical scenarios ranging from industrial design and digital twins to robot training.

In key moments in history that shaped civilization, spatial intelligence often played a central role.

In ancient Greece, Eratosthenes achieved an astonishing feat through geometric thinking about shadows—he measured the 7-degree angle formed by the sun’s shadow in Alexandria and compared it with the phenomenon of “noon without shadow” in Syene, thereby calculating the Earth’s circumference.

Hargreaves’ invention of the “Spinning Jenny” also stemmed from spatial insight: he realized that by installing multiple spindles side-by-side on a single frame, one worker could spin multiple threads simultaneously, increasing production efficiency eightfold.

The breakthrough in revealing the structure of DNA by Watson and Crick also relied on their hands-on construction of three-dimensional molecular models—they continuously adjusted and assembled metal plates and wire frames until the spatial arrangement of base pairs fit perfectly.

In these cases, spatial intelligence drove the progress of civilization—when scientists and inventors needed to manipulate objects, imagine structures, and reason within physical space, these capabilities could never be fully conveyed through text alone.

Spatial Intelligence is the scaffolding that supports human cognition.

Whether passively observing or actively creating, it operates silently in the background.

It drives our reasoning and planning, even on the most abstract topics. It also shapes how we interact with the world—whether through language communication or physical action, whether with others or with the environment itself.

Although most of us do not discover new truths every day like Eratosthenes, we think in essentially the same way almost all the time: understanding this complex world through our senses and making it comprehensible by relying on intuitive knowledge of physical and spatial laws.

Regrettably, today’s AI cannot yet think in such a manner.

Significant progress has indeed been made in recent years. Multimodal Large Language Models (MLLMs), which have introduced vast amounts of multimedia data for training beyond text, have initially acquired spatial perception capabilities:

They can analyze images, answer related questions, and even generate hyper-realistic images and short videos. Meanwhile, aided by breakthroughs in sensors and haptic technology, the most advanced robots have begun to manipulate objects and tools in strictly constrained environments.

However, frankly speaking, AI’s spatial capabilities are still far from reaching human levels. Its limitations are also obvious: state-of-the-art MLLMs often perform no better than random guessing on tasks such as estimating distance, direction, and size; they cannot “mentally rotate” objects—i.e., reproduce the shape of the same object from a new angle; they do not navigate mazes, identify shortcuts, or predict basic physical laws; although generated videos are novel and dazzling, they often lose coherence after just a few seconds.

Today’s top AI excels at reading, writing, retrieval, and pattern recognition, but when it comes to representing the physical world or

However, there are fundamental limitations in interaction.

Human understanding of the world is holistic: we do not merely see “what is right in front of us,” but also comprehend their spatial relationships, semantic meanings, and significance in reality.

This capacity to understand the world through imagination, reasoning, creation, and interaction is precisely the power of spatial intelligence.

Without it, AI becomes disconnected from the physical reality it seeks to understand. It will be unable to safely drive cars, guide robots in homes and hospitals, create entirely new immersive learning and entertainment experiences, or accelerate discoveries in materials science and medicine.

The philosopher Ludwig Wittgenstein once wrote, “The limits of my language mean the limits of my world.” I am not a philosopher, but I know that for AI, the world extends beyond language. Spatial intelligence represents the frontier that transcends language.

It connects imagination, perception, and action, opening new possibilities for machines to genuinely enhance human life: from healthcare to creativity, from scientific discovery to everyday assistance.

The Next Decade of AI: Building Machines with True Spatial Intelligence

So, how do we build AI that possesses spatial intelligence?

How can we equip models with the spatial reasoning capabilities of Eratosthenes, the engineering precision of industrial designers, the creative imagination of storytellers, and the fluid interaction with their environment exhibited by emergency responders?

To achieve such an AI, we need a system more ambitious than Large Language Models (LLMs): World Models.

This is a new class of generative models whose capabilities in understanding, reasoning, generating, and interacting will surpass the current limits of LLMs. They can understand and generate complex virtual or real worlds across semantic, physical, geometric, and dynamic dimensions.

This field is still in its infancy, with existing methods ranging from abstract reasoning models to video generation systems.

World Labs was founded in early 2024 based on the belief that foundational approaches are still taking shape, and that this will be the defining challenge for artificial intelligence over the next decade.

In this emerging domain, it is crucial to establish core principles to guide development directions. For spatial intelligence, I define a “World Model” as a system possessing three core capabilities:

1. Generative: World models can generate worlds with perceptual, geometric, and physical consistency.

To achieve spatial understanding and reasoning, world models must be able to generate their own simulated worlds.

Guided by semantic or perceptual instructions, they should be able to generate infinitely diverse and varied virtual worlds while maintaining consistency in geometry, physics, and dynamics, whether these worlds are real or virtual.

The research community is currently exploring whether these worlds should be represented as implicit or explicit geometric structures.

In addition to powerful latent representations, I believe the outputs of a general-purpose world model must also allow for the generation of explicit, observable world states to adapt to different application scenarios. Crucially, the model’s understanding of the current world must remain coherent with its past states—understanding the present is understanding how it evolved.

2. Multimodal: World models are inherently multimodal by design.

Just like humans and animals, world models should be able to process multiple forms of input. In the field of generative AI, these inputs are referred to as “prompts.”

Faced with incomplete information—whether images, videos, depth maps, text instructions, gestures, or actions—the world model should predict or generate the most complete possible state of the world.

This requires the model to process image inputs with the precision of realistic vision while simultaneously understanding semantic instructions with equal flexibility.

In this way, both agents and humans can communicate about the “world” with the model through diverse input forms and receive outputs in various ways.

3. Interactive: World models output the next state based on input actions.

Finally, when actions and/or goals are part of the input prompts, the world model’s output must include the next state of the world.

This state can be implicit or explicit. When the input contains only an action (with or without a goal), the world model should generate outputs consistent with the world’s previous state, expected target states (if any), and its semantic meaning, physical laws, and dynamic behaviors.

As spatial intelligence world models continue to enhance their reasoning and generation capabilities, we can imagine that future models will not only predict the next state of the world but also be able to predict the next action based on that state.

The scale of this challenge surpasses anything AI has faced before.

Language is a purely generative phenomenon in human cognition, whereas the rules governing “the world” are far more complex.

On Earth, for example: gravity dictates motion, atomic structure determines the color and brightness of light, and countless physical laws constrain all interactions.

Even the most fantastical or creative worlds consist of spatial objects and agents that adhere to physical and dynamic laws.

Consistently coordinating these dimensions—semantic, geometric, dynamical, and physical—within a model requires entirely new methodologies. This is because “the world” is far more complex than language, which is essentially a one-dimensional sequential signal.

To achieve world models with universal spatial intelligence akin to humans, we must overcome several massive technical barriers.

At World Labs, our research team is working on foundational breakthroughs toward this goal.

Below are examples of some current directions in our research:

  • A new general-purpose training objective function: In world model research, a long-term goal is to define a universal task function as concise and elegant as “next-token prediction” in LLMs. However, the complexity of the input and output spaces for world models makes designing this function more difficult. Although there is still significant room for exploration, this objective function and its corresponding representations must conform to geometric and physical laws, faithfully reflecting the essence of the world model’s “grounded representation” between imagination and reality.
  • Large-scale training data: The data required to train world models is far more complex than text. The good news is that we already possess vast data resources. Massive collections of images and videos on the internet provide rich material for training. The challenge lies in: how to enable algorithms to extract deeper spatial information from two-dimensional images or video frames (RGB). Research over the past decade has revealed scaling laws regarding data volume and model size in language models; for world models, the key is to build architectures that can effectively utilize visual data at similar scales. Furthermore, the role of high-quality synthetic data and additional modalities (such as depth and touch) cannot be underestimated, as they play a supplementary role during critical stages of training. Future developments depend on more

Advanced sensing systems, more robust signal extraction algorithms, and more powerful neural simulation methods.

  • New Model Architectures and Representation Learning: World model research will inevitably drive innovation in model architectures and learning algorithms, particularly moving beyond current multimodal large language models (LLMs) and video diffusion models. These existing models typically encode data as one- or two-dimensional sequences, making simple spatial tasks—such as counting different chairs in a short video or remembering what a room looked like an hour ago—exceptionally difficult. New architectural approaches could address these limitations, such as tokenization with 3D or 4D perception capabilities and enhanced context and memory mechanisms. For example, at World Labs, we recently developed RTFM (Real-Time Generative Frame-based Model), a frame-based real-time generative model. By using spatially-grounded frames as a form of spatial memory, it achieves efficient real-time generation while maintaining the continuity and consistency of the generated world.

Clearly, significant challenges remain before we can fully unlock the potential of spatial intelligence. However, this research is not merely theoretical; it is becoming the core engine for the next generation of creative and productivity tools.

Progress at World Labs is encouraging. We recently demonstrated an early version of Marble to select users—the world’s first world model capable of generating consistent 3D environments from multimodal inputs—allowing users and creators to explore, interact with, and continue building their creative worlds within them. We are working tirelessly to make it available to the public as soon as possible.

Marble is only our first step. As research accelerates, researchers, engineers, users, and business leaders alike are beginning to recognize the immense potential of this direction. Next-generation world models will elevate machine spatial intelligence to an entirely new level, unlocking core capabilities that AI has largely lacked thus far and truly ushering in an era where artificial intelligence understands and creates the world.

Building a Better World for Humanity with World Models

The motivation behind the development of artificial intelligence is crucial. As one of the scientists who helped usher in the modern AI era, my drive has always been clear: AI should augment human capabilities rather than replace them.

For years, I have dedicated myself to ensuring that the development, deployment, and governance of AI align with human needs.

While extreme narratives about “technological utopias” and “doomsday scenarios” are abundant today, I maintain a more pragmatic stance: AI is developed by humans, used by humans, and governed by humans.

It must always respect human autonomy and dignity. Its “magic” lies in expanding our capabilities, making us more creative, more connected, more efficient, and more fulfilled.

Spatial intelligence embodies this vision—an AI that empowers human creators, caregivers, scientists, and dreamers to achieve goals once thought impossible. This belief is the fundamental reason I view spatial intelligence as the next great frontier of AI.

The applications of spatial intelligence span different time horizons. Creative tools are emerging now—World Labs’ Marble already allows creators and storytellers to take hold of this capability firsthand. Robotics represents a medium-term ambition, where we are working to refine the closed loop between perception and action. The most transformative scientific applications may take longer, but they will profoundly enhance human well-being.

Across all these timelines, several areas stand out for their potential to reshape human capabilities. Realizing this potential requires collective efforts far beyond the scope of any single team or company.

It demands the participation of the entire AI ecosystem: researchers, innovators, entrepreneurs, and policymakers, all working toward a shared vision. And that is a vision worth pursuing. Here is what the future looks like:

Creativity: Supercharging Narratives and Immersive Experiences

“Creativity is intelligence having fun.” This is one of my favorite quotes from Albert Einstein.

Long before humans invented writing, we were storytelling—painting stories on cave walls, passing them down through generations, and building cultures around shared narratives. Stories are how humans understand the world, connect across time and space, explore what it means to be human, and find meaning in life and love.

Today, spatial intelligence has the potential to completely transform how we create and experience narratives, giving them deeper impact across entertainment, education, design, and construction.

World Labs’ Marble platform places unprecedented spatial expression and editorial control into the hands of filmmakers, game designers, architects, and various storytellers. They can now rapidly create, iterate, and explore complete 3D worlds without the cumbersome workflows of traditional 3D design software. The act of creation remains a core human activity—AI simply amplifies and accelerates the process of bringing ideas to life. This includes:

  • Multi-dimensional narrative experiences: Filmmakers and game designers can use Marble to build entire worlds, unconstrained by budget or geography, exploring scenes and perspectives impossible within traditional production pipelines. As the boundaries between media and entertainment blur, we are approaching a new form of interactive experience—personalized worlds that blend art, simulation, and gaming, allowing anyone (not just large studios) to create and enter their own stories.
  • Telling spatial stories through design: Virtually all manufactured items or constructed spaces must undergo virtual 3D design before physical realization—a process often fraught with high time and cost expenditures. With spatial intelligence models, architects can visualize and walk through buildings that do not yet exist in minutes; industrial or fashion designers can instantly translate imagination into form, exploring how objects interact with the human body and space.
  • New immersive and interactive experiences: One of the deepest ways humans experience meaning is through the act of creation itself. Throughout human history, we have shared only one 3D world: the physical world. It was not until recent decades, through games and early virtual reality (VR), that we began to glimpse the possibility of “self-made worlds.” Today, spatial intelligence combined with VR, XR (extended reality) headsets, and immersive display devices elevates this experience to unprecedented heights. In the future, “walking into” multi-dimensional worlds will be as natural as opening a book. Spatial intelligence expands the power to build worlds from specialized teams to every creator, educator, and ordinary person with a vision.

Robotics: The Practice of Embodied Intelligence

From insects to humans, animals rely on spatial intelligence to understand, navigate, and interact with the world. Robots will be no exception.

Since the inception of this field, “machines with spatial awareness” has been a human dream, including research conducted by me at Stanford alongside students and collaborators. This is why I am exceptionally excited about realizing this vision using models built by World Labs.

  • Extending robot learning through world models: Advances in robot learning depend on scalable training data solutions. For robots to possess the ability to understand, reason, plan, and interact, they need access to an extremely vast state space. Many researchers believe that combining internet data, synthetic simulation data, and real-world human demonstrations is key to achieving generalizable robots. However, unlike language models, robot training data is currently scarce. World models will play a decisive role here. As their perceptual accuracy and computational efficiency improve, outputs generated by world models will rapidly narrow the gap between simulation and reality, enabling robots to learn across countless states, interactions, and environments.
  • Human-robot collaboration partners: Whether as research assistant robots aiding scientists in labs or home assistants accompanying elderly people living alone, robots can expand labor forces and boost social productivity. But to do so, robots must possess spatial intelligence—the ability to perceive, reason, plan, act, and, most importantly: maintain empathy for human goals and behaviors

For instance, laboratory robots can take over instrument operations from scientists, allowing them to focus on tasks requiring reasoning; home assistant robots can help the elderly cook without stripping away their enjoyment or autonomy. World models with true spatial intelligence—capable of predicting the next state and even inferring the corresponding next action—are key to realizing this vision.

  • Expanded Embodied Forms: Humanoid robots are merely one form we have created for our own world. The real innovation dividends will come from more diverse designs: nanobots that deliver medication, soft-bodied robots that navigate narrow spaces, and machines engineered for the deep sea or outer space. Regardless of their physical form, future spatial intelligence models must integrate environmental modeling with the robot’s own perception and motion. However, a key challenge in developing these robots lies in the lack of training data across diverse morphologies. World models will play a crucial role in this process by providing support for simulation data, training environments, and evaluation tasks.

A Longer Horizon: Science, Medicine, and Education

Beyond creative applications and robotics, the profound impact of “spatial intelligence” will extend to many other fields that can enhance human capabilities, save lives, and accelerate discovery. Below, I highlight three directions with transformative potential. Of course, the applications of spatial intelligence go far beyond these; its reach spans nearly every industry.

In scientific research, systems equipped with spatial intelligence can simulate experiments, validate hypotheses in parallel, and explore environments inaccessible to humans—from the deep sea to distant planets. This technology has the potential to revolutionize computational modeling in fields such as climate science and materials research. By combining multi-dimensional simulations with real-world data collection, these tools can significantly lower computational barriers and expand the boundaries of what every laboratory can observe and understand.

In healthcare, spatial intelligence will reshape the entire process from the lab bench to the bedside. At Stanford, my students, collaborators, and I have worked for years with hospitals, nursing facilities, and patients in their homes. These experiences have convinced me of the transformative potential of spatial intelligence in medicine. AI can accelerate drug discovery through multi-dimensional modeling and improve diagnostic quality by assisting radiologists in identifying patterns in medical images. It can also support environment-aware monitoring systems, providing continuous support for patients and caregivers without replacing human care. Furthermore, there is immense potential for robots to assist healthcare workers and patients across various settings.

In education, spatial intelligence enables immersive learning, making abstract or complex concepts tangible and creating iterative experiences that align with how the human brain and body learn. In the AI era, faster and more efficient learning and skill reskilling are crucial for both children and adults. Students can explore cellular machinery in multi-dimensional ways or “experience” historical events firsthand; teachers can use interactive environments to deliver personalized instruction; and professionals such as surgeons and engineers can safely practice complex skills in highly realistic simulation environments.

Across these fields, the possibilities are infinite, but the goal remains consistent: to make AI a force that enhances human expertise, accelerates discovery, and amplifies compassion—rather than replacing human judgment, creativity, and empathy.

Conclusion

Over the past decade, artificial intelligence has become a global phenomenon, bringing about turning points in technology, economics, and even geopolitics.

However, as a researcher, educator, and entrepreneur, what excites me most is still the spirit behind the question posed by Alan Turing seventy-five years ago. I share his curiosity and wonder—this very curiosity drives me every day to explore the challenges of spatial intelligence.

For the first time in human history, we stand at a moment where it is possible to build machines that are highly aligned with the physical world, making them true partners in addressing major challenges.

Whether accelerating disease research, revolutionizing storytelling, or providing support during vulnerable moments of illness, injury, or aging, we are on the threshold of a technological transformation that will elevate our most cherished values of life.

This is a vision for a deeper, richer, and more empowered existence.

Nearly five hundred million years have passed since nature first showed the dawn of spatial intelligence in primitive animals, and we are fortunate to be this generation of technology creators—perhaps the ones about to endow machines with similar capabilities—and lucky enough to apply these capabilities for the well-being of all humanity.

Without spatial intelligence, our dream of “truly intelligent machines” will remain incomplete.

This exploration is my “North Star.” I invite you to pursue it with me.

Original Link: https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence