“Draw while watching, think while drawing”: Enabling Large Language Models to master spatial thinking capabilities has directly achieved a new State-of-the-Art (SOTA) in spatial reasoning tasks.
ViLaSR-7B, an open-source model jointly developed by the Natural Language Group at Ant Group’s Technology Research Institute, the Institute of Automation, Chinese Academy of Sciences (CASIA), and The Chinese University of Hong Kong (CUHK).

It achieved an average improvement of 18.4% across five benchmarks, including maze navigation, static image understanding, and video spatial reasoning.
Notably, on the VSI-Bench proposed by renowned scholar Fei-Fei Li and others, it reached a score of 45.4%, comparable to Gemini-1.5-Pro, comprehensively surpassing existing methods.

△ Main experimental results
More importantly, extensive case studies demonstrate that the model has indeed mastered spatial reasoning strategies and reflective capabilities similar to humans, marking a significant step toward true visual intelligence.
The team designed a three-stage training framework to cultivate this reasoning capability:
- Cold-start training establishes foundational visual manipulation skills.
- Reflective rejection sampling filters for high-quality reasoning paths.
- Reinforcement learning directly optimizes task objectives.
Let’s take a closer look.
Two Reasoning Paradigms
Following breakthroughs in text-based tasks, visual reasoning has become a major hotspot in machine intelligence. Visual reasoning refers to the ability of machines to perform visual understanding and logical judgment by analyzing objects, scene layouts, and spatial relationships within single or multiple (sequential) images, much like humans do.
In April this year, OpenAI’s o3 and o4-mini models achieved significant breakthroughs in visual reasoning. These two models adopt a “Thinking with Images” paradigm, actively manipulating images (such as cropping, scaling, rotating, etc.) during the text-based reasoning process and feeding the manipulated images back into the model for subsequent steps. In multiple visual reasoning benchmarks such as MMMU, the o3 model significantly outperformed previous best results, demonstrating the immense potential of this paradigm.

△ Two Visual Reasoning Paradigms
Why does visual reasoning require “Thinking with Images”?
Unlike o3/o4-mini, traditional Large Vision-Language Models (LVLMs) often employ a “Vision-to-Text” reasoning paradigm. This approach treats image information merely as auxiliary input: it compresses the data via a vision encoder into token sequences aligned with language space, which are then handed over to an LLM for pure text-based reasoning.
Although a paper titled The Platonic Representation Hypothesis, liked by Ilya Sutskever in June last year, posits that visual and linguistic representations naturally converge as model scale increases, practical implementation faces numerous challenges with this alignment.
On one hand, due to limitations in training data and the capabilities of vision encoders, this compression and alignment process inevitably loses a significant amount of critical detail and spatiotemporal information. Once lost during the initial alignment phase, this information cannot be recovered during subsequent pure-text reasoning.
On the other hand, visual data often contains substantial background details irrelevant to the task, particularly in multi-frame scenarios like videos where redundancy is high. Blindly increasing model size to preserve more information not only consumes vast computational resources processing irrelevant data but may also cause the model to over-focus on noise, thereby degrading reasoning performance.
As illustrated, the limitations of the “Vision-to-Text” reasoning paradigm are particularly evident in specific tasks—such as confusing directions during maze navigation or struggling to establish spatiotemporal associations between objects in multi-view reasoning.

△ Limitations of “Vision-to-Text” Reasoning
Currently, visual reasoning is undergoing a paradigm shift from “Vision-to-Text” to “Thinking with Images.”
In fact, “Thinking with Images” is not a brand-new concept.
For instance, the CVPR 2023 Best Paper VisProg proposed a training-free prompting method that allows large models to generate Python programs to call vision tools, embodying this philosophy of thinking with images. Ant Group’s Technology Research Institute also pioneered the work VisualReasoner, presented at EMNLP 2024, which proactively introduced visual operations during reasoning. By editing and generating new visual cues, it enhanced the model’s perceptual capabilities. Crucially, this work designed a data synthesis method capable of automatically generating large volumes of training data containing multi-step visual reasoning processes, achieving for the first time the native injection of such reasoning capabilities into model parameters.
These explorations have opened new directions for addressing information loss issues inherent in traditional vision-to-text conversion paradigms.

△ Comparison of Two Reasoning Paradigms
Under the broader framework of “Thinking with Images,” the Natural Language Group at Ant Group’s Technology Research Institute, in collaboration with CASIA and CUHK, focused on spatial reasoning problems in video or multi-image scenarios. They aimed to address current shortcomings in visual reasoning work, such as insufficient enhancement of spatial relationships and limited cross-frame tracking capabilities.
To this end, the team open-sourced the ViLaSR-7B (Vision-Language Model for Spatial Reasoning) model. Through an innovative “Drawing to Reason in Space” paradigm, ViLaSR-7B enables LVLMs to “draw while thinking” like humans: by drawing auxiliary annotations (such as reference lines and bounding boxes) in visual space, it guides the vision encoder to capture key spatial relationships. This preserves richer spatial information within the visual token embeddings, effectively mitigating the information loss associated with traditional “Vision-to-Text” reasoning paradigms. This interactive visual reasoning approach simulates the human thought process when solving spatial problems, enhancing the model’s spatial perception capabilities.

△ Example of “Drawing to Reason in Space”
Technical Solution: Drawing to Reason in Space
This framework allows the model to manipulate single or multiple images at each reasoning step. By selecting key frames, performing cross-frame comparisons, and drawing bounding boxes and auxiliary lines, it constructs visual cues that focus on specific spatial regions and dynamically track changes across different images.
Unlike existing methods that rely on external specialized cognitive tools or are limited to observing local details, this approach not only maintains the model’s original…
enhanced visual reasoning capabilities, further supporting coherent spatial reasoning in multi-image scenarios. It continuously updates and optimizes its holistic understanding of spatial states, truly realizing the cognitive process of “drawing while thinking, thinking while drawing.” This mechanism demonstrates significant advantages when handling complex spatial reasoning tasks that require multiple steps and long sequences, not only improving reasoning efficiency but also enhancing the interpretability and controllability of the results.
Three-Stage Training Framework: Systematically Cultivating Spatial Reasoning Abilities
To effectively improve the performance of Vision-Language Models (VLMs) on spatial reasoning tasks, ViLaSR employs a systematic three-stage training framework. This framework aims to gradually cultivate the model’s spatial understanding and reasoning capabilities from scratch, enabling it to perform multi-step, in-depth spatial analysis through “drawing-assisted thinking,” much like humans do.
Stage 1: Cold-Start Training
The first step of training is to establish the model’s basic cognitive abilities regarding visual space. The research team constructed initial visual reasoning paths using synthetic data and trained the model to execute basic drawing operations via supervised learning, such as annotating bounding boxes and drawing auxiliary lines. These operations lay the foundation for subsequent complex reasoning.
Stage 2: Reflective Rejection Sampling
The goal of the second stage is to enhance self-correction and reflective capabilities. This stage introduces a reflective rejection sampling mechanism that evaluates multiple reasoning paths generated by the model, selecting high-quality samples that demonstrate reflective behaviors (such as modifying bounding boxes or auxiliary lines) for reinforcement training. This mechanism encourages the model to proactively identify and adjust uncertain or erroneous reasoning paths, dynamically optimizing solutions based on feedback.
Stage 3: Reinforcement Learning
The final stage adopts a reinforcement learning strategy to further optimize the model’s overall reasoning ability and the efficiency of its drawing operations. In this phase, the model focuses simultaneously on answer accuracy and the logical consistency and formatting rationality of the reasoning process through result reward functions and format reward functions. The format reward is only granted when the result reward exceeds a threshold (set here to 0), ensuring that the model prioritizes correct results rather than merely optimizing for format compliance. The objective of this stage is to enable the model to autonomously select optimal reasoning paths across different tasks and use drawing tools reasonably, avoiding redundant operations. This phase not only improves the model’s final performance but also enhances its adaptability in various spatial reasoning scenarios.
Experimental Results
1. ViLaSR Demonstrates Excellent Performance Across Multiple Spatial Reasoning Benchmarks
ViLaSR-7B achieved an average improvement of 18.4% across five major spatial reasoning benchmarks: Maze Navigation (Maze), Static Image Understanding (SpatialEval-Real), Video Spatial Reasoning (VSI-Bench), Multi-Image Spatial Reasoning (SPAR-Bench, MMSI-Bench).
This significant improvement indicates that introducing the image-assisted thinking mechanism substantially enhances the model’s generalization and spatial reasoning capabilities across various task types, making it more adaptable than pure text-based reasoning.
Notably, on VSI-Bench—one of the most challenging benchmarks for visual-spatial understanding—ViLaSR-7B achieved an average accuracy of 45.4%, significantly outperforming Qwen2.5-VL-7B (+12.7%).
2. Reflective Rejection Sampling Enhances Self-Correction; Reinforcement Learning Optimizes Drawing Efficiency

△ Ablation study. Scores represent the percentage relative increase in key behaviors compared to the complete ViLaSR model.
Ablation experiments revealed that the cold-start stage first helps the model master the “drawing-assisted thinking” capability; removing the reflective rejection sampling stage leads to a significant reduction in reflective behaviors, reasoning steps, and drawing operations. This indicates that the reflective rejection sampling mechanism plays a crucial role in the model’s self-identification and correction when facing erroneous paths.
Furthermore, compared to ViLaSR-7B, the version without reinforcement learning showed performance degradation across most sub-tasks, accompanied by a surge in the frequency of drawing/auxiliary line usage (+159.4% / +9.1%), indicating that reinforcement learning helps learn more refined operational strategies.
The performance drop was more pronounced for numerical tasks compared to multiple-choice tasks (-9.21% vs. -4.07%), validating that the dense rewards provided by reinforcement learning are more effective in promoting precise spatial reasoning, offering advantages over supervised fine-tuning alone.
3. Possesses Human-Like Spatial Reasoning Strategies
In-depth case studies indicate that ViLaSR-7B not only surpasses existing methods in performance but also exhibits human-like spatial reasoning strategies. As shown below, the model has mastered the following key capabilities:
1. Reference-Based Metric Reasoning:
In a task measuring the size of a telephone, the model demonstrated mature reference-based reasoning abilities. It first recognized that relying solely on pixel measurements would not yield accurate results, then proactively sought out a reference object with known dimensions (a monitor), and finally calculated the actual size of the phone through proportional conversion. This reasoning approach is highly consistent with how humans solve practical measurement problems.

△ Example of reference-based metric reasoning
2. Systematic Cross-Frame Object Tracking:
When faced with tasks requiring the understanding of relative object positions across multiple frames, the model adopted a systematic annotation strategy—marking the positions of identical objects in different frames and establishing spatial and temporal associations between them through these markers. This method not only ensures reasoning accuracy but also improves result interpretability.

△ Example of systematic cross-frame object tracking
This study focuses on spatial reasoning tasks, integrating drawing operations with multimodal reasoning through the “Drawing to Reason in Space” paradigm. This enables models to “draw while thinking” within visual spaces, more effectively understanding and reasoning complex spatiotemporal relationships, thereby significantly enhancing large models’ spatial perception capabilities as well as the interpretability and controllability of their reasoning. This paradigm lays the foundation for spatial intelligence in fields such as robot navigation and virtual assistants, and will continue to drive multimodal reasoning toward greater generality and efficiency in the future.
The first author of this work is Wu Junfei, a Ph.D. student at the Institute of Automation, Chinese Academy of Sciences, currently interning at Ant Group’s Technology Research Institute. Guan Jian, Associate Research Fellow at Ant Group’s Technology Research Institute, is the co-first author.
Paper Link: https://arxiv.org/abs/2506.09965
Code Repository: https://github.com/AntResearchNLP/ViLaSR
— End —