Today, SenseTime officially open-sourced its multimodal autonomous reasoning model, SenseNova-MARS (available in 8B and 32B versions). In core benchmarks for multimodal search and reasoning, it achieved a score of 69.74, surpassing Gemini-3-Pro (69.06) and GPT-5.2 (67.64).
SenseNova-MARS is the first Agentic VLM model to support dynamic visual reasoning and deep integration with image-text search. It can autonomously plan steps and invoke tools, effortlessly handling various complex tasks and endowing AI with true “execution capabilities.”
In benchmarks such as MMSearch, HR-MMSearch, FVQA, InfoSeek, SimpleVQA, and LiveVQA, SenseNova-MARS achieved State-of-the-Art (SOTA) results among open-source models, outperforming top closed-source models like Gemini-3.0-Pro and GPT-5.2. It leads comprehensively in two core areas: search reasoning and visual understanding. For more details, please refer to the technical report (https://arxiv.org/abs/2512.24330). Developers and users from all industries are welcome to test and experience the model.
An All-Around Champion: Autonomously Solving Complex Problems
SenseNova-MARS demonstrates a clear leading advantage in multiple multimodal search evaluations, achieving an average score of 69.74. This successfully surpasses Gemini-3-Pro’s 69.06 and GPT-5.2’s 67.64.

On the MMSearch leaderboard (the core evaluation for image-text search), the model topped the charts with a score of 74.27, exceeding GPT-5.2’s 66.08. In HR-MMSearch (high-definition detail search evaluation), it led with 54.43 points, significantly widening the gap with closed-source models.

The test questions for HR-MMSearch are akin to the “Olympics of the AI world”: they utilize 305 brand-new 4K ultra-high-definition images from 2025, ensuring that AI cannot rely on outdated knowledge to “cheat.” All questions target details occupying less than 5% of the image, such as small logos, tiny text, or minute objects, which require image cropping tools to see clearly. The tests cover eight major fields: sports, entertainment and culture, science and technology, business and finance, gaming, academic research, geography and travel, with 60% of questions requiring at least three different tools to answer.
In short, whether it is a knowledge-intensive task that requires “searching the entire web” or a fine-grained visual analysis demanding “sharp eyes,” SenseNova-MARS is currently the “all-around champion.”
Using Combination Moves to Solve Real-World Scenarios
SenseNova-MARS can be practically deployed in our daily lives and work scenarios, solving problems that require “multi-step reasoning + multi-tool collaboration.”
Traditional AI tool invocation is often limited: it can either search text or view images. When faced with complex tasks requiring “zooming in on details first, identifying objects second, and checking background information last,” these models are often at a loss.

Faced with the complex task of “identifying a tiny logo on racing gear + querying the company’s founding year + matching the driver’s birth date + calculating the difference,” SenseNova-MARS can autonomously invoke image cropping and text/image search tools, completing the closed-loop solution without human intervention.

SenseNova-MARS can identify corporate logos from photos of products and industry summits, quickly gathering information about products and companies, as well as details such as time, quantity, and parameters. This assists in analyzing industry conditions and landscapes.

From race photos, SenseNova-MARS can identify logos, people, and other information within the frame, tracing background information about the competition or personnel to help quickly supplement important details.

SenseNova-MARS can even easily handle these ultra-long-step multimodal reasoning tasks involving more than three tool invocations. It automatically crops and analyzes details, searches for relevant research data, quickly validates hypotheses, and draws key conclusions.
With this capability of “autonomous thinking + multi-tool collaboration,” SenseNova-MARS can autonomously solve complex tasks involving “detail recognition + information retrieval + logical reasoning,” helping to improve work efficiency.
- Image Cropping: Can precisely focus on minute details in images. Even details occupying less than 5%—such as a tiny logo on a racer’s suit or slogans in the stands of race photos—can be clearly analyzed by cropping and zooming in.
- Image Search: Automatically matches relevant information the moment it sees an object, person, or scene—for example, identifying a racer’s identity or the model number of a niche device.
- Text Search: Quickly captures precise information. Whether it is a company’s founding year, a person’s birth date, or the latest industry data, it can be retrieved in seconds.
Learning from Practice: Forming “Intuition” and “Experience”
SenseNova-MARS adopts a training method of “teaching according to aptitude.”
Phase 1: Building Foundations.
Addressing the pain point of scarce training data for cross-modal multi-hop search reasoning, it innovatively proposes an automated data synthesis engine based on multimodal agents. Using a mechanism of fine-grained visual anchors and multi-hop deep associative retrieval, it dynamically mines and correlates logic across web entities, automatically constructing high-complexity multi-hop reasoning chains. Simultaneously, it introduces closed-loop self-consistency verification to remove hallucinated data, creating multi-hop search Q&A data with rigorous logical chains and high knowledge density. Using carefully selected “high-difficulty cases” as teaching materials, each case is annotated with “which tools to use and what the steps are,” allowing the AI to first learn basic “detective logic.” These cases are the “hard bones” picked from massive datasets, ensuring the AI encounters real-world complex scenarios from the start.
Phase 2: Practicing for Combat.
It employs “Reinforcement Learning”—much like a detective accumulating experience through solving case after case. The AI receives rewards for every correct decision (e.g., choosing the right tool or step) and adjusts its strategy when it makes mistakes. To prevent the AI from “learning incorrectly,” the research team added a “stabilizer” called the BN-GSPO algorithm. This ensures stable progress whether handling simple or complex problems, avoiding “subject imbalance.” This elegant mechanism based on dual-stage normalization effectively smooths out optimization fluctuations caused by the diversity of dynamic tool invocation return distributions and ensures consistency in learning signal distribution, thereby successfully solving the convergence challenges in cross-modal multi-step multi-tool agent training.
Through this training, the AI not only learns to use tools but also cultivates “tool-use intuition”—knowing which tools to use under what circumstances and how to organically combine results from different tools.
Full Open Source: Model, Code, and Data
SenseTime’s SenseNova-MARS model, code, and datasets are fully open-sourced, supporting direct download via Hugging Face.
GitHub Repository:
https://github.com/OpenSenseNova/SenseNova-MARS
Model Repositories:
32B:
https://huggingface.co/sensenova/SenseNova-MARS-32B
8B:
https://huggingface.co/sensenova/SenseNova-MARS-8B
References
Technical Report:
https://arxiv.org/abs/2512.24330