SenseTime's 'Daily New 6.5' Upgrade Marks AI's Leap from Tool to Human

Frontier Models · Published: Jul 29, 2025 · Lin Mei Huang · ~8 min read

Author Info

Multimodal & Media AI Editor

M.F.A. Digital Media (RISD); former VFX pipeline technical director

Lin reports on image, video, and audio models with an eye toward rights, provenance, and creative workflows. She explains technical limits of generative media and highlights platform policy changes that affect commercial use. She collaborates with legal review on copyright-sensitive topics.

#Generative Media #Copyright & Licensing #Creative Workflows #Platform Policy

Full author profile →

The ability to perceive and process multimodal information is a core requirement for Artificial General Intelligence (AGI) and the essential path from language models toward AGI.

From multimodal perception and reasoning to interaction, the evolution of multimodal intelligence will drive AI’s next phase of development.

On July 27, 2025, at the WAIC 2025 Large Model Forum titled “Boundless Love · Shaping the Future,” hosted by the Artificial Intelligence Committee of the All-China Federation of Industry and Commerce (ACFIC) and organized by SenseTime, SenseTime unveiled its new SenseNova V6.5 large model system. This release marks a breakthrough upgrade in multimodal foundation models, enabling AI to leap from being a “productivity tool” to becoming actual “productivity.” Additionally, SenseTime’s core product, SenseTime Little Raccoon (Xiao Huan Xiong), has completed an agent-based upgrade.

SenseTime's "SenseNova 6.5" Upgrade: Enabling AI's Leap from "Tool" to "Human"

In 1950, Alan Turing defined AI as “human-like capabilities” through the “Imitation Game.” However, practical AI has long remained confined to the category of “tools,” even experiencing periods of stagnation. In the era of large models, AI is gradually approaching the boundaries of AGI and truly moving toward “human-like” standards, thanks to breakthroughs in multimodal fusion capabilities.

Xu Li, Chairman and CEO of SenseTime and the first rotating chairman of the Presidium of the ACFIC Artificial Intelligence Committee, stated: “SenseTime has always sought to understand the essence of artificial intelligence. By leveraging technological innovation to unlock maximum intelligence, we are driving AI’s transition from a ‘tool’ to a ‘human,’ becoming true productivity.”

SenseTime's "SenseNova 6.5" Upgrade: Enabling AI's Leap from "Tool" to "Human"

SenseNova V6.5 Refreshed: Breakthrough Upgrades Touching the “Depth of Understanding”

SenseTime’s SenseNova V6.5 multimodal foundation model introduces three major breakthrough upgrades:

Strong Reasoning: Image-text interleaved multimodal chain-of-thought reasoning, with performance comparable to Gemini 2.5 Pro and Claude 4-Sonnet.
High Efficiency: Optimized multimodal architecture, improving cost-effectiveness by more than three times.
Agent Capabilities: Significant leadership in data analysis, supporting end-to-end scenario implementation and achieving a closed-loop of value creation.

By advancing from standard multimodal chain-of-thought data to synthesized image-text interleaved chain-of-thought data, SenseTime’s SenseNova V6.5 has achieved substantial improvements in multimodal reasoning and interaction performance:

SenseTime's "SenseNova 6.5" Upgrade: Enabling AI's Leap from "Tool" to "Human"

SenseTime’s SenseNova V6.5 has pioneered the image-text interleaved chain-of-thought technology, introducing visual thinking into large models. It is now the first commercial-grade large model in China to implement this capability.

In human cognition, visual and logical thinking are equally important; their organic integration forms comprehensive thinking abilities. As the saying goes, “a picture is worth a thousand words.” An image often triggers more effective thought than lengthy text. While current mainstream multimodal models have achieved the fusion of multiple modalities at the input stage, their reasoning processes still rely primarily on linguistic inference, leaving gaps in graphical and spatial reasoning.

The key to constructing multimodal chains of thought lies in the graphical representation of information. This is more challenging than pure text-based chains, as it requires not only presenting textual thinking processes but also generating images that serve as nodes in the reasoning chain—a task difficult to achieve at scale through manual annotation alone. SenseTime’s R&D team first constructed seed data based on an understanding of the thinking process. After supervised fine-tuning (SFT), the model initially acquired the ability to think with interleaved text and images. Subsequent rounds of reinforcement learning significantly enhanced its multimodal reasoning capabilities.

SenseTime's "SenseNova 6.5" Upgrade: Enabling AI's Leap from "Tool" to "Human"

Simultaneously, SenseTime has improved the fusion architecture of its multimodal models to promote early cross-modal integration. The new architecture employs a significantly lighter visual encoder and a deep, narrow backbone model. This design allows visual representations to align and merge with language during the early stages of feedforward computation, resulting in more efficient perception and deeper modal fusion.

Thanks to these architectural improvements, SenseTime’s SenseNova V6.5 has increased pre-training throughput by over 20%, reinforcement learning efficiency by 40%, and inference throughput by more than 35%, achieving an optimal balance between performance and cost. Compared to SenseNova V6.0, the cost-effectiveness of SenseNova V6.5 has tripled.

SenseTime's "SenseNova 6.5" Upgrade: Enabling AI's Leap from "Tool" to "Human"

AI as a Productivity Driver: Sensetime’s Raccoon Agent Takes Center Stage in Office Work

Large language models have become auxiliary tools for many professionals today. However, relying solely on large language models is insufficient to elevate AI from a mere “tool” to an autonomous “agent.”

Human daily tasks inherently involve processing multimodal information, including text, images, video, and web pages. The key transition from a productivity tool to actual productivity lies in the ability to input, process, and output this multimodal data.

Leveraging the powerful multimodal data analysis capabilities of its “SenseNova V6.5” model, Sensetime’s Raccoon agent has undergone a comprehensive upgrade. It can now handle complex multimodal inputs, perform deep fused analysis across modalities, and deliver professional visual outputs. This evolution establishes “AI productivity in office scenarios,” enabling AI to leap from being a “productivity tool” to becoming actual “productivity.”