Multimodal Models Learn 'On-Demand Search,' Reducing Queries by 30% While Improving Accuracy: ByteDance & NTU Study Optimizes Search Strategy

Author Info

David Kowalski

Developer Tools & Agents Editor

15+ years software engineering; maintainer of internal agent-evaluation playbooks

David tests coding agents, IDE integrations, and terminal workflows the way working teams use them. He documents prompts, environment pins, and regression cases so readers can compare tools fairly. When vendors sponsor access, he discloses it and keeps scoring criteria unchanged.

#Coding Agents #IDE Integrations #Developer Productivity #Tool Comparisons

Full author profile →

Multimodal Models Learn to “Search on Demand”!

ByteDance and NTU’s latest research optimizes multimodal model search strategies:

By building web search tools, constructing a multimodal search dataset, and designing simple yet effective reward mechanisms, this study makes the first attempt at end-to-end reinforcement learning-based autonomous search training for multimodal models.

The trained model can autonomously determine when to search, what to search for, and how to process search results, executing multi-turn on-demand searches in real-world internet environments.

Multimodal Models Learn "Search on Demand," Reducing Searches by 30% While Improving Accuracy: ByteDance & NTU New Research Optimizes Multimodal Model Search Strategies

Experimental results show that in knowledge-intensive Visual Question Answering (VQA) tasks, the MMSearch-R1 system demonstrates significant advantages:

Its performance not only surpasses that of similarly sized models using traditional Retrieval-Augmented Generation (RAG) workflows but also achieves the performance level of larger-scale models performing traditional RAG, while reducing search queries by approximately 30%.

The following section provides a detailed analysis of the research methodology and experimental findings.

Multimodal Models Learn "Search on Demand," Reducing Searches by 30% While Improving Accuracy: ByteDance & NTU New Research Optimizes Multimodal Model Search Strategies

How Was This Achieved?

In recent years, with dual improvements in the scale and quality of vision-language training datasets, Large Multimodal Models (LMMs) have demonstrated exceptional performance in cross-modal understanding tasks, significantly enhancing their ability to align textual and visual knowledge.

However, real-world information is highly dynamic and complex. Relying solely on expanding training data size for knowledge acquisition has inherent limitations: it struggles to cover long-tail distribution knowledge, cannot access new information post-training cutoff dates, and fails to reach private domain information resources.

These limitations lead to hallucinations in practical applications, severely restricting the reliability of model deployment across broad real-world scenarios.

In this context, web search, as a core pathway for humans to acquire new knowledge, is viewed as an important tool for expanding model capabilities and has received significant attention from academia.

How to enable multimodal models to possess autonomous and precise external information retrieval capabilities, thereby achieving accurate question answering, has become a key challenge in current research.

Therefore, the MMSearch-R1 project, jointly conducted by ByteDance and Nanyang Technological University (NTU) S-Lab, explores solutions to this challenge.

Below is a detailed look at the research methodology.

Reinforcement Learning Training with Integrated Multi-Turn Search

Multimodal Models Learn "Search on Demand," Reducing Searches by 30% While Improving Accuracy: ByteDance & NTU New Research Optimizes Multimodal Model Search Strategies

1. Multimodal Search Tools

MMSearch-R1 integrates two tools: image search and text search, to meet the needs of visual question answering tasks. The image search tool is based on Google Lens, supporting searches for web page titles and main thumbnails that match the user’s image visual appearance, helping the model accurately identify important visual elements.

The text search tool consists of a pipeline comprising Google Search, JINA Reader, and a language model for summarizing web content. It supports searching for web pages most relevant to the model-generated search queries along with their content summaries, helping the model precisely locate required textual knowledge and information.

2. Multi-Turn Search Reinforcement Learning Training

MMSearch-R1 employs Group Relative Policy Optimization (GRPO) as its reinforcement learning algorithm for model training. Based on the veRL framework, it implements a Rollout process integrating multi-turn dialogue and search. In each turn of dialogue, the model first engages in reasoning and executes optional actions, such as invoking multimodal search tools to interact with the real internet or providing a final answer.

3. Reward Function with Search Penalty

The reward function for MMSearch-R1 is composed of accuracy scores and format scores, combined via weighted summation with weights of 0.9 and 0.1, respectively. These measure whether the model accurately answered the user’s question (exact string matching between the model’s answer and the ground truth) and adhered to the prescribed response format.

To incentivize the model to prioritize using its own knowledge for answering, responses that rely on search tools to arrive at the correct answer are penalized (search penalty factor is 0.1). The final reward function is:

Multimodal Models Learn "Search on Demand," Reducing Searches by 30% While Improving Accuracy: ByteDance & NTU New Research Optimizes Multimodal Model Search Strategies

Constructing a Balanced Multimodal Image QA Dataset for Search Needs

To effectively train models to achieve intelligent on-demand search capabilities, the researchers carefully constructed the FactualVQA (FVQA) dataset, comprising training and test sets. The construction of this dataset adopted a meticulously designed semi-automated process, focusing primarily on Q&A scenarios requiring rich support from both visual and textual knowledge.

Multimodal Models Learn "Search on Demand," Reducing Searches by 30% While Improving Accuracy: ByteDance & NTU New Research Optimizes Multimodal Model Search Strategies

1. Data Collection

The team first performed multi-level sampling based on the metadata distribution of MetaCLIP, ensuring coverage of diverse visual concepts ranging from high-frequency to long-tail categories. They then searched the internet for images most relevant to these visual concepts and generated factual Q&A pairs using GPT-4o.

To enhance the textual knowledge dimension of the dataset, the team also selected representative Q&A samples from the InfoSeek training set for supplementation. To ensure data quality closely mirrored real-world application scenarios, FVQA included an additional 800 annotated Q&A sample pairs labeled by human annotators.

2. Data Balancing

After completing initial data collection, a coarsely trained model was used to classify existing samples and check the necessity of searching for each piece of data. The final training dataset contains approximately 3,400 samples requiring search and 1,600 samples that do not require search.

How Did the Experiments Perform?

MMSearch-R1-7B was trained based on the Qwen2.5-VL-7B model.

In knowledge-intensive VQA tasks such as FVQA-test and InfoSeek, MMSearch-R1-7B’s average accuracy was approximately 3% higher than that of traditional RAG baselines for similarly sized models, while its search ratio decreased by 32.9%. It also performed comparably to the RAG baseline of a 32B model.

Multimodal Models Learn "Search on Demand," Reducing Searches by 30% While Improving Accuracy: ByteDance & NTU New Research Optimizes Multimodal Model Search Strategies

After reinforcement learning training, the model improved its ability to optimize search content and process search results (left figure below: the RL-trained model’s performance in executing RAG workflows is superior to that of the original model), while also enhancing its ability to mine and utilize its inherent knowledge (right figure below: the model increased the ratio of correctly answering questions without searching).

Multimodal Models Learn "Search on Demand," Reducing Searches by 30% While Improving Accuracy: ByteDance & NTU New Research Optimizes Multimodal Model Search Strategies

Reinforcement learning demonstrated greater potential than supervised fine-tuning, achieving larger performance gains with fewer training samples across all tasks (left figure below).

It also proved that balancing the data search ratio and incorporating a search penalty mechanism in the reward function helps shape the model’s on-demand search behavior during training (right figure below).

Multimodal Models Learn "Search on Demand," Reducing Searches by 30% While Improving Accuracy: ByteDance & NTU New Research Optimizes Multimodal Model Search Strategies

In summary, MMSearch-R1 is an innovative framework based on reinforcement learning that empowers large multimodal models to perform intelligent on-demand searches in real-world internet environments. This framework enables models to autonomously identify knowledge boundaries and subsequently choose image or text search methods to acquire necessary information, effectively reasoning over the search results.

The team stated that this research provides important insights for developing large multimodal models with real-world interaction capabilities, laying the foundation for building adaptive, interactive multimodal agents. As models continue to interact with the real world through more tools, it is expected that multimodal intelligence will achieve new leaps in reasoning and adaptability.

Paper Link: https://arxiv.org/abs/2506.20670
Project Link: https://github.com/EvolvingLMMs-Lab/multimodal-search-r1