Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Eliminates Prompt Engineering

Models & Benchmarks · Published: Sep 20, 2024 · David Kowalski · ~9 min read

Author Info

Developer Tools & Agents Editor

15+ years software engineering; maintainer of internal agent-evaluation playbooks

David tests coding agents, IDE integrations, and terminal workflows the way working teams use them. He documents prompts, environment pins, and regression cases so readers can compare tools fairly. When vendors sponsor access, he discloses it and keeps scoring criteria unchanged.

#Coding Agents #IDE Integrations #Developer Productivity #Tool Comparisons

Full author profile →

The emergence of OpenAI’s o1 has ushered in a new paradigm for the evolution of large language models—Inference Law.

As Jim Fan, an AI scientist at NVIDIA, stated, the arrival of o1 marks a shift where developers are moving their focus from training-phase investments to the inference process.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 2

Fan also cited machine learning pioneer Rich Sutton’s classic essay, The Bitter Lesson, which argues that only two technologies can infinitely expand the potential of AI computation: learning and search.

Now is the time to focus on the latter.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 3

By investing more resources into inference, models gain a more complete thinking process; this increased investment yields qualitative improvements.

In China, Zhou Hongyi, founder of 360, shares this philosophy. 360 had earlier proposed the concept of “slow thinking” and has already applied it to its technical architecture and products.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 4

Furthermore, 360 emphasizes multi-model collaboration in its AI products, encouraging large models from different vendors to “huddle together for warmth.” This approach has found a viable path for domestic models to catch up with OpenAI.

Observing Large Model “Slow Thinking” Through o1

Although the specific thinking process of o1 remains a top secret at OpenAI, it is certain that Chain of Thought (CoT) plays a crucial role.

In its report on o1, OpenAI stated that CoT enables models to recognize and correct errors, break down complex steps into simpler ones, and even try different methods, significantly enhancing their reasoning capabilities.

At this year’s AI top conference ICLR, a paper by Denny Zhou, creator of Google Brain’s inference team, along with Yu Ma (a Tsinghua Yao Class alumnus, Stanford Assistant Professor, and Sloan Fellow), further unveiled the infinite potential of Chain of Thought.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 5

Looking beyond the surface, in a sense, the essence of CoT is what Daniel Kahneman, winner of the 2002 Nobel Prize in Economics, proposed in Thinking, Fast and Slow as “System 2,” or the “slow thinking” system.

“System 2” or “slow thinking” refers to complex, conscious reasoning. This contrasts with “System 1” or “fast thinking,” which involves simple, unconscious intuition.

The performance of o1 proves that this “slow thinking” concept, applicable to humans, is equally suitable for large models.

However, it should be noted that these two systems coexist and cooperate in the human brain; they should not be separated in large models either.

Zhou Hongyi believes that o1 likely follows “Dual Process Theory,” whose core lies in the collaborative operation of fast and slow systems.

As a participant in the “Battle of Hundred Models,” Zhou Hongyi and 360 are also thinkers and pioneers regarding “slow thinking” and “multi-system collaboration.”

At the ISC.AI conference in late July, Zhou announced plans to “build a slow-thinking system to enhance the slow-thinking capabilities of large models.”

Based on the “multi-system collaboration” mechanism, 360 utilized an agent framework composed of multiple models to transition large models from “fast thinking” to “slow thinking,” creating two star AI products: 360 AI Search and 360 AI Browser.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 6

Letting Different Large Models “Huddle Together for Warmth”

360 AI Search offers three modes: concise answers, standard answers, and in-depth answers. An in-depth answer may involve 7 to 15 calls to large models.

For example, this might include one call to an intent recognition model, one to a search query rewriting model, five search calls, one webpage ranking call, one main answer generation call, and one follow-up question generation call…

Through the coordinated cooperation of multiple models, 360 AI Search has formed the following workflow:

First, use an intent classification model to identify the user’s intent;
Next, use a task routing model to decompose the problem. Different problems are categorized into “simple tasks,” “multi-step tasks,” and “complex tasks” for scheduling across multiple models;
Finally, construct an AI workflow to enable collaborative operation among multiple large models.

For instance, when faced with a question requiring translation of classical Chinese poetry into English, the routing module would invoke multiple models such as translation and reflection models, allowing them to divide labor and complete the task together.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 7

The latest version further strengthens multi-model collaboration during answer generation, establishing it as an independent response mode.

Three different models play distinct roles: the Expert generates the initial answer, the Reflector checks the response, and the Summarizer provides the final answer.

In this example, the Expert model Kimi identified key points in the question but lacked clarity. Under the guidance of the reflection model 360 ZhiNao, Doubao re-summarized the content to provide a direct and precise solution.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 8

This working mode not only integrates fast-slow thinking collaboration and reflection mechanisms into AI applications but also further improves overall performance through cross-validation among different models.

In another AI product, the 360 AI Browser, 54 large models from 16 vendors have gathered, enabling capabilities that traditional browsers cannot offer.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 9

The AI browser can summarize tens of thousands of words in English academic papers within 10 seconds, allowing users to ask detailed questions about specific points.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 10

It can immensely translate PDF documents, with original text and translation scrolling synchronously for easy comparison.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 11

It can also act as an “AI Efficiency Expert,” helping to summarize online videos and highlight key points in minutes, drawing mind maps based on video structure, and even analyzing creative styles…

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 12

Not only can online documents and videos be analyzed, but these analysis functions are also applicable to local files.

More conveniently, the 360 AI Browser has a mobile version, allowing users to leverage AI-assisted browsing on their phones anytime.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 13

The AI Assistant (bot.360.com), which has joined the 360 AI Browser and is also based on the CoE architecture, can automatically dispatch the most suitable large model according to task type and model strengths.

Users can directly converse with 54 large models or more powerful hybrid models without switching platforms, choosing whichever they prefer.

The AI Assistant also supports “multi-model collaboration.” Users can select any three of the 54 models to serve as Expert, Reflector, and Summarizer.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 14

In the future, 360 will release versions where five or more models collaborate to complete tasks.

Also within the 360 AI Browser, the AI Assistant has launched a “Model Arena” (bot.360.com), supporting “head-to-head competition” among 54 large model products. The latest version includes features such as “team battles,” “anonymous showdowns,” and “random matches.”

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 15

In summary, whether it is 360 AI Search or the 360 AI Browser, although their focuses differ, they both reflect a core philosophy:

While engaging in “slow thinking,” rather than competing solely on the capabilities of a single model, let models “huddle together for warmth,” drawing on strengths from all sides to create a situation where “many hands make light work.”

Of course, the significance of this approach is not only that it brings better AI experiences to users but also serves as an incentive for developers of various large models.

We know that R&D investment in large models is enormous, and sufficient user adoption is necessary to recoup costs.

By leveraging entry points such as 360 AI Search, the Browser, and Security Guard, 360 has opened access to its 1 billion users to large model developers.

This is also a key reason why major tech giants like Alibaba, Tencent, and Baidu, as well as the “Little Six Tigers” of large models, have joined the 360 AI architecture.

Thus, the mutual effort between 360 and these dozen-plus vendors has achieved a virtuous cycle where models and AI applications promote each other’s development.

The Model Arena provides domestic large models with a platform to learn through competition and an excellent opportunity to receive user feedback, fostering a more proactive and enterprising atmosphere.

The “Elimination” of Prompt Engineering

From a technical perspective, the bridge connecting concepts to products is 360’s proprietary CoE (Collaboration-of-Experts) architecture.

The CoE architecture aggregates a larger number of large language models and expert models, achieving an organic integration of “fast thinking” and “slow thinking” through chain-of-thought reasoning and “multi-system collaboration.”

In terms of approach, CoE follows a similar path to o1 but goes deeper:

No matter how much o1 integrates, it ultimately relies on OpenAI’s proprietary models. In contrast, CoE is inclusive, aggregating a wider variety of large language models and expert models.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 16

△ Schematic diagram of the CoE architecture

Furthermore, the CoE architecture incorporates many expert models with parameters in the billions or even smaller. This makes the entire system more intelligent, allowing it to deliver high-quality responses while saving inference resources and improving response speed.

Shortly after the CoE architecture was released, the hybrid large model capabilities based on CoE—which drew strengths from various sources—surpassed GPT-4o (then considered the strongest).

In tests across 12 metrics such as translation and writing, this hybrid large model achieved a comprehensive score of 80.49, outperforming GPT-4o’s score of 69.22. Moreover, in all categories except coding, it surpassed GPT-4o.

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Elimin… — figure 17

Additionally, the CoE architecture embraces all models, going further than OpenAI on the path to open collaboration…

Furthermore, whether it is OpenAI’s o1 or 360’s CoE, both point toward a new trend in the development of large language models:

Complex manual processes will be automated. Specifically within the context of large models, this means the “elimination” of prompt engineering.

At first glance, this may seem counterintuitive because, when using large models, the quality of prompts has a decisive impact on generated content; its importance is self-evident.

However, upon closer reflection, there is no contradiction: AI applications like large language models ultimately exist to serve humans.

Prompt engineering, conversely, requires humans to adapt to the way models work—a reversal of priorities.

Therefore, while prompt engineering is undoubtedly important, it should not become an obstacle for ordinary users interacting with large models.

The solution lies in treating prompt design as just another task within a chain-of-thought process, delegating it to the large model itself.

In this mode, the essence of prompt engineering remains intact but gradually fades from the user’s perspective, creating a sense of “disappearance.”

This approach also reflects 360’s vision for the future development of AI:

Achieving inclusive access to AI for more people, ensuring that large models are no longer confined to elite circles (“high temples”) but become as ubiquitous and essential as household lights.