Why Are Leading Domestic AI Model Developers Choosing CPUs?

Author Info

Yuki Tanaka

Asia-Pacific AI Markets Reporter

B.A. Economics (University of Tokyo); bilingual EN/JP; former APAC tech wire correspondent

Yuki tracks model launches, cloud partnerships, and industrial policy across East Asia. She sources from company filings, local press briefings, and on-the-ground industry contacts, then contextualizes moves for a global English-speaking audience. She is careful to note translation limits and regional regulatory differences.

#APAC Markets #Cloud Partnerships #Industrial Policy #Cross-Border Launches

Full author profile →

One day in AI equals one year on Earth.

The update speed of both large models and AI applications is now so rapid that it feels impossible to keep up:

Sora, Suno, Udio, Luma… major applications are being released one after another.

As data from an InfoQ survey indicates, although AIGC (AI-Generated Content) is still in its early stages, the market size has begun to take shape:

The market is expected to reach 450 billion RMB by 2030. AIGC applications are flourishing across multiple sectors, gradually penetrating from general scenarios into industry-specific depths.

While the rapid development of the entire industry is undoubtedly a positive sign, the competition for specific applications and large model implementations has become increasingly fierce.

For instance, recently, major model vendors engaged in an intense “price war,” competing on who could offer the lowest prices, even pushing large model pricing into the “cent era” (extremely low cost).

Compounded by OpenAI’s recent “supply cutoff” incident, domestic vendors have intensified their efforts to promote “easy migration” plans and increased incentives such as free token giveaways.

The root cause lies in the current trend where applications are king, particularly the need to deploy businesses quickly at minimal costs.

So, how can large model players achieve a balance between being fast, high-quality, and cost-effective?

This brings us back to an unavoidable factor that accounts for the absolute majority of costs—computing power.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 2

When it comes to training and inference for large models, many people’s first reaction is to think of GPUs.

Admittedly, GPUs hold certain advantages in high performance, but their “hard flaws” are also quite obvious: insufficient supply and high prices.

How can this bottleneck be broken? Baidu Intelligent Cloud’s Qianfan Large Model Platform, a top-tier player in domestic large models, has offered its own solution with better “cost-effectiveness”:

Except for a few major clients pursuing peak performance from large models, most enterprises and institutions need to comprehensively evaluate the usage effects, performance, and cost-efficiency—commonly referred to as “price-performance ratio”—when adopting large models.

Regarding computing power deployment, Xin Zhou, General Manager of Baidu Intelligent Cloud’s AI and Big Data Platform, believes:

The use of CPUs for running AI has actually been prevalent since the early days; GPU popularity is only a recent phenomenon.

In many scenarios, although GPUs offer high-density computing capabilities, practical tests show that modern high-end CPUs are fully capable of handling these tasks.

Moreover, the entire AI business workflow involves not just large model computations but also preliminary stages like data cleaning, where CPUs play a crucial role.

In short, in the era of large models, CPUs have become even more important than before and are one of the key factors enabling large models and applications to be deployed “fast, well, and cheaply.”

So, how does this perform in practice? Let’s continue reading.

Top Domestic Large Model Players Have Chosen CPU

With the explosion of domestic AIGC applications, Baidu Intelligent Cloud’s Qianfan Large Model Platform has played an indispensable role.

As a “one-stop” service platform for enterprises to use large models, Qianfan has been used by over 120,000 clients since its launch in March last year, with 20,000 optimized models and 42,000 incubated applications.

These applications cover numerous scenarios such as education, finance, office work, and healthcare, providing strong support for industry digital transformation.

In the education sector, Qianfan empowers applications like question generation, online grading, and problem analysis, significantly improving teaching and exam preparation efficiency.

For example, users can provide reference materials and set question types and difficulty levels; the platform then automatically generates high-quality test questions. Interactive problem explanations offer personalized learning guidance tailored to each student’s weak points.

In office scenarios, Qianfan collaborates with leading industry partners to create innovative applications like intelligent writing assistants. These tools can quickly generate professional documents such as recruitment copy, marketing plans, and data reports based on user-input keywords.

They also focus on various writing scenarios, intelligently generating thesis outlines, project reports, and brand promotional drafts, greatly enhancing the efficiency of administrative and marketing staff.

Healthcare is another major application track for Qianfan. Models trained on medical knowledge bases can automatically generate interpretations of health checkup reports, explaining indicators in plain language and providing personalized health guidance.

This allows ordinary people to better understand their physical condition and achieve “autonomous health management.”

It is evident that Qianfan has achieved the “last mile” implementation of AI models across multiple fields.

So, how does Qianfan support so many AI applications?

The answer is: Making CPU a viable choice for customers and democratizing the benefits of “cost-effectiveness” across all industries.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 3

Baidu Intelligent Cloud explains this approach as follows:

Currently, there is still a significant demand for offline LLM applications in the industry, such as generating article summaries, abstracts, and data analysis. Compared to online scenarios, offline scenarios often utilize idle computing resources on platforms. They have lower latency requirements but are more sensitive to inference costs; therefore, users tend to prefer low-cost, easily accessible CPUs for inference.

Cloud platforms like Baidu Intelligent Cloud host a large number of CPU-based cloud servers. Releasing the AI computing potential of these CPUs helps improve resource utilization and meets users’ needs for rapid LLM model deployment.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 4

Regarding performance, taking Llama-2-7B as an example, the Token throughput on fourth-generation Intel® Xeon® Scalable processors can exceed 100 TPS, representing a 60% improvement over third-generation models.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 5

△ Llama-2-7B Model Token Throughput Output

In low-latency scenarios, under equal concurrency, the first-token latency of fourth-generation Xeon® Scalable processors can be reduced by more than 50% compared to third-generation models.

After upgrading to fifth-generation Xeon® Scalable processors, throughput increased by approximately 45%, while first-token latency decreased by about 50%.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 6

△ Llama-2-7B Model First-Token Latency

Furthermore, the Qianfan Large Model Platform team stated based on practical experience:

For LLM models with a scale below 30 billion parameters, Intel® Xeon® Scalable processors can be adopted to achieve good performance experiences.

Moreover, by leveraging abundant CPU resources and reducing reliance on AI accelerator cards, the total cost of ownership (TCO) for LLM inference services is lowered, particularly excelling in offline LLM inference scenarios.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 7

Additionally, on the Qianfan Large Model Platform, it is not just Baidu’s own ERNIE models; many mainstream large models are also integrated there.

This indirectly confirms that fifth-generation Intel® Xeon® Scalable processors meet performance standards.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 8

How Does Fifth-Generation Intel Xeon Make Performance and Efficiency Pro Max?

Baidu Intelligent Cloud’s Qianfan Large Model Platform handles more than just large model inference workloads; it is actually a platform covering the entire lifecycle of large models.

Specifically, Qianfan provides comprehensive functional services including data annotation, model training and evaluation, inference services, application integration, rapid application orchestration, and plugin integration, facilitating multi-scenario deployment of large models. In this context, fully utilizing the widely deployed CPU resources on the platform is a more cost-effective choice compared to deploying dedicated accelerators solely for large model inference.

For the vast number of offline large model application demands on Qianfan—such as generating article summaries, abstracts, or evaluating multiple models’ effects—these requirements do not have strict latency needs but often face memory bottlenecks.

Using CPUs makes memory expansion more convenient and allows for the utilization of idle computing resources during off-peak times, further improving resource utilization and reducing total cost of ownership.

Against this backdrop, the design of performance-intensive general-purpose computing application loads (similar to P-Core performance cores) in fifth-generation Intel® Xeon® Scalable processors becomes particularly critical.

Compared to E-Cores (efficiency cores), P-Cores adopt a design focused on maximizing performance, capable of handling very heavy workloads while also supporting AI inference acceleration.

The adoption of this design in fifth-generation Xeon® Scalable processors is not just talk; it involves hardware-software co-optimization with comprehensive consideration across all aspects.

On the hardware side, Intel® AMX (Advanced Matrix Extensions) technology is specifically optimized for the massive matrix multiplication operations involved in deep learning within large model inference. It can be understood as “Tensor Cores inside the CPU.”

With Intel® AMX, each processor clock cycle can complete up to 2048 INT8 operations, an eightfold improvement over the previous generation’s AVX512_VNNI instructions.

More importantly, the Intel® AMX accelerator is built directly into the CPU core, bringing matrix storage and computation closer together. This feature reduces latency when processing subsequent tokens in large model inference, enhancing end-user experience.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 9

△ Intel® AMX enables more efficient AI acceleration

On the software side, Baidu Intelligent Cloud’s Qianfan Large Model Platform has introduced xFasterTransformer (xFT), a large model inference software solution deeply optimized for the Intel® Xeon® Scalable platform, using it as the backend inference engine. The main optimization strategies are as follows:

  • Fully utilizing instruction sets such as AMX/AVX512 to efficiently implement core operators like Flash Attention
  • Adopting low-precision quantization to reduce data access volume and leverage the advantages of INT8/BF16 operations
  • Supporting multi-machine, multi-card parallel inference for ultra-large-scale models

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 10

△ Intel® Xeon® Scalable Processor LLM Inference Software Solution

Finally, it is worth noting that choosing a hardware platform affects not only the initial procurement cost but also subsequent maintenance costs and even talent reserve expenses.

As Baidu Intelligent Cloud states, high-cost-performance computing infrastructure works in tandem with advanced large model algorithms and platform software, allowing upper-layer developers to apply and build their businesses more smoothly, thereby maximizing the commercial value of cloud computing platforms.

In the Era of Large Models, CPUs Have Great Potential

Looking at the current landscape, large models are moving from laboratories into industries, transforming from “toys” for a few into “tools” available to the masses.

This means that large model services must not only deliver excellent performance but also be affordable and easy to deploy. In short, “fast, high-quality, and cost-effective” has become a key link in the commercialization of large models.

To achieve this, the choice of computing infrastructure is crucial.

Traditional views hold that dedicated accelerators are the “standard configuration” for AI. However, against the backdrop of tight supply and high costs, the advantages of dedicated accelerators are diminishing.

In comparison, well-optimized high-end

CPUs not only provide sufficient computing power to handle large model inference but also boast a broader deployment foundation, a more mature software ecosystem, and enhanced security safeguards. Consequently, they are gaining increasing favor among industry players.

The x86 architecture CPUs, represented by the Intel® Xeon® series, possess a well-established software ecosystem and extensive application base. Millions of developers can leverage existing tools and frameworks to rapidly build and optimize AI applications without needing to learn specialized accelerator software stacks. This significantly reduces development complexity and migration costs.

Furthermore, enterprise users can utilize the multi-layered security technologies built into CPUs to achieve full-stack protection from hardware to software, ensuring robust data security and privacy. These advantages are difficult for dedicated accelerators to match.

Therefore, fully leveraging CPUs for inference is a key strategy for the AIGC industry to overcome computing power barriers and drive large-scale application adoption. It transforms AI from a “money-burning game” into “inclusive technology.” As technological innovation continues and ecosystems mature, this model will create value for more enterprises and inject new momentum into industrial development.

Beyond directly accelerating inference tasks, CPUs efficiently handle critical steps in the end-to-end AI pipeline, such as data preprocessing and feature engineering. Various databases supporting machine learning and graph analysis are primarily built on CPU architectures. Taking Intel® Xeon® Scalable Processors as an example, in addition to Intel® Advanced Matrix Extensions (Intel® AMX), they integrate a series of data analytics engines, including Intel® QuickAssist Technology (Intel® QAT) for data protection and compression acceleration, and Intel® In-Memory Analytics Accelerator (Intel® IAA). By offloading specific tasks, these features enable better CPU utilization, improve overall workload performance, and accelerate data analysis.

Thus, building “fast, accurate, and stable” AI applications relies not only on the powerful computing power of dedicated accelerators but also on the superior general-purpose computing capabilities of CPUs to unlock the full potential of the entire system.

To popularize the role of CPUs in the new era of AI inference, this website has launched the “Most ‘In’ AI” column. This series will comprehensively interpret the topic from multiple perspectives, including technical education, industry case studies, and practical optimization strategies.

Through this column, we hope to help more people understand the practical achievements of CPUs in AI inference acceleration, as well as in entire AI platforms or full-process acceleration. The focus is on how to better utilize CPUs to enhance the performance and efficiency of large model applications.

Why Are Leading Domestic AI Model Developers Choosing CPUs? — figure 11

— End —

Comments