This year, major AI companies have once again dramatically increased their spending on GPU procurement.
Elon Musk’s xAI plans to expand its 100,000-GPU supercomputing cluster tenfold, while Meta intends to invest $10 billion to build a data center with a capacity of 1.3 million GPUs…
The number of GPUs has become the direct indicator of an internet company’s AI capabilities.

Indeed, building AI computing power through the brute-force method of stacking GPUs is the simplest approach. However, more GPUs in a cluster do not necessarily mean better performance.
Although GPUs offer strong computational performance, they face numerous challenges when deployed in clusters. Even NVIDIA, the industry leader, struggles with communication bottlenecks, memory fragmentation, and fluctuating resource utilization rates.
In simple terms, due to limitations such as communication overhead, the full potential of GPUs cannot be fully realized.
Therefore, building cloud data centers for the AI era is not a one-time solution achieved simply by stacking cards into racks. The shortcomings of existing data centers must be addressed through architectural innovation.
Recently, Huawei released a substantial 60-page paper outlining its next-generation AI data center architecture design concept—Huawei CloudMatrix—and the first generation of productized implementation based on this concept: CloudMatrix384. Rather than simply “stacking cards,” Huawei’s CloudMatrix adopts architectural design principles centered on high-bandwidth, fully symmetric interconnectivity and fine-grained resource decoupling.
This paper is packed with technical insights, detailing the hardware design of CloudMatrix384 and introducing best practices for DeepSeek inference based on this platform: CloudMatrix-Infer.

So, how powerful is Huawei’s proposed CloudMatrix384? Simply put, it can be summarized in three aspects:
- High Efficiency: Prefill throughput reaches 6,688 tokens/s/NPU, and decode phase throughput reaches 1,943 tokens/s/NPU. In terms of computational efficiency, prefill achieves 4.45 tokens/s/TFLOPS, and the decode phase achieves 1.29 tokens/s/TFLOPS, both surpassing performance benchmarks on NVIDIA H100/H800 systems;
- High Accuracy: The accuracy of INT8 quantization benchmark tests for the DeepSeek-R1 model on Ascend NPUs matches that of the official API;
- High Flexibility: It supports dynamic adjustment of inference latency Service Level Objectives (SLOs), maintaining a decode throughput of 538 tokens/s even under strict 15ms latency constraints.

AI Data Center Architecture: Huawei Cloud Takes a Step Ahead
Before diving into this significant paper, it is necessary to first understand “Why we need CloudMatrix384.”
In one sentence, the answer is that traditional architectures cannot meet the current computing power demands of AI development.
Traditional AI clusters operate more like “dispersed small workshops,” where each server (node) functions somewhat independently. Computing power, memory, and network resources are statically allocated.
Under this traditional model, when AI clusters encounter ultra-large-scale models, various problems arise, such as insufficient computing power, memory bandwidth bottlenecks, and sluggish inter-node communication.
The goal of Huawei in this paper is to propose a new mode that transforms these “small workshops” into “supercomputing factories”—represented by CloudMatrix (the first production-grade implementation being CloudMatrix384), Huawei Cloud’s next-generation AI data center architecture.

Its most distinct feature is unified resource scheduling: CloudMatrix384 integrates 384 NPUs, 192 CPUs, and other hardware into a single super node.
Consequently, resources such as computing power, memory, and network bandwidth are managed uniformly like an assembly line in a factory, allocated precisely where needed.
Furthermore, data within CloudMatrix384 moves like items on a high-speed conveyor belt in a factory. Because all chip connections are handled by the ultra-high-bandwidth, low-latency Unified Bus (UB) network, data is transmitted directly between chips in a “fully symmetric” manner. This avoids the “traffic jams” typical of traditional networks.
As a result, regardless of whether CloudMatrix384 encounters large models with massive parameter scales or inference tasks requiring frequent cache access, it can efficiently complete computations through dynamic resource allocation.
△ Huawei CloudMatrix Architecture Vision
After understanding the design vision for next-generation AI data centers, let us delve deeper into its innovative technical details and unique advantages.
Fully Symmetric Interconnectivity: A Crucial Step Forward by Huawei
Fully symmetric interconnectivity (Peer-to-Peer) is arguably a major innovation in CloudMatrix384’s hardware architecture design.
In traditional AI clusters, the CPU plays the role of a “leader,” while other hardware like NPUs act as “subordinates.” Data transmission requires CPU “approval and signing,” which significantly reduces efficiency.
This issue becomes particularly acute when processing large-scale models, where communication overhead can account for up to 40% of the total task duration!
However, in CloudMatrix384, the situation is entirely different.
The CPU and NPUs function more like a “flat-management team,” with relatively equal status. They communicate directly via the UB network, eliminating the time wasted on hierarchical communication (“leaders passing messages”).
△ CloudMatrix384 Fully Symmetric Interconnectivity Hardware Architecture Design
The key to achieving this “flat management” is the UB network mentioned earlier, which utilizes a non-blocking fully connected topology.
It adopts Clos architecture design, where L1/L2 switches across 16 racks form a multi-level non-blocking network, ensuring constant communication bandwidth between any two NPUs or CPUs.
In contrast, traditional clusters communicate via RoCE networks, with bandwidth typically limited to 200 Gbps (approximately 25 GB/s), and they suffer from “north-south bandwidth bottlenecks” (e.g., excessive load on data center core switches).
With the support of the UB network, each NPU provides 392 GB/s of unidirectional bandwidth—equivalent to transmitting 48 full HD movies per second. Data transmission is both fast and stable.
Additionally, traditional NPU communication relies on SDMA engines (similar to “logistics transit hubs”), which have a drawback: high startup latency (approximately 10 microseconds).
To address this, fully symmetric interconnectivity introduces the AIV Direct Connect (AIV-Direct) mechanism. This allows direct writing to remote NPU memory via the UB network, bypassing the SDMA transit. Transmission startup latency is reduced from 10 microseconds to under 1 microsecond.
This mechanism is particularly well-suited for high-frequency communication scenarios such as token distribution in Mixture of Experts (MoE) models, reducing single-communication time by over 70%.
However, beyond hardware design, software support has also played an indispensable role in CloudMatrix384’s high efficiency.
For example, the UB network combines memory pooling technology to achieve a “global memory view” for CloudMatrix384. This allows all NPUs and CPUs to directly access cross-node memory without needing to know the physical location of the data.
During the decode phase, NPUs can directly read KV caches generated by NPUs in the prefill phase, eliminating the need for CPU transit or disk storage. Data access latency is reduced from milliseconds to microseconds, increasing cache hit rates to over 56%.
Taking the 671B parameter DeepSeek-R1 model as an example, through FusedDispatch fused operators and AIV Direct Connect, token distribution latency was reduced from 800 microseconds to 300 microseconds. Prefill computational efficiency improved to 4.45 tokens/second/TFLOPS, surpassing NVIDIA H100’s 3.75 tokens/second/TFLOPS.
Furthermore, under the constraint of TPOT < 50ms, decode throughput reached 1,943 tokens/second/NPU. Even when tightened to TPOT < 15ms, it maintained 538 tokens/second, verifying the stability of fully symmetric interconnectivity in strict latency scenarios.

Cloud-Native: No Need to Worry About Hardware Details, Ready-to-Use on Huawei Cloud
Besides “fully symmetric interconnectivity,” the second key technical term in this paper is undoubtedly “Cloud.”
In simple terms, this is a cloud-oriented infrastructure software stack. It acts like an “intelligent butler team,” transforming complex hardware devices into a “cloud computing supermarket” accessible to everyone.
Notably, long before CloudMatrix384 was introduced, the Huawei Cloud team had already determined that next-generation AI data centers would be built on a “cloud-oriented” foundation, demonstrating Huawei’s foresight in its technological strategic layout.
Through over two years of refinement, the team has made deploying CloudMatrix384 a “zero-threshold” process. Users can deploy it directly without worrying about hardware details.
△ Huawei Cloud Infrastructure Software Stack for Deploying CloudMatrix384
Overall, this cloud-oriented infrastructure software stack primarily consists of several major modules: MatrixResource, MatrixLink, MatrixCompute, MatrixContainer, and the top-level ModelArts platform. Their roles are clearly defined yet mutually collaborative.
Let’s first look at MatrixResource.
It serves as the “resource allocation butler” in the software stack, primarily responsible for supplying physical resources within super nodes, including topology-aware computing instance allocation.
By running MatrixResource agents on Qingtian cards installed in each computing node, it dynamically manages the allocation of hardware resources such as NPUs and CPUs, ensuring efficient scheduling according to topological structures and avoiding cross-node communication bottlenecks.
MatrixLink acts as the “network communication butler.”
It provides service-oriented functions for UB and RDMA networks, supporting QoS guarantees, dynamic routing, and network-aware workload placement. It optimizes communication efficiency among 384 NPUs within super nodes and across nodes. For example, in inference scenarios, it assists in improving inference efficiency by 20% through parallel transmission and multi-path load balancing technologies.
MatrixCompute plays the role of the “logical super node butler.”
Its task is to manage the lifecycle (“birth, aging, sickness, and death”) of super nodes, handling everything from boot-up to fault recovery, including bare-metal provisioning, auto-scaling, and fault recovery.
Specifically, it orchestrates resources across physical nodes, constructing dispersed hardware components into tightly coupled logical super node instances, achieving elastic resource expansion and high availability.
MatrixContainer is the “container deployment butler.”
Its function allows users to deploy AI applications onto super nodes as easily as sending a “package.” Based on Kubernetes container technology, it packages complex AI programs into standardized containers. Users simply need to “click deploy,” and the system automatically assigns them to suitable hardware for execution.
Finally, there is ModelArts, the “AI full-process butler.”
Located at the top of the software stack, it provides end-to-end services from model development and training to deployment, including ModelArts Lite (bare-metal/containerized hardware access), ModelArts Standard (complete MLOps pipeline), and ModelArts Studio (Model-as-a-Service, MaaS).
Novice users can use Mode
ModelArts Lite directly invokes hardware computing power; advanced users can use ModelArts Standard to manage the entire process of training, optimization, and deployment; enterprise users can use ModelArts Studio to turn models into API services (such as chatbots) with a one-click publish.
Thus, on top of CloudMatrix384’s inherent efficiency, the cloud-native infrastructure software stack plays a “winged tiger” role, making deployment even more convenient.
Software-Hardware Integration: Efficient and Convenient, Yet Flexible
In addition to the keywords of “fully peer-to-peer interconnectivity” and “cloud-native,” the paper also highlights the advantages in flexibility demonstrated by their integration as a unified software-hardware system.
For example, regarding the previously mentioned aspect where “users do not need to focus on underlying hardware details but only call APIs,” specifically, Huawei Cloud’s EMS (Elastic Memory Service) uses memory pooling technology to aggregate DRAM connected to CPUs into a shared memory pool. The NPU can directly access remote memory, enabling KV cache reuse. This reduces the first-token latency by 80% while decreasing NPU procurement costs by approximately 50%.
Additionally, MatrixCompute supports automatic scaling of super-node instances. For instance, it dynamically adjusts the number of NPUs in prefilling/decoding clusters based on workload, maintaining a decoding throughput of 538 tokens per second even under strict TPOT constraints of 15ms.
Through deterministic operations services and Ascend Cloud Brain technology, fault recovery for a cluster of ten thousand cards can be achieved within 10 minutes. In scenarios involving HBM or network link failures, the system challenges a 30-second recovery time; for example, the impact of optical module failures is reduced by 96%, ensuring the continuity of training and inference tasks.
The software stack also supports multi-tenant partitioning of super-node resources. Different users can share hardware resources while maintaining logical isolation—for instance, using namespaces to isolate cache data for different models, ensuring data security and fair resource allocation.
Intelligent scheduling enables “daytime inference, nighttime training.” Inference tasks run during the day, while idle computing power is utilized for model training at night. Nodes switch between training and inference in less than 5 minutes, improving computing power utilization.
It is understood that CloudMatrix384 has been launched across four major Huawei Cloud nodes: Ulanqab, Horinger, Gui’an, and Wuhu. Users can provision computing power on demand without building their own hardware environments. A 10-millisecond latency circle covers 19 urban agglomerations nationwide, supporting low-latency access.
Furthermore, CloudMatrix384 provides full-stack intelligent operations capabilities. For example, the fault knowledge base of Ascend Cloud Brain already covers 95% of common scenarios, with a one-click diagnosis accuracy rate reaching 80%. Network fault diagnostics take less than 10 minutes, effectively lowering the threshold for operations and maintenance (O&M).
Breaking the “Impossible Triangle”
At this point, we can make a simple summary.
Huawei’s CloudMatrix384 breaks the traditional “impossible triangle” between computing power, latency, and cost through its “fully peer-to-peer architecture + software-hardware synergy” model.
On the hardware level, its fully peer-to-peer UB bus achieves an inter-card bandwidth of 392GB/s, allowing 384 NPUs to work efficiently in coordination. In EP320 expert parallel mode, token distribution latency is controlled within 100 microseconds.
On the software level, CloudMatrix-Infer adopts a fully peer-to-peer inference architecture, large-scale EP parallelism, Ascend-customized fused operators, and UB-driven disaggregated memory pools to maximize hardware efficiency.
This design makes high computing power, low latency, and controllable costs simultaneously possible. In short, with CloudMatrix384, cloud-based large model deployment solutions have become significantly more attractive.
The cloud allows for unified planning at the data center level, constructing specialized high-speed network topologies that break through the physical limitations of individual enterprises.
More importantly, the cloud supports elastic scaling. Enterprises can dynamically adjust resource scales according to business needs, expanding from dozens of cards to hundreds without modifying physical infrastructure.

Moreover, choosing the cloud means users do not need to find professional teams to handle complex issues such as model optimization, distributed training, and fault handling.
CloudMatrix384’s automated O&M design further reduces fault impact by 96%, keeping cluster fault recovery time for ten-thousand-card clusters within 5 minutes. This level of specialized operational capability is difficult for most enterprises to build independently.
More significantly, the cloud-based AI service model represented by CloudMatrix384 provides Chinese enterprises with a more realistic path for AI implementation.
For example, migrating DeepSeek-R1 from model adaptation to online launch took only 72 hours, compared to two weeks for traditional solutions, representing a significant efficiency improvement.
This advantage in cost and efficiency allows more enterprises to experiment with AI applications without bearing the risk of massive infrastructure investment.
CloudMatrix384 proves that domestic cloud solutions are not just “usable” but also possess competitive advantages in both performance and cost-effectiveness.
AI Infrastructure Is Being Redefined
CloudMatrix384 represents not only a stronger AI supercomputer but also a redefinition of “what constitutes AI infrastructure.”
Technologically, it disrupts the past CPU-centric hierarchical design through UB, turning an entire super-node into a unified computing entity.
Looking to the future, Huawei’s paper outlines two development paths: on one hand, continuing to expand node scale; on the other, pursuing stronger decoupling.
Expanding scale is easy to understand: as LLM parameter sizes grow in the future, more tightly coupled computing resources will be required.
Decoupling can be viewed from both resource and application dimensions.
In terms of resources, CPU and NPU resources will physically separate into dedicated resource pools, moving from logical decoupling to physical decoupling to achieve better resource utilization.

In terms of applications, memory-intensive attention calculations during large model inference will be decoupled from the decoding path. Attention and expert components will also separate into independent execution services.
In summary, the authors depict a fully decoupled, adaptive, heterogeneous AI data center architecture. This architecture will further enhance scalability, flexibility, efficiency, and performance.
In the future, computing resources will no longer be fixed physical devices but abstract capabilities that can be dynamically orchestrated.
Through CloudMatrix384 and its future vision, we are witnessing yet another new technological iteration, as well as a profound transformation in the paradigm of AI data centers.