Xingrongyuan released Xingzhi AI network solution for LLM large model bearer network

Mondo Technology Updated on 2024-02-06

Artificial intelligence is the core driving force of the digital economy, and AI models are the new engine of artificial intelligence. In recent years, with the rapid development of generative artificial intelligence (AIGC) such as ChatGPT, industry leaders have competed to launch large models with trillions and 10 trillion parameters, and have also put forward higher requirements for the scale of underlying GPU support, reaching the 10,000 card level. However, how to meet such a large-scale training task poses unprecedented challenges to the scale, performance, reliability, and stability of the network.

The amount of computation in AI applications is increasing exponentially, and the algorithm model is developing towards giant quantization, and the parameters of the current AI super-large model have reached the level of hundreds of billions and trillions. There is no doubt that ultra-high computing power is required to train such a model. The AI ultra-large model is trained on GPUs, and the interconnection network requirements are between 100 Gbps and 400 Gbit/s, and the RDMA protocol is used to reduce transmission latency to improve network throughput.

In the AI large model training scenario, the on-machine and off-machine collective communication operations will generate a large amount of communication data. Pipeline parallelism, data parallelism, and tensor parallel modes require different communication operations, which places high demands on the single-port bandwidth of the network, the number of available links between nodes, and the total network bandwidth.

Network jitter will make the collective communication inefficient, which will affect the training efficiency of large AI models. Therefore, maintaining network stability and efficiency is an extremely important goal in the AI large model training task cycle, which brings new challenges to network O&M.

The network delay generated in the process of data communication transmission consists of two parts: static delay and dynamic delay, among which the dynamic delay has a greater impact on network performance. Dynamic delay includes the internal queuing delay and packet loss retransmission delay of the switch, which is usually caused by network congestion and packet loss.

This further increases the complexity of configuration due to the large cluster size in AI large model training. Under the conditions of huge architecture and configuration, business personnel can simplify configuration and deployment, and effectively ensure overall business efficiency.

The requirements of large AI models for networks are mainly reflected in several aspects, such as scale, bandwidth, latency, and stability. From the perspective of the actual capabilities of the current data center network, there is still a certain gap in technology to fully match the requirements of large AI models.

With the continuous improvement of the demand for computing power for large model training, the intelligent computing GPU has changed from 10000 cards to 10,000 cards, and in the face of the construction requirements of more than 10,000 cards, the traditional network solution is the clos architecture, which usually allows a server to be equipped with 8 GPU cards, and the corresponding 8 10,000 cards are connected to 8 server leaves in a single HB domain to realize the communication of GPUs with the same card number on a server leaf. At the same time, in order to ensure high speed, each level should ensure 1:1 no convergence, taking the 128-port box device as an example, the ports of the server leaf and the spine device are allocated to 64 ports up and down, and the 128 ports of the super spine device are all used for downlink access, based on this port planning, the overall network scale has 8 HB domains, 64 pods and 64 fabrics, and the network card access scale is 32768.

It can be intuitively seen that the overall network architecture is extremely complex, not only the network construction cost is high, the network path hop is large, and the follow-up operation and maintenance and fault troubleshooting are extremely difficult.

The cost of a full mesh network is high.

Cross-leaf switches have 3 hops in the path, and the number of hops across pods is more, which greatly increases service latency.

The network structure is complex, making O&M and troubleshooting difficult.

Take 32,768 GPUs and 128 terminals as an example

Number of CLOS layers: 3 layers.

Switches: 1,280 = ((64 + 64) x 8) + 256 switches

Number of light emitters: 196608

In order to narrow the technical gap, Xingrongyuan launched the Xingzhi AI network solution, which builds a large-scale, low-latency, large-bandwidth, high-stability, and automatically deployed AI bearer network for LLM large model scenarios.

1. Program introduction

Compared with the traditional solution, the Xingzhi AI network eliminates the connection between different GPU card numbers across GPU servers, only retains the Leaf layer switch connected to the GPU, and uses all the ports originally used for the uplink Spine to the downlink GPU, which further improves the connection efficiency of the Leaf switch, and this network architecture can still realize the communication between different HB domains through **.

Network ports with the same number between servers of different computing nodes need to be connected to the same switch. For example, the No. 1 RDMA network port of intelligent computing server 1, the No. 1 RDMA network port of intelligent computing server 2, and the No. 1 RDMA network port of intelligent computing server n are all connected to the No. 1 switch.

In the intelligent computing server, the upper-layer communication library performs network matching based on the intra-machine network topology, so that GPU cards with the same number can be associated with network ports with the same number. In this way, two intelligent computing nodes with the same GPU number can communicate with each other with only one hop.

With the help of the Rail Local technology in the NCCL communication library, the NVSWITCH bandwidth between GPUs in the host can be fully utilized to convert the inter-GPU interoperability between multiple machines to the inter-GPU interoperability between multiple CPUs.

The Xingzhi AI network solution easily builds the vanka network of the intelligent computing center, which meets the needs of users for the construction of the intelligent computing center network, and avoids the shortcomings of traditional networks in the intelligent computing center.

Without affecting performance, the network architecture is simplified, which greatly reduces the cost of user network construction.

The network requires only one hop, reducing service latency.

The network structure is simplified, reducing the difficulty of O&M and troubleshooting.

Take 32,768 GPUs and 128 terminals as an example

Number of CLOS layers: 1 layer (rail only).

Switches required: 256.

Number of light emitters: 65536

The maximum reduction in network costs is 75%.

2. Advantages of the program

Performance Improvement: Increases the bandwidth of a single-node network

1) Increase the number of network cards, the initial business volume is small, you can consider CPU and GPU sharing, prepare 1 to 2 separate network cards for the CPU in the later stage, and prepare 4 or 8 network cards for the GPU;

2) Increase the bandwidth of a single NIC and match the bandwidth of the host PCIe and the bandwidth of the network switch.

Performance improvement: RDMA network (ROCE) is applied

1) With the help of RDMA technology, the number of data replications in the process of GPU communication is reduced, the communication path is optimized, and the communication delay is reduced.

2) Through the Easy ROCE technology, it can deliver complex ROCE-related configurations (such as PFC and ECN) with one click, effectively helping users reduce the complexity of O&M.

Performance improvements: Reduce network congestion

1) Reduce network latency and improve GPU efficiency: ultra-low latency of 400ns;

2) Reduce network congestion through DCB protocol groups: PFC, PFC Watchdog, and ECN build an all-Ethernet zero-packet loss and low-latency network;

With the advent of large-scale model applications such as ChatGPT, Copilot, and Wenxin Yiyan, the intelligent computing center network under the AI large-scale model will also bring a new upgrade. Xingrongyuan continues to invest in research and development, and Xingzhi's AI network solutions have been recognized in customer field tests. We will work with AI vendors to gradually promote the maturity and implementation of key technologies of intelligent computing center networks under AI large models, and we continue to pursue better solutions for user scenarios, and look forward to working with many partners to build large-scale, high-bandwidth, high-performance, low-latency, and intelligent AI large model intelligent computing center networks.

Follow VX's official account "Xingrongyuan Asterfusion" to get more technical sharing and the latest product trends.

Related Pages