Will InfiniBand ride the AI wind, or will Ethernet continue to hold its ground?

Mondo Technology Updated on 2024-02-01

The best way to reduce latency today is to use NVIDIA interconnect technology whenever possible. Of course, if you can tolerate slower training speeds, it's not impossible to continue with traditional techniques.

Dell'Oro analyst Sameh Boujelbene said the growing demand for AI capabilities will drive the data center switching market up by 50%. He also said that a major wave of technological innovation is about to emerge in the field of network switching.

Boujelbene estimates that AI systems currently account for "well under 10 percent" of the total network switching address, and that about 90 percent of them are deployed with Nvidia's Mellanox Infiniband rather than traditional Ethernet. These deployments have boosted NVIDIA's network revenue to $10 billion per year, making it the second largest player in the space, overtaking Juniper and Arista.

And this is no accident: when it comes to AI workloads, bandwidth and latency are always top priorities. The latency of Infiniband is really low because of its architecture that reduces packet loss. In contrast, packet loss in Ethernet is significantly more severe.

While many applications can handle packet loss, it slows down AI training and is inherently costly and time-consuming. That's probably why Microsoft chose Infiniband when building a data center for machine learning workloads.

However, Infiniband also has its own shortcomings, first of all, the upper limit of the original transmission bandwidth is often less than that of Ethernet. NVIDIA's latest Quantum Infiniband switch ports are capable of up to 25 transfer speeds6 TB seconds, 400 GB seconds per port; In contrast, Ethernet switching speeds reached 51 almost two years ago2 TB seconds, or 800 GB seconds for a single port.

In a traditional data center, such a fast suite is only possible at the aggregation layer. And it's rare for a regular server node to use up even a quarter of its 400 GB second port bandwidth, not to mention using up a quarter of it.

But the situation with AI clusters is completely different. Common AI nodes often need to be equipped with a 400 GB second network card for each GPU. However, a single node can hold four to eight GPUs (the number of NICs must also be increased at the same time), and it is often filled with huge data streams generated by AI workloads.

Boujelbene compares two competing standards, Infiniband and Ethernet, to a slower but less congested national highway (Infiniband and a highway with higher speed limits but occasional crashes), respectively.

While Ethernet has a technical advantage in transmission bandwidth, other real-world bottlenecks, such as the available PCIe bandwidth of the network card, are often wiped out.

The year 2024 has arrived, and the highest technical standard we can choose from is PCIe 50。The bidirectional bandwidth is about 64 Gb/s, which means that 16x ports are required to support a single 400 Gb/s interface.

Some chipmakers, including NVIDIA, have cleverly integrated PCIe switching into their network cards to improve performance. Instead of cramming the GPU and NIC into the CPU, this acceleration design daisy-chains the network interface through a PCIe switch. We speculate on PCIe 60 or 7Before the advent of the 0 standard, NVIDIA achieved 800 Gb/s and 1600 Gb/s network transmission performance in this way.

Dell'oro expects that by 2025, the vast majority of switch ports deployed in AI networks will be operating at 800 Gb seconds; And by 2027, this number will double to 1600 gb seconds.

In addition to maintaining higher transmission bandwidth, the Ethernet switching field has recently been innovating to address its shortcomings compared to Infiniband interconnect technology.

But it's all within Nvidia's calculations. Ironically, with the introduction of the SpectrumX platform, Nvidia has instead become the biggest proponent of lossless Ethernet technology.

According to Gilad Shainer, vice president of marketing for NVIDIA's networking division, explained in a previous interview, Infiniband is better suited for users who run a small number of hyperscale workloads, such as GPT-3 or digital twin modeling. But in more dynamic large-scale cloud environments, Ethernet solutions tend to be the preferred choice.

Ethernet's openness and ability to adapt to most workloads are the reasons why it is so popular with cloud service providers and hyperscale infrastructure operators. Whether it's to avoid the hassle of managing a dual-stack network or to prevent lock-in by a handful of infiniband** vendors, they have good reasons to choose Ethernet technology.

Nvidia's SpectrumX portfolio combines its own 51The 2 TBS Spectrum-4 Ethernet switch with the Bluefield-3 SuperNIC with 400 GB second RDMA Converged Ethernet (ROCE) delivers network performance, reliability, and latency comparable to Infiniband performance.

Broadcom has made a similar choice in its family of Tomahawk and Jericho switches. These switches either use data processing units to manage congestion or handle congestion in top-rack switches through the Jericho3-AI platform, which was released last year.

Boujelbene said it's clear that Broadcom is seeing success with hyperscale infrastructure operators and cloud service providers like Amazon Web Services. The role of NVIDIA SpectrumX is also to consolidate this work into a single platform, making it easy to set up lossless Ethernet.

While Microsoft clearly favors Infiniband for its AI cloud infrastructure, Amazon Web Services is interconnecting its 16,384 GH200 compute clusters (officially announced at the latest Re: Invent conference in late 2023) using improved congestion management in its Elastic Fabric Adapter 2 (EFA2).

While Dell'Oro expects Infiniband to continue its dominance in AI switching for the foreseeable future, it also expects significant growth in Ethernet technology, increasing revenue share to 20 percentage points by 2027. Behind this change, the main drivers are those cloud service providers and hyperscale data center operators.

Related Pages