Cisco and NVIDIA expand their collaboration to advance Ethernet in AI networks

Mondo Technology Updated on 2024-02-09

At Cisco Live in Amsterdam on Tuesday, enterprise networking giant Cisco announced a partnership with NVIDIA to launch a series of hardware and software platforms tailored to the hottest words of the moment — AI ML: artificial intelligence machine learning.

One of the key points of the partnership is to make it easier to deploy and manage AI systems using standard Ethernet, which will be well understood by those who have gone through the trouble of obtaining a CCNA and/or CCNP certificate.

GPUs powering AI clusters are also often the focus of discussion, but the high-performance, low-latency networks required to support AI clusters can be quite complex. Modern GPU nodes do benefit heavily from 200Gbs, 400Gbs, and the upcoming 800Gbs high-speed network, but that's only part of the story, especially when it comes to training models. Because these workloads typically need to be distributed across multiple servers with four or eight GPUs, any additional latency will result in longer training times.

As a result, NVIDIA's Infiniband still dominates AI network deployments. dell'Sameh Boujelbene, an enterprise analyst at Oro Group, recently estimated in an interview that about 90% of deployments use NVIDIA Mellanox's Infiniband rather than Ethernet.

That's not to say that Ethernet isn't being taken seriously. Emerging technologies, such as SmartNICs with deep packet buffers and AI-optimized switches, application-specific integrated circuits (ASICs) that help suppress packet loss, allow Ethernet to operate at least more like Infibland.

For example, the Cisco Silicon One G200 switch ASIC, which we talked about last summer, has a number of features that benefit AI networks, including advanced congestion management, packet spraying technology, and link failover. However, it's important to note that these features aren't unique to Cisco, as NVIDIA and Broadcom have introduced similar switches in recent years.

dell'Oro**The role of Ethernet in AI networks will account for about 20% of the revenue share by 2027. One reason for this is the industry's familiarity with Ethernet. AI deployments may still require some specific adjustments, but enterprises already know how to deploy and manage Ethernet infrastructure.

For NVIDIA, this alone makes a partnership with networking vendors like Cisco an attractive prospect. While this may reduce sales of NVIDIA's own Infiniband or Spectrum Ethernet switches, the payoff is the ability to deliver more GPUs to enterprises that might otherwise be skeptical about deploying a completely separate networking stack.

To support these efforts, Cisco and NVIDIA have introduced reference designs and systems designed to ensure compatibility and help address knowledge gaps in deploying networking, storage, and compute infrastructure to support their AI deployments.

These reference designs target platforms that enterprises may have already invested in, including suites from Pure Storage, NetApp, and Red Hat. The reference design also helps drive Cisco's GPU-accelerated systems. These include reference designs and automation scripts that apply its FlexPod and FlashStack frameworks to AI inference workloads. Many expect inference, especially for small, domain-specific models, to become a major part of enterprise AI deployments due to its relatively low cost of running and training.

FlashStack AI Cisco Verified Design (CVD) is a guide to deploying Cisco's network and GPU-accelerated UCS systems with Pure Storage's flash arrays. FlexPod AI (CVD) seems to follow a similar pattern, swapping Pure for NetApp's storage platform. Cisco says the products will be available later this month, with more NVIDIA-powered DVDs to come in the future.

Mention Cisco's UCS computing platform, which also introduces an edge-focused X-Series blade system with NVIDIA's latest GPUs.

The X Direct chassis has 8 slots and can be configured with either a slot or a four-slot compute blade combination, or a PCIe expansion node for GPU computing. Additional X-Fabric modules can also be used to expand the GPU capacity of the system.

It's worth noting, though, that unlike NVIDIA's most powerful SXM modules from Supermicro, Dell, HPE, and others, Cisco's UCS X Direct system only appears to support PCIe-based GPUs with lower power consumption.

According to the UCS X Direct datasheet, each server can be equipped with up to six compact GPUs, or up to two **slot, full-length, full-height GPUs.

This can be a limitation for users who want to run large language models that consume hundreds of gigabytes of GPU memory. However, this may be sufficient when running smaller inference workloads, such as edge data preprocessing.

Cisco's platform is targeted at manufacturing, healthcare, and enterprises running small data centers.

Related Pages