Graphics card battles!Nvidia and AMD pinched each other!Can GPU supremacy be maintained?

Mondo Digital Updated on 2024-01-30

Hello everyone, I'm Ergou.

Nvidia and AMD, the two chip giants, are pinching up!

It all started two weeks ago when AMD Chairman and CEO Lisa Su unveiled a next-generation Intinct Mi300x GPU chip accelerator card for generative AI and data centers at an event.

There's nothing wrong with just releasing a graphics card, butAMD claims that the MI300X chip is capable of achieving 40% lower latency than the H100 when inferring Meta's LLAMA 270 billion parameter model.

This means that AMD's Mi300X chip performs better

Nvidia, as the big brother of the GPU graphics card industry, how can he be willing to hear it.

So, just last week, Nvidia went out of its way to publish an official blog post to prove that the h100 has top-of-the-line inference performance.

The Nvidia blog stated:

Best-in-class AI performance requires an efficient parallel computing architecture, an efficient tool stack, and deeply optimized algorithms. Nvidia has released the open-source NVIDIA TensorRT-LLM, which includes the latest kernel optimizations for the NVIDIA Hopper architecture for the NVIDIA H100 Tensor Core GPU cores. These optimizations enable models such as LLAMA 2 70B to perform with accelerated FP8 operations on H100 GPUs while maintaining inference accuracy.

AMD mentioned at the press conference that the Mi300X chip has better inference performance than the H100 GPU, but AMD's tests did not use optimization softwareIf the benchmark is correct, the H100 will be 2x faster for inference.

In a nutshell, what Nvidia is trying to mean is that AMD doesn't benchmark with optimization software or H100 support for FP8 data types, but instead uses VLLM on FP16 for testing. In general, lower-precision data types sacrifice precision for performance. In other words, Nvidia said that AMD deliberately hindered the H100's performance. Nvidia also blogged about the actual test performance of a single NVIDIA DGX H100 server with 8 H100 GPUs on the LLAMA 2 70B model. The test includes the results of "batch-1" processing one inference request at a time, as well as the results processed using a fixed response time.

Nvidia claims that when benchmarked with its closed-source Tensorrt LLM framework and FP8, the H100 actually performs twice as well as the Mi300X.

Nvidia also believes that AMD presents the best case for performance by setting the batch size to 1, in other words, by processing only one inference request at a time. Nvidia doesn't think this is realistic, as most cloud providers trade latency for larger batch sizes.

According to NVIDIA, a DGX H100 node with 8 accelerators is capable of handling 14 batch sizes, while a similar node with 8 AMD mi300x can handle one batch size, using NVIDIA's optimized software stack.

Within a day of the publication of Nvidia's blog post, AMD also posted a blog post in response, claiming that its graphics cards do have industry-leading performance, and Nvidia's benchmarks are not a like-for-apps comparison.

AMD accused Nvidia of the unreasonable test benchmarks:

NVIDIA tested with Tensorrt-LLM on the H100 instead of the VLLM used in AMD benchmarksNVIDIA compared the performance of the FP8 data type on the H100 with the FP16 data type on the AMD Mi300X GPU;Nvidia reversed the performance data published by AMD from relative latency to absolute throughputAMD said:

We're in the midst of a product upgrade phase, and we're constantly looking for new ways to unlock performance with ROCM software and the AMD Instinct MI300 accelerator.

The data we presented at the launch was recorded during the November test. We've come a long way since November, and we're excited to share our latest results highlighting these results.

The graph below is a comparison of AMD's performance data using the latest Mi300X running LLAMA 70B, with blue being the performance of the Mi300X graphics card and gray being the performance of the H100 graphics card.

It's easy to see that under AMD's tests, the Mi300X outperforms the H100 in terms of performance and latency.

AMD went on to say, "The results show once again that the MI300X with FP16 is comparable to the H100 recommended by NVIDIA for optimal performance settings, even with FP8 and TensorRT-LLM." ”

Nvidia didn't make a statement about AMD's latest blog post afterwards, but the Nvidia-AMD benchmarking debate highlights the role that software libraries and frameworks play in improving AI performance.

One of Nvidia's main arguments is,AMD uses VLLM instead of TensorRT-LLM software for testing, which is why the H100 is at a disadvantage.

Announced in September and released at the end of October, Tensorrt-LLM combines features including a deep learning compiler, an optimized kernel, pre- and post-processing steps, and multi-GPU and multi-node communication primitives.

NVIDIA claims to have effectively doubled the inference performance of the H100 when running a 6 billion parameter GPT-J model using optimized software, and the H100 also improved performance by 77% in the LLAMA 2 70B.

AMD made a similar announcement when it unveiled the ROCM 6 framework earlier this month. AMD claims that its latest AI framework is able to improve LLM performance by 13x to 26x. And the Mi250X running on the new ROCM 6 software framework is 8x faster than the Mi250X running on the ROCM 300.

AI inference workloads are complex, and performance depends on a variety of factors, including flops, precision, memory capacity, memory bandwidth, interconnect bandwidth, and model size.

AMD's biggest advantage this time around isn't floating-point performance, it's the memory — the MI300X's high-bandwidth memory (HBM) is 55% faster, at a speed of 52TB seconds for 192GB, more than double the H100's 80GB. This is important for AI inference because the size of the model is proportional to the amount of memory required to run it. In FP16, each parameter has 16 bits or 2 bytes. So, for the LLAMA 70B, about 140GB+ of KV cache space is required, which helps accelerate inference workloads but requires additional memory.

AMD's MI300X platform can support systems with up to 8 accelerators and a total of 1 HBM5TB, while NVIDIA's HGX platform peaks at 640 GB. As SemiAnalysis noted in its MI300X launch report, on FP16, the Bloom model with 176 billion parameters requires 352GB of memory, leaving more memory for AMD to accommodate larger batch sizes.

Nvidia usually doesn't fight with AMD, but this time it may be really panicked.

Because on the day of AMD's press conference, Meta and Microsoft said that they will buy the Instinct Mi300X, which uses AMD's latest AI chip, which means that AMD should be the top priority alternative in the event that Nvidia graphics cards are out of stock.

The chart below is a recent report by research firm Omidia, showing NVIDIA's top 12 H100 graphics card buyers in Q3 2023:

And these large customers may be at risk of churn.

Xi Xiaoyao Technology said that she had just written an article not long agoThe Nvidia crisis has exploded!Overnight, embattled, an in-depth analysis of the opponents and crises that NVIDIA is now facing (interested friends can move to view).

1. AMD challenges the supremacy of Nvidia's graphics card head-on;

2. Microsoft's self-developed AI chip, software and hardware are both grasped;

3. Google insists on using its own TPU to create the strongest TPU in the next generation

4. The United States** restriction on Nvidia, banning the sale of Chinese mainland, will lose Ali Douyin customers;

5. OpenAI and other startups are developing their own AI chips.

Will the industry wait for Nvidia?How much will NVIDIA's "graphics card cake" be eaten?

The answer may not be optimistic, but Nvidia may still be the top winner.

Related Pages