The Convergence of CPUs and GPUs: The "8087 Moment" of Modern Computing
In the past, CPUs relied on external math coprocessors to boost floating-point performance. Today, that trend is reversing. With the introduction of the NVIDIA GH-200 processor and the AMD Mi300A APU, GPUs have been incorporated into the CPU architecture.
The rise of GPU embedded processors.
GPUs are known for their powerful ability to speed up math processing. By integrating GPUs into CPUs, NVIDIA and AMD have achieved significant improvements in HPC performance.
Absorption of external performance hardware.
This convergence marks the "8087 moment" in computing, similar to the early CPUs absorbing optional math coprocessors. It heralds a future trend in which external performance hardware is gradually absorbed by the CPU itself.
Goodbye PCI
GPU-to-CPU memory connection bottleneck.
Traditionally, GPUs from Nvidia and AMD communicate with CPUs via PCI buses. Since CPUs and GPUs have separate memory domains, data must be moved between the two via PCI interfaces, creating bandwidth bottlenecks.
NVIDIA Grace Hopper GH200 GPU
NVIDIA's Grace Hopper GH200 GPU solves this bottleneck with a 900 Gb second NVLink-C2C connection, about 14 times faster than traditional PCIe buses. In addition, the GH200 implements a single CPU-GPU shared memory domain, eliminating the need for data movement.
GH200 memory architecture.
The GH200 has up to 480 GB of LPDDR5X CPU memory and 96 GB or 144 GB of HBM3 GPU memory. These memories add up to 576 GB to 624 GB and are fully interoperable between the CPU and GPU.
amd instinct mi300a apu
AMD's Instinct Mi300A APU also features a single memory domain, sharing 128 GB of HBM3 memory consistently between the CPU and GPU via Infinity Fabric. The package has a peak throughput of 53 TB seconds. While external memory expansion is not currently supported, CXL will offer the potential for future upgrades.
The benefits of a single storage domain.
The single storage domain for the GH200 and MI300A eliminates the GPU memory limitations found in traditional approaches. This is critical for high-performance computing (HPC) and generative artificial intelligence (GenAI) that need to load large models in memory and run them on GPUs.
Unified memory expansion.
The GH200 further pushes memory capacity limits by creating up to 20 TB of unified memory over an external NVLink connection.
It's not far from your desktop
From high-end technology to low-cost commodity markets: the shift to high-performance computing.
High-performance computing (HPC) is undergoing a shift from expensive new technologies to more economical commodity markets. One notable change is the migration to a single memory domain, moving all components from multi-core to premium memory from high-end to "mobile" devices.
gptshop.AI's GH200 Workstation: A low-cost solution for HPC and GenAI.
On the Linux benchmark** Phronix, tester Michael Larabel ran the HPC benchmark on a GH200 workstation. The system uses GPTshopAI's Grace Hopper superchip, offering an impressive 576GB of memory, dual 2000+ W power supplies, and flexible configuration options.
Low noise, high power: Ideal for non-data center environments.
A unique feature of the GH200 is its TDP programmability range of 450W to 1000W (CPU + GPU + Memory), making it ideal for non-data center environments. In addition, its default air-cooled noise is only 25 decibels, providing a quiet operating experience. Liquid cooling is also an option.
Cost-effective, single-domain memory solution.
Although the GH200 is not a low-priced product, its price starts at €47,500 (about $41,000), considering that the current Nvidia H100 PCIe GPU is between 30,000 and 3$50,000, plus the cost of the host system, makes this an attractive one.
GPTshop workstations offer 576GB of single-domain memory, which is a valuable advantage for HPC and GenAI users who require a lot of CPU-GPU memory, significantly exceeding the H100 GPU's 80GB memory limit.
Preliminary benchmarks
With GPTshop, Phoronix can perform multiple benchmarks remotely. Benchmarks should be considered preliminary results and not final performance assessments. These tests are for CPUs only, not on Hopper A100 GPUs. Therefore, the baseline map is incomplete. Phoronix plans to expand testing to GPU-based applications in the future.
The baseline environment uses Ubuntu 2310、linux 6.5 and gcc-13 as standard compilers. To ensure test consistency, comparable processors were tested in a similar environment, including Intel Xeon Scalable, AMD EPYC, and Ampere Altra Max. See Phronix for a complete list
Unfortunately, power consumption data is not available during the benchmark run. According to Phoronix, the NVIDIA GH200 is currently not exposed on Linux with a Rapl PowerCap HWMON interface that can be used to read its power and energy usage. Although the system BMC can display the power consumption of the entire system through a web interface, this data is not accessible via IPMI.
Despite these limitations, this study provides some key benchmarks for the GH200 for the first time in an environment outside of NVIDIA.
Good Ole HPCG
ARM GH200 performance.
In the HPCG memory bandwidth benchmark, the ARM GH200 stands out with a performance of 42 Gflops, outperforming the Xeon Platinum 8380 2P (40 Gflops) and Ampere Altra Max (41 Gflops).
The GH200 also performed well in the NWCHEM benchmark, with a 72-core ARM GH200 running time of just 1404 seconds, second only to the leader 128-core EPYC 9554 (1323 seconds).
Notably, the 72-core Arm Grace CPU is nearly twice as powerful as the Ampere Altra Max 128-core Arm processor.
What's coming
High-end CPUs are integrated into GPU architectures to advance AI.
The NVIDIA GH200 and AMD Mi300A introduce new processor architectures that integrate the GPU into the CPU. Similar to absorbing math coprocessors in the past, this marks the beginning of high-end CPUs absorbing GPUs and becoming "dedicated" processors.
Genai demand drove ** down.
Even though these high-end processors are currently expensive, the huge interest in generative AI (GenAI) could push them into the commodity price point. This trend will continue as more benchmarks become available.
The rise of the personal high-performance workstation.
The advent of personal high-performance workstations with ample memory is of great significance. They can run large language models (LLMs) in the office and support workloads such as high-memory GPU-optimized high-performance computing (HPC) applications.
Data centers and the cloud are still important, but personal workstations offer a "reset button."
Data centers and the cloud are still the workhorses of computing, but the advent of personal high-performance workstations provides flexibility. Users can run LLMS and HPC applications on-premises without relying on the cloud or data center.
- What are your thoughts on this? -
- Welcome to leave a message** and share in the comment area. -