AMD s latest GPU architecture, in depth interpretation

Mondo Technology Updated on 2024-01-31

**: The content was compiled from chipsandcheese by Semiconductor Industry Watch (ID: icbank), thank you.

AMD has long been vying for GPU computing market share. AMD has been playing catch-up ever since NVIDIA got ahead of the curve with its Tesla architecture. Terascale 3 migrated from VLIW5 to VLIW4 to improve utilization of execution units in compute workloads. GCN replaces Terascale and emphasizes consistent performance across GPGPU and graphics applications. AMD then splits its GPU architecture development into separate cDNA and RDNA circuits, dedicated to compute and graphics, respectively.

Ultimately, cDNA 2 was a significant success for AMD. The Mi250X and Mi210 GPUs have won several supercomputer contracts, including Ornl's Frontier, which ranked first on the TOP500 list in November 2023. But while cDNA2 provides stable and cost-effective FP64 computing, H100 boasts better AI performance and offers a larger unified GPU.

CDN 3 wants to close these gaps by offering all the features that AMD has to offer. The MI300X is equipped with an advanced chiplet setup, which fully demonstrates the company's experience in advanced packaging technology. Together with Infinity Fabric components, advanced packaging enables the MI300X to scale to compete with NVIDIA's largest GPUs. In terms of memory, the Infinity Cache of the RDNA family was introduced into the cDNA space to alleviate bandwidth issues. But that doesn't mean the Mi300X's memory bandwidth is light. It still has a huge HBM setup that gives it the best of both worlds. Finally, the compute architecture of CDNA 3 has received significant generational improvements to increase throughput and utilization.

GPU layout

AMD has always had a tradition of using chiplets to cheaply scale the number of Ryzen and EPYC CPU cores. The Mi300X uses a similar strategy at a high level, splitting the computation onto an accelerator composite chip (XCD). XCD is similar to the Graphics Computing Chip (GCD) of CDNA 2 or RDNA 3 or the Core Composite Chip (CCD) of Ryzen. AMD may change the naming because cDNA products lack the dedicated graphics hardware in the RDNA family.

Each XCD contains a set of cores and a shared cache. Specifically, each XCD physically has 40 cDNA 3 compute units, 38 of which are enabled on each XCD on the Mi300X. There is also a 4 MB L2 cache on the XCD that serves all CUs of the chip. The Mi300X has 8 XCDs for a total of 304 compute units.

This is a big increase from the 250 CUs of the Mi220X. Even better, the Mi300X can expose all of these CUs as a single GPU. On the MI250X, the programmer had to manually distribute the work between the two GPUs, as each GPU had a separate memory pool.

NVIDIA's H100 consists of 132 streaming multiprocessors (SMs) and presents them to programmers as a large unified GPU. The H100 takes the traditional approach and implements all computing on a large monolithic chip. Even if everything is on the same chip, the H100 is too big to give all SMs equal access to the cache. Therefore, H100 splits L2 into two instances. A single SM can use all 50 MB of L2, but accessing more than 25 MB results in a performance penalty.

Still, NVIDIA's strategy makes more efficient use of cache capacity than Mi300X's. The Mi300X XCD doesn't use the L2 capacity on other XCDs for caching, just as the CCDs on EPYC Ryzen aren't allocated to each other's L3 caches.

Intel's Ponte Vecchio (PVC) compute GPU makes a very interesting comparison. PVC places its basic computing building blocks in chips called "compute tiles", which are roughly similar to the xcd of cDNA 3. Similarly, PVC's Base Tile has similar features to the IO chip of cDNA 3. Both contain a large last-level cache and an HBM memory controller. Like the Mi300X, the Ponte Vecchio card can be exposed as a single GPU with a unified memory pool.

However, there are also important differences. Compared to the 38 compute units on the cDNA 3 xcd, Ponte Vecchio's compute block is smaller, with only 8 XE cores. Instead of using a compute block wide cache, Intel uses a larger L1 cache to reduce cross-chip traffic requirements. Using a two-layer Ponte Vecchio part as a unified GPU also presented challenges. The EMIB bridge between the two stacks provides only 230 Gbs of bandwidth, which is not enough to take full advantage of HBM bandwidth if access is striped across all memory controllers. To solve this problem, Intel provides APIs that allow programs to work with GPUs in a Numa configuration.

In terms of physical construction, PVC and cDNA 3 designs face different challenges. The ability of CDNA 3 to provide a unified memory pool with HBM requires high bandwidth between IO chips. PVC uses EMIB links with relatively low bandwidth. But the design of PVC becomes complicated because it uses four mold types with different process nodes and foundries. AMD uses only two chip types in the Mi300X, and both nodes (6nm and 5nm) are from TSMC.

Resolve bandwidth issues

For decades, computing speed has outpaced memory. Like CPUs, GPUs are responding to this problem with increasingly sophisticated caching strategies. cDNA 2 uses a traditional two-tier cache hierarchy with 8 MB L2 and relies on HBM2E to keep the execution unit running. But even with HBM2E, the Mi250X's bandwidth requirements are more severe than NVIDIA's H100. If AMD simply adds more computing power, lack of bandwidth can become a serious problem. That's why AMD has taken the experience of RDNA(2) and added "infinite caching".

Much like consumer-grade RDNA GPUs, the MI300's infinite cache is what the technical documentation calls "additional last-level memory" (mall), which is a fancy way of saying that the last-level cache level is memory-side cache. In contrast to L1 and L2 caches, which are closer to the compute unit, the Infinity Cache is connected to the memory controller. All memory traffic goes through the infinite cache, regardless of which block it comes from. This includes IO traffic, so communication between peer-to-peer GPUs can benefit from unlimited cache bandwidth. Because Infinite Cache always has the most up-to-date view of DRAM content, it doesn't have to deal with snooping or other cache maintenance operations.

However, due to the fact that the memory-side cache is far away from the computation, higher latency is often incurred. As a result, AMD has equipped megabytes of L2 cache on both cDNA 3 and RDNA 2 to isolate compute from the lower performance of memory-side caches.

Like rdDNA 2, the infinite cache of cDNA 3 is linked by 16 groups. However, the implementation of cDNA 3 is more optimized for bandwidth than for capacity. It consists of 128 slices, each with a capacity of 2 MB and a read bandwidth of 64 bytes per cycle. All slices can transfer a total of 8192 bytes per cycle, which is for 217 at 1 GHz2 TBS is good.

In comparison, RDNA 2's 128 MB infinite cache can deliver 1024 bytes per cycle across all chips, resulting in a high level of 2Available at 5 GHz 2Theoretical bandwidth of 5 TBS. Chip screenshots show that each Infinity Cache slice has a capacity of 4 MB and provides 32B cycles. As a result, RDNA 2 uses larger slice, less slice, and less bandwidth per slice.

The MI300x's focus on bandwidth means that workloads with lower compute density can still enjoy decent performance if they get enough infinite cache hits. This should make the execution unit of cdna 3 easier to run, although the ratio of main memory bandwidth to compute hasn't changed much and still lags behind nvidia.

If we build a roofline model of the MI300X using the theoretical bandwidth of Infinity Cache, we can achieve full FP64 throughput with 4. loaded per byte75 flop。This is a huge improvement over DRAM, which takes 14. per byte loaded6 flops to 15 flops.

Challenges that can be faced with cross-chip bandwidth

The Mi300X's Infinity Fabric spans four IO chips, each connected to two HBM stacks and associated cache partitions. However, when the Mi300X is running as a single logical GPU with a unified memory pool, the bandwidth of the chip-to-chip connection may limit the bandwidth to achieve full infinite cache. If memory access is evenly distributed across the memory controller (and cache partitions), as is typical of most GPU designs, the available chip-to-chip bandwidth may prevent the application from reaching the theoretically unlimited cache bandwidth.

First, let's focus on a single io die partition. It has 2.s along two edges adjacent to other IO chipsIngress bandwidth of 7 TBS. It can get 4 for its two xcds2 TBS of Infinity cache bandwidth. If L2 miss requests are evenly distributed across the chip, then 3 4 or 3 of that bandwidth15 TBS must come from the peer chip. Since 315 TBS greater than 27 TBS, cross-chip bandwidth will limit the achievable cache bandwidth.

We can add chips diagonally without any difference because all the required inter-chip bandwidth is in the opposite direction. The Mi300X has a bidirectional chip-to-chip link.

Things get even more complicated if all chips need the maximum unlimited cache bandwidth in a unified configuration. Consumes additional cross-chip bandwidth because the transmission between diagonal chips requires two hops, which reduces the ingress bandwidth available for each chip.

While the Mi300X is designed to be like a large GPU, splitting the Mi300X into multiple Numa domains can provide a higher combined infinite cache bandwidth. AMD may have an API that transparently splits programs between different IO chips. In addition, high L2 hit rates will minimize the possibility of bandwidth issues, which will help avoid these bottlenecks. With the Infinity Cache hit rate low, the Mi300X's inter-chip link is robust enough and provides sufficient bandwidth to handle HBM traffic smoothly.

Cross-XCD consistency

While infinite caching doesn't have to worry about consistency, second-level caching does. Normal GPU memory access follows a permissive consistency model, but programmers can use atomics to force ordering between threads. Memory access on AMD GPUs can also be marked with GLC bits (Global Level Coherent). If AMD wants the Mi300X to be a single large GPU rather than a multi-GPU configuration like the Mi250X, then these mechanisms still have to work.

On previous AMD GPUs, Atomics and Coherent access were handled at L2. Setting the loading of the GLC bit will bypass the L1 cache and thus get the latest copy of the data from L2. This does not apply to MI300X, as the latest copy of the cache line may be on the L2 cache of another XCD. AMD can make coherent access bypass L2, but this will slow down performance. This may apply to gaming GPUs, as consistent access to gaming GPUs isn't too important. But AMD wanted the Mi300X to perform well in terms of compute workloads and needed the Mi300A (APU variant) to efficiently share data between the CPU and GPU. That's where Infinity Fabric comes in.

Like Infinity Fabric on Ryzen, cDNA 3 has a Coherent Master(CM) that XCD connects to the IO chip. Coherent slaves (CS:Coherent SLes) are located in each memory controller along with an infinite cache (IC:Infinity Cache) slice. We can infer how they work by the Ryzen documentation, which shows that Coherent Sl**es have a probe filter and hardware for handling atomic transactions. The Mi300X may have a similar CS implementation.

If there is a consistent write on the CS, you must ensure that the write is observed by any thread that performs a consistent read, regardless of where the thread is running on the GPU. This means that any xcd that caches the row will have to reload it from the Infinity Cache to get the latest data. This generally leads us to think that CS must probe the L2 cache across all XCDs, since any of them can cache the corresponding data. Probe filters help avoid this by keeping track of which xcds actually cache the row, thus avoiding unnecessary probe traffic. The *snooping filter for cDNA 3 (another name for a probe filter) is large enough to cover multiple XCD L2 caches. I certainly believe them, because the Mi300X has 8 MB of L2 on all 32 XCDs. Even consumer-grade Ryzen parts can have more CCD-specific cache for probe filters to override.

Thanks to CPU-like Infinity Fabric components such as CS and CM, XCD can have a private write-back L2 cache that can handle consistent access within the chip without having to cross the IO chip fabric. AMD could have gone with a simple solution of coherent operation and Atomics bypassing L2 and going straight into the infinite cache. Such a solution would save engineering effort and create a simpler design, but at the cost of degraded performance for consistent operations. Clearly, AMD believes that optimizing atomic and coherent access is very important and therefore requires more effort.

However, the way cDNA 3 in XCD works is still very similar to the previous GPU. Obviously, normal memory writes don't automatically invalidate write lines from peer caches like CPUs do. Instead, L2 must be explicitly told to write back dirty lines, and the peer L2 cache must invalidate non-local L2 lines.

L2 caching

Close to the compute unit, each MI300X XCD contains a 4 MB L2 cache. L2 is a more traditional GPU cache, built from 16 Slice. Each 256 KB slice provides 128 bytes of bandwidth per cycle. In 2At 1 GHz, this is important for 43 TBS is good. As the last level of cache on the same chip as the compute unit, L2 plays an important role as a fallback for L1 misses.

Compared to the H100 and Mi250X, the Mi300X has a higher L2 bandwidth compute ratio. Since each XCD is equipped with L2, L2 bandwidth naturally expands as cDNA 3 products are equipped with more XCD. In other words, the L2 layout of the Mi300X avoids the problem of connecting a single cache to a large number of compute units and maintaining a large amount of bandwidth.

The L2 of PVC is in stark contrast. As Intel adds more compute blocks, so does the bandwidth demand for the shared L2 of the base blocks. From a cache design perspective, PVC is simpler to configure because L2 acts as a fallback for single point of consistency and L1 misses. But it can't provide as much bandwidth as the Mi300X's L2. The Mi300X may also enjoy better L2 latency, making it easier for applications to utilize cache bandwidth.

L1 cache

The focus on high cache bandwidth in CDNA 3 continues to L1. In a move to match RDNA, the L1 throughput of cDNA 3 increased from 64 bytes per cycle to 128 bytes. Compared to 2048 bits in the GCN, cDNA 2 increases the per-CU vector throughput to 4096 bits per cycle, so the doubling of L1 throughput for cDNA 3 helps maintain the same compute-to-L1 bandwidth ratio as the GCN.

In addition to higher bandwidth, cDNA 3 increases L1 capacity from 16 KB to 32 KB. This move again reflects the evolution of the RDNA family, where RDNA 3's L1 cache has received a similar size increase. A higher hit rate for larger caches will reduce the average memory access latency, which will improve execution unit utilization. Transmitting data from L2 and higher layers consumes power, so higher hit rates also contribute to power efficiency.

While cDNA 3 improves L1 caching, Ponte Vecchio is still the champion in the category. Each XE core in PVC can transfer 512 bytes per cycle, providing Intel with a very high L1 bandwidth compute ratio. The L1 is also large at 512 KB. A memory-bound core that fits L1 will perform well on Intel architecture. However, Ponte Vecchio lacks a mid-level cache at the compute block level and can face a harsh performance cliff when data overflows L1.

Scheduling and execution units

Complex chiplet setups and a modified cache hierarchy allowed AMD to present the Mi300X as a single GPU, addressing one of the Mi250X's biggest weaknesses. But AMD didn't stop there. They also made iterative improvements to the core compute unit architecture, addressing the difficulties of using FP32 units for cDNA 2.

While cDNA 2 shifted to natively processing FP64, AMD provided double-rate FP32 through packaged execution. The compiler must pack two fp32 values into adjacent registers and execute the same instructions on both values. In general, it is very difficult for the compiler to achieve this unless the programmer explicitly uses vectors.

cDNA 3 solves this problem with a more flexible dual-issuance mechanism. Most likely, this is an extension of GCN's multi-question feature, not the vopd w**e64 approach of RDNA 3. Each cycle, the CU scheduler selects one of the four simds and checks if its threads are ready to execute. If multiple threads are ready, the GCN can choose up to five threads to send to the execution unit. Of course, the GCN Simd only has a 16-wide vector ALU, so the GCN must select threads with different instruction types and be ready for multi-emitting. For example, a scalar ALU instruction can be issued with a vector ALU instruction.

Another approach is to take advantage of the wider width of w**e64 and have the thread complete two vector instructions in four cycles. However, doing so would break GCN's model of processing VALU instructions in multiples of 4 clock cycles. cDNA 3 is still more closely related to GCNs than rdna, and it is wise to reuse GCN's multi-release strategy. AMD can also use RDNA 3's VOPD mechanism, where a special instruction format can contain two operations. While this approach can improve the performance of each thread, relying on the compiler to find double-problem pairs can succeed or fail.

The dual-problem approach of cDNA 3 may shift the onus to the programmer to expose more thread-level parallelism with a larger scheduling size, rather than relying on the compiler. If the simd has more threads running, it will have a better chance of finding two threads with fp32 instructions ready to execute. At a minimum, SIMD requires two active threads to achieve full FP32 throughput. In fact, cDNA 3 requires a higher occupancy to achieve good FP32 utilization. GPU usage is executed sequentially, so individual threads are often blocked due to memory or execution latency. Maintaining power to a group of actuators can be difficult, even when it is full.

As a result, AMD has dramatically increased the number of threads that can be tracked per cDNA 3 simd from 8 to 24. If programmers can take advantage of this, cDNA 3 will be better positioned for multi-problem. But it can be difficult. AMD doesn't mention an increase in vector register file size, which often limits the number of threads that a simd can run. If each thread uses fewer registers, the vector register file can hold the state of more threads, so the multi-emitting feature of cDNA 3 may be best suited for simple kernels with few active variables.

Register file bandwidth presents another challenge for dual issuance. The packaged FP32 execution of cDNA 2 does not require additional reads from the vector register file, as it takes advantage of the wider register file port required to pass 64-bit values. But individual instructions can reference different registers and require more reads from the register file. Adding more register file ports is costly, so cDNA 3 "improves the source cache generation by generation to provide better reuse and bandwidth amplification so that each vector register read can support more downstream vector or matrix operations." Most likely, AMD is using a larger register cache to mitigate port conflicts and keep the execution unit running.

Matrix operations

With the rise of machine learning, matrix multiplication is becoming increasingly important. NVIDIA has invested heavily in this area, adding matrix multiplication units (tensor cores) to its Volta and Turing architectures many years ago. AMD's cDNA architecture adds matrix multiplication support, but contemporary NVIDIA architectures invest more in matrix multiplication throughput. This is especially true for low-precision data types commonly used in AI, such as FP16.

Compared to previous generations of cDNA, the MI300X catches up by doubling the throughput per CU matrix. On top of that, the chiplet design of the Mi300X allows for a large number of CUs. But NVIDIA's higher per-SM matrix performance still makes it a force to be reckoned with. As a result, CDNA 3 continues AMD's trend of battling NVIDIA from vector FP64 performance, while maintaining isolated powerful AI performance.

Instruction caching

In addition to handling the memory access of the instruction request, the compute unit must also fetch the instruction itself from memory. Traditionally, GPUs have been easier to pass instructions because GPUs tend to be simple and don't take up a lot of memory. In the DirectX 9 era, Shader Model 30 even imposes a limit on the size of **. As GPUs continued to evolve to take on computational tasks, AMD introduced the GCN architecture with 32 KB instruction cache. Today, cDNA 2 and RDNA GPUs continue to use the 32 KB instruction cache.

cDNA 3 increases instruction cache capacity to 64 KB. The correlation has also doubled, from 4 to 8. This means that cDNA 3 with larger, more complex cores has a higher instruction cache hit ratio. I suspect AMD's goal is to naively port CPUs to GPUs. Complex CPUs can have an impact on GPUs because they can't hide instruction cache miss delays with long-range instruction prefetching and accurate branching. The higher instruction cache capacity helps accommodate larger cores, while the increased correlation helps avoid collision misses.

Like cDNA 2, each cDNA 3 instruction cache instance serves two compute units. GPU cores typically boot up at a large enough working size to fill many compute units, so sharing a cache of instructions is a great way to use SRAM storage efficiently. I suspect AMD isn't sharing the cache across more compute units, as a single cache instance can struggle to meet instruction bandwidth demands.

Written in the last words

The *** of CDNA 3 says that "the biggest generational change in AMD's CDN3 architecture is in the memory hierarchy", and I have to agree. While AMD has improved the low-precision math capabilities of the compute unit compared to CDNA 2, the real improvement is the addition of infinite caching.

The main problem with the Mi250X is that it's not really a GPU. It's two GPUs sharing the same package, with only 200 GB per second in each direction between GCDs. According to AMD's assessment, 200 GB per second in each direction is not enough for the Mi250X to appear as a GPU, which is why AMD has significantly increased chip-to-chip bandwidth.

AMD increases the total east-west bandwidth to 2 in each direction4TB seconds, which is 250x more than the Mi12X. The total north-south bandwidth is even higher, with 30TB seconds. With a massive increase in bandwidth, AMD was able to make the Mi300 look like one large unified accelerator instead of 250 separate accelerators like the Mi250X.

If both xcds require all the available memory bandwidth, then 40 TBS total ingress bandwidth may not seem sufficient. However, the combination of the two XCDs can only access up to 42TBS of bandwidth, so actually 4An ingress bandwidth of 0TBS is not an issue. Max 4An ingress bandwidth of 0TBS means that a single IO chip cannot take advantage of all 53TBS of memory bandwidth.

This is similar to the desktop Ryzen 7000 part, where a single CCD cannot take full advantage of DDR5 bandwidth due to Infinity Fabric limitations. However, this may not be a problem on the Mi300X, as all chips are running and the bandwidth requirements will be the highest. In this case, each chip will consume about 13 TBS of bandwidth, you won't have a problem getting 3 4 of them by crossing chip links.

But the Mi300 is not just the GPGPU part, it also has the APU part, and in my opinion this is the more interesting of the two Mi300 products. AMD's first Apu Llano was released in 2011 and was based on AMD's K105 CPUs with Terascale 3 GPUs. Fast forward to 2023, and AMD is pairing 300 cDNA3 xCDs with 6 Zen 4 cores in its first "Big Iron" APU Mi24A, while reusing the same base chip. This allows the CPU and GPU to share the same memory address space, eliminating the need to replicate data over an external bus to keep the CPU and GPU consistent with each other.

We're looking forward to what AMD will be able to do with the future "Big Iron" APUs and the future GPGPU series." Maybe they're going to have a dedicated CCD with a wider vector unit, or they're going to have a network on their base chip that can connect directly to the XGMI switch that Broadcom says is being made.

Related Pages