For a long time, all of our programs have been based on CPU and sequential execution architecture, and each of us is familiar with how to tune for CPU and how to debug sequential execution. However, now the situation is slowly changing, due to the development of machine learning Xi and artificial intelligence, the GPU of large-scale parallel computing is highlighted as the basic computing architecture, and now the era of artificial intelligence and large models is contending, in fact, it is the era of fighting computing power and fighting GPU, why Nvidia can soar into the sky, it is because of its H100 such high-performance GPU capabilities, which is the key to winning this competition. And software development engineers are also faced with a choice, how to become AI trendsetters, rather than being replaced by AIThe first point is to understand AI, high-performance parallel computing, and GPU, and this article is to Xi learn these basics with you.
The big difference between CPUs and GPUs is that their original design purposes are fundamentally different. The CPU is designed to execute sequential instructions. In order to improve sequential execution performance, many optimization instructions have been introduced into CPU design over the years. The focus is on reducing instruction execution latency so that the CPU can execute the instruction sequence as quickly as possible, with optimization points such as instruction pipelines, out-of-order execution, speculative execution, and multi-level caching.
GPUs, on the other hand, are designed for massive parallelism and high throughput, but have moderate to high instruction delays in execution speed. Therefore, it is more suitable for use in gaming, graphics, numerical computing, and deep Xi computing, all of which require a large number of linear algebra and numerical calculations to be performed at very fast speeds, especially at large and very large scales of throughput and parallel execution.
Due to the lower instruction latency, the CPU can add two numbers faster than the GPU, performing multiple such calculations in a row faster than the GPU. And when doing millions or billions of such calculations, the GPU completes these calculations faster than the CPU due to its huge parallelism. Specifically, the NVIDIA Ampere A100 offers 32 with 19-bit precision5tflops of throughput. In comparison, Intel's 24-core processor has a 32-bit precision throughput of 066 tflops。
Moreover, the throughput performance gap between GPUs and CPUs is widening year by year.
Comparison of CPU and GPU computing structures:
As shown in the diagram, the CPU dedicates a large amount of chip area to functions that reduce instruction latency, such as large multi-level caches, less ALU, and more control units. In contrast, GPUs use a lot of ALUs to maximize their computing power and throughput. They use a very small amount of chip area as a cache and control unit, which reduces CPU latency.
Latency tolerance and high throughput
The answer to how GPUs can deliver high performance with high latency per order is theirsParallel computing。GPUs have a large number of threads and powerful computing power to ensure high concurrency and high throughput. Even if a single instruction has a high latency, the GPU efficiently schedules thread execution so that they utilize computing power at every point in time. For example, when some threads are waiting for an instruction result, the GPU will switch to execute other non-awaiting threads to ensure that the compute units on the GPU are running at their maximum capacity at all points in time, providing high throughput.
GPU computing is characterized by high throughput and high concurrency. The architecture is designed with this principle and purpose in mind.
GPUs consist of a series of streaming multiprocessors (SMs). Each SM in turn consists of multiple stream calculators, cores, and threads. For example, the NVIDIA H100 GPU has 132 SMs with 64 cores each, for a total of 8448 cores.
Each SM has a finite amount of on-chip memory, often referred to as shared memory or scratchpads, that is shared across all cores. Again, the control unit resource on the SM is shared by all cores. In addition, each SM is equipped with a hardware-based thread scheduler to execute threads.
In addition to this, each SM has multiple functional units or other accelerated computing units, such as tensor cores or ray tracing units, to meet the specific computing needs of the workload that the GPU satisfies.
GPUs have multiple tiers of different types of memory, each with its own specific use case. The following diagram shows the memory hierarchy of an SM in the GPU.
Its main constituent units are:
Register:Each SM cell in the GPU has a large number of registers. For example, the NVIDIA A100 and H100 models have 65536 registers per SM. These registers are shared between cores and dynamically assigned to them based on the requirements of the thread. During execution, the registers assigned to a thread are private to that thread, i.e., they cannot be read or written by other threads.
Constant cache:Used to cache the constant data used by ** executed on SM. In order to take advantage of these caches, the programmer must explicitly declare objects as constants in ** so that the GPU can cache them and save them in the constant cache.
Shared Memory:Each SM also has a shared memory or scratchpad, which is a small amount of fast and low-latency on-chip programmable SRAM memory, which you may be familiar with. It is designed to be shared by blocks of threads running on SM. The idea behind shared memory is that if multiple threads need to work on the same piece of data, only one of them should load it from global memory, while the other threads share it. Careful use of shared memory can reduce redundant loading operations on global memory and improve kernel execution performance. Another use of shared memory is as a synchronization mechanism between threads executing within a block.
L1 cache:Each SM also has an L1 cache that caches frequently accessed data in the L2 cache.
L2 cache:There is an L2 cache that is shared by all SMsm. It caches frequently accessed data in global memory to reduce latency. Note that both L1 and L2 caches are transparent to SM, i.e. SM does not need to know whether it is getting data from L1 or L2, for SM it is getting data from global memory. This is similar to how L1 L2 L3 caching works in a CPU.
Global Memory:The GPU also has an off-chip global memory, which is a high-capacity, high-bandwidth DRAM. For example, the NVIDIA H100 has 80GB of high-bandwidth memory (HBM) with a bandwidth of 3000GB of seconds. Due to the distance from SM, the latency of global memory is quite high. However, multiple tiers of cache and a large number of compute units can eliminate these delays.
To understand how the GPU performs the kernel, you first need to understand what the kernel is and what its configuration is.
Introduction to kernel and thread blocks
CUDA is a programming interface provided by NVIDIA to write programs for its GPUs. In CUDA, the computation to be run on the GPU is expressed in the form of a C++-like function, which is called the kernel. The kernel operates in parallel with the digital vectors, which are provided to it as function parameters. A simple example is a kernel that performs vector addition, i.e. a kernel takes two vectors of numbers as input, adds them up by elements and writes the result into a third vector.
In order to execute the kernel on the GPU, a number of threads need to be started, which are collectively known as the grid. But there's more to the grid than that. A grid is made up of one or more threaded blocks (sometimes referred to simply as blocks), and each block is made up of one or more threads.
The number of blocks and threads depends on the size of the data and the degree of parallelism we want. For example, in the vector addition example, if you want to add a vector with dimension 256, you might decide to configure a thread block with 256 threads so that each thread operates on one element of the vector. For larger issues, the GPU may not be able to provide enough threads available to meet demand at one time.
As far as implementation goes, writing the kernel requires two parts. One is the host that executes on the CPU. This is where data is loaded, memory is allocated on the GPU, and the kernel is started using the configured thread grid. The second part is writing the device (GPU) that executes on the GPU
As an example of vector addition, the following diagram shows the host**.
Below is the device, which defines the actual kernel functions.
For CUDA, limited by space, it will not be introduced in depth, and interested students can refer to the official documentation.
Copy data from the host to the device
Before scheduling kernel execution, all the data it needs must be copied from the memory of the host (CPU) to the global memory of the GPU (device). Still, in the latest GPU hardware, it is also possible to read data directly from the host memory using unified virtual memory.
Scheduling of thread blocks on SM
When the GPU has all the necessary data in its memory, it allocates thread blocks to SM. All threads in a block are processed by the same SM at the same time. In order to achieve this, the GPU must reserve resources for these threads on SM before it can start executing them. In practice, multiple thread blocks can be assigned to the same SM for simultaneous execution.
Due to the limited number of SMs, and the large kernel may have a large number of blocks, not all blocks can be allocated for execution at once. The GPU maintains a list of blocks waiting to be allocated and executed. When any block finishes executing, the GPU allocates one of the blocks in the waiting list to execute.
Single-instruction, multi-threading, and warp
All threads of a block are assigned to the same SM. But there's another level of threading after that. These threads are further divided into 32 size warps and are allocated together to execute on a set of cores called processing blocks.
SM executes all threads in WARP together by getting and issuing the same instructions to all threads. These threads then execute that instruction at the same time, but for different parts of the data. In the vector addition example, all threads in the warp may be executing addition instructions, but they will operate on different indexes of the vector.
The execution model of this WARP is also known as single-instruction multi-threading (SIMT) because multiple threads are executing the same instruction.
Starting with Volta, the new generation of GPUs offers an alternative instruction scheduling mechanism called independent thread scheduling. It allows full concurrency between threads, regardless of warp. It can be used to make better use of execution resources, or as a synchronization mechanism between threads.
WARP scheduling and latency tolerance
There are some interesting details about how WARP works that are worth discussing.
Even though all processing blocks (core groups) in SM are processing WARP, at any given moment, only a few blocks are actively executing instructions. This happens because of the limited number of execution units available in SM.
But some instructions take longer to complete, resulting in warp while waiting for the result. In this case, SM puts the waiting WARP to sleep and starts executing another WARP that doesn't need to wait for any action. This allows the GPU to maximize the use of all available compute and provide high throughput.
Zero-overhead scheduling
Since each thread in each thread bundle has its own set of registers, there is no overhead when SM switches from executing one thread bundle to another.
This is in contrast to the way the context is switched between processes on the CPU. If a process is waiting for a long-running operation, the CPU schedules another process on that core at the same time. However, context switching in the CPU is expensive because the CPU needs to save registers to the main memory and restore the state of other processes.
Copy the resulting data from the device to host memory
Finally, when all threads of the kernel have finished executing, the last step is to copy the result back to host memory.
A metric is used in the GPU to measure the utilization of GPU resources. This metric represents the ratio of the amount of WARP allocated to SM to the maximum amount it can support. In order to achieve maximum throughput, we want to have 100% occupancy. However, in practice, this is difficult to achieve due to various constraints.
SM has a fixed set of execution resources, including registers, shared memory, thread block slots, and thread slots. These resources are dynamically allocated between thread requirements and GPU throttling. For example, on the NVIDIA H100, each SM can process 32 blocks, 64 WARPs (i.e. 2048 threads), and 1024 threads per block. If you start a grid with a block size of 1024 threads, the GPU splits the 2048 available thread slots into 2 chunks.
Dynamic partitioning allows for more efficient use of computing resources in the GPU. If you compare it to a fixed partitioning scheme, where each thread block receives a fixed amount of execution resources, then it may not always be the most efficient. In some cases, threads may be allocated more resources than they need, resulting in wasted resources and reduced throughput.
Let's take a look at how resource allocation affects SM occupancy with an example. Assuming a block size of 32 threads is used and a total of 2048 threads are required, there will be 64 such blocks. However, each SM can only process 32 blocks at a time. So even though SM can run 2048 threads, it can only run 1024 threads at a time, resulting in only 50% usage.
Again, each SM has 65,536 registers. To execute 2048 threads at the same time, each thread can have up to 32 registers (65536 2048 = 32). If the kernel requires 64 registers per thread, then each SM can only run 1024 threads, which is also only 50% occupied.
The challenge with suboptimal occupancy is that it may not provide the necessary latency tolerance or compute throughput required to achieve peak hardware performance.
Efficiently creating GPU cores is a complex task. Resources must be allocated appropriately to maintain high occupancy while minimizing latency. For example, having a number of registers can make it run faster, but may reduce usage, so careful optimization is important.
In this article, we popularize the basic support and architecture design related to GPU and GPU computing
GPUs are made up of multiple streaming multiprocessors (SMs), each of which has multiple processing cores.
There is an off-chip global memory, which is HBM or DRAM. The SM on the chip is farther away and the latency is higher.
There is an off-chip L2 cache and an on-chip L1 cache. These L1 and L2 caches operate in a similar way to the L1 L2 caches in the CPU.
There is a small amount of configurable shared memory on each SM. This is shared between cores. Typically, threads within a thread block load a piece of data into shared memory and then reuse it, rather than loading it again from global memory.
Each SM has a large number of registers that are partitioned between programs according to the requirements of the thread. For example, the NVIDIA H100 has 65,536 registers per SM.
In order to execute the kernel on the GPU, the GPU also designs a thread grid. A grid is made up of one or more threaded blocks, and each threaded block is made up of one or more threads.
The GPU allocates one or more blocks based on resource availability for execution on SM. All threads of a block are allocated and executed on the same SM. This is to take advantage of data locality and synchronization between threads.
The threads assigned to SM are further divided into 32 sizes, called warp. All threads within a warp execute the same instructions at the same time, but in different parts of the data (simt).
The GPU performs dynamic resource partitioning based on the needs of each thread and the limits of the SM. Programmers need to carefully optimize** to ensure that SM occupancy is at its highest level during execution.