The strength of a chain is only as strong as its weakest link – just as your AI ML infrastructure is only as fast as your slowest component. If you're using GPUs to train machine learning models, your weak point may be your storage solution. The result is what I call the "hungry GPU problem".
The hungry GPU problem occurs when your network or storage solution can't provide training data to your training logic quickly enough to make the most of your GPU. The symptoms are quite noticeable. If you're monitoring your GPUs, then you'll notice that they're never close to being fully utilized. If you've instrumentation your training**, you'll notice that the total training time is mostly dominated by io. Unfortunately, there is bad news for those who are dealing with this. Let's take a look at some of the advances in the use of GPUs to see how this issue is going to get worse in the coming years.
GPUs are getting faster. Not only is the raw performance improving, but so is the memory and bandwidth. Let's take a look at the three features of NVIDIA's latest GPUs: A100, H100, and H200.
Note: The above ** uses statistics corresponding to the PCIe (Peripheral Component Interconnect Express) slot solution for the A100 and the SXM (Server PCI Express Module) slot solution for the H100 and H200. The statistics for SXM do not exist for the A100. Regarding performance, the comparison uses the statistics of the floating-point 16 tensor cores. )
Some of the observations in the above statistics are worth noting. First of all, the performance of the H100 and H200 is the same (1,979 Tflops), which is 100 of the A317 times. The H100 has twice as much memory as the A100 and the memory bandwidth has increased by just as much – which makes sense, otherwise the GPU would starve itself. The H200 can handle up to 141GB of memory, and its memory bandwidth also increases proportionally with the increase in memory. Let's go into more detail about these statistics and discuss what they mean for machine learning.
Performance - One teraflop (tflops) is one trillion (10 to 12) floating-point operations per second. This is a 1 followed by 12 zeros (1,000,000,000,000). Floating-point operations that occur during model training include simple tensor math as well as the first derivative of the loss function (i.e., gradient). However, some estimates can be made due to the relative comparison between TFLOPS and IO demand in the statistics. Looking at the stats above, we can see that the H100 and H200, both of which exhibit a performance of 1,979 Tflops, is 3x faster – and if everything else can keep up, data processing can theoretically be 3x faster.
GPU Memory - Also known as Memory or Graphics Memory. The GPU memory is separate from the system's Main Memory (RAM) and is dedicated to handling the intensive graphics processing tasks performed by the graphics card. The GPU memory determines the batch size when the model is trained. In the past, when the training logic was migrated from the CPU to the GPU, the batch size would be reduced. However, as the GPU memory is comparable to the CPU memory capacity, the batch size of the GPU training will increase. When both performance and memory capacity increase, the result is larger requests and faster data processing per gigabyte.
Memory bandwidth - The "highway" that connects GPU memory and compute cores. It determines how much data can be transferred per unit of time. Just as a wider highway allows more cars to pass in a given amount of time, higher memory bandwidth allows more data to move between memory and GPU. As you can see, the designers of these GPUs have increased the ratio of memory bandwidth to memory with each new version, so the data display inside the chip doesn't become a bottleneck.
In August 2023, NVIDIA announced its next GPU platform for accelerating computing and generative AI, the Grace Hopper superchip platform. The new platform uses the Grace Hopper superchip, which can be connected via NVIDIA NVLink to make them work together for model training and inference. While all the specs on the Grace Hopper superchip represent an improvement over the previous chip, the most important innovation for AI ML engineers is its unified memory. Grace Hopper gives the GPU full access to the CPU's memory. This is important for engineers who used to want to train with GPUs. Because in the past, engineers who wanted to train with GPUs first had to pull data into system memory and then move data to GPU memory from there. Grace Hopper eliminates the need to use CPU memory as a rebound buffer for data to reach the GPU. For anyone tasked with upgrading their GPU and making sure everything else can keep up, it can be a little unsettling to make a simple comparison of a couple of key GPU stats, as well as what Grace Hopper is capable of. A storage solution definitely needs to serve data at a much faster rate to keep up with these GPU improvements. Let's take a look at the solutions to a common Tiger GPU problem.
There is a common and obvious solution to this problem that does not require organizations to replace or upgrade existing storage solutions. You can leave your existing storage solution intact so you can take advantage of all the enterprise features your organization needs. This storage solution is likely to be a data lake containing all of your organization's unstructured data – so it can be quite large, and total cost of ownership is a consideration. It also has a number of features enabled for redundancy, reliability, and security, all of which impact performance. What can be done, however, is to set up a storage solution in a data center that is the same as the computing infrastructure – ideally, this solution should be the same as the compute cluster. Make sure you have a high-speed internet and the best storage device. From there, only copy the data used for ML training. Amazon's recently announced Amazon S3 Express One Zone exemplifies this approach. It is a bucket type optimized for high throughput and low latency, limited to a single Availability Zone (no replication). Amazon's intent is for customers to use it to hold copies of the data that need high-speed access. Therefore, it is specifically designed for model training. According to Amazon, it provides 10 times faster access to S3 standard data and costs 8 times more than S3 standard. Read more about our evaluation of the Amazon S3 Express One Zone.
The common solutions I've outlined above require Amazon to customize its S3 storage solution by offering specialized buckets that increase costs. Also, some organizations (not minio customers) are buying professional storage solutions to achieve the simple things I described above. Unfortunately, this adds complexity to the existing infrastructure, as a new product is needed to solve a relatively simple problem.
The irony of all this is that minio customers have always had this option. You can do everything I described by installing a new instance of minio on a high-speed network and using an NVMe drive. minio is a software-defined storage solution – the same product can run on bare metal or a cluster of your choice, using a variety of storage devices. If your enterprise data lake uses minio on bare metal and works from HDD, and works fine for all non-ML data, there's no reason to replace it. However, if the dataset for ML requires a faster IO because you use GPU, then consider the approach I've outlined in this post. Make sure to make a copy of your ML data for a high-speed instance of Minio – always present a gold copy in Minio's indestructible installation. This will allow you to turn off features such as replication and encryption in minio's high-speed instances, further improving performance. It's very easy to copy data using minio's mirroring feature.
Minio has the ability to meet the performance required by hungry GPUs – a recent benchmark achieved 325 Gibs on Gets and 165 Gibs on Pits with only 32 nodes of a standard NVMe SSD.
Join Minio today to learn how easy it is to build a data lakehouse. If you have any questions, be sure to contact us!