In the decade since Google released Kubernetes to the open source community, it has become the leading platform for orchestration and management.
Translated from Kubernetes Clusters H**e Massive Overprovisioning of Compute and Memory by Jeffrey Burt. In the decade since Google released Kubernetes to the open source community, it has become the go-to platform for orchestrating and managing software containers and microservices, beating competitors like Docker Swarm and Mesosphere. (Remember them?) Ten years from now, you won't remember. Companies building software stacks have adopted Kubernetes to create their own container platforms, such as Red Hat's OpenShift and VMware's Tanzu, and almost every cloud service provider offers Kubernetes under its many services, and it's very fast.
Today, more than 5.6 million developers use Kubernetes, which accounts for 92% of the container orchestration tool space, according to the Cloud Native Computing Association. Kubernetes is very powerful. Laurent Gil, co-founder and chief product officer of Cast AI, said that Cast AI is a startup whose AI-based automation platform is designed to help organizations optimize their use of Kubernetes, which is essential for software developers and DevOps engineers in an increasingly distributed and accelerated IT world.
Think of Kubernetes as a great toolbox," Gil told The Next Platform. "We used to have monolithic applications. The benefit of Kubernetes is that you can break down your application into smaller parts, and the benefit is that some parts can be replicated, so you can scale easily. Imagine you're Netflix – they actually use Kubernetes – and there are millions of people pouring in and ** at the same time. If you're using Kubernetes, you can replicate these containers infinitely to handle this traffic. Containers are perfect for this situation. You can expand it. It's pretty much designed that way. ”
That said, developers face challenges when using Kubernetes in the cloud, and one of the key challenges is configuring CPUs and memory for applications. Last year, the five-year-old company looked at the capabilities of developers and DevOps people** for the amount of IT resources needed for Kubernetes applications, and the results weren't good.
According to GIL, developers often request far more compute and memory than they actually need, resulting in significant overspending. In 2022, there was a large gap between the requested CPUs and the configured CPUs – 37%, and the company found that the gap widened further to 43% last year. This is an over-provisioned amount based on the developer's idea.
This means that within a year, waste actually increases, not decreases," he said. "It should be clear by now. If you need two CPUs, just configure two. Don't configure three. But it's worse than last year. ”
The Cast AI researchers also looked at how many configured CPUs the developers were actually using. On average, the figure is 13%. They wanted to see if these numbers were better in larger clusters, but in clusters with 1,000 or more CPUs, CPU utilization was only 17%.
Clusters with 30,000 or more CPUs achieved 44% utilization, but only 1% of the systems they examined.
All of this indicates that CPUs are massively overprovisioned and most of the computing power is idle.
I didn't expect it to turn out very well, but I didn't expect it to be so bad," he said. "On average, you're overprovisioned by 8 times. The one that really works. Out of 100 machines – the CPU is the most expensive component in Kubernetes – you're only using 13. On average, you don't use the rest. If you have 100 machines, they're all used, but only 13% of each one is used. Kubernetes is like gas in a room. It will fill up the space. If you have an application running on that machine, they will all be used. They will only be used at 13% utilization. ”
For the 2024 Kubernetes Cost Benchmark Report, Cast AI looked at 4,000 clusters running on AWS, Azure, and Google Cloud Platform between January 1 and December 31 of last year, and then optimized them using the vendor's platform. They excluded clusters with fewer than 50 CPUs for analysis. Another area to look at: utilization across cloud-hosted Kubernetes platforms. On Elastic Kubernetes Services (EKS) on AWS and Kubernetes Service (AKS) on Microsoft Azure, utilization hovers around 11%, while on Google Cloud's Google Kubernetes Engine (GKE), utilization is even better at 17%. Clusters on GKE tend to be larger than the other two clusters, and the service provides custom instances.
Google is the source of Kubernetes, and it probably has savvy users, and it can be translated that way," Gil said. "But you know what? Frankly, even 17% isn't good. It is still more than five times overallocated. Think about it: you go to your CTO and you say, 'You know what? You can reduce your cloud costs by a factor of five because you don't actually need that much. ’”
Cast AI also looked at memory usage and noticed that on average, memory usage was 20%. However, memory is cheaper than CPUs, so it would be better if CPU utilization was higher, GIL says. But that's not the case.
People focus more on memory, and essentially, as they focus more, they do better," he said. "They focus more on memory because when the container is running low on memory, the container stops and restarts. CPUs are resilient. You can go from 0% to 80%. There's always room for it. memory, you can't exceed 100. If you exceed 100, it will crash. It's called 'Out of Memory.' oom’。This is the biggest fear of DevOps and Kubernetes. They focus more on memory, so it's slightly better, but on average it's still five times too much. ”
There isn't much difference between cloud platforms, with Azure having the highest memory utilization at 22%, followed by AWS at 20% and Google Cloud at 18%.
As businesses prepare to increase their spending on cloud services, the researchers write in the report, they need to address this utilization issue. Global end-user spending on public cloud services is expected to reach $678.8 billion this year, up 20. from $563.6 billion in 20234%。Even AWS pricing Spot Instances in its most popular U.S. region averaged 23% between August 2022 and 2023.
Ten years ago, many organizations actively venturing into cloud computing were surprised to find that costs began to pile up, which, along with data sovereignty and regulatory concerns, became the main drivers behind data repatriation over the past few years. Improving resource utilization would help, Gil says.
The problem, he said, is that determining the resources needed is still a highly manual process. Developers don't know what their application or cluster needs because they haven't seen it at scale. It's hard to guess what resources microservices need, Gil said, adding that it won't get any easier as Kubernetes becomes more complex.
We call this a nonlinear problem, where you have to adjust a lot of small variables in real time, and each variable affects the others," he said. "It's not just that you're only using 10% of it, it's that you're probably not using the right variables. That's why humans are over-provisioning. They know it's not right. But for some reason, they don't know what to do. ”
A growing number of vendors are offering automation tools and platforms to improve resource optimization in the cloud. The lineup includes established vendors like Cisco Systems, as well as AppDynamics, Nutanix, AppTio, VMware, and Flexera. Cast AI boasts that its platform can save organizations 50% or more in cloud costs using AI technology. In November 2023, the company received financial backing, raising $35 million in Series B funding, bringing the total amount raised to $73 million.