After Nvidia's market capitalization of more than 2 trillion ranks third in the world, who can challenge the GPU chips on which it depends?
An AI inference chip called an LPU could be the answer.
In the process of using large models, many people will find that the large models are slow to answer questions, and the answers usually jump out word by word or sentence by sentence, accompanied by stuttering. However, with the blessing of LPU, the large model has been greatly accelerated, and it can output 500 tokens per second, much higher than the performance of ChatGPT of 40 tokens per second. The commonly used large models are based on GPU acceleration, with an average speed of 20 tokens per second.
Since February 19, this LPU has continued to become a hot spot, taking on the role of Nvidia's GPU challenger. NVIDIA's main chip, H100, is hard to find, which brings an opportunity to Groq.
The LPU comes from the unicorn company Groq, which was founded in 2016 and has a star R&D team. Groq's team came from Google and single-handedly built Google's TPU chip project. Nvidia once invented the term "GPU (graphics processing unit)" to sell graphics cards, and Groq specifically invented "LPU", which refers to "language processing unit", which is specifically used for large language model inference tasks.
Groq said that the speed of the LPU chip is 10 times faster than that of the H100, and the cost is only one-tenth of it, which can be said to have increased the speed of the large model from the level of "feature machine" to the level of "smart phone", and successfully became the "new favorite" of the market at the most embarrassing moment of NVIDIA.
Cracks that were pried open
The discussion around the Groq LPU has formed two directions. Foreign developers have started all kinds of DIY with the help of GroQ LPU, and developers from different applications have made the same exclamation: the speed is too fast! The domestic technology community debated the cost of GROQ LPUs and gave an in-depth interpretation of the technology behind the LPUs.
Groq gives the data, and its LPU can deliver 10 times the speed of the Nvidia H100 at one-tenth the cost of inference.
This claim raises questions. AI scientist Jia Yangqing's calculations show that although the Groq LPU is fast, the annual electricity cost is 10 times higher than that of the H100. More discussion confirmed that the Groq LPU did not have the advantage of either operating costs or procurement costs.
The reason for this is simple: the "cost of inference" referred to by Groq is primarily used to measure performance, and it refers to "energy efficient". The industry describes costs and uses "power efficiency", which is directly linked to electricity consumption. This means that the data of the LPU compared to the H100 test is good, but the actual reference value is not high.
In addition, the memory capacity of a Groq LPU is 230MB, and the memory capacity of an H100 is 80GB. Then running a large model of the same size requires a larger number of Groq LPUs.
Groq LPU clusters have amazing computing power, bring very high throughput and capacity, and at the same time cause very high power consumption, which is reflected in the very high output speed and very low latency that we see in inference.
And then that didn't become an obstacle for Groq LPUs to enter the market.
After the explosion, Jonathan Ross, the founder of GroQ, showed off the chip delivery on the social platform X, hinting that he had successfully opened up the situation in the AI chip market.
Almost within a week, Groq launched a new division, Groq Systems, focused on building an ecosystem for customers and developers; At the same time, it announced the acquisition of startup Definitive Intelligence to strengthen the GroqCloud business; This was followed by a partnership with Saudi Aramco to build the inference capabilities of GroqCloud.
In addition, the Groq LPU does not rely on Samsung or Hynix's HBM as well as TSMC's CODOS package, and the ** chain is entirely in North America, using a proven 14nm process. It can be said that almost all the factors that cause the tension of mainstream chips have been bypassed. And the founder Ross further stated that 420,000 LPUs, with the aim of expanding the deployment to 220,000 through cooperation and deploying 1.5 million by next year.
It seems that the crack in the AI chip market dominated by Nvidia has been pried open.
There are more than 100 startups in the field of AI-specific chips, and many of them claim that their chips are comparable to the H100, but few of them can stand up to the discussion experienced by the Groq LPU. Groq founder Jonathan Ross has his own understanding of the market, arguing that "no one buys something because it's better, but because they have unsolved problems." Groq does things very differently. ”
In contrast to the CPU or GPU, the "software-defined" approach to the GroQ LPU design is becoming a trend in autonomous driving, networking, storage, and other hardware.
The determinism of dedicated chips
The classic "software devours the world" summarizes and predicts the Internet. In the era of com and app, Andrej Karpathy, the former director of artificial intelligence at Tesla, emphasized that "software 20", that is, "software devours the world, AI devours software".
Software in the past was written in languages such as Python, C++, etc., and programmers were able to interpret every line of it, which represents "Software 10”。Andrea Karpas believes that "Software 2"0" refers to an abstract neural network into which programmers can only write frameworks and have little way to dive into them. In contrast to binaries or scripts, matrix multiplication for neural networks can be run on many computational configurations. As neural networks become the standard commodity, software-first, software-defined hardware is possible.
André Karpas helped Tesla launch its self-driving system based on this philosophy, and from the outset he was firmly committed to using vision-based algorithms that did not rely on lidar and high-definition maps.
Today, neural network algorithms are used to solve various problems in science, transportation, security, and other fields, and are typically computationally intensive tasks due to the huge amount of matrix computation required by deep neural networks. The explosion of large models has further increased the scale and complexity of computing, bringing challenges to traditional CPU and GPU architectures.
The microarchitecture of CPUs and GPUs is not designed for deep neural networks, but many of their inherent features make the order and timing of instruction execution uncertain and difficult to reason. For example, in large language models, computational processing is usually serial rather than parallel, and if there is no nth value, the n+1st value cannot be achieved. Therefore, GPUs with parallel design cannot run full high performance in large language models.
Groq says it was "inspired by software-first thinking" that updated the chip architecture, optimized for serial tasks, and eliminated extraneous circuitry in the chip. This design is in stark contrast to the GPU, which is like a large workshop where workers move through different parts of the work. The LPU provides an assembly line that can process data tasks in a sequential and organized manner.
It took a long time for Groq to come up with these ideas, and Chamath Palihapitiya, a venture capitalist at GroQ, shared in a podcast how the Groq team failed during its start-up years.
Groq had sought cooperation with Tesla when it was considering a lidar solution in the early days, but was "kindly refused". Later, the team tried to sell the technology to high-frequency trading customers and Sanxin Institution, but they all failed. It wasn't until they saw NVIDIA's CUDA that the Groq team realized they had to build a high-level compiler that would be able to adapt to a variety of models. Since its inception, Groq has devoted almost half of its time to compiler development.
The Groq LPU implements software-defined hardware, where the chip hands over management to the compiler, who is responsible for scheduling and execution control, taking on the task of non-determinism, which ensures that the chip hardware can focus on deterministic computation. This approach fundamentally bypasses the limitations of the traditional, hardware-centric architecture model and becomes the foundation for the low latency and high throughput of the Groq LPU.
"Software-defined" is not a new concept, but it has gained popularity again in recent years. For example, Intel has proposed a "software-defined, chip-enhanced" strategy under the leadership of Kissinger. Gelsinger said that software indirectly defines Intel's foundry strategy and the factory's ability to produce accelerator chips. In the field of intelligent driving, software-defined vehicles are the direction determined by almost all players.
Groq applies "software-defined" to chip design and further expands to chip clusters. According to GroQ, compared to other cloud computing power, the speed of large models accelerated by Groq Cloud can be increased by 18 times.
The bottleneck of general-purpose chips
Today's large models are supported by general-purpose GPUs such as A100 and H100, which can provide huge FLOPS computing power to meet the needs of large models for training data. However, once the inference application after training is reached, the bottleneck of the general-purpose GPU is further amplified.
The autoregressive model represented by Transformer requires multiple rounds of repeated computation in the inference process, and all input tokens will be repeatedly calculated for each token generated. Then, each token generated needs to interact with the memory for data, and this process is called "memory retrieval". If the content of a long sequence is generated, the speed of the memory access determines the speed of generation.
Even for general-purpose GPUs such as the H100, there are limitations in the inference process. Caitong** mentioned in the report that the computing speed of the world's most advanced AI chips is "much faster" than the memory bandwidth. The memory access speed limits the inference speed, resulting in low utilization of computing power.
In other words, the H100 that large model manufacturers and companies queue up to buy cannot actually be fully used. Lower computing power utilization is equivalent to increasing the cost of chip procurement.
In the industry, methods such as branch reduction, distillation, and operator optimization are commonly used to improve the utilization rate, which inevitably affects the quality of the model.
NVIDIA's H200 chip is equipped with Micron's new generation HBM3E memory, and the peak memory bandwidth has been greatly increased by 44%.
GroQ's approach is to replace HBM with SRAM, and with the help of SRAM's own advantages, the memory bandwidth is increased to 80TBS on a single chip, which directly improves the inference speed by orders of magnitude. This idea has been seen in Graphcore and Pingtou's products, and Groq has done it more thoroughly, and it uses SRAM completely.
In addition, the Groq LPU uses a 14nm chip process, and the next generation launched in 2025 will use Samsung's 4nm, in exchange for more computing matrix and SRAM. The Next Platform estimates that 576 LPUs are needed to complete 70B large model inference today, and that only about 100 may be needed by 2025.
At the same time, Groq will use Samsung's 4nm factory in North America, which avoids the ** bottleneck to the greatest extent. That's why there's still a market for Groq LPUs. Huang said at the earnings conference that the overall situation of AI chips is improving, but it is expected that the situation of short supply will continue throughout 2024.
Groq venture capitalist Chamas believes that today's AI is more of a proof of concept, or a toy app, that is difficult to deliver widely as a commercial product to enterprise customers. The reason is largely that the large model is not good enough, the speed is too slow, and it requires too much infrastructure and cost. In the process of commercializing commercialization, LPUs are suitable for developers of all sizes, and Groq will have the opportunity to make a leap forward in commercialization.
Groq categorizes its customers into three categories: large-scale data centers, Fortune 3000 companies, and everyone else. Chamas revealed that in the short period of time when the Groq LPU exploded, most of the registered customers came from large-scale companies.
In an interview with The Futurum, Groq executives revealed that they believe the Global 3000 represents an important market for LPUs. Companies outside of the Fortune 3,000 tend to try cloud-based products such as APIs. More and more businesses are telling Groq that they want to own their proprietary data independently, and many are considering adding on-premise data centers rather than working with just data center vendors.