Groq s LPU will be another new favorite after NVIDIA GPUs?

Mondo Technology Updated on 2024-02-21

Author: Mao Shuo.

Almost as soon as you press the send button, the large model generates a reply at an astonishing rate. This time, the GROQ model completely overturned GPT-4's speed record of 40 Tok s with 500 tokens per second!

The reason why Groq is "out of the circle" is because of its amazing speed, claiming to be "the fastest model in history"! The response speed that makes it the best in the circle of large models comes from the new AI chip that drives the model - LPU (Language Processing Units).

The LPU of the Groq family "does not go off the beaten path".

LPUs are designed to overcome the bottlenecks of two large language models (LLMs) – compute density and memory bandwidth. Compared to GPUs and CPUs, LPUs have more computing power to handle LLMs. This reduces the time required for each word to be calculated, allowing text sequences to be generated more quickly. In addition, the elimination of external memory bottlenecks enables the LPU inference engine to achieve an order of magnitude improvement in performance.

Unlike GPUs designed for graphics rendering, LPUs employ an entirely new architecture designed to deliver deterministic performance for AI computing.

The GPU uses SIMD (Single Instruction Multiple Data), while the LPU takes a more linear approach, avoiding the need for complex scheduling hardware. This design allows each clock cycle to be utilized efficiently, ensuring consistent latency and throughput.

To put it simply, if a GPU is like an elite sports team, where each member is good at multitasking but requires complex coordination to perform at its best, then an LPU is like a team of experts on a single project, each of whom completes the task in the most direct way in their area of expertise.

For developers, this means that performance can be precisely ** and optimized, which is critical in real-time AI applications.

When it comes to energy efficiency, LPUs are also showing their advantages. By reducing the overhead of managing multiple threads and avoiding inefficient utilization of cores, LPUs are able to do more computing tasks with less power.

GroQ also allows multiple TSPs to connect seamlessly, avoiding bottlenecks common in GPU clusters and achieving extremely high scalability. This means that performance scales linearly as more LPUs are added, simplifying the hardware requirements for large-scale AI models and making it easier for developers to scale their applications without having to re-architect the system.

For example, if you think of a GPU cluster as an island connected by multiple bridges, the capacity of the bridges limits the performance gains, even though more resources can be accessed through those bridges. The LPU, on the other hand, is like a new type of transportation system designed to avoid traditional bottlenecks by allowing multiple processing units to connect seamlessly. This means that performance scales linearly as more LPUs are added, greatly simplifying the hardware requirements for large-scale AI models, making it easier for developers to scale their applications without having to rearchitect the entire system.

Is the lightning-fast Groq any better?

Although the innovation of the LPU is eye-popping, for the general model, whether it is good or not is the key.

We had the same request for ChatGPT and Groq without having a second conversation.

Regardless of whether the content is correct or not, from the perspective of language style alone, it is not difficult to find from the feedback given by the two models that GroQ's replies are a bit blunt and have a strong "AI flavor", while ChatGPT is relatively natural and has a more thorough "understanding" of human language (Chinese) habits.

Then we asked almost the same question, and they answered something like this:

GPT's language style is thorough with "human sophistication", while GROQ is still "AI-flavored".

Can it replace Nvidia's GPU?

At the same time as the GroQ was racing at high speed, there was a voice - is Nvidia's GPU already lagging behind?

However, speed is not the only decisive factor in the development of AI. When discussing large model inference deployments, the example of the 7b (7 billion parameters) model is very revealing.

Currently, deploying such a model requires about 14GB of memory. Based on this, about 70 dedicated chips are required, each corresponding to a computing card. If a common configuration is adopted, i.e., a 4U server with 8 compute cards, then deploying a 7B model requires 9 4U servers, which almost fills a standard server cabinet. A total of 72 computing chips are required, and the computing power in this configuration reaches a staggering 13 in FP16 mode5p (petaflops) and up to 54p in INT8 mode.

Take Nvidia's H100, for example, which has 80GB of high-bandwidth memory and can run five 7B models at the same time. In FP16 mode, the sparsity-optimized H100 has a computing power close to 2P, and in INT8 mode, it is close to 4P.

A foreign blogger made a comparison, and the results showed that the solution using Groq for inference in INT8 mode requires 9 servers. The cost of 9 Groq servers is much higher than that of 2 H100 servers. The Groq solution costs more than $1.6 million, while the H100 server costs $600,000, not including rack-related expenses and electricity costs.

For larger models, such as the 70b parameter model, using the INT8 mode may require at least 600 compute cards, close to 80 servers, and the cost is astronomical.

In fact, for the architecture of Groq, it may need to be built on a small memory and a large computing power, so that the limited content to be processed corresponds to a very high computing power, resulting in a very fast speed.

For large models that deploy inference capabilities, the most cost-effective is still NVIDIA's GPU.

Related Pages