How to break through the biggest bottleneck of large language models

Translator: Bugatti.

Large language models (LLMs) such as OpenAI's GPT-4 and Anthropic's Claude 2 have sparked the public's imagination with their ability to generate class-level texts. Businesses are also enthusiastic, with many exploring how LLMs can be used to improve their products and services. However, one major bottleneck that is severely limiting the adoption of state-of-the-art LLMs in production is rate limiting. There are ways to break this rate limit, but without improvements in computing resources, real progress may not come.

The public LLM API allows users to access models from companies such as OpenAI and Anthropic, imposing strict limits on the number of tokens (text units) that can be processed per minute, the number of requests per minute, and the number of requests per day.

API calls to OpenAI GPT-4 are currently limited to 3 requests per minute (rpm), 200 requests per day, and a maximum of 10,000 tokens per minute (TPM). The highest gear allows a limit of 10,000 rpm and 300,000 tpm.

For large, production-grade applications that need to process millions of tokens per minute, this rate limiting makes it practically impossible for enterprises to use state-of-the-art LLMs. The number of requests is constantly increasing, taking minutes or even hours, and there is no real-time processing.

Most organizations are still struggling to safely and effectively adopt LLMs at scale. But even when they solve challenges in terms of data sensitivity and internal processes, rate limiting becomes a stubborn obstacle. Startups developing products around LLMs quickly hit bottlenecks as product usage and data accumulates, but large enterprises with large user bases are the most constrained. Without a special access mechanism, their applications simply cannot function.

What to do? One way is to bypass rate limiting technology altogether. For example, there are some purpose-built generative AI models that don't have LLM bottlenecks. DiffBlue, a Oxford, UK-based startup, relies on reinforcement learning technology with no rate limits. It does one thing very well, is very effective, and may cover millions of lines**. It can create j**a unit tests 250 times faster and compile 10 times faster than developers.

Unit tests written by DiffBlue Cover enable you to quickly understand complex applications, enabling both large enterprises and startups to innovate with confidence, which is ideal for migrating legacy applications to the cloud. It also autonomously writes new, improves existing ones, accelerates CI CD pipelines, and provides insight into change-related risks without the need for human review. It's not bad.

Of course, some companies have to rely on LLMs. What options do they have?

One option is to request an increase in the company's rate limit. So far this has been a good practice, but the underlying problem is that many LLM providers don't actually have the extra capacity to offer well. This is the crux of the matter. GPU availability depends on the total number of silicon wafers from foundries such as TSMC. Nvidia, the dominant GPU manufacturer, can't procure enough chips to meet the first-class demands that AI workloads bring, and large-scale inference requires thousands of GPUs to be combined.

The most straightforward way to increase the amount of GPU** is to build new semiconductor fabrication fabs, so-called fabs. But a new fab costs $20 billion and takes years to build. Major chipmakers such as Intel, Samsung Foundry, TSMC, Texas Instruments, and others are building new semiconductor production facilities in the United States. For now, everyone can only wait.

As a result, there are very few actual production deployments utilizing GPT-4. The scope of the environment in which GPT-4 is actually deployed is limited, and they use LLMs as an auxiliary function rather than as a core product component. Most companies are still evaluating pilots and proofs of concept. Before you can consider rate limiting, you need to integrate LLM into your enterprise workflow inherently.

GPU constraints limit GPT-4's processing power, which has prompted many companies to use other generative AI models. For example, AWS has its own chip dedicated to training and inference (running models once trained), giving customers more flexibility. Importantly, not every problem requires the most powerful and expensive computing resources. AWS offers a range of cheaper and easier tuning models, such as Titan Light. Some companies are exploring alternatives, such as fine-tuning open-source models such as Meta's LLAMA 2. Weaker models are sufficient for simple use cases that involve retrieval enhanced generation (RAG), where context needs to be attached to a prompt and a response is generated.

Other techniques can also help, such as parallel processing of requests across multiple legacy LLMs with high limits, data chunking, and model distillation. There are several techniques that can reduce the cost and speed of inference. Quantization reduces the accuracy of the weights in the model, which are typically 32-bit floating-point numbers. This is not a new approach. For example, Google's inference hardware tensor processing unit (TPU) only works with models whose weights are quantized as 8-bit integers. The model loses some of its accuracy, but is much smaller and runs faster.

A new popular technique called "sparse models" can reduce the cost of training and inference, requiring less manpower than model distillation. LLMs are like a collection of many smaller language models. For example, when you ask GPT-4 a question in French, you only need to use the French part of the model, and the sparse model takes advantage of this feature.

You can do sparse training and only need to train a French subset of the model, or you can do sparse inference and run only the French part of the model. When used with quantization, this extracts a smaller, specialized model from the LLM, which can run on the CPU instead of the GPU. GPT-4 is famous because it is a general-purpose text generator rather than a narrower, more specific model.

On the hardware side, new processor architectures specifically for AI workloads are expected to improve efficiency. Cerebras has developed a huge wafer-level engine optimized for machine learning, while MantiCore is revamping "discarded" GPU chips discarded by manufacturers to provide practical chips.

Ultimately, the biggest results will come from the next generation of LLMs that require less compute. Combined with optimized hardware, future LLMs can break through today's rate-limiting barriers. Currently, the ecosystem is overwhelmed by the rush of eager companies to take advantage of the capabilities of LLMs. Those looking to blaze a new trail in the AI space may need to wait until the GPU** slows down further. Ironically, these restrictions may just help to dispel some of the bubble hype surrounding generative AI and give the industry time to adapt to the positive model in order to use it efficiently and affordably.

How to break through the biggest bottleneck of large language models

Related Pages

How to break through the bottleneck period of sports rehabilitation?

How to integrate large language models into your own applications via APIs

How can Chinese companies seize the opportunity when the language model tuyere is coming?

How to break through the bottleneck of computing power when the financial model is implemented

The evolution of the large language model API market