In a joint study by Berkeley DeepMind, RaLMSpec accelerates retrieval enhanced LLMs by 2 to 7 times

Mondo Education Updated on 2024-02-01

In knowledge-intensive natural language processing (NLP) tasks, traditional large language models face a huge challenge in encoding massive knowledge into fully parameterized models. Not only does this require a lot of effort during the training and deployment phases, but it's also exacerbated when the model needs to adapt to new data or different downstream tasks. To address these challenges, recent studies have proposed retrieval-augmented language models (ralm), which combine parameterized language models with non-parameterized knowledge bases through retrieval augmentation.

Ralm assists in the generation process by interacting with language models through one-shot or iterative retrieval. Although iterative RALM performs better in terms of build quality, it suffers from high overhead due to frequent retrieval steps. Therefore, this article raises the question: can we reduce the overhead of iterative RALM without compromising build quality?

To solve this problem, we propose the RALMSPEC framework, which uses speculative retrieval and batched verification to reduce the service overhead of iterative RALM while ensuring the correctness of the model output. RARMSPEC leverages the speculative execution concept in computer architecture to replace expensive iterative retrieval steps with more efficient but less precise speculative retrieval steps. With the further introduction of prefetching, optimal speculation stride scheduler, and asynchronous verification, ralmspec is able to automatically take full advantage of the acceleration potential.

In an extensive evaluation of the three language models on four downstream QA datasets, Ralmspec achieved significant speed gains while maintaining the same model output as the baseline. These results suggest that RALMSPEC can be used as a general acceleration framework for service iterative RALM.

Statement: This issue ** interpretation is written by non-humans, and the full text is independently completed by the Cyber Malean AI** interpretation expert agent, and is released after manual review and illustrations.

** Xi Xiaoyao Technology said "Background reply".Intelligent internal testing"Get the invitation link for the internal test of the intelligence.

Title

accelerating retrieval-augmented language model serving with speculation

Links

Traditional large language models (e.g., GPT-3, LLAMA-2, PALM) excel in diverse natural language processing (NLP) tasks, but they require tremendous effort to encode large amounts of knowledge when trained and deployed. This is further exacerbated when the underlying model needs to adapt to new data or different downstream tasks. For example, encoding knowledge into a fully parametric model, such as GPT-3, requires a significant effort in training and deployment. In addition, this situation becomes even more severe when the underlying model needs to adapt to new data or multiple downstream tasks.

To address this challenge, recent work has introduced Retrieval Enhanced Language Models (RALMs), which integrate parametric language models with non-parametric knowledge bases through retrieval augmentation. Ralm excels at adapting to the latest data at low cost and better source attribution mechanisms. Compared with traditional language models, RAL shows the potential to solve knowledge-intensive NLP tasks by combining a non-parametric knowledge base with a parametric language model. For example, iterative ralm provides better generation quality due to more frequent interaction between the retriever and the language model. However, iterative ralm often encounters high overhead due to frequent retrieval steps.

A key bottleneck of existing iterative RAL methods is the inefficiency of retrieval. Due to the autoregressive nature of generative language models, the retrieval step is typically performed with a single query that summarizes the current context. The existing iterative RARM approach interleaved the retrieval and generation steps by continuously retrieving from the knowledge base using the latest context-dependent queries (e.g., Q0, Q1, and Q2). The content retrieved accordingly (e.g., a, b, c) then aids in the generation process with prompts or a combination of attention levels, providing relevant information. However, it is inherently inefficient to issue these queries sequentially to retrieve the knowledge base.

The RALMSPEC framework adopts a speculative retrieval and batch validation approach to reduce the service overhead of iterative RALM while theoretically preserving the model output. RALMSPEC replaces the expensive and iterative retrieval steps of the existing RAL method with speculative search, and adopts a more efficient but less accurate speculative search step. Therefore, ralmspec uses a batch validation step to correct any incorrect inferences and maintain the quality of the model's generation. More specifically, after a series of speculative retrieval steps, ralmspec initiates the validation step by performing a batch retrieval (i.e., the query in Figure 1(b)), where the query in the batch corresponds to the query in the speculative retrieval step. If there is a mismatch between the putative document and the real document retrieved in the validation step, ralmspec automatically corrects the mismatch by rolling back to the location of the first false guess and re-running the language model decoding with the real document. The results show that ralmspec saves latency by efficient batch retrieval, i.e., retrieval from the knowledge base using n queries than performing n searches sequentially.

The RALMSPEC framework uses speculative retrieval and batched verification to reduce the service overhead of iterative retrieval of iterative RALM while ensuring the correctness of the model output. The concept of speculative retrieval is derived from speculative execution in computer architecture, and its core idea is to replace expensive iterative retrieval steps with more efficient but less accurate speculative retrieval steps. After a certain number of speculative searches, ralmspec initiates a batch validation step where the batch query corresponds to the query in the speculative search step. If the presumed document does not match the real document retrieved in the validation step, ralmspec automatically rolls back to the location of the first false presumed and reruns the language model decoding with the real document.

ralmspec utilizes a local cache to store past documents for each request, and uses a local cache instead of a knowledge base for retrieval in speculative retrieval. This approach takes advantage of the temporal and spatial locality of retrieved documents, i.e., the same or consecutive documents may be retrieved multiple times during the generation process. To improve the success rate of speculation, ralmspec updates the local cache at each validation step, directly adding the same or consecutive documents retrieved from the knowledge base. In addition, RALMSPEC supports cache prefetching, which improves speculative performance by pre-storing top-K retrieved documents from the knowledge base in a local cache.

ralmspec improves concurrency by allowing asynchronous verification, so that additional speculative steps can be executed asynchronously with validation steps. This technique is particularly beneficial when the validation delay is less than the language model decoding delay. In addition, Ralmspec introduces the Optimal Speculation Stride Scheduler (OS3), which dynamically adjusts the speculation stride, i.e., the number of consecutive speculation steps between two validation steps, to minimize the estimation overhead.

Three standard natural language generation (NLG) model categories were used in the experiment: GPT-2, OPT, and LLAMA-2. These models are widely used as the base language models for Ralm and cover different model sizes. The knowledge-intensive open-domain Q&A task datasets involved in the experiment include wiki-qa, web questions, natural questions, and trivia-qa.

In order to demonstrate the consistency of the ralmspec, different retrievers including dense retrievers and sparse retrievers were tested. For intensive retrievers, a further distinction is made between exact (exact) and approximate (approximate) methods. The baseline implementation follows the existing implementation directly for the iterative RARM service, where the retrieval is triggered after every fourth token is generated by the language model. For the KNN-LM service, the baseline uses an implementation by Khandelwal et al. (2019), where retrieval is performed when each token is generated.

The ralmspec framework exhibits significant speedup on different types of retrievers. Specifically, when using an exact dense retriever (EDR), ralmspec is able to achieve up to 239x acceleration ratio; However, when using an approximate dense retriever (ADR), the speedup ratio is 104 to 139 times; For sparse retriever (sr), the speedup ratio is between 131 to 177 times between. These results show that RALMSPEC can effectively reduce the high overhead caused by the retrieval step while maintaining the quality of the model output when processing iterative RAL services.

The acceleration effect of ralmspec is due to the working together of several key components. First, the use of local caching provides the basis for speculative retrieval, speeding up subsequent retrieval steps by storing past documents. Second, the batch validation step corrects for any incorrect inferences by processing multiple queries in parallel, thus ensuring the correctness of the model's output. In addition, the prefetching mechanism further improves speculative performance by adding the first k retrieved documents from the knowledge base to the local cache. Finally, by dynamically adjusting the speculation stride and allowing asynchronous validation, ralmspec is able to automatically exploit the maximum acceleration potential.

Knn-LM (K-nearest neighbour language models) is a method that combines the distribution from the retrieved k nearest neighbour documents and the output of the language model by interpolation when generating the next word distribution. Although KNN-LM is very effective in improving the perplexity of the underlying language model, it is extremely expensive in the inference process because each step of generation requires retrieval. By modifying the cache update rules and validation protocol, RALMSPEC is able to significantly improve the service speed of KNN-LM up to 388x acceleration.

In KNN-LM's service, RALMSPEC significantly reduces the retrieval overhead required for each token generation step through speculative retrieval and batch validation. Experimental results show that RALMSPEC is able to achieve up to 759x acceleration; When using an approximate dense retriever, the speedup ratio can reach 245 times. These results demonstrate the effectiveness of RALMSPEC when handling retrieval-intensive workloads, and by enabling the optimal speculative step scheduler (OS3), RARMSPEC is able to consistently achieve the best performance across different scenarios.

By introducing the concepts of speculative retrieval and batched verification, the RALMSPEC framework provides significant service acceleration for iterative retrieval augmented language models (RALM) while maintaining the quality of the model output. The potential applications of this mechanism are wide-ranging, not limited to current question-answering (QA) tasks, but may also be extended to other knowledge-intensive natural language processing (NLP) tasks, such as machine translation, text summarization, or dialogue systems.

In future work, RALMSPEC can be further integrated with emerging large language models (e.g., LLAMA-2, GPT-3, PALM) to improve the efficiency of these models in real-world applications. In addition, RALMSPEC's speculative retrieval and batch validation mechanisms can also be combined with other types of retrieval processors, such as TF-IDF or BM25, to explore performance with different retrieval accuracy and efficiency trade-offs.

Although ralmspec has demonstrated its acceleration capabilities on multiple datasets and language models, it still faces some challenges when it comes to real-world deployment. For example, the success rate of speculative retrieval is highly dependent on the temporal spatial locality of the retrieval, which may be less significant in certain tasks or datasets. In addition, the efficiency of the batch verification step depends on parallel processing capabilities, which can be limited by hardware resources.

Future research could explore how to optimize the various components of Ralmspec, such as improving the local caching strategy, resizing prefetching, or further optimizing the asynchronous verification mechanism. In addition, researchers can also explore algorithms that adaptively adjust the speculation stride to dynamically balance speculation performance and validation overhead across different runtime environments.

By introducing speculative retrieval and batch validation, ralmspec significantly improves the service efficiency of iterative retrieval augmented language models. Experimental results show that RALMSPEC can achieve significant acceleration of different retrievers, including exact dense retriever, approximate dense retriever and sparse retriever, while maintaining the output quality of the model. Especially when using a precise dense retriever, RALMSPEC+PSA (combined with prefetching, optimal estimation step scheduler, and asynchronous validation) is able to achieve up to 2. on different language models and datasets compared to baseline39x acceleration ratio;

In addition, three additional technologies of RRMSPEC – prefetching, optimal estimation step scheduler (OS3), and asynchronous validation – further reduce the latency of RALM services. The combination of these technologies enables Ralmspec to automatically and consistently achieve optimal speedup ratios across a wide range of scenarios, demonstrating its potential as a universal acceleration framework for iterative Ralm services.

Statement: This issue ** interpretation is written by non-humans, and the full text is independently completed by the Cyber Malean AI** interpretation expert agent, and is released after manual review and illustrations.

Related Pages