The emergence of the Retrieval Enhanced Generation (RAG) system has improved the accuracy of LLMS response generation. It is divided into two parts: retrieval and generation. Retrieval is the use of a retrieval machine to retrieve the most relevant or similar paragraphs to the query from a large number of documents, while generation is the LLMS to generate responses for mixed queries and retrieved documents.
There has been a lot of research on RAG recently, especially on the retrieval component. The document we present today discusses from a special point of viewThe impact of the retrieved documents on the performance of the RAG system.
You may want to say, what's there to discuss, the impact of retrieved documents on performance is very straightforward, it must be related to the query, the better the effect.
So how do you add some noise to retrieving documents? In other words, what is the impact on the performance of the system if the documents that cannot be queried are not available?
Intuitively, noise should have a negative impact on system performance.
But the results given in today's article are quite astonishing!
Rather than negatively impacting system performance, noise documentation can significantly improve system accuracy by up to 35%.Query-related documents can be powerful distractors that affect the performance of the model。This finding challenges the conventional understanding of traditional information retrieval systems, in which traditional retrieval techniques may not be optimal, requiring the development of specialized approaches to the specific needs of language generation models and retrieval integration.
Title
the power of noise: redefining retrieval for rag systems
Statement: This issue ** interpretation is written by non-humans, and the full text is independently completed by the Cyber Malean AI** interpretation expert agent, and is released after manual review and illustrations.
** Xi Xiaoyao Technology said "Background reply".Intelligent internal testing"Get the invitation link for the internal test of the intelligence.
The documents obtained through the retrieval component can be divided into three categories: relevant, related but not answered, and irrelevant
Related documents contain information that is directly related to the query, providing standard data that directly answers or interprets the query. Related but not containing an answer document, while not directly answering the query, is semantically or contextually associated with the topic. For example, if someone asks about the color of Napoleon's horse, a document that expresses the color of Napoleon's wife's horse, although it does not contain the correct information, is highly relevant to it. Unrelated documents are irrelevant to the query and represent a kind of information noise in the retrieval process. 1.Datasets
The experiment used the Natural Questions (NQ) dataset, a large-scale collection of real-world queries derived from Google search data. Each dataset entry includes a user query and a corresponding Wikipedia page with the answer. To facilitate the study of natural language understanding and open-domain question answering, the dataset provides a rich set of real-world questions and contextually relevant answers. After processing, the final dataset consisted of 21,035,236 documents, with 72,209 queries in the training set and 2,889 queries in the test set.
2.Document retrieval
The document retriever uses Contriever, a BERT-based dense retriever that uses contrast loss for unsupervised training. In order to improve the efficiency of similarity searches in a corpus of approximately 21 million documents, the FAISS IndexFlatip indexing system was also used. The embedding of each document and query is averaged by averaging the hidden state of the last layer of the model.
3.LLM generation
Upon receipt of the query, the retriever selects the previous document from the corpus based on the given similarity metric. These documents, along with task instructions and queries, form the input for the LLM to generate responses. The nq-open dataset is constructed to contain only those queries with no more than five words in answer. Therefore, the task of the LLM is to extract a response of up to five words from the provided document. The model generation always uses the greedy generation method and sets the maximum response length to 15 tokens. LLM uses LLAMA2-7B, which supports 4096 tokens, and Falcon-7B, which supports 2048 tokens, PHI-2-27B supports 2048 tokens, MPT-7B supports almost unlimited context length, but is limited to 2048 tokens for fair comparison.
To enhance clarity and comprehension of the experimental setup, a simplified pattern will be used to represent the composition of the prompts. The pattern looks like this:
In this pattern, task guidance (i) and query (q) are always at the beginning and end. The middle section is mutable, representing different contextual elements, which in turn refer to the document (referring to the original context in the nq dataset, including the Wikipedia page paragraph that contains the answer and is relevant to the given query.) ), related documents, related documents that do not contain answers, and irrelevant documents. These documents are sequentially combined in different quantities to test the impact of retrieved documents on RGA performance.
Related but no answer documents are set to documents that are assigned a higher score by the retriever but do not contain answers。The following table shows the results of the LLM when evaluated using a prompt consisting of ** documents and a different number of related documents that do not contain answers. “far”,"mid","near"The first line of "0" represents the documents that are not related but do not contain the answer, and the number of related documents is increased in turn. "-Indicates that the input exceeds the input length supported by the LLM.
As can be seen from the table above, in the Retrieval Enhancement Generation System,Documents that are semantically related to a query but do not contain the correct answer can have a negative impact on system performance。When only one related document is added in context, the accuracy can drop by up to 25%. The general consensus that relevant documents are generally more acceptable than irrelevant documents challenges the common sense of traditional information retrieval systems.
The image below is an example of an output error after adding a related document that does not contain an answer, yellow represents the gold standard document with the correct answer, and it is clear that those documents that are related but do not contain an answer are misleading the LLM and result in a decrease in accuracy.
In addition to this, the authors also show the attention scores of the model for the relevant but non-answer document and the ** document, as shown in the figure below, the model focuses too much on a related but no answer document (far left) and ignores the ** document (far right), which can lead to incorrect responses.
In order to evaluate the robustness of the RAG system to noise, a certain number of randomly selected documents from the corpus were added to the ** document, which was called noise. The results of the experiment are shown in the table below
Unexpectedly,In the presence of noise, there is no degradation in performance, on the contrary, there has been a significant improvement in some cases, such as an improvement of 0 in MPT08 (36% increase).llama2 and phi2 are at the farthest distance from the query"near"When noise is introduced, improvements are shown. However, in"mid"with"far"The performance of the model is degraded when noise is introduced, but much less than in the case of the related documentation. This suggests that while llama2 and phi-2 are effective at handling noise away from the query, they are less capable of handling extraneous information close to the query. The experiment further examines the impact of the placement of the document (i.e., the document containing the correct answer) in the context on the performance of the model. “far”,"mid","near"Respectively, they represent placing the ** document in a different location. You can see it in both of the large tables above.
The experimental results show that the location of the ** document has a significant impact on the performance of the RAG system.
In the setting of adding related but no answer documents, the model is most accurate when the document is close to the query statement. Conversely, when the document is in the middle of the context or far away from the query statement, the accuracy of the model decreases. In document-agnostic settings, some models maintain or improve performance even in noisy situations. These findings underscore the fact that in the RAG system,The retriever needs to be carefully designed to ensure the best placement of the document to improve the accuracy of the overall system
The above experiments are conducted under the assumption that standard answers are retrieved, but in the actual scenario, it is not possible to retrieve a document containing the answer every time. So the author sets up a more realistic scenario. Given a query, we retrieve a set of documents, which can be related or related but do not contain answers. Add irrelevant documents to this set of retrieved documents, as shown in the following table, with rows representing the number of irrelevant documents added and columns representing the number of retrieved documents.
Experimental results show that adding extraneous documents is almost always beneficial and can improve accuracy. In addition, when experiments are performed with a sparse retriever such as BM25, the accuracy is improved by an average of 3-4 percentage points.
These results show that in the design of the retriever, it is necessary to find the optimal balance between relevant and irrelevant documents.
The above experiments show that adding irrelevant documents can improve performance. But one might argue that these documents are not really unrelated, since they are part of the same corpus (Wikipedia) and could lead LLMs to respond in a way that is more consistent with that particular corpus without introducing substantial noise.
As a result, the authors conducted another experiment in which extraneous documents were extracted from the Reddit webis-tldr-17 dataset, giving it a clear contrast to Wikipedia in tone and style.
As you can see in the table below, the left section reports the results of adding irrelevant documents from reddit, and the right section reports the results of nonsensical sentences made up of random words.
As you can see,Whether the noise comes from irrelevant documents in the Reddit corpus or nonsensical sentences, performance is improved
There is literature that isExtremely low attention entropy causes LLMs to generate degraded output and drastically degrade performance。These situations are called:Entropy collapses。Along this line of study, the authors measured the entropy of the attention score when only the gold-standard document was provided, compared to the case when a random document was added. It was found that the entropy of the system increased by a factor of 3 after the introduction of random documents. However, this phenomenon does not fully explain why noise is effective, and it is left to future research.
Given that LLMS can only process a limited number of documents, what documents should the retriever provide to LLMS? Common sense dictates that documentation should be provided that is close to the semantics of the query. However, from the results of this paper, random documentation seems to have a positive effect on the accuracy of LLMs. So how do you balance the ratio of related and irrelevant documents?
The optimal ratio is found when initially retrieving a minimal set of documents and supplementing extraneous documents until the context limit is reached. In the experiments in this article,Retrieving 3 to 5 documents is the most efficient option, exceeding this number increases the risk of containing too many relevant but distracting documents. However, this theory has not yet been widely studied, and further research is urgently needed to determine the widespread applicability of this rule.
This paper is the first comprehensive to understand how retrieved documents affect the RAG framework, and aims to understand the features required by the retriever to construct RAG system optimization prompts. Key findings from the study include:
The relevant document should be located close to the query, otherwise it will be difficult for the model to focus on it. Documents that are related to the semantics of the query but do not contain answers are extremely harmful to the RAG system, and subsequent research should find ways to remove these distractions from the retrieved documents. Contrary to expectations, extraneous noise documentation helps RAG improve the accuracy of the system when placed correctly. The authors propose that these strategies can help optimize RAG systems, but further research is still needed to uncover the underlying mechanisms behind this behavior and develop a new generation of information retrieval techniques that are more suitable for interacting with generative components.
Statement: This issue ** interpretation is written by non-humans, and the full text is independently completed by the Cyber Malean AI** interpretation expert agent, and is released after manual review and illustrations.