Abstract of the paper: Large language models (LLMs) inevitably exhibit hallucinations since the accuracy of generated texts cannot be secured solely by the parametric knowledge they encapsulate. Although retrieval-augmented generation (RAG) is a practicable complement to LLMs, it relies heavily on the relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong. To this end, we propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation. Specifically, a lightweight retrieval evaluator is designed to assess the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. Since retrieval from static and limited corpora can only return sub-optimal documents, large-scale web searches are utilized as an extension for augmenting the retrieval results. Besides, a decompose-then-recompose algorithm is designed for retrieved documents to selectively focus on key information and filter out irrelevant information in them. CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches. Experiments on four datasets covering short- and long-form generation tasks show that CRAG can significantly improve the performance of RAG-based approaches.
Original Paper: https://arxiv.org/abs/2401.15884
First, the proposed method can significantly improve the performance of both RAG and SelfRAG. Specifically, CRAG outperformed RAG by margins of 2.1% accuracy on PopQA and 2.8% FactScore on Biography when based on LLaMA2hf-7b, as well as by margins of 19.0% accuracy on PopQA, 14.9% FactScore on Biography, 36.6% accuracy on PubHealth, and 8.1% accuracy on ArcChallenge when based on SelfRAG-LLaMA2-7b. Compared with the current state-of-the-art SelfRAG, Self-CRAG outperformed it by margins of 20.0% accuracy on PopQA, 36.9% FactScore on Biography, 4.0% accuracy on Arc-Challenge when based on LLaMA2-hf-7b, as well as by margins of 6.9% accuracy on PopQA, 5.0% FactScore on Biography, 2.4% accuracy on PubHealth, when based on SelfRAG-LLaMA2-7b. These results demonstrated the adaptability of CRAG which is plug-and-play and can be implemented into RAGbased approaches.
id: 57b3db4489dd182a105de6a28b50d8fe - page: 6
In particular, these benchmarks reported in Table 1 respectively represent different practical scenarios including short-form entity generation (PopQA), long-form generation (Biography), and closed-set tasks (PubHealth, Arc-Challenge). The results verified the consistent effectiveness of Second, Method PopQA (Accuracy) Bio (FactScore) Pub (Accuracy) ARC (Accuracy) LMs trained with propriety data LLaMA2-c13B Ret-LLaMA2-c13B ChatGPT Ret-ChatGPT Perplexity.ai 20.0 51.8 29.3 50.8 55.9 79.9 71.8 71.2 49.4 52.1 70.1 54.7 38.4 37.9 75.3 75.3 Baselines without retrieval LLaMA27B Alpaca7B LLaMA213B Alpaca13B CoVE65B 14.7 23.6 14.7 24.4 44.5 45.8 53.4 50.2 71.2 34.2 49.8 29.4 55.5 21.8 45.0 29.4 54.9 Baselines with retrieval LLaMA27B Alpaca7B SAIL LLaMA213B Alpaca13B 38.2 46.7 45.7 46.1 78.0 76.6 77.5 77.7 30.0 40.2 69.2 30.2 51.1 48.0 48.0 48.4 26.0 57.6
id: dec9417ca95226c3a6cb395c4d8d7a05 - page: 6
LLaMA2-hf-7b RAG CRAG Self-RAG* Self-CRAG 37.7 39.8 29.0 49.0 44.9 47.7 32.2 69.1 9.1 9.1 0.7 0.6 23.8 25.8 23.9 27.9 SelfRAG-LLaMA2-7b RAG CRAG Self-RAG Self-CRAG 40.3 59.3 54.9 61.8 59.2 74.1 81.2 86.2 39.0 75.6 72.4 74.8 46.7 54.8 67.3 67.2 Table 1: Overall evaluation results on the test sets of four datasets. Results are separated based on the generation LLMs. Bold number indicates the best performance among all methods and LLMs. Gray-colored bold score indicates the best performance using a specific LLM. * indicates results that were reproduced by us, otherwise results except ours are cited from their original papers. indicates scores that are not reported in the original paper or have not been evaluated. CRAG. Its versatility across a spectrum of tasks underscores its robust capabilities and generalizability across diverse scenarios.
id: 4e33a3b4fbf5b701defdcd6b18b6169f - page: 7
Third, the proposed method exhibited greater flexibility in replacing the underlying LLM generator. It can be seen that CRAG still showed competitive performance when the underlying LLMs was changed from SelfRAG-LLaMA2-7b to LLaMA2-hf-7b, while the performance of SelfRAG dropped significantly, even underperforming the standard RAG. The reason for these results is that Self-RAG needs to be instruction-tuned using human or LLM annotated data to learn to output special critic tokens as needed, while this ability is not learned in common LLMs. CRAG does not have any requirements for this ability. As you can imagine, when more advanced LLMs are available in the future, they can be coupled with CRAG easily, while additional instruction tuning is still necessary for Self-RAG.
id: 02b9421644fb8c0cc624e1ede69619f7 - page: 7