Large language models (LLMs) demonstrate powerful capabilities, but they still face challenges in practical applications, such as hallucinations, slow knowledge updates, and lack of transparency in answers. Retrieval-Augmented Generation (RAG) refers to the retrieval of relevant information from external knowledge bases before answering questions with LLMs. RAG has been demonstrated to significantly enhance answer accuracy, reduce model hallucination, particularly for knowledge- intensive tasks. By citing sources, users can verify the accuracy of answers and increase trust in model outputs. It also facilitates knowledge updates and the introduction of domain-specific knowl- edge. RAG effectively combines the parameter- ized knowledge of LLMs with non-parameterized external knowledge bases, making it one of the most important methods for implementing large language models. This paper outlines the development paradigms of RAG in the era of LLMs, sum- marizing three paradigms: Naive RAG, Advanced RAG, and Modular RAG. It then provides a summary and organization of the three main compo- nents of RAG: retriever, generator, and augmentation methods, along with key technologies in each component. Furthermore, it discusses how to evaluate the effectiveness of RAG models, introducing two evaluation methods for RAG, emphasizing key metrics and abilities for evaluation, and presenting the latest automatic evaluation framework. Finally, potential future research directions are introduced from three aspects: vertical optimization, horizonal scalability, and the technical stack and ecosys- tem of RAG.
When dealing with retrieval tasks that involve structured data, the work of SANTA[Li et al., 2023d] utilized a threestage training process to fully understand the structural and semantic information. Specifically, in the training phase of the retriever, contrastive learning was adopted, with the main goal of optimizing the embedding representations of the queries and documents. The specific optimization objectives are as follows: LDR = log esim(q,d+) ef (q,d+) + (cid:80)
id: 663517505abd8b7375767ca90ee91231 - page: 14
d,d+ represent negative samples and positive samIn the initial training stage of the generples respectively. ator, we utilize contrastive learning to align structured data and the corresponding document description of unstructured data. The optimization objective is as above. in the later the gener[Sciavolino et al., 2021, ator, Zhang et al., 2019], we recognized the remarkable effectiveness of entity semantics in learning textual data representations in retrieval. Thus, we first perform entity identification in the structured data, subsequently applying a mask to the entities in the input section of the generators training data, enabling the generator to predict these masks. The optimization objective hereafter is: Moreover, training stage of inspired by references LM EP = k (cid:88) logP (Yd(tj)|X mask
id: 4d4be60fbfcfef763bba3e02ddc51e6d - page: 14
.., j 1)) j=1 (13) where Yd(yj denotes the j-th token in the sequence Yd. And Yd = < mask >1, ent1, ..., < mask >n, entn denotes the ground truth sequence that contains masked entities. Throughout the training process, we recover the masked entities by acquiring necessary information from the context, understand the structural semantics of textual data, and align the relevant entities in the structured data. We optimize the language model to fill the concealed spans and to better comprehend the entity semantics[Ye et al., 2020]. 6 Augmentation in RAG This chapter is primarily organized into three dimensions: the stage of augmentation, augmentation data sources, and the process of augmentation, to elaborate on the key technologies in the development of RAG.Taxonomy of RAGs Core Components is illustrated in Fig 4. (12)
id: d17313683624607d0c6e1f677cb00f8a - page: 14
6.1 RAG in Augmentation Stages As a knowledge-intensive task, RAG employs different technical approaches during the language model trainings pretraining, fine-tuning, and inference stages. Pre-training Stage Since the emergence of pre-trained models, researchers have delved into enhancing the performance of Pre-trained Language Models (PTMs) in open-domain Question Answering (QA) through retrieval methods at the pre-training stage. Recognizing and expanding implicit knowledge in pre-trained models can be challenging. REALM[Arora et al., 2023] introduces a more modular and interpretable knowledge embedding approach. Following the Masked Language Model (MLM) paradigm, REALM models both pre-training and fine-tuning as a retrieve-then-predict process, where the language model pre-trains by predicting masked tokens y based on masked sentences x, modeling P (x|y). RETRO[Borgeaud et al., 2022]leverages
id: aefcbd85b7f5d02b2e2cfd7ffd343056 - page: 14