Large language models (LLMs) encapsulate a vast amount of factual information within their pre-trained weights, as evidenced by their ability to answer diverse questions across different domains. However, this knowledge is inherently limited, relying heavily on the characteristics of the training data. Consequently, using external datasets to incorporate new information or refine the capabilities of LLMs on previously seen information poses a significant challenge. In this study, we compare two common approaches: fine-tuning and retrieval-augmented generation (RAG). Researchers evaluate both approaches on a variety of knowledge-intensive tasks across different topics. Their findings reveal that while fine-tuning offers some improvement, RAG consistently outperforms it, both for existing knowledge encountered during training and entirely new knowledge. Moreover, they find that LLMs struggle to learn new factual information through fine-tuning, and that exposing them to numerous variations of the same fact during training could alleviate this problem.\
Configuration Variations Our evaluation included multiple configurations, with a grid-search over them, to allow for more comprehensive benchmarking. Firstly, we compared the baseline and fine-tuned models and their performance with the RAG component. Secondly, we explored the optimal number of text chunks to add to the context in RAG. Specifically, different values of K {0, . . . , 5} were employed to analyze the impact on model performance. Finally, we explored 5-shot performance vs. 0-shot. Training Setup We trained all of the models using the unsupervised training procedure described in Section 3.2. For each dataset, we divided the auxiliary knowledge base into equal chunks of size 256 by concatenating or splitting the original chunks based on their length. We also added two special tokens, <BOS> and <EOS>, to demarcate the original chunks beginnings and ends to preserve the documents structure.
id: 665f6822071c681cce2d370476544b9f - page: 6
3 HuggingFaceH4/open_llm_leaderboard The models were trained using learning rates between 1 106 and 5 105, which were found through a hyper4
id: 9e5670a738dee51c10ee212279384dd4 - page: 6
Comparing Knowledge Injection in LLMs parameter search. All models were trained on 4 NVIDIA A-100 GPUs for a maximum of 5 epochs and a batch size of 64. Evaluation method All evaluations were done by appending each of the multiple-choice options to the question, followed by passing the concatenation through the model to get a log probability score per option. The highest score was interpreted as the models choice and used for accuracy calculation. More formally, this means that in Equation (1) we say that M(qn) = cn if: porates context relevant to the question, a feature lacking in fine-tuning. Additionally, fine-tuning may impact other capabilities of the model due to a degree of catastrophic forgetting. Finally, its plausible that unsupervised fine-tuned models might benefit from further alignment through supervised or RL-based fine-tuning, as evidenced by the vastly improved performance of Orca2 over the base Llama2.
id: ecac91e8e1e3a5b8239b6b6f5dd024c0 - page: 6
6. The Importance of Repetition cn = arg max {M(qna1 n), . . . , M(qnaL n )}, l where M(qnal n) = log PM(qnal n). MMLU Results For each task and model, we compared four approaches: using just the base model, RAG, FT, and finally combining FT and RAG by using the fine-tuned model as the generator. Furthermore, we tested the MMLU tasks using both 0-shot and 5-shot scenarios. The full results are shown in Table 1 and summarized in Figure 2. (4) Unlike the other tasks, where the model has been exposed to aspects related to the topic during pretraining, current events includes new information. In this case, standard regular fine-tuning not only did not improve the performance of Llama2 but also significantly degraded it. To improve the fine-tuning results, we explored augmentation of the data using paraphrases.
id: c066dc40cc1157317c319186096ff0c6 - page: 7