1
Retrieval meets Long Context Large Language Models by NVIDIA
rEudPrdI-IF1xFf-D4YDcAull55bCR60bynVI2DBtT4
File Type
PDF
Entry Count
65
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Abstract of the paper:Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks?ii) Can both methods be com- bined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a pro- prietary 43B GPT and LLaMA2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented LLaMA2-70B with 32K context window, outperformsGPT-3.5-turbo-16k and Davinci003 in terms of average score on seven long con- text tasks including question answering and query-based summarization. It also outperforms its non-retrieval LLaMA2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.Paper itself: https://arxiv.org/pdf/2310.03025.pdf

We finetune the LLM by taking the loss only on the {Answer} part with batch size 128 and learning rate of 5e-6 for 1000 steps. For the rest of the paper, results are all reported using the instruction tuned chat model on top of the foundational GPT-43B and LLaMA2-70B. 6 Model Seq len. Avg.
id: dd6f08d78010ce562eee989396e5de38 - page: 6
QM QASP NQA QLTY MSQ HQA MFQA GPT-43B + ret GPT-43B + ret 4k 4k 16k 16k 26.44 29.32 29.45 29.65 15.56 16.60 16.09 15.69 23.66 23.45 25.75 23.82 15.64 19.81 16.94 21.11 49.35 51.55 50.05 47.90 11.08 14.95 14.74 15.52 28.91 34.26 37.48 36.14 40.90 44.63 45.08 47.39 LLaMA2-70B + ret LLaMA2-70B + ret LLaMA2-70B + ret 4k 4k 16k 16k 32k 32k 31.61 36.02 36.78 37.23 37.36 39.60 16.34 17.41 16.72 18.70 15.37 18.34 27.70 28.74 30.92 29.54 31.88 31.27 19.07 23.41 22.32 23.12 23.59 24.53 63.55 70.15 76.10 70.90 73.80 69.55 15.40 21.39 18.78 23.28 19.07 26.72 34.64 42.06 43.97 44.81 49.49 53.89 44.55 48.96 48.63 50.24 48.35 52.91 Table 2: Comparison of model variants (GPT-43B, LLaMA2-70B) with sequence length ranging from 4k to 32k under seven datasets. ret denotes using the best retriever (Dragon or Contriever) and here we used top-5 for the retriever. 4 RESULTS In this section, we report the results and provide detailed analysis. 4.1 MAIN RESULTS
id: 238599424bbd41c8a4d93f75dfc4a849 - page: 7
In Table 2, we compare different model variants with context lengths ranging from 4K to as long as 32K using GPT-43B and LLaMA2-70B. First, we find that baseline models without retrieval of 4k sequence length achieve the worst results for both GPT-43B and LLaMA2-70B. This is because the minimum average sequence length of all seven tasks exceeds 4096, the context window of the foundation models and therefore valuable texts get truncated randomly. As a result, retrieval is especially helpful for 4K LLMs e.g., LLaMA2-70B-4K is improved from 31.61 to 35.73 while GPT-43B-4K is improved from 26.44 to 29.32. Second, we observe that HotpotQA (HQA) especially favors long sequence models as the score improves from 34.64 to 43.97 for LLaMA2-70B and from 28.91 to 37.48 for GPT-43B when the sequence length increases from 4k to 16k. This is because Hotpot QA is a multi-hop dataset where the questions are not hard to answer but all intermediate hops are necessary to get correct answer. Therefore, lon
id: 0f6abb24949e5b2e45faf201cd99e23c - page: 7
It is quite interesting that the retrieval-augmented long context LLM (e.g., 16K and 32K) can obtain better results than retrieval-augmented 4K context LLM, even they are feed with the same top 5 chunks of evidence. We hypothesize this interesting observation is related to the lost in the middle phenomenon (Liu et al., 2023), where the LLMs has such U-shaped performance curve. Specifically, LLMs are better at utilizing relevant information that occurs at the beginning or end of its input context window. Due to this reason, the 4K context LLM tends to ignore the information in the middle of 4K input, while 32K context LLM tend to ignore the information in the middle of 32K input. From Figure 1, the length of top 5 chunks is about 2K tokens, which can be in the middle and ignored by 4K context LLM, but is only at the beginning part of 16K and 32K context and may not be ignored by the 16K or 32K context LLM.
id: 3bb15c11bfa49da947c0f475b82ee3f9 - page: 7
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "rEudPrdI-IF1xFf-D4YDcAull55bCR60bynVI2DBtT4", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "rEudPrdI-IF1xFf-D4YDcAull55bCR60bynVI2DBtT4", "level": 2}'