2
Lost in the Middle: How Language Models Use Long Contexts
SMLWJ6TeKb-eYv0IR1F2GRy4SKtGDz-ix-Z6t86XVNo
File Type
PDF
Entry Count
87
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. Researchers analyze the performance of language models on two tasks that require identifying relevant information in their in- put contexts: multi-document question answering and key-value retrieval. They find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, they observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. This analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.

Figure 9: Query-aware contextualization (placing the query before and after the documents) does not substantially improve robustness of language models to changing the position of relevant information in multidocument QA; performance slightly increases when relevant information occurs at the very beginning, but otherwise slightly decreases. of the prompt and decoder-only models can only attend to prior tokens at each timestep. In contrast, encoder-decoder models (which seem more robust to changes in the position of relevant information; 4.1) use a bidirectional encoder to contextualize input contextscan we use this observation to improve decoder-only models by placing the query before and after the data, enabling query-aware contextualization of documents (or key-value pairs)?
id: 0caad96ac482888a1ac2dfa59ddc7ecb - page: 8
4.3 Effect of Instruction Fine-Tuning The models we evaluated are all instruction finetunedafter their initial pre-training, they undergo supervised fine-tuning on a dataset of instructions and responses. The task specification and/or instruction is commonly placed at the beginning of the input context in supervised instruction finetuning data, which might lead instruction finetuned language models to place more weight on the start of the input context. To better understand the potential effects of instruction fine-tuning on how language models use long input contexts, we compare the multi-document question answering performance of MPT-30B-Instruct against its base model (i.e., before instruction fine-tuning) MPT30B. We use the same experimental setup as 2. We find that query-aware contextualization dramatically improves performance on the key-value retrieval taskall models achieve near-perfect per-
id: f934fceb96d6e663f2545087e6e409a3 - page: 8
Figure 10 compares the multi-document QA performance of MPT-30B and MPT-30B-Instruct as a function of the position of the relevant in56Accuracy 46 54 10th 52 15th 20thPosition of Document with the Answer 48 mpt-30b 44 20 Total Retrieved Documents (~4K tokens) 5th 1st mpt-30b-instruct 50 Figure 10: Multi-document QA performance of MPT30B-Instruct compared against its base model (i.e., before instruction fine-tuning) MPT-30B. Both models have a U-shaped performance curve, where performance is much higher when relevant information occurs at the start or end of the input context, indicating that the instruction fine-tuning process itself is not necessarily responsible for these performance trends.
id: 8bcb42352758045dd013f7f2cc30a44c - page: 8
Surprisingly, we see that both MPT-30B and MPT-30B-Instruct exhibit a U-shaped performance curve, where performance is highest when relevant information occurs at the very beginning or very end of the context. Although the absolute performance of MPT-30BInstruct is uniformly higher than that of MPT-30B, their overall performance trends are similar. We also observe that instruction fine-tuning slightly reduces the worst-case performance disparity from nearly 10% between the base model bestand worst-case performance to around 4%.
id: 962bea3392187dc2a0378112822a8d50 - page: 9
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "SMLWJ6TeKb-eYv0IR1F2GRy4SKtGDz-ix-Z6t86XVNo", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "SMLWJ6TeKb-eYv0IR1F2GRy4SKtGDz-ix-Z6t86XVNo", "level": 2}'