While many contemporary large language models (LLMs) can process lengthyinput, they still struggle to fully utilize information within the long context, knownas the lost-in-the-middle challenge. We hypothesize that it stems from insufficientexplicit supervision during the long-context training, which fails to emphasizethat any position in a long context can hold crucial information. Based on thisintuition, our study presents INformation-INtensive (IN2) training, a purelydata-driven solution to overcome lost-in-the-middle. Specifically, IN2 trainingleverages a synthesized long-context question-answer dataset, where the answerrequires (1) fine-grained information awareness on a short segment (∼128 tokens)within a synthesized long context (4K−32K tokens), and (2) the integration andreasoning of information from two or more short segments. Through applyingthis information-intensive training on Mistral-7B, we present FILM-7B (FILlin-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing longcontexts, we design three probing tasks that encompass various context styles(document, code, and structured-data context) and information retrieval patterns(forward, backward, and bi-directional retrieval). The probing results demonstratethat FILM-7B can robustly retrieve information from different positions in its 32Kcontext window. Beyond these probing tasks, FILM-7B significantly improvesthe performance on real-world long-context tasks (e.g., 23.5→26.9 F1 score onNarrativeQA), while maintaining a comparable performance on short-context tasks(e.g., 59.3→59.2 accuracy on MMLU). Github Link: github.com/microsoft/FILM.
All FILM-7B (20%) 1.0 106 (default) 2.0 106 1.0 107 1.0 108 82.9 83.9 83.7 84.6 11.5 9.3 7.6 6.6 74.5 79.8 81.7 81.4 27.7 27.1 18.4 22.3 83.5 87.7 89.4 87.7 31.6 13.2 16.8 13.2 80.3 83.8 84.9 84.6 23.6 16.5 14.3 14.0 These comparisons highlight that both context styles and retrieval patterns significantly contribute to the hardness of the probing tasks. Training on synthesized long-context data effectively generalizes to real-world scenarios. Table 2 contains the results on various real-world long-context tasks. It shows that FILM-7B also significantly improves the performance of the backbone model in real-world long-context scenarios. Moreover, it also achieves SOTA-level performances on these tasks among 7B size open-source models. Notably, the long contexts used in IN2 training are all synthesized from short segments. These improvements suggest that the long-context capabilities learned from the synthesized data can be successfully applied to real-world tasks.
id: ce4517bf40b7361de68e23404ff909a7 - page: 9
FILM-7B maintains the performance on short-context tasks. Figure 5 illustrates the performances of FILM-7B and the vanilla backbone model on short-context tasks. It reveals that the overall performances on short-context tasks are almost comparable with minor variances. These results confirm that FILM-7B does not compromise the short-context capabilities of the backbone model. 4.3 Training Strategy Analysis Experimental results in Section 4.2 demonstrate the feasibility of IN2 training. We aim to explore further into enhancing the effectiveness and efficiency of IN2 training, particularly from the perspective of training strategies. We are specifically interested in investigating the impact of the following two training strategies: applying the sliding window and adjusting the position encoding. Considering the high cost of training, the following experiments use 20% of all training examples.
id: b723068235169b6af29f505fca412e9b - page: 9
Models using sliding windows cannot effectively capture the long distance information. Our experiments involving Mistral models, as shown in Figure 4a, reveal that the performance of Mistral7B-Instruct-v0.1 is awful when the information is positioned at a long distance. Its worth noting that Mistral-7B-Instruct-v0.1 employs the sliding window strategy while Mistral-7B-Instruct-v0.2 does not. Consequently, we are interested in determining whether our IN2 training can still alleviate the lost-in-the-middle problem under the sliding window strategy. We conduct the following two experiments with a 4K sliding window during training: 9 Apply the sliding window in both pre-training and IN2 training. We take the Mistral-7BInstruct-v0.1 as the backbone model and conduct IN2 training with the same window size (4K).
id: beb03e9a46d45a74d50226018bf09cf4 - page: 9
Apply the sliding window only during the IN2 training. We take the Mistral-7B-Instruct-v0.2 as the backbone model and additionally apply a 4K sliding window during IN2 training. Figure 6 illustrates the performances of models with sliding windows. It shows that in both two settings with sliding windows, the performances drop dramatically when the distance between the retrieval question and information is longer than the sliding window size. It reveals that the sliding window strategy greatly hurts the long-context capability of models.
id: 3469fec19a4caed3ffe01881aefc91dd - page: 10