Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.Source: https://arxiv.org/abs/2404.12096
In the last row block of Table 2, we further include the best results achieved by E5Base, E5-RoPEBase and E5-Mistral after context window extension. For E5Base and E5-RoPEBase, we extend their contexts from 512 to 4,096. For E5-Mistral, we extend its context from 4,096 to 32,768. Compared to the original versions, the extended models achieve an average score increase of +15.6 / +20.3 / +10.9 points. This indicates the efficacy of these context extension strategies on embedding models, enabling them to handle inputs of several folds longer. Detailed performance comparison of different extension strategies on APE & RoPE-based embedding models is presented in Section 5.3.
id: 51e607e9361ac547f0d2782d9080fa1a - page: 7
5.3 PERFORMANCE COMPARISON OF CONTEXT EXTENSION METHODS APE-Based Models. Figure 5a illustrates the impact of various context extension strategies on E5Base and GTEBase across different target context lengths. We observe that plug-and-play methods 7 Table 2: Results (%) of existing and extended embedding models on LONGEMBED. NQA, SFD, 2WmQA is short for NarrativeQA, SummScreenFD, 2WikiMultihopQA, respectively. We show that context window extension can effectively improve existing embedding models in handling long context. Model Param. Synthetic (Acc1) Real (nDCG10) Passkey Needle NQA QMSum SFD 2WmQA Avg.
id: 49d0b743e282e35e50f7244de9a09c7a - page: 7
512 Context Models E5Base (Wang et al., 2022) E5-RoPEBase GTEBase (Li et al., 2023) BGE-Base (Xiao et al., 2023) Contriever (Izacard et al., 2021) GTR-Base (Ni et al., 2022) 110M 110M 110M 110M 110M 110M 38.0 38.5 31.0 18.0 38.5 38.5 28.5 31.5 24.5 25.3 29.0 26.3 25.3 24.6 28.6 25.6 26.7 26.5 23.8 23.2 21.8 22.4 25.5 18.3 74.7 66.6 55.8 60.3 73.5 63.7 55.8 58.8 47.3 51.7 47.3 52.2 41.0 40.5 34.8 33.9 40.1 36.5 4k Context Models E5-Mistral (Wang et al., 2023b) Jina-V2 (Gnther et al., 2023) Nomic-V1(Nussbaum et al., 2024) BGE-M3 (Chen et al., 2024) OpenAI-Ada-002 7B 137M 137M 568M 71.0 50.3 32.3 59.3 50.8 48.3 54.5 25.3 40.5 36.8 44.6 37.9 38.3 45.8 41.1 43.6 38.9 35.0 35.5 40.0 96.8 93.5 91.0 94.0 91.8 82.0 74.0 73.4 78.0 80.1 64.4 58.2 49.2 58.9 56.8 Our Extended Models
id: ad4c8fd07b9c3fa5f1a538663f147fde - page: 8
E5Base + Tuning (4k) E5-RoPEBase + SelfExtend (4k) E5-Mistral + NTK (32k) 110M 110M 7B 67.3 73.5 93.8 41.5 53.5 66.8 30.4 32.3 49.8 35.7 39.1 49.2 95.2 91.9 97.1 69.2 74.6 95.2 56.6 60.8 75.3 1k 0.5k 55 55 50 1k Tuning 2k 60 2k 40 0.5k RP 45 GP GP Avg. Score (%) of GTE-Base 40 50 4kContext Length PCW PCW 35 Avg. Score (%) of E5-Base 45 PI RP 4kContext Length PI Tuning Tuning on RP 14 16 12 GTE-BaseModel RP Tuning on PI vs. RP E5-Base 10 18 PI Tuning on PI 20 Avg. Score (4k 512) (a) (b) Figure 5: (a) Effects of different context window extension methods on E5Base and GTEBase. We found that plug-and-play methods obtain similar scores, while further tuning yields the best results. Note that for Tuning, we report results based on PI by default, as evidenced by: (b) Further ablation of tuning setups. Tuning on PI consistently outperforms RP across both models.
id: 9900ff915196af82a7c383069f4c038e - page: 8