Abstract of the Paper written by Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia: Neural information retrieval (IR) has greatly advanced search and other knowledge-intensive language tasks. While many neural IR methods encode queries and documents into single-vector representations, late interaction models produce multi-vector representations at the granularity of each token and decompose relevance modeling into scalable token-level computations. This decomposition has been shown to make late interaction more effective, but it inflates the space footprint of these models by an order of magnitude. In this work, we introduce ColBERTv2, a retriever that couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction. We evaluate ColBERTv2 across a wide range of benchmarks, establishing state-of-the-art quality within and outside the training domain while reducing the space footprint of late interaction models by 6--10×.https://doi.org/10.48550/arXiv.2112.01488
While our contributions have made ColBERTs late interaction more efcient at storage costs, largescale distillation with hard negatives increases system complexity and accordingly increases training cost, when compared with the straightforward training paradigm of the original ColBERT model. While ColBERTv2 is efcient in terms of latency and storage at inference time, we suspect that under extreme resource constraints, simpler model designs like SPLADEv2 or RocketQAv2 could lend themselves to easier-to-optimize environments. We leave low-level systems optimizations of all systems to future work. Another worthwhile dimension for future exploration of tradeoffs is reranking architectures over various systems with cross-encoders, which are known to be expensive yet precise due to their highly expressive capacity.
id: daeffc98b13656bfe4e1978cef80e26c - page: 10
Research Limitations While we evaluate ColBERTv2 on a wide range of tests, all of our benchmarks are in English and, in line with related work, our out-of-domain tests evaluate models that are trained on MS MARCO. We expect our approach to work effectively for other languages and when all models are trained using other, smaller training set (e.g., NaturalQuestions), but we leave such tests to future work.
id: d5181d0160fd4de3c5a1d7a7698cb2b4 - page: 10
We have observed consistent gains for ColBERTv2 against existing state-of-the-art systems across many diverse settings. Despite this, almost all IR datasets contain false negatives (i.e., relevant but unlabeled passages) and thus some caution is needed in interpreting any individual result. Nonetheless, we intentionally sought out benchmarks with dissimilar annotation biases: for instance, TREC-COVID (in BEIR) annotates the pool of documents retrieved by the systems submitted at the time of the competition, LoTTE uses automatic Google rankings (for search queries) and StackExchange questionanswer pairs (for forum queries), and the Open-QA tests rely on passageanswer overlap for factoid questions. ColBERTv2 performed well in all of these settings. We discuss other issues pertinent to LoTTE in Appendix D. We have compared with a wide range of strong baselinesincluding sparse retrieval and singlevector modelsand found reliable patterns across tests. However, we caution t
id: 177b160e7b1739e98fdf2b6937fc736a - page: 10
We thus primarily used results and models from the rich recent literature on these problems, with models like RocketQAv2 and SPLADEv2.
id: 1d8bcfc64951dc90159763744671a015 - page: 10