0
Nomic Embed: Training a Reproducible Long Context Text Embedder
qg7_9hzkzvr-cf1JNhiDOwVGjMiCmYiud0rSR9ltd_0
File Type
PDF
Entry Count
56
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Abstract of the Paper: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1.Original Paper: https://arxiv.org/abs/2402.01613

23.3 30.4 30.6 30.1 30.2 31.4 31.2 31.7 31.1 31.6 31.6 30.8 31.1 29.9 31.3 30.1 48.7 56.0 59.0 59.5 62.3 66.6 62.4 63.1 63.6 64.2 60.4 61.0 62.3 64.6 61.4 62.4 and BigPatent 8 (Sharma et al., 2019). Results are presented in Table 4. Similar to (Gunther et al., 2024), we report the V-scores and NDCG10 for the clustering and retrieval datasets respectively. Across sequence lengths and tasks, nomic-embedtext-v1 beats or ties jina-embeddings-v2-base on all datasets at 8k context length. Additionally, nomic-embed-text-v1 beats text-embedding-ada002 on two of the four datasets. We also report similar results to (Gunther et al., 2024) on WikiCitiesClustering that sequence length hurts performance, suggesting that longer sequence lengths are not necessary to perform well on the test.
id: 8bc84aa51123afc970e01a5bad43d9f5 - page: 6
TV episode transcripts, and scientific research papers. We include the QASPER Abstract Articles dataset for completeness, but would like to highlight that many models seem to oversaturate the benchmark and approach 1.0 NDCG10. Results are presented in Table 6. nomic-embed-textv1 beats jina-embeddings-v2-base-en across sequence lengths. nomic-embed-text-v1 beats M2Bert at 2048 and is competitive at 8192. At sequence length 4096, nomic-embed-text-v1 is competitive with E5 Mistral while being significantly smaller. 5.4 Few-Shot Evaluation of BEIR 5.3.2 LoCo Benchmark The LoCo Benchmark consists of 5 retrieval datasets, 3 from (Shaham et al., 2022) and 2 from (Dasigi et al., 2021). The benchmark tests retrieval across meeting transcripts, national policy reports, 8
id: b600fae366f97c454464c4e4db8c5499 - page: 6
, 2023), GTE (Li et al., 2023), and E5-Mistral (Wang et al., 2023b) report training on train splits of BEIR benchmark datasets such as FEVER and HotpotQA. To understand the impact of this on our Model Seq NarrativeQA WikiCities SciFact BigPatent Avg nomic-embed-text-v1 128 nomic-embed-text-v1-ablated 128 jina-embeddings-base-v2 128 text-embedding-ada-002 128 text-embedding-3-small 128 text-embedding-3-large 128 20.1 20.8 19.6 25.4 29.5 45.6 90.0 86.8 79.9 84.9 87.5 87.9 65.4 65.2 62.1 68.8 68.8 74.8 18.5 17.5 14.4 16.6 15.0 16.5
id: 4ecad8f99ae4b474dfca833efc59e9a7 - page: 6
9 25.7 21.3 25.5 32.2 48.1 88.7 81.9 79.3 84.8 89.0 89.9 70.5 71.5 66.7 72.6 73.2 77.6 25.3 23.7 21.9 23.0 23.6 23.6
id: 37e8e42210d5fc699d94233307368acc - page: 7
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "qg7_9hzkzvr-cf1JNhiDOwVGjMiCmYiud0rSR9ltd_0", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "qg7_9hzkzvr-cf1JNhiDOwVGjMiCmYiud0rSR9ltd_0", "level": 2}'