In-context Learning with Retrieved Demonstrations for Language Models: A Survey - Dria

Join the Network

Created at 6am, Jan 23

Artificial Intelligence

1

In-context Learning with Retrieved Demonstrations for Language Models: A Survey

mhE_qrG9MH7h9v1h4VqqSdvpdsdEbZO26keSYLOiqss

File Type

PDF

Entry Count

114

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

Abstract of the Paper: Language models, especially pre-trained large language models, have showcased remarkable abilities as few-shot in-context learners (ICL), adept at adapting to new tasks with just a few demonstrations in the input context. However, the model’s ability to perform ICL is sensitive to the choice of the few-shot demonstrations. Instead of using a fixed set of demonstrations, one recent development is to retrieve demonstrations tailored to each input query. The implementation of demonstration retrieval is relatively straightforward, leveraging existing databases and retrieval systems. This not only improves the efficiency and scalability of the learning process but also has been shown to reduce biases inherent in manual example selection. In light of the encouraging results and growing research in ICL with retrieved demonstrations, we conduct an extensive review of studies in this area. In this survey, we discuss and compare different design choices for retrieval models,retrieval training procedures, and inference algorithms.Original paper: https://arxiv.org/abs/2401.11624

Distillation by KL Divergence Ye et al. (2023a) claims that although the InfoNCE loss has been found effective in training demonstration retrievers and can learn which examples might be superior to others, it has the same treatment for all negative examples and the predicted scores from LLM are not fully utilized. As an alternative to train a demonstration retriever using positive and negative examples, Shi et al. (2022) proposed to train the retriever by directly distilling the LLMs scoring function. More specifically, the retriever model is designed to produce ranking scores that match the usefulness of a demonstration to help with the LLM inference; this is done by minimizing the KL-divergence between the top K examples score distribution from scoring LLM and the ranking score distribution produced by the retriever Ldistill = KL(pLLM||pretriever) = K (cid:88) k=1 pLLM(dk)log

id: 6decc888f85842701451b06b2c7f97fd - page: 11

(cid:18) pLLM(dk) pretriever(dk) (cid:19) Multiple Objectives In Wang et al. (2023a), the authors proposed to train the demonstration retriever model with combined objectives: (1) knowledge distillation from the trained reward model which can capture the preferences of LLMs over the retrieved candidates (2) InfoNCE-based contrastive loss to incorporate the in-batch negatives. More specifically, the resulting loss function is as follows:

id: 417f85909870da0a2252306a4a5c4a2e - page: 11

Lcombined = Lcont + Ldistill 11 Here is a constant that controls the relative importance of the two losses. They claimed that with the multi-objective function, both the absolute scores and supervised signals are taken into consideration. Li et al. (2023) trains a universal retriever with both list-wise ranking loss and the InfoCNE loss. Iterative Training Regarding training strategies, most research efforts have centered on fine-tuning a single retriever. Wang et al. (2023a) and Li et al. (2023) instead proposed to iterate the retriever model multiple times. More specifically, the retriever trained in iteration i will be employed to retrieve a new set of candidates for the subsequent iteration i + 1. Such an iterative training approach allows progressively improving retriever quality by mining better positive and hard negative examples at each iteration.

id: c1cba97ed72d153102bff642938c2da7 - page: 11

Diversity Training The Determinantal Point Process model (Alex Kulesz, 2012) defines a probability distribution over all the combinations of candidate demonstrations, giving high probability to subsets that contain relevant and diverse items (Levy et al., 2022). It models diversity by incorporating cross-candidate similarity scores, and models similarity via a per-candidate relevance score, i.e., a similarity score between a candidate and the test query. In addition to using DPP directly (Levy et al., 2022), Ye et al. (2023a) also fine-tuned a DPP model and demonstrated meaningful improvements over pure similarity-based methods.

id: 91bc457ec264650f9c690134d0421e9f - page: 12

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "mhE_qrG9MH7h9v1h4VqqSdvpdsdEbZO26keSYLOiqss", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "mhE_qrG9MH7h9v1h4VqqSdvpdsdEbZO26keSYLOiqss", "level": 2}'