What is Retrieval-Augmented Generation (RAG)

Created at 7pm, Jan 21

BcVUqHLP

Artificial Intelligence

Contract ID

ZCyazHDSRMDHZEm3RaSH_LBrLiNr9kfU2jUVV-QH_rQ

File Type

MP3

Entry Count

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

Large language models usually give great answers, but because they're limited to the training data used to create the model, over time they can become incomplete--or worse, generate answers that are just plain wrong. One way of improving the LLM results is called \'retrieval-augmented generation\' or RAG. In this video, IBM Senior Research Scientist Marina Danilevsky explains the LLM/RAG framework and how this combination delivers two big advantages, namely: the model gets the most up-to-date and trustworthy facts, and you can see where the model got its info, lending more credibility to what it generates.

What does that mean? That means that now, instead of just relying on what the LLM knows, we are adding a content store. This could be open, like the internet. This could be closed, like some collection of documents, collection of policies, whatever.

The point, though, now is that the LLM first goes and talks to the content store and says, hey, can you retrieve from me information that is relevant to what the user's query was? And now, with this retriever augmented answer, it's not Jupiter anymore. We know that it is Saturn. What does this look like? Well, first, user prompts the LLM with their question. They say this is what my question was. And originally, if we're just talking to a generative model, the generative model says, oh, OK, I know the response. Here it is. Here's my response. But now, in the RAM framework, the generative model actually has an instruction that says, no, no, no. First, go and retrieve relevant content. Combine that with the user's question, and only then generate the answer.

So the prompt now has three parts, the instruction to pay attention to the retrieved content, together with the user's question. Now give a response. And in fact, now you can get evidence for why your response was what it was. So now, hopefully, you can see, how does RAD help the two LLM challenges that I had mentioned before? So first of all, I'll start with the out of date part. Now, instead of having to retrain your model if new information comes up, like, hey, we found some more moons. Now it's a Jupiter again. Maybe it'll be Saturn again in the future.

All you have to do is you augment your data store with new information, updated information. So now, the next time that a user comes and asks the question, we're ready. We just go ahead and retrieve the most up-to-date information. The second problem, source. Well, the LLM model is now being instructed to pay attention to primary source data before giving its response. And in fact, now being able to give evidence. This makes it less likely to hallucinate or to leak data, because it is less likely to rely only on information that it learned during training.

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "ZCyazHDSRMDHZEm3RaSH_LBrLiNr9kfU2jUVV-QH_rQ", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "ZCyazHDSRMDHZEm3RaSH_LBrLiNr9kfU2jUVV-QH_rQ", "level": 2}'