RAG: Redefining How We Retrieve and Generate Information

Aishwarya
3 min readSep 8, 2024

--

A Deep Dive into How Retrieval-Augmented Generation is Transforming Information Retrieval and the SOTA Techniques Powering It

What is RAG?

Retrieval-augmented generation (RAG) combines information retrieval and natural language generation by providing the model with external knowledge that it uses to generate text.

Large Language Models are trained on a large corpus of data and contain billions and trillions of parameters that store all the learned information. RAGs extend these models for specific domains or custom use cases without needing to retrain the pre-trained model.

One key distinction between RAG and traditional fine-tuning is that fine-tuning leverages in-domain learning and alters the model’s parameters to improve task-specific performance, while RAG leverages in-context learning without modifying or adding to the existing parameters.

Figure 1: Difference between fine-tuning (left) and RAGs (right). Fine-tuning is a closed-book approach where information is memorized. RAG uses an open-book approach where relevant documents can be retrieved when needed. (Source)

Think of RAG as an open-book approach, retrieving relevant documents from an index to provide context, whereas fine-tuning is more like a closed-book approach where knowledge is embedded in the model’s parameters.

Why RAG?

To understand why we need RAG systems, we first need to examine the limitations of Large Language Models.

Challenges with LLMs

Figure 2: Key challenges faced by Large Language Models (LLMs) including Hallucinations, Attribution, Staleness, Revisions, and Customization.
  1. Hallucinations — LLMs often generate inaccurate responses with high confidence.
  2. Attribution — Without access to the data LLMs are trained on, it’s hard to trace where their responses originate.
  3. Staleness — LLMs are limited to the knowledge available up to their training date, so they can’t provide recent information.
  4. Revisions — Updating facts or editing the model’s knowledge post-training isn’t currently feasible.
  5. Customization — Many use cases require LLMs to be fine-tuned on specific company or user data.

How RAG solves these problems

RAG introduces an external memory that LLMs can access while generating responses. This memory helps address staleness by enabling real-time updates and revisions through the addition of new information to the index. Moreover, multiple indexes can be created and switched between, all while keeping the base model unchanged. RAG also solves the attribution issue by grounding the responses in specific data, making it easier to trace the source. Additionally, RAG has been shown to reduce hallucinations in LLMs.

How does RAG work?

RAG systems have two main components — a Retriever and a Generator.

Figure 3: The RAG system consists of two key components: the Retriever (pink block) and the Generator (blue block). The Retriever identifies and retrieves the top-k relevant documents for a given query, while the Generator synthesizes an answer based on the information from the retrieved documents. (Source)

Retrieval

Given a query, which is a definition of the information needed, retrieval involves fetching the most relevent candidates from an index based on the given query.

Information retrieval has two parts: indexing and ranking. The most common indexing method is the inverted index, which maps context (words, tokens, sentences) to its location (documents, paragraphs).

To build an index, we can use sparse vector representations like Bag-of-Words (BOW) and TF-IDF, paired with classic algorithms like BM25. Alternatively, dense vector representations using transformer-based architectures (e.g., encoders, dual-encoders) offer better semantic understanding. Sparse vectors provide faster computation due to low information density, while dense vectors capture deeper meaning but are slower. Most modern approaches use a hybrid representation to balance speed and semantic accuracy.

Augmented Generation

The retrieved candidates and a prompt are passed to a Large Language Model (LLM) to get a synthesized (augmented) response to the query.

The response generated by the LLM can vary based on the task. It can either use its intrinsic knowledge or restrict its knowledge only to the given context. Naive RAGs struggle with hallucinations where the response is not aligned with the retrieved context, or sometimes the response solely relies on the retrieved context and does not generate coherent and insightful responses. These issues can be solved by using advanced RAG techniques.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Aishwarya
Aishwarya

Written by Aishwarya

Data Science Practioner | Machine Learning Enthusiast

No responses yet

Write a response