A Deep Dive into How Retrieval-Augmented Generation is Transforming Information Retrieval and the SOTA Techniques Powering It
What is RAG?
Retrieval-augmented generation (RAG) combines information retrieval and natural language generation by providing the model with external knowledge that it uses to generate text.
Large Language Models are trained on a large corpus of data and contain billions and trillions of parameters that store all the learned information. RAGs extend these models for specific domains or custom use cases without needing to retrain the pre-trained model.
One key distinction between RAG and traditional fine-tuning is that fine-tuning leverages in-domain learning and alters the model’s parameters to improve task-specific performance, while RAG leverages in-context learning without modifying or adding to the existing parameters.

Think of RAG as an open-book approach, retrieving relevant documents from an index to provide context, whereas fine-tuning is more like a closed-book approach where knowledge is embedded in the model’s parameters.
Why RAG?
To understand why we need RAG systems, we first need to examine the limitations of Large Language Models.
Challenges with LLMs

- Hallucinations — LLMs often generate inaccurate responses with high confidence.
- Attribution — Without access to the data LLMs are trained on, it’s hard to trace where their responses originate.
- Staleness — LLMs are limited to the knowledge available up to their training date, so they can’t provide recent information.
- Revisions — Updating facts or editing the model’s knowledge post-training isn’t currently feasible.
- Customization — Many use cases require LLMs to be fine-tuned on specific company or user data.
How RAG solves these problems
RAG introduces an external memory that LLMs can access while generating responses. This memory helps address staleness by enabling real-time updates and revisions through the addition of new information to the index. Moreover, multiple indexes can be created and switched between, all while keeping the base model unchanged. RAG also solves the attribution issue by grounding the responses in specific data, making it easier to trace the source. Additionally, RAG has been shown to reduce hallucinations in LLMs.
How does RAG work?
RAG systems have two main components — a Retriever and a Generator.

Retrieval
Given a query, which is a definition of the information needed, retrieval involves fetching the most relevent candidates from an index based on the given query.
Information retrieval has two parts: indexing and ranking. The most common indexing method is the inverted index, which maps context (words, tokens, sentences) to its location (documents, paragraphs).
To build an index, we can use sparse vector representations like Bag-of-Words (BOW) and TF-IDF, paired with classic algorithms like BM25. Alternatively, dense vector representations using transformer-based architectures (e.g., encoders, dual-encoders) offer better semantic understanding. Sparse vectors provide faster computation due to low information density, while dense vectors capture deeper meaning but are slower. Most modern approaches use a hybrid representation to balance speed and semantic accuracy.
Augmented Generation
The retrieved candidates and a prompt are passed to a Large Language Model (LLM) to get a synthesized (augmented) response to the query.
The response generated by the LLM can vary based on the task. It can either use its intrinsic knowledge or restrict its knowledge only to the given context. Naive RAGs struggle with hallucinations where the response is not aligned with the retrieved context, or sometimes the response solely relies on the retrieved context and does not generate coherent and insightful responses. These issues can be solved by using advanced RAG techniques.