Article

RAG ~ Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information. Implementing RAG in an LLM-based question answering system has two main benefits: It ensures that the model has access to the most current, reliable facts, and that users have access to the model’s sources, ensuring that its claims can be checked for accuracy and ultimately trusted.

RAG lifecycle consists of Data Ingestion and Data Querying (Retrieval + Synthesis).

Data Ingestion (Pre-retrieval)

Data Ingestion is where we provide LLM with domain knowledge and up to date events. In this workflow, datasets (PDF, DOCX, TXT, HTML etc.) are first divided into smaller segments or chunks. These chunks are then transformed into embeddings using an embeddings module and then stored in a vector database for efficient retrieval.

Data Querying (Retrieval and Post-retrieval)

When a user submits a query in a chat application, the backend system conducts a search to find relevant context. It combines this context with a custom prompt and utilizes a Language Model (LLM) acting as a summarization agent. The LLM’s role is to generate human-like responses to the query, ensuring that the answers are coherent and contextually relevant.

RAG Paradigms

Over the past few years, RAG systems have evolved from Naive RAG to Advanced RAG and Modular RAG. This evolution has occurred to address certain limitations around performance, cost, and efficiency.

Naive RAG

Naive RAG follows the traditional aforementioned process of indexing, retrieval, and generation. In short, a user input is used to query relevant documents which are then combined with a prompt and passed to the model to generate a final response. Conversational history can be integrated into the prompt if the application involves multi-turn dialogue interactions.

Challenges with Naive RAG

Low precision: Not all chunks in retrieved set are relevant, this can cause hallucinations and Lost in the Middle Problems.
Low recall: Not all relevant chunks are retrieved (lacks enough context for LLM to synthesize an answer).
Outdated information: The data is redundant or out of date.

Advanced RAG

Advanced RAG helps deal with issues present in Naive RAG such as improving retrieval quality that could involve optimizing the pre-retrieval, retrieval and post-retrieval processes.

Pre-Retrieval

Pre-retrieval optimizations focus on data indexing optimizations as well as query optimizations. Data indexing optimization techniques aim to store the data in a way that helps you improve retrieval efficiency, such as:

Sliding window: uses an overlap between chunks and is one of the simplest techniques.
Enhancing data granularity: applies data cleaning techniques, such as removing irrelevant information, confirming factual accuracy, updating outdated information, etc.
Adding metadata: such as dates, purposes, or chapters, for filtering purposes.
Optimizing index structures: involves different strategies to index data, such as adjusting the chunk sizes or using multi-indexing strategies.

Retrieval

The retrieval stage aims to identify the most relevant context. Usually, the retrieval is based on vector search, which calculates the semantic similarity between the query and the indexed data. Thus, the majority of retrieval optimization techniques revolve around the embedding models:

Fine-tuning embedding models: customizes embedding models to domain-specific contexts, especially for domains with evolving or rare terms.
Dynamic Embedding: adapts to the context in which words are used, unlike static embedding, which uses a single vector for each word. For example, OpenAI’s embeddings-ada-02 is a sophisticated dynamic embedding model that captures contextual understanding.

Post-Retrieval

Additional processing of the retrieved context can help address issues such as exceeding the context window limit or introducing noise, thus hindering the focus on crucial information. Post-retrieval optimization techniques summarized in the RAG survey are::

Prompt compression: reduces the overall prompt length by removing irrelevant and highlighting important context.
Re-ranking: uses machine learning models to recalculate the relevance scores of the retrieved contexts.

Modular RAG

As the name implies, Modular RAG enhances functional modules such as incorporating a search module for similarity retrieval and applying fine-tuning in the retriever. Both Naive RAG and Advanced RAG are special cases of Modular RAG and are made up of fixed modules. Extended RAG modules include search, memory, fusion, routing, predict, and task adapter which solve different problems. These modules can be rearranged to suit specific problem contexts. Therefore, Modular RAG benefits from greater diversity and flexibility in that you can add or replace modules or adjust the flow between modules based on task requirements.

Given the increased flexibility in building RAG systems, other important optimization techniques have been proposed to optimize RAG pipelines including:

Hybrid search exploration: This approach leverages a combination of search techniques like keyword-based search and semantic search to retrieve relevant and context-rich information; this is useful when dealing with different query types and information needs.
Recursive retrieval and query engine: Involves a recursive retrieval process that might start with small semantic chunks and subsequently retrieve larger chunks that enrich the context; this is useful to balance efficiency and context-rich information.
StepBack-prompt: A prompting technique that enables LLMs to perform abstraction that produces concepts and principles that guide reasoning; this leads to better-grounded responses when adopted to a RAG framework because the LLM moves away from specific instances and is allowed to reason more broadly if needed.
Sub-queries: There are different query strategies such as tree queries or sequential querying of chunks that can be used for different scenarios. LlamaIndex offers a sub question query engine that allows a query to be broken down into several questions that use different relevant data sources.
Hypothetical document embeddings: HyDE generates a hypothetical answer to a query, embeds it, and uses it to retrieve documents similar to the hypothetical answer as opposed to using the query directly.