Agentic RAG
Part 5 of a 6-part series on agentic AI, multi-agent architectures, and the theory of LLM collaboration.
Retrieval-Augmented Generation (RAG) was supposed to solve the hallucination problem. Give the model access to a knowledge base, retrieve relevant documents at query time, and ground the model's response in real information rather than memorized patterns. In practice, standard RAG is fragile: it retrieves the wrong documents, includes irrelevant passages, ignores relevant ones, and the model sometimes hallucinates despite having the correct information in its context window.
Agentic RAG addresses these failures by replacing the rigid retrieve-then-generate pipeline with an agent that can reason about whether to retrieve, what to retrieve, whether the retrieved information is relevant, and whether the final answer is consistent with the evidence. The progression from standard RAG to agentic RAG mirrors the progression from direct prompting to ReAct: adding a reasoning loop around the retrieval process turns a pipeline into a problem solver.
From Librarian to Research Team
▾What RAG Gets Wrong
Standard RAG works like a simple library catalog. You type a query, the system returns the most similar documents, and the model reads them and generates an answer. This works well when the query is simple and the relevant information is contained in a single document. It fails in several ways:
- Wrong retrieval. The query is about "Python exceptions" and the retriever returns documents about "Python snakes." The model diligently reads about snake biology and generates a confident but useless answer.
- Partial retrieval. The answer requires information from three different documents, but the retriever only returns one. The model answers based on incomplete evidence, with no awareness that it is missing information.
- Unnecessary retrieval. The query is "What is 2 + 2?" and the system retrieves ten documents about arithmetic before answering. This wastes compute and can even hurt performance if the retrieved documents contain confusing or contradictory information.
- No verification. The model generates an answer that contradicts the retrieved evidence. Standard RAG has no mechanism to detect or correct this.
Standard RAG always retrieves and never checks. Agentic RAG reasons about whether to retrieve, evaluates relevance, and verifies consistency.
Self-RAG: The Model That Reflects
Asai et al. (2024) proposed Self-RAG (ICLR 2024), a model that generates special "reflection tokens" alongside its regular text output. These tokens are not part of the response shown to the user — they are internal signals that control the generation process. There are three types:
- Retrieve token: "Do I need to retrieve information to answer this?" (Yes/No)
- Relevance token: "Is this retrieved passage relevant to the question?" (Relevant/Irrelevant)
- Support token: "Does my generated response actually follow from the retrieved evidence?" (Fully supported/Partially supported/Not supported)
The model is trained to generate these tokens through a combination of supervised learning (using GPT-4 to generate ground-truth reflection tokens on a training set) and reinforcement learning (rewarding generations that are factual and useful). At inference time, the model uses its own reflection tokens to decide whether to retrieve, which passages to use, and whether to regenerate if the response is not supported by the evidence.
The metacognitive student. A standard RAG system is like a student who, for every homework question, goes to the library, grabs the first book on the shelf, and writes an answer based on whatever is in that book. Self-RAG is a student who first asks, "Do I already know the answer?" If yes, she writes it directly. If not, she goes to the library, evaluates whether the book she found is actually relevant, and after writing her answer, re-reads the source to make sure she is not making things up. The key capability is not retrieval — it is knowing when you do not know.
Adaptive-RAG and CRAG
Adaptive-RAG (Jeong et al., 2024) takes the Self-RAG idea further: instead of a binary retrieve/don't-retrieve decision, it classifies queries into three complexity levels and routes them to different strategies:
- Simple queries: Answer directly, no retrieval needed.
- Moderate queries: Single-step retrieval followed by generation.
- Complex queries: Multi-step retrieval with iterative refinement (retrieve, generate a partial answer, retrieve again based on the partial answer, and repeat).
Corrective RAG (CRAG) (Yan et al., 2024) focuses specifically on the evaluation step. After retrieval, a separate evaluator assesses the quality of the retrieved documents. If the documents are good, they are used directly. If they are ambiguous, a refinement step extracts only the relevant passages. If they are bad, the system falls back to web search or alternative knowledge sources.
KARMA: A Multi-Agent Knowledge System
KARMA (NeurIPS 2025) is a multi-agent system for knowledge graph enrichment that exemplifies the agentic RAG philosophy at scale. Instead of a single model handling retrieval and generation, KARMA uses nine specialized agents, each responsible for a different aspect of the knowledge enrichment pipeline:
- An agent that identifies gaps in the existing knowledge graph.
- An agent that formulates queries to fill those gaps.
- An agent that retrieves candidate information from multiple sources.
- An agent that evaluates the quality and reliability of retrieved information.
- An agent that resolves conflicts between retrieved facts and existing knowledge.
- An agent that formats new knowledge into the graph's schema.
- An agent that verifies the consistency of the updated graph.
- An agent that identifies second-order implications of new knowledge (what else changes?).
- A coordinator agent that orchestrates the pipeline and handles errors.
The nine agents communicate through a shared state (the evolving knowledge graph) and a message-passing protocol. Each agent is specialized and relatively simple, but the system as a whole performs a complex knowledge management task that would be far beyond any single model.
KARMA's nine agents form a pipeline that identifies knowledge gaps, retrieves and evaluates new information, resolves conflicts, and updates the knowledge graph with consistency checking.
Agentic RAG replaces the rigid retrieve-then-generate pipeline with an agent (or system of agents) that reasons about the retrieval process itself. The key capabilities are: (1) deciding when retrieval is needed, (2) evaluating retrieved information before using it, (3) iteratively refining retrieval queries, and (4) verifying that the final response is grounded in evidence. These capabilities transform RAG from a one-shot lookup into an active research process.
Architectures for Agentic RAG
▾Standard RAG: The Pipeline
Standard RAG consists of three components. A retriever $R$ maps a query $q$ to a set of $k$ documents: $\mathcal{D}_q = R(q) = \{d_1, \ldots, d_k\}$. Retrieval typically uses dense embeddings: $R(q) = \text{top-}k_{d \in \mathcal{C}} \;\text{sim}(E_q(q), E_d(d))$, where $E_q$ and $E_d$ are query and document encoders and $\text{sim}$ is cosine similarity. A generator $G$ produces the answer conditioned on both the query and the retrieved documents: $a = G(q, \mathcal{D}_q)$. The generator is an LLM that receives the query and retrieved passages in its context window.
The retriever and generator are typically trained independently. The retriever optimizes for recall (retrieving relevant documents), and the generator optimizes for answer quality conditioned on having the right documents. This decoupled training creates a fundamental problem: the retriever does not know what the generator needs, and the generator cannot tell the retriever what to look for.
Self-RAG: Architecture and Training
Self-RAG addresses this by training the generator to include control tokens in its output vocabulary. Formally, the model $G_\theta$ generates a sequence:
$$y = (t_1, t_2, \ldots, t_T)$$where each $t_i$ is either a regular text token or a special reflection token from the set $\{\texttt{[Retrieve]}, \texttt{[Relevant]}, \texttt{[Supported]}, \texttt{[Useful]}\}$. The generation process is:
- Generate text until producing a $\texttt{[Retrieve]}$ token.
- If $\texttt{[Retrieve]} = \text{Yes}$: invoke the retriever, obtain passages.
- For each passage, generate $\texttt{[Relevant]}$. Keep only relevant passages.
- Generate the response segment.
- Generate $\texttt{[Supported]}$. If not supported, regenerate.
- Repeat until the response is complete.
Training uses a two-phase procedure:
- Critic training. A critic model (GPT-4) annotates a training corpus with reflection tokens. For each (query, passage, response) triple, the critic determines whether retrieval was needed, whether the passage was relevant, and whether the response was supported.
- Generator training. The generator is fine-tuned on the annotated corpus to produce both text and reflection tokens. The loss includes both the standard language modeling loss and a reward signal for producing correct reflection tokens.
At inference, the model's own reflection tokens control the generation process. No external retrieval controller is needed — the model itself decides when and how to use retrieval.
Adaptive-RAG: Complexity-Based Routing
Adaptive-RAG adds a query complexity classifier $C$ that routes queries to different processing strategies:
$$C(q) \in \{\text{simple}, \text{moderate}, \text{complex}\}$$The classifier is a small model trained on a dataset of queries labeled by the complexity of the retrieval needed to answer them. The routing is:
- $C(q) = \text{simple}$: $a = G(q)$ (no retrieval)
- $C(q) = \text{moderate}$: $a = G(q, R(q))$ (single-step RAG)
- $C(q) = \text{complex}$: iterative RAG with query decomposition
For complex queries, the iterative process is:
- Decompose $q$ into sub-queries $q_1, \ldots, q_m$.
- For each $q_i$: retrieve $\mathcal{D}_{q_i} = R(q_i)$, generate partial answer $a_i = G(q_i, \mathcal{D}_{q_i})$.
- Synthesize: $a = G(q, a_1, \ldots, a_m)$.
Adaptive-RAG routes queries to no-retrieval, single-step, or iterative multi-step strategies based on estimated complexity.
CRAG: Corrective Retrieval
CRAG adds a retrieval evaluator $E$ that assesses retrieved documents before they reach the generator:
$$E(q, d) \in \{\text{Correct}, \text{Ambiguous}, \text{Incorrect}\}$$The evaluator is a fine-tuned model that assesses whether a document actually contains the information needed to answer the query. The evaluation triggers different actions:
| Evaluation | Action | Rationale |
|---|---|---|
| Correct | Use documents, apply knowledge refinement | Extract only relevant passages |
| Ambiguous | Combine refined documents with web search | Hedge by using multiple sources |
| Incorrect | Discard documents, use web search only | Retrieved docs would mislead the model |
The "knowledge refinement" step is critical: even when documents are relevant, they may contain irrelevant sentences that dilute the model's attention. CRAG uses a decompose-then-recompose strategy: split each document into sentences, filter out irrelevant sentences, and concatenate only the relevant ones. This gives the generator a focused, noise-free context.
Multi-Agent RAG Architectures
In multi-agent RAG, the retrieval, evaluation, and generation steps are handled by separate specialized agents. A typical architecture:
- Query planner agent: analyzes the query, determines what information is needed, and formulates retrieval queries.
- Retrieval agent: executes retrieval across multiple knowledge sources (vector store, web search, structured databases).
- Evaluator agent: assesses retrieved information for relevance, accuracy, and recency.
- Synthesis agent: generates the final response from the curated evidence.
- Verification agent: checks the response against the evidence and flags unsupported claims.
This decomposition has several advantages over monolithic Self-RAG: each agent can be a different model (e.g., a specialized retrieval model for the retriever, a strong reasoning model for the synthesizer), agents can be updated independently, and the pipeline can be monitored at each stage.
Agentic RAG systems all share a common design principle: close the loop between generation and evidence. Standard RAG has an open loop: retrieve once, generate once, hope for the best. Agentic RAG closes the loop with evaluation, verification, and iterative refinement. Each closing of the loop adds compute cost but dramatically improves reliability. The cost-quality tradeoff is the central engineering decision in agentic RAG design.
Information Theory and Verification
▾When Does Retrieval Help?
An information-theoretic framework clarifies when retrieval adds value. Let $Q$ be the query, $A$ the target answer, $M$ the model's internal knowledge, and $D$ the retrieved documents. The model's generation quality is a function of the mutual information between its total context and the answer:
$$\text{Quality} \propto I(A; M, D \mid Q)$$By the chain rule of mutual information:
$$I(A; M, D \mid Q) = I(A; M \mid Q) + I(A; D \mid M, Q)$$The first term is the information the model already has about the answer (from its training data). The second term is the novel information in the retrieved documents — information about the answer that the model does not already possess. Retrieval helps if and only if $I(A; D \mid M, Q) > 0$.
When is this term zero (retrieval does not help)?
- The model already knows the answer: $I(A; M \mid Q) = H(A \mid Q)$, so $I(A; D \mid M, Q) = 0$. Self-RAG's retrieve token captures this: don't retrieve if the model is already confident.
- The retrieved documents are irrelevant: $D \perp A \mid Q, M$, so $I(A; D \mid M, Q) = 0$. CRAG's evaluator captures this: discard irrelevant documents.
When is this term large (retrieval helps a lot)?
- The model lacks knowledge: $I(A; M \mid Q)$ is small. Retrieval can provide the missing information.
- The answer requires recent information: the model's knowledge is outdated ($M$ does not contain recent facts), but the retrieved documents do.
- The answer requires specific details: the model has vague knowledge ($I(A; M \mid Q)$ is moderate but $H(A \mid Q)$ is high), and retrieval provides the precise details.
Multi-Agent Verification as Hypothesis Testing
The verification step in agentic RAG can be formalized as a hypothesis test. Given a generated response $a$ and a set of evidence documents $\mathcal{D}$, we test:
- $H_0$: The response $a$ is supported by the evidence $\mathcal{D}$ (grounded).
- $H_1$: The response $a$ contains claims not supported by $\mathcal{D}$ (hallucinated).
A single verification model has some error rate: false positive rate $\alpha$ (accepting a hallucinated response) and false negative rate $\beta$ (rejecting a grounded response). With $K$ independent verifiers, the aggregate error rates can be reduced:
Majority vote verification. If each verifier independently detects hallucination with probability $1 - \beta$ (and falsely flags grounded responses with probability $\alpha$), then majority vote among $K$ verifiers has:
$$\alpha_K = \sum_{j > K/2} \binom{K}{j} \alpha^j (1-\alpha)^{K-j}$$For $\alpha = 0.1$ and $K = 5$: $\alpha_5 \approx 0.001$. The false positive rate drops dramatically with the number of verifiers. The analysis parallels Condorcet's Jury Theorem applied to verification rather than generation.
Convergence of Iterative Retrieval-Refinement
Iterative RAG (as in the complex query path of Adaptive-RAG) can be modeled as a fixed-point iteration. Define the state $s_t = (q_t, \mathcal{D}_t, a_t)$ (current query, current documents, current partial answer) at iteration $t$. The update rule is:
$$q_{t+1} = \text{Refine}(q, a_t) \quad \text{(refine query based on partial answer)}$$ $$\mathcal{D}_{t+1} = R(q_{t+1}) \quad \text{(retrieve with refined query)}$$ $$a_{t+1} = G(q, \mathcal{D}_{t+1}, a_t) \quad \text{(generate updated answer)}$$Convergence holds if the composition $F = G \circ R \circ \text{Refine}$ is a contraction in an appropriate metric space. Define the quality metric $d(a_t, a^*)$ as the distance between the current answer and the optimal answer. If:
$$d(a_{t+1}, a^*) \leq \gamma \cdot d(a_t, a^*) + \epsilon$$for some $\gamma < 1$ and error floor $\epsilon \geq 0$, then the iteration converges to a fixed point within $\epsilon / (1 - \gamma)$ of the optimum. The error floor $\epsilon$ represents the irreducible error from imperfect retrieval and generation. The contraction rate $\gamma$ depends on how effectively each iteration improves the answer.
In practice, iterative RAG typically converges in 2–4 iterations: the first retrieval provides the bulk of the information, the second fills in gaps, and subsequent iterations show diminishing returns. This is consistent with a contraction rate $\gamma \approx 0.3$–$0.5$.
The Retrieval-Computation Tradeoff
There is a fundamental tradeoff between retrieval (bringing in external information) and computation (reasoning about existing information). Define the total cost of answering a query as:
$$\text{Cost} = c_R \cdot n_R + c_G \cdot n_G$$where $c_R$ is the cost per retrieval operation, $n_R$ is the number of retrievals, $c_G$ is the cost per generation step, and $n_G$ is the number of generation steps. The quality is a concave function of both:
$$\text{Quality}(n_R, n_G) = Q_{\max}\left(1 - e^{-\alpha n_R} \cdot e^{-\beta n_G}\right)$$This model captures several empirical observations: quality increases with both retrieval and computation but with diminishing returns; there is an interaction effect (retrieval helps more when paired with sufficient computation to use it, and vice versa); and the optimal allocation depends on the relative costs $c_R/c_G$ and the problem's information requirements (captured by $\alpha$ and $\beta$).
For knowledge-intensive tasks ($\alpha$ large): invest heavily in retrieval. For reasoning-intensive tasks ($\beta$ large): invest in computation. For tasks requiring both: the optimal allocation gives roughly equal marginal returns to the last unit of retrieval and computation.
The optimal allocation between retrieval and computation depends on whether the task is knowledge-intensive, reasoning-intensive, or both.
Agentic RAG is, at its core, an active information acquisition problem. The agent decides what information to gather (retrieval), how much to gather (adaptive complexity routing), how to assess its quality (evaluation), and when to stop gathering and start producing the answer (convergence criterion). The information-theoretic analysis shows that retrieval helps exactly when the model lacks relevant knowledge, and multi-agent verification reduces hallucination rates exponentially with the number of verifiers. The practical design question is not whether to use agentic RAG, but how to balance the cost of additional retrieval and verification against the quality improvement it provides.
Further Reading
- Lewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020.
- Asai, A. et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024.
- Yan, S.-Q. et al. (2024). Corrective retrieval augmented generation. arXiv:2401.15884.
- Jeong, S. et al. (2024). Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. NAACL 2024.
- Gao, Y. et al. (2024). Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997.