Jan 21, 2026 • RAG • LLM • Enterprise AI • Knowledge Management • AI Architecture

Understanding RAG: What It Is, What It Isn't, and How It's Actually Used

Written by: Mycellia Team

Anyone who has worked with large language models (LLMs) in the last few years has had the same moment. At first, everything feels impressive. The model answers questions. Writes code. Summarizes documents. You think, 'Okay, this works.' Then a question comes in. A bit more specific. A bit more up to date. Or something company-specific. And the model… makes things up. Not hesitantly. Not unsure. Very confidently.

At some point, you realize what's really happening: the model doesn't know — it's just predicting. The Stanford paper On the Opportunities and Risks of Foundation Models points exactly to this issue. Foundation models are extremely powerful, but by design they operate with a closed form of memory. Everything they learn during training is absorbed into the model's parameters. In the literature, this is called parametric knowledge.

In other words, everything the model 'knows' is a statistical summary of what it has seen in the past. For many use cases, this is enough. But this is also where real-world problems begin.

Why isn't parametric knowledge enough? The model is not up to date — it naturally doesn't know what happened after its training data ended. The model doesn't realize what it doesn't know — it answers even when it shouldn't. The model can't point to sources — when you ask 'Where does this come from?', there's no real answer. The model is blind to company data — it doesn't know your documents, policies, or internal systems.

The Stanford paper puts this very clearly: foundation models were never designed to solve every problem on their own. Their real strength comes from the systems built around them. This leads to an important distinction: Parametric knowledge is information stored inside the model's weights. Non-parametric knowledge is information the model pulls from an external source while generating an answer. And this is exactly where RAG comes in.

Retrieval-Augmented Generation (RAG) emerged when it became clear that 'let's just make the model bigger' was not enough. The idea is actually very simple: The model doesn't have to answer every question. It can look things up first. If needed, it can pull information from an external source. Then it can speak.

In other words, the model no longer relies only on its memory. Before generating an answer, it performs retrieval. The information it pulls from outside is what we call non-parametric knowledge. You can think of it like this: The model doesn't have to act like someone who knows everything. It acts more like someone with a notebook next to them.

RAG is not an alternative to foundation models. It's a system approach that complements them. Foundation models provide the intelligence. RAG is what connects them to the real world.

Normally, when you ask an LLM a question, this is what happens: the model generates the most likely answer based on what it saw during training. In other words, it speaks by looking backward. RAG gives the model a different option: 'Wait. Let me check first. Maybe there's more accurate information outside.'

In one sentence: Retrieval-Augmented Generation (RAG) is a setup where an LLM retrieves relevant information from an external source before generating an answer, and uses that information as context during inference.

What RAG does is combine parametric and non-parametric knowledge. The model still understands language. It still builds context. It still generates answers. But it no longer speaks blindly.

How does RAG work? A user asks a question. For example: 'What is our company's remote work policy?' In a system without RAG, the model recalls the concept of 'remote work' from training, generates a generic answer, and it's very likely wrong. In a RAG-based system, the question is received, the system performs retrieval related to that question, relevant parts are pulled from internal documents and policy files, and the model generates an answer based on this content.

This flow is described in the original RAG work using two core components: the Retriever (the part that finds the most relevant documents for the question) and the Generator (the LLM that produces the answer using those documents).

The real difference with RAG is this: The information is not just text appended to the prompt afterward. The information becomes part of the answer itself. This small difference changes a lot: Information can be updated, incorrect documents can be removed, different sources can be tested, and there's no need to retrain the model.

RAG helps with a lot of things: It largely solves the freshness problem, it connects company or domain-specific data to the model, and it reduces hallucinations. But it doesn't guarantee correct retrieval every time, with the wrong document the model can still produce a wrong answer, and on its own it doesn't provide perfect accuracy. So no — RAG is not magic. But when used correctly, it's one of the most important steps toward making LLMs actually usable in the real world.

Since RAG started gaining attention, a lot of unrealistic expectations have formed around it. Adding RAG doesn't suddenly turn your system into a 'correct answer machine.' If retrieval is wrong, the model will look at the wrong information and produce a very convincing wrong answer. RAG doesn't prevent wrong answers. RAG changes where the error comes from. Before, the model was making things up. Now it might be speaking confidently while looking at the wrong document.

RAG is not a vector database. Setting up a vector database and embedding documents does not mean you have RAG. That's just retrieval infrastructure. When we talk about RAG, we're talking about the full set of questions: Which document was retrieved? Why this document? How is this information actually used in the answer?

RAG is not an alternative to fine-tuning. They solve different problems. Fine-tuning changes behavior — it affects tone, format, and reasoning style. RAG changes access to information — it determines what the model looks at while speaking. If you want 'the model to answer in this style' → fine-tuning. If you want 'the model to look at the right information' → RAG.

RAG does not completely eliminate hallucinations. Yes, RAG reduces hallucinations. But it does not guarantee that the model will correctly interpret the retrieved information, that the retrieved information is actually correct, or that conflicting sources will be detected. That's why we're seeing approaches like Corrective RAG, Adaptive RAG, and Self-Reflective RAG.

You can't fix everything with a prompt. Wrong retrieval → bad context → bad answer. No matter how good your prompt is, you can't turn the wrong document into the right one and you can't fill in missing information. RAG is an information retrieval problem, a system design problem, and an evaluation problem. The prompt is just the last step in that system.

Naive RAG performs retrieval once, takes a few of the closest chunks, puts them into the prompt, and waits for the model to answer. This approach works with small datasets, for clear well-defined questions, and when little context is required. But Naive RAG is a starting point, not a goal.

Dense retrieval converts both the question and the documents into embedding vectors and searches based on semantic similarity. Even if a keyword doesn't appear exactly, content with similar meaning can still be found. This is a big leap compared to traditional keyword search. But being semantically similar does not always mean being correct, especially for dates, versions, and policy changes.

Hybrid retrieval combines semantic search (dense) with keyword search. This approach really shines when dealing with specific terms, version numbers, code, IDs, policy names, and situations where 'this word must appear.'

Query rewriting uses the LLM before retrieval to expand the question, clarify it, and generate alternative queries. Here, the LLM is not producing an answer. It's helping retrieval. This small change can make a dramatic difference, especially for complex questions.

Multi-step RAG handles questions where the answer doesn't live in a single document, multiple steps are required, and you need to learn something first then continue. It performs initial retrieval, intermediate answer or inference, new retrieval, and final answer. This approach closely mirrors how humans actually do research.

At MyCellia, while trying to build a meaningful enterprise search across all of a company's data, we saw that single-step retrieval was often not enough. There were many cases where the model had to use what it found first to perform a second — sometimes even a third — search. So in practice, the system naturally evolved into a multi-step setup.

Adaptive RAG determines whether retrieval is actually needed for a given question. Self-Reflective / Corrective RAG questions the retrieved documents, asks 'Did this information actually help?', and if needed, corrects itself or tries again. At this point, RAG is no longer a simple pipeline. The model becomes part of a system that actively evaluates the quality of its own outputs.

The most common mistakes with RAG: Adding more documents doesn't always make it better — the model sees too much information that doesn't help. Not measuring retrieval quality — trying to improve generation without measuring retrieval is usually wasted effort. Doing retrieval for every question hurts both performance and quality. Trying to fix everything with prompting — RAG is not a prompt problem. Treating RAG as 'set it and forget it' — without monitoring, evaluation, and feedback loops, RAG systems tend to run into problems.

If RAG isn't working the way you expect, don't look at the model first — look at retrieval first. Because most of the time, the problem is here: The model isn't speaking incorrectly. It's speaking while looking at the wrong thing.

RAG is not a library, a prompt trick, or a plug-and-play feature. RAG is a system design problem. What matters is not just what an LLM says, but what it is looking at while saying it. Foundation models are still at the center of everything. But what connects them to the real world is not RAG itself — it's how RAG is designed.

Many of the things discussed in this article are not theoretical. Most of them are problems we've encountered in real systems and had to solve in practice. At MyCellia, we don't look at RAG as 'does it work?' but as 'does it actually help?'