Jan 26, 2026 • RAG • Data Extraction • Chunking • Embeddings • Vector Database • AI Architecture

The Hidden Layer of RAG: Data Extraction, Chunking, and Embeddings

Written by: Mycellia Team

In the previous article, we talked about RAG. We explained where retrieval comes in and why an LLM alone is not enough. But naturally, we mostly looked at the result: 'The model goes, finds a document, and produces an answer.' In real life, this is actually where RAG starts. Because retrieval is not the core problem. Retrieval is a result, an output. What really determines the quality of that result is everything happening behind the scenes.

Mainly these three steps: How you extract the data (Data Extraction), how you split it (Chunking), and how you turn meaning into vectors (Embedding). There is a very familiar moment that everyone experiences when building RAG for the first time: 'We have documents, we created embeddings, we set up a vector database… but the answers are still nonsense.'

At this point, most people instinctively attack the prompt. They make it more detailed, add roles, tweak the temperature… But the problem is usually not the prompt. The problem is what you are giving to the LLM.

On paper, the RAG architecture looks very clean. But as emphasized in the Retrieval-Augmented Generation paper, the retrieval layer depends entirely on a meaningful representation. What the model is really doing is this: 'In the representation space you gave me, let me find the piece closest to the question.' If that representation is poorly extracted, badly chunked, or embedded without real meaning — then no matter how good the LLM is, it will explain the wrong piece perfectly.

This is a bit like waking up in the wrong simulation in The Matrix. You took the red pill, but you are still in the wrong world. That's why in this article, we won't look at RAG backwards from retrieval. We'll look at it forwards from the data.

There is a very common situation in RAG systems: 'Retrieval seems to work, but the model keeps bringing irrelevant chunks.' When this happens, the instinct is usually to look at retrieval itself or the prompt. But as clearly shown in work on Unstructured Data Analysis and Semi-Structured Data Extraction, the problem often starts before retrieval — at the data extraction stage.

Because LLMs do not see the document, do not know the file format, and do not understand tables as tables. In the LLM's world, there is only one reality: the text + metadata representation you extracted. Whatever that representation is, the embedding becomes that, the vector space becomes that, and retrieval simply navigates inside that space.

The most common mistake is treating extraction as 'just pull the text and move on.' But in a production RAG system, a good extraction output includes the text itself, its position inside the document, its contextual meaning, and relevant metadata. This is exactly why text is never sent to the vector store by itself — chunks are stored together with metadata and source information.

Not every data type can be extracted in the same way. PDF has visual hierarchy but often no semantic hierarchy. HTML headings and structure can usually be preserved. Semi-structured data has structured fields mixed with free text. PDFs are especially problematic — headings are often lost, tables turn into plain text, and context gets fragmented. At this point, the extraction mistake is permanently carried into the embedding. It cannot be fixed later with a prompt.

Extracting a table incorrectly means the embedding treats it as 'a story' instead of preserving the relationship between entity, attribute, and value. This difference directly affects retrieval and determines whether questions are routed to the correct chunk or completely miss it.

Why is this layer so often underestimated? Because extraction is invisible, there's no flashy demo, and it's not as 'cool' as prompt engineering. But the performance of RAG systems is largely limited by the quality of the input representation. Until the extraction layer is fixed, changing the embedding model or the vector database has only a limited effect.

At some point, everyone building RAG has the same idea: 'Let's make the chunks bigger. More context = better answers.' It sounds reasonable. But both the literature and real production experience say: No — it's not that simple. The core problem shown in 'Lost in the Middle: How Language Models Use Long Contexts' is this: LLMs do not use long contexts evenly. They remember what is at the beginning and at the end. What's in the middle often disappears.

When you put a 4–5 page document into a single chunk and retrieval works, the LLM is now dealing with hundreds of tokens, multiple subtopics, and several possible answer candidates. The result: the model gives an answer that is 'close' but blurry, the correct information exists but sits in the wrong place, and sometimes the critical sentence in the middle is completely ignored. As important information is pushed toward the middle of a chunk, the probability that the model actually uses it drops significantly. So: 'Bigger chunks' does not equal 'Better reasoning.'

Small chunks don't save you either. When you make chunks too small, context gets fragmented, sentences become meaningless on their own, and embeddings capture only surface-level similarity. Retrieval may return the chunk, but when the LLM tries to generate an answer, it has to fill in the gaps — and those gaps are usually filled with imagination.

Chunking is not a preprocessing detail. It is a decision about how knowledge will be accessed. That's why fixed-size chunking is usually insufficient and semantic or structure-aware chunking is more stable. Especially when headings, subheadings, and section boundaries are used as chunk boundaries — embeddings become more meaningful, retrieval becomes more consistent, and the context given to the LLM stays focused on a single topic.

Overlap means that consecutive chunks share a small portion of the same text. It's used because when chunks are split too sharply, sentences can be cut in half, context can be split into two pieces, and an important idea can land exactly on the boundary. Overlap is useful when chunks are split by token or character count, not by semantic boundaries. But overlap becomes harmful when chunking is already based on headings or sections and the same information keeps getting embedded repeatedly — retrieval returns the same content again and again, the context gets bloated, and the LLM reads the same sentence three times.

When chunking is wrong, even the correct embedding model can't do its job, the vector database returns the closest match but not the right context, and the LLM tries to complete the missing context. And yes — this is where hallucination starts.

One of the most common sentences you hear when people talk about RAG is: 'We embedded the documents.' This sentence is usually technically correct, but it creates a very wrong mental model. Embedding is not turning text into numbers and not just compressing text. Embedding is placing the semantic representation of text into a mathematical space. The model is not learning the text itself — it is learning what the text is similar to.

An embedding model takes each piece of text and converts it into a fixed-length vector placed somewhere in a space. In that space, texts with similar meaning are close to each other and unrelated texts are far apart. What happens during retrieval is essentially: 'Find the document embeddings closest to the question embedding.' This closeness is usually measured with cosine similarity.

In RAG systems, a bi-encoder is used almost all the time. The question is encoded separately, the document is encoded separately, and similarity is calculated afterward. This matters because you can precompute document embeddings and search through millions of documents very fast. The alternative cross-encoder reads the question and document together — more accurate but much slower. That's why in RAG: bi-encoder for retrieval, cross-encoder for reranking if used.

The same text can have different neighbors in different embedding models because every embedding model is trained with a different objective and learns a different notion of 'similarity.' General embeddings capture overall semantic similarity, search-optimized embeddings capture question–answer closeness, and code embeddings capture syntax + intent. If you choose the wrong embedding model, you may embed the correct chunk but it will be close to the wrong questions. That's why embedding choice largely determines the quality of retrieval.

Why can't embedding errors be fixed with a prompt? Because prompts operate at the generation stage while embeddings operate at the retrieval stage. A wrong embedding retrieves the wrong chunk, which means the LLM never sees the correct information. And what the model doesn't see, it can only guess at best.

At some point in almost every RAG discussion, you hear: 'We put it into a vector DB.' Just like 'we embedded the documents,' this is technically correct but creates the wrong mental picture. A vector store is not smart, not generating meaning, and not deciding what is right or wrong. The vector store does only one thing: It quickly finds the closest vectors among the vectors it is given. Nothing more, nothing less.

A vector store stores each document chunk as a vector, and when a query comes in, returns the vectors closest to the query's vector. This closeness is usually measured using cosine similarity. So the vector store does not know if something is a policy, a table, or critical information. It only knows: 'This vector is closer to that one.'

In real systems with millions of vectors, vector stores use ANN (Approximate Nearest Neighbor). It tries to find the nearest result not with 100% certainty but with very high probability and very fast. This means you gain speed but pay a small cost in accuracy. If embedding quality is low or chunks are poorly defined, ANN will amplify the error instead of hiding it.

Vector stores usually don't store only vectors — they also store metadata like source, document type, section, date, and access level. During retrieval, this creates a critical difference: 'Find the closest vector' versus 'Find the closest vector from this document type and this section.' The second approach reduces noise, filters out results that are close but wrong, and gives the LLM much cleaner context. If there is no metadata, the vector store is blind.

The most common misconception is: 'If we change the vector DB, it will get better.' Most of the time, nothing really changes because the vector store shows the symptom but does not create the disease. The disease is usually in extraction, chunking, or embedding. The vector store just reflects the result — very fast and very clearly.

When you first build RAG, the journey usually looks like this: We collected the documents, we embedded them, we put them into a vector DB, we tweaked the prompt a bit. And then: 'RAG works… but sometimes it says nonsense.' At this point, we can be very clear: The problem is not the prompt. Because in RAG, the prompt is the last link in the chain.

If we had to summarize RAG in one sentence: RAG takes a question and tries to find the representation pieces that could answer it. The system does not ask 'Is this document correct?' or 'Is this information critical?' It only asks: 'How similar is this vector to that one?' That's why RAG's behavior depends entirely on this chain: Raw Data → Extraction → Chunking → Embedding → Vector Store → Retrieval → LLM. If any link in this chain is weak, the final answer will be weak too.

With prompts, you can adjust the tone, fix the format, and say 'give a short answer.' But prompts cannot fix wrong retrieval, create missing context, or repair broken embeddings. So if the prompt is Iron Man's suit, the data pipeline is the arc reactor. No energy, no power. The suit is just metal.

A well-built RAG system needs very few prompt hacks, doesn't break when the model version changes, and produces answers that are not surprising but consistent. Because extraction is clean, chunks are meaningful, embeddings do the right job, and the vector store just does its job. The system doesn't shout. It doesn't show off. It just works.

What makes RAG good is not a bigger model, a longer context, or a more complex prompt. What makes RAG good is how you treat your data. Everything else is a detail.