RAG: A Distributed Search Problem with an LLM at the End

slug: 007-rag number: 7 title: "RAG: A Distributed Search Problem with an LLM at the End" description: "Retrieval-augmented generation as a distributed search problem. Indexing, ranking, context windows, evals." youtubeId: null publishedAt: null anchor: authors: "Patrick Lewis et al." year: 2020 title: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" institution: "Facebook AI Research / UCL" venue: "NeurIPS"

The pattern at a glance

Long-form article coming soon. The narration below is the spoken version of this episode — read it as a quick transcript while the written companion is in draft.

Transcript

An internal chatbot at your company is asked about the latest sales policy. It answers confidently. Every detail sounds plausible. Every detail is wrong.

The model was trained eighteen months ago. The policy changed last quarter. The model has no way to know. It generates fluent text that is also entirely fabricated.

This is hallucination. And it's the problem RAG was designed to solve.

Large language models are powerful, but frozen. Their training data has a cutoff. They don't know what happened yesterday. They don't know what's in your company's documentation. They don't know your customer's history.

You can ask a frontier model a general question and get a brilliant answer. Ask it a specific one — what's our return policy — and it makes one up.

The fix isn't a smarter model. It's a smarter pipeline. Feed the model the right context at the right moment.

RAG was named in a 2020 paper by Patrick Lewis and colleagues at Facebook AI Research and University College London. The paper introduced retrieval-augmented generation as a single architecture combining a learned retriever and a generative model. The technique is older — neural information retrieval research goes back decades — but Lewis crystallized the name and the pipeline.

RAG is three steps.

One: retrieve. Take the user's query, find the most relevant chunks of documentation, internal data, or knowledge base. This is a search problem.

Two: augment. Inject those retrieved chunks into the prompt as context. The model now has the facts it needs in its working memory.

Three: generate. The model produces an answer grounded in the retrieved context, not just its training data.

The architecture is a search index in front of an LLM. Most of the hard work is in the retrieval. The LLM does what it's already good at — coherent writing — over content you gave it.

Strip away the AI hype and RAG is a familiar shape.

Documents are chunked into passages — paragraph-sized, typically. Each chunk is converted into a vector embedding, a numerical representation that captures its meaning.

A vector database stores all chunks and their embeddings. When a query arrives, the query is also embedded. The database finds chunks whose embeddings are nearest to the query's. These are the most semantically similar passages.

Optional layer: re-ranking. A second model scores the top candidates more carefully before the survivors go to the LLM.

Final step: build the prompt. System message, retrieved chunks, user query, instructions. Send to the LLM. Return the answer.

This is information retrieval, not generative AI. The LLM is the last step, not the system.

Four traps every RAG system hits.

One: chunking. Cut documents into chunks too small and you lose context. Too large and you waste tokens and dilute relevance. There is no universal answer — the right chunk size depends on the document type and the question shape.

Two: embedding quality. Off-the-shelf embeddings work well for general English. They underperform on domain-specific jargon — legal, medical, internal company codenames. Fine-tuning embeddings on your corpus is often where the real gains are.

Three: retrieval relevance. The top results the vector database returns are similar to the query but not always useful. Re-rankers, hybrid search combining keyword and vector, and metadata filtering are the standard tools for fixing this.

Four: hallucination is not eliminated, just reduced. The model can still confidently fabricate details that contradict the retrieved context. Citation generation, where the model is forced to point at specific chunks, helps but does not eliminate this.

RAG is the wrong answer for purely conversational tasks where the model's training is sufficient — the LLM doesn't need help being helpful, charming, or thoughtful.

RAG is also wrong when the data is dynamic enough that maintaining an embedding index becomes the dominant cost. Sometimes a direct database query and a templated answer is faster, cheaper, and more accurate.

If your problem is "the model doesn't know this," RAG is right. Otherwise, it's overhead.

RAG is a distributed search problem with a large language model at the end. The hard engineering is in the retrieval pipeline — chunking, embeddings, vector indexes, re-ranking, metadata filters. The LLM does the easiest part: turning relevant context into a coherent answer.

Treat RAG like the search system it is, not the AI breakthrough it isn't.

Next episode: vector search internals — how the index actually finds the nearest neighbors at scale.