Retrieval-augmented generation is a data-quality project nobody scoped.
The retriever inherits every undefined term in the corpus. The model just reads it aloud.
The pitch for retrieval-augmented generation is that grounding a model in your documents makes it accurate. The unscoped clause is that the answer is only ever as good as the documents it retrieves.
What happened is that an architecture got mistaken for a guarantee. The original method, introduced by Lewis and colleagues in 2020, pairs a model with a retrieval step so answers draw on an external knowledge source rather than memory alone. Vendor guidance is candid about the dependency that follows. AWS states plainly that grounded responses are only as reliable as the documents, databases, or APIs they are based on.
It matters because retrieval quality is a data-quality question wearing an infrastructure costume. Microsoft's own guidance notes that RAG quality depends on how content is prepared for retrieval. So the moment a corpus contains three definitions of a term, two stale policies, and a deprecated number nobody flagged, the retriever does its job perfectly and surfaces the contradiction with a citation attached.
What it reveals about the field is the recurring substitution of plumbing for governance. A survey of RAG robustness catalogs the failure honestly: retrieved context can be noisy, incomplete, or adversarial, and the system inherits those flaws. None of that is fixed by a better vector index. It is fixed by deciding what the authoritative document is and who maintains it, which is ownership, not retrieval.
What to watch is who owns the corpus. If retrieval accuracy is filed as a model problem, the team tuning embeddings will be asked to compensate for definitions nobody assigned. The reveal: a model grounded in undefined data is not careful. It is confidently wrong, and now it shows its work.
Grounding does not retire semantic debt. It cites it. Define and own the corpus before you measure the model.
RAG pairs a generative model with a retrieval step over an external knowledge source, as introduced by Lewis et al. in 2020.
supports01Grounded responses are only as reliable as the underlying documents, databases, or APIs.
supports02RAG answer quality depends on how the source content is prepared for retrieval.
supports03Retrieved context can be noisy, incomplete, or adversarial, and these retrieval-quality problems are not solved by the model alone.
supports04
No notes yet. The margin is open.
Sign in to add a note. The margin is moderated — we keep it useful, not cruel.
The obligation assumes an inventory the organization skipped. The inventory is the project.
Business Sense RequiredMinimization is a sentence about purpose. Most firms never finished the sentence.
Owner MissingThe catalog filled with products. The org chart did not move an inch.