Skip to content
Vol. I · No. 251
Mon · 8 Jun
A Daily Lexicon of Trustworthy Data
No. 245
245·01 · Business Sense RequiredNo. 245 · 23 May 2026 · 3 min

Retrieval-augmented generation is a data-quality project nobody scoped.

The retriever inherits every undefined term in the corpus. The model just reads it aloud.

EvidenceThe EditorSignal, Not Theater

The pitch for retrieval-augmented generation is that grounding a model in your documents makes it accurate. The unscoped clause is that the answer is only ever as good as the documents it retrieves.

What happened is that an architecture got mistaken for a guarantee. The original method, introduced by Lewis and colleagues in 2020, pairs a model with a retrieval step so answers draw on an external knowledge source rather than memory alone. Vendor guidance is candid about the dependency that follows. AWS states plainly that grounded responses are only as reliable as the documents, databases, or APIs they are based on.

It matters because retrieval quality is a data-quality question wearing an infrastructure costume. Microsoft's own guidance notes that RAG quality depends on how content is prepared for retrieval. So the moment a corpus contains three definitions of a term, two stale policies, and a deprecated number nobody flagged, the retriever does its job perfectly and surfaces the contradiction with a citation attached.

What it reveals about the field is the recurring substitution of plumbing for governance. A survey of RAG robustness catalogs the failure honestly: retrieved context can be noisy, incomplete, or adversarial, and the system inherits those flaws. None of that is fixed by a better vector index. It is fixed by deciding what the authoritative document is and who maintains it, which is ownership, not retrieval.

What to watch is who owns the corpus. If retrieval accuracy is filed as a model problem, the team tuning embeddings will be asked to compensate for definitions nobody assigned. The reveal: a model grounded in undefined data is not careful. It is confidently wrong, and now it shows its work.

The takeaway

Grounding does not retire semantic debt. It cites it. Define and own the corpus before you measure the model.

The claim, mapped
  1. RAG pairs a generative model with a retrieval step over an external knowledge source, as introduced by Lewis et al. in 2020.

    supports01
  2. Grounded responses are only as reliable as the underlying documents, databases, or APIs.

    supports02
  3. RAG answer quality depends on how the source content is prepared for retrieval.

    supports03
  4. Retrieved context can be noisy, incomplete, or adversarial, and these retrieval-quality problems are not solved by the model alone.

    supports04
Sources
01
arXiv (Lewis et al., NeurIPS 2020) — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks2020-05-22 · Tier 1 · primaryIntroduces RAG: pairing a parametric model with a differentiable retrieval mechanism over an explicit non-parametric memory to improve knowledge-intensive generation.
02
AWS Prescriptive Guidance — Grounding and Retrieval Augmented Generation2025-01-01 · Tier 3 · vendorStates that grounded responses are only as reliable as the documents, databases, or APIs they are based on, and lists source data-quality assurance and traceability as required controls.
03
Microsoft Learn — RAG and Generative AI - Azure AI Search2026-01-15 · Tier 3 · vendorFrames the retrieval challenges in RAG and states that RAG quality depends on how content is prepared for retrieval, with relevance and recall determining grounding data.
04
arXiv (Chaitanya Sharma) — Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers2025-05-28 · Tier 2 · primarySurveys RAG and identifies challenges in retrieval quality, grounding fidelity, and robustness against noisy or adversarial inputs.
Mark this entry
Marginalia · 0 notes

No notes yet. The margin is open.

Sign in to add a note. The margin is moderated — we keep it useful, not cruel.

Related entries
Business Sense Required
New AI rules ask you to govern data you never classified. The bill comes due first.

The obligation assumes an inventory the organization skipped. The inventory is the project.

Business Sense Required
Privacy law says keep less. The model says keep everything. Nobody wrote down what "it" is.

Minimization is a sentence about purpose. Most firms never finished the sentence.

Owner Missing
Everyone shipped data products. The decision rights never left the building.

The catalog filled with products. The org chart did not move an inch.