NIST Asked Where the Data Came From; The Pipeline Went Quiet
The Generative AI Profile treats provenance as a control — but admits most builders cannot say what they trained on.
NIST's Generative AI Profile lists content provenance as a named risk-management practice. Then, in plain text, it concedes the awkward part: most model builders cannot actually tell you what they trained on.
On 26 July 2024, NIST published AI 600-1, the Generative AI Profile, a companion to the AI Risk Management Framework developed with public input. It frames content provenance and data documentation as governance controls spanning the AI lifecycle, and describes high-integrity information as that which 'can be linked to the original source(s) with appropriate evidence' and has 'a clear chain of custody.'
The same document states the operating reality without flinching: 'most model developers do not disclose specific data sources on which models were trained, limiting user awareness of whether personally identifiable information (PII) was trained on and, if so, how it was collected.' A framework that recommends provenance is also documenting, in the same pages, that provenance frequently does not exist.
This is the recurring gap dressed in federal typography. Provenance, lineage, and a chain of custody are not properties a model acquires at inference time; they are records a data function maintains from collection onward, or they are absent forever. The Profile can ask for the receipt. It cannot conjure the bookkeeping that organizations declined to staff while they were busy standing up the model.
Watch whether 'data provenance' shows up in AI programs as a funded role with a system of record, or as a slide. The honest test is simple and cheap: pick one production data set and ask who can produce its origin, its consent basis, and its known gaps without phoning a contractor. The answer reveals whether you have governance or just a vocabulary for it.
Provenance is a record you keep from collection onward, or it is absent forever. Run the chain-of-custody test on one production dataset; the Profile can ask for the receipt, but it cannot conjure the bookkeeping you declined to staff.
NIST released the AI 600-1 Generative AI Profile on 26 July 2024 as a companion to the AI Risk Management Framework.
The Profile describes high-integrity information as linkable to original sources with appropriate evidence and having a clear chain of custody.
supports02The Profile states that most model developers do not disclose the specific data sources on which models were trained, limiting awareness of whether PII was included.
supports02
No notes yet. The margin is open.
Sign in to add a note. The margin is moderated — we keep it useful, not cruel.
A graph that stops at the warehouse door explains everything except where the number came from.
Process DebtThe EU AI Act's data-governance clause assumes lineage, provenance, and bias records most teams were never resourced to keep.
Process DebtAirflow and dbt will draw your pipeline for nothing. The arrow still dies one hop short of the meeting where the number gets used.