Skip to content
Vol. I · No. 251
Mon · 8 Jun
A Daily Lexicon of Trustworthy Data
No. 249
249·07 · Owner MissingNo. 249 · 29 May 2026 · 2 min

NIST Asked Where the Data Came From; The Pipeline Went Quiet

The Generative AI Profile treats provenance as a control — but admits most builders cannot say what they trained on.

EvidenceThe EditorSource Notes

NIST's Generative AI Profile lists content provenance as a named risk-management practice. Then, in plain text, it concedes the awkward part: most model builders cannot actually tell you what they trained on.

On 26 July 2024, NIST published AI 600-1, the Generative AI Profile, a companion to the AI Risk Management Framework developed with public input. It frames content provenance and data documentation as governance controls spanning the AI lifecycle, and describes high-integrity information as that which 'can be linked to the original source(s) with appropriate evidence' and has 'a clear chain of custody.'

The same document states the operating reality without flinching: 'most model developers do not disclose specific data sources on which models were trained, limiting user awareness of whether personally identifiable information (PII) was trained on and, if so, how it was collected.' A framework that recommends provenance is also documenting, in the same pages, that provenance frequently does not exist.

This is the recurring gap dressed in federal typography. Provenance, lineage, and a chain of custody are not properties a model acquires at inference time; they are records a data function maintains from collection onward, or they are absent forever. The Profile can ask for the receipt. It cannot conjure the bookkeeping that organizations declined to staff while they were busy standing up the model.

Watch whether 'data provenance' shows up in AI programs as a funded role with a system of record, or as a slide. The honest test is simple and cheap: pick one production data set and ask who can produce its origin, its consent basis, and its known gaps without phoning a contractor. The answer reveals whether you have governance or just a vocabulary for it.

The takeaway

Provenance is a record you keep from collection onward, or it is absent forever. Run the chain-of-custody test on one production dataset; the Profile can ask for the receipt, but it cannot conjure the bookkeeping you declined to staff.

The claim, mapped
  1. NIST released the AI 600-1 Generative AI Profile on 26 July 2024 as a companion to the AI Risk Management Framework.

    supports0102
  2. The Profile describes high-integrity information as linkable to original sources with appropriate evidence and having a clear chain of custody.

    supports02
  3. The Profile states that most model developers do not disclose the specific data sources on which models were trained, limiting awareness of whether PII was included.

    supports02
Sources
01
National Institute of Standards and Technology (NIST) — Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile2024-07-26 · Tier 1 · primaryOfficial NIST page for AI 600-1, the Generative AI Profile (released 26 July 2024), a companion to the AI RMF addressing governance, content provenance, testing, and incident disclosure.
02
National Institute of Standards and Technology (NIST) — NIST AI 600-1: Generative Artificial Intelligence Profile (PDF)2024-07-26 · Tier 1 · primaryPrimary text: 'most model developers do not disclose specific data sources on which models were trained, limiting user awareness of whether personally identifiable information (PII) was trained on.'
Mark this entry
Marginalia · 0 notes

No notes yet. The margin is open.

Sign in to add a note. The margin is moderated — we keep it useful, not cruel.

Related entries
Owner Missing
Lineage is mandatory for the audit and partial in practice.

A graph that stops at the warehouse door explains everything except where the number came from.

Process Debt
Article 10 Quietly Bills You for the Data Catalog Nobody Funded

The EU AI Act's data-governance clause assumes lineage, provenance, and bias records most teams were never resourced to keep.

Process Debt
The Lineage Graph Is Free Now, Right Up to Where It Hurts

Airflow and dbt will draw your pipeline for nothing. The arrow still dies one hop short of the meeting where the number gets used.