006·326

benchmark saturation

/ˈbɛntʃmɑːrk ˌsætʃəˈreɪʃən/ - n.

1 [colloq.] Everyone got an A, so the test resigned.Keep. Punchy.This is the problem.

Working definition

2. The point at which models score near-perfectly on a benchmark, so it no longer distinguishes their real capability.

Filed

See also

ai maturityTerm struck: cannot be measured until the field defines the quality, ownership, and monitoring the score is meant to rate.
eval harnessA scoreboard everyone trusts and no one has read the rules of.
ground truthOne analyst's spreadsheet, promoted to truth because no one else volunteered.
model evaluationThe benchmark the model passes, chosen after seeing which benchmark it passes.