Skip to content

A Daily Lexicon of Trustworthy Data

006·37

llm-as-judge

/ˌɛl ɛl ˈɛm æz dʒʌdʒ/ - n.

1 [colloq.] Outsourcing whether the answer is good to a second system that also does not know.Keep. Punchy.This is the problem.

Working definition

2. Using one language model to score another model's outputs against a rubric, in place of human grading.

Evidence

See also

eval setThe exam the model is allowed to study before every retake.
ground truthOne analyst's spreadsheet, promoted to truth because no one else volunteered.
model evaluationThe benchmark the model passes, chosen after seeing which benchmark it passes.