What is Eval Harness?

Signed benchmark scores

An eval harness runs model benchmarks — MMLU, HELM, TruthfulQA, GSM8K, or an internal golden set — and records each run as a signed, replayable receipt. Trinitite binds the model reference, a Merkle root over the suite, a Merkle root over per-item answers, and the seed, so a benchmark number becomes evidence someone can independently re-run.

A published accuracy figure is a claim until a third party can reproduce it. Because the kernel is deterministic, the same model on the same pinned suite reproduces the same answer root, bit-for-bit, and a fail-closed verify endpoint recomputes the attestation hash and re-checks the signature — so a doctored database row is caught.

Champion-vs-challenger becomes a signed accuracy/refusal/error delta carrying a deterministic flag — the effective-challenge evidence SR 11-7 asks for, not two spreadsheets. The eval harness is the signing primitive underneath the broader Evals module.

Related terms

Agent Evals

What is Agent Evals? →

SR 11-7

What is SR 11-7? →

AI Red Teaming

What is AI Red Teaming? →