NEW RESEARCH: Your Sandbox Is Made of Glass

Read

Trinitite

PricingResearchBlogPodcasts

Glossary / Eval Harness

Definition

What is Eval Harness?

Signed benchmark scores

An eval harness runs model benchmarks — MMLU, HELM, TruthfulQA, GSM8K, or an internal golden set — and records each run as a signed, replayable receipt. Trinitite binds the model reference, a Merkle root over the suite, a Merkle root over per-item answers, and the seed, so a benchmark number becomes evidence someone can independently re-run.

A published accuracy figure is a claim until a third party can reproduce it. Because the kernel is deterministic, the same model on the same pinned suite reproduces the same answer root, bit-for-bit, and a fail-closed verify endpoint recomputes the attestation hash and re-checks the signature — so a doctored database row is caught.

Champion-vs-challenger becomes a signed accuracy/refusal/error delta carrying a deterministic flag — the effective-challenge evidence SR 11-7 asks for, not two spreadsheets. The eval harness is the signing primitive underneath the broader Evals module.

See Eval Harness in action.

Run the free 1,000-log pre-audit and get a signed, reproducible report you can verify in a browser — no NDA.