NEW RESEARCH: Your Sandbox Is Made of Glass

Read

Trinitite

PricingResearchBlogPodcasts

Eval Harness · Signed Benchmarks

Your MMLU number is a claim until someone can re-run it.

When you benchmark a model — MMLU, HELM, TruthfulQA, GSM8K, or an internal golden set — Trinitite records the run as a signed, replayable receipt that binds which model, which suite, and every per-item verdict. An auditor re-runs the suite and proves the same model on the same inputs produces the same scores. Champion-vs-challenger becomes a signed delta, not a spreadsheet.

eval_harness · eh_3f9a…c7e1

VERIFIED

87.6%

MMLU · bound to checkpoint

suite

mmlu · 14,042 items

model_ref

llama-3.1-70b+lora

seed

0

answer_root

e8d2…41aa

Δ vs incumbent

+1.9 · deterministic

A score you can’t reproduce — and can’t bind to a model — is marketing, not assurance.

“92% on MMLU” — against which checkpoint, with which prompts? You re-run next quarter and the number moved. Was that the model, the suite, the seed, or someone editing a cell? A signed receipt answers it; a screenshot of a notebook output is a picture of a number.

What gets signed

Which model. Which suite. Every answer.

model_reference

The exact model id, optional AI bill-of-materials, and LoRA adapter under test — so the score is bound to a checkpoint, not a brand name.

suite_merkle_root

A Merkle root over the benchmark itself — MMLU, HELM, TruthfulQA, GSM8K, a red-team battery, or your internal golden set — pinning exactly what was tested.

answer_merkle_root

A Merkle root over the per-item outputs and verdicts (correct / incorrect / refused / error) — what the model actually produced.

seed

The seed the run executed with. Same model + same pinned suite + same seed reproduces the same answer root, bit-for-bit.

How it works

Run, record, sign, verify, compare.

01

Run a suite

Execute MMLU, HELM, GSM8K, a red-team battery, or a custom suite against a model reference.

02

Record the run

Every item’s input hash, output hash, and verdict — correct, incorrect, refused, or error.

03

Get a signed receipt

Binds model + suite + seed + per-item answers + aggregate accuracy / refusal / error rates, KMS-signed and Merkle-rooted.

04

Verify any time

Trinitite recomputes the attestation hash and re-checks the signature — failing closed if a stored metric was tampered with, even a doctored database row.

05

Compare

Diff two receipts for a signed champion/challenger accuracy delta with a deterministic flag for SR 11-7.

In your language

A benchmark that re-verifies without trusting you.

Model Risk / SR 11-7 examiner

The challenger comparison the framework requires — candidate vs incumbent on the same suite, signed, with a determinism check.

Head of AI / ML

A benchmark history you can trust: every score bound to a specific checkpoint, re-runnable, diffable.

Internal audit / Big-4

Accuracy and robustness evidence that rolls into the workpaper and re-verifies without trusting your notebook.

Regulatory affairs

Per-suite evidence for EU AI Act Art. 15 (accuracy, robustness), NIST AI RMF MEASURE-2.x, ISO 42001 §B.6.2.5.

AI vendor

Publish a capability claim a buyer’s risk team can re-verify — not a number they have to take on faith.

The eval harness is the signing primitive under the Evals module; it feeds model risk management (SR 11-7) and shares the proof chain of deterministic replay.

FAQ

Eval harness, answered

What is an eval harness?

An eval harness runs a model against a benchmark suite and records the result. Trinitite’s eval harness turns that run into a signed, replayable receipt: it pins the model reference, a Merkle root over the suite, and a Merkle root over the per-item answers, then KMS-signs the whole envelope. Because the kernel is deterministic, re-running the same model on the same pinned suite reproduces the same answer root, bit-for-bit.

How does this satisfy SR 11-7?

SR 11-7 effective challenge wants a candidate-vs-incumbent comparison on the same suite. The eval harness produces a signed accuracy / refusal / error delta between two receipts, carrying a deterministic flag — so the comparison is reproducible and citeable by id, not two spreadsheets and a promise that the runs are comparable.

Do receipts store my raw prompts and answers?

No. Receipts store input and output hashes, not raw text — so they are workpaper-embeddable while still spot-checkable per item. The verify endpoint recomputes the attestation hash and re-checks the signature, failing closed on any post-hoc tampering.

How does this relate to the Evals module?

The eval harness is the signing primitive underneath the Evals module: when Trinitite tests your agent’s behavior with our judge, each run mints one of these same eh_… receipts. The harness scores a model against a benchmark suite; the Evals module scores an agent against a rubric. Both feed model risk management.

Record one benchmark run. Re-verify the receipt.

Run a suite against a candidate model, get a signed receipt binding the model, the suite, and every verdict — then re-verify it yourself, no notebook required.