NEW RESEARCH: Your Sandbox Is Made of Glass
Read
Eval Harness · Signed Benchmarks
When you benchmark a model — MMLU, HELM, TruthfulQA, GSM8K, or an internal golden set — Trinitite records the run as a signed, replayable receipt that binds which model, which suite, and every per-item verdict. An auditor re-runs the suite and proves the same model on the same inputs produces the same scores. Champion-vs-challenger becomes a signed delta, not a spreadsheet.
eval_harness · eh_3f9a…c7e1
VERIFIED
87.6%
MMLU · bound to checkpoint
suite
mmlu · 14,042 items
model_ref
llama-3.1-70b+lora
seed
0
answer_root
e8d2…41aa
Δ vs incumbent
+1.9 · deterministic
A score you can’t reproduce — and can’t bind to a model — is marketing, not assurance.
“92% on MMLU” — against which checkpoint, with which prompts? You re-run next quarter and the number moved. Was that the model, the suite, the seed, or someone editing a cell? A signed receipt answers it; a screenshot of a notebook output is a picture of a number.
What gets signed
model_reference
The exact model id, optional AI bill-of-materials, and LoRA adapter under test — so the score is bound to a checkpoint, not a brand name.
suite_merkle_root
A Merkle root over the benchmark itself — MMLU, HELM, TruthfulQA, GSM8K, a red-team battery, or your internal golden set — pinning exactly what was tested.
answer_merkle_root
A Merkle root over the per-item outputs and verdicts (correct / incorrect / refused / error) — what the model actually produced.
seed
The seed the run executed with. Same model + same pinned suite + same seed reproduces the same answer root, bit-for-bit.
How it works
01
Run a suite
Execute MMLU, HELM, GSM8K, a red-team battery, or a custom suite against a model reference.
02
Record the run
Every item’s input hash, output hash, and verdict — correct, incorrect, refused, or error.
03
Get a signed receipt
Binds model + suite + seed + per-item answers + aggregate accuracy / refusal / error rates, KMS-signed and Merkle-rooted.
04
Verify any time
Trinitite recomputes the attestation hash and re-checks the signature — failing closed if a stored metric was tampered with, even a doctored database row.
05
Compare
Diff two receipts for a signed champion/challenger accuracy delta with a deterministic flag for SR 11-7.
In your language
Model Risk / SR 11-7 examiner
The challenger comparison the framework requires — candidate vs incumbent on the same suite, signed, with a determinism check.
Head of AI / ML
A benchmark history you can trust: every score bound to a specific checkpoint, re-runnable, diffable.
Internal audit / Big-4
Accuracy and robustness evidence that rolls into the workpaper and re-verifies without trusting your notebook.
Regulatory affairs
Per-suite evidence for EU AI Act Art. 15 (accuracy, robustness), NIST AI RMF MEASURE-2.x, ISO 42001 §B.6.2.5.
AI vendor
Publish a capability claim a buyer’s risk team can re-verify — not a number they have to take on faith.
The eval harness is the signing primitive under the Evals module; it feeds model risk management (SR 11-7) and shares the proof chain of deterministic replay.
FAQ
eh_… receipts. The harness scores a model against a benchmark suite; the Evals module scores an agent against a rubric. Both feed model risk management.Run a suite against a candidate model, get a signed receipt binding the model, the suite, and every verdict — then re-verify it yourself, no notebook required.
Trinitite
AI governance that catches mistakes, proves compliance, and shows the board what it saved—in dollars.
Trinitite is built by Fiscus Flows, Inc.
Products
Products
Solutions
Resources
Developers
© 2026 Fiscus Flows, Inc. · All rights reserved
Accessibility
The Guardian Standard™