NEW RESEARCH: Your Sandbox Is Made of Glass
Read
Glossary / Eval Harness
Definition
Signed benchmark scores
An eval harness runs model benchmarks — MMLU, HELM, TruthfulQA, GSM8K, or an internal golden set — and records each run as a signed, replayable receipt. Trinitite binds the model reference, a Merkle root over the suite, a Merkle root over per-item answers, and the seed, so a benchmark number becomes evidence someone can independently re-run.
A published accuracy figure is a claim until a third party can reproduce it. Because the kernel is deterministic, the same model on the same pinned suite reproduces the same answer root, bit-for-bit, and a fail-closed verify endpoint recomputes the attestation hash and re-checks the signature — so a doctored database row is caught.
Champion-vs-challenger becomes a signed accuracy/refusal/error delta carrying a deterministic flag — the effective-challenge evidence SR 11-7 asks for, not two spreadsheets. The eval harness is the signing primitive underneath the broader Evals module.
Run the free 1,000-log pre-audit and get a signed, reproducible report you can verify in a browser — no NDA.
Trinitite
AI governance that catches mistakes, proves compliance, and shows the board what it saved—in dollars.
Trinitite is built by Fiscus Flows, Inc.
Products
Products
Solutions
Resources
Developers
© 2026 Fiscus Flows, Inc. · All rights reserved
Accessibility
The Guardian Standard™