NEW RESEARCH: Your Sandbox Is Made of Glass

Read

Trinitite

Eval Harness · Signed Benchmarks

Your MMLU number is a claim until someone can re-run it.

When you benchmark a model — MMLU, HELM, TruthfulQA, GSM8K, or an internal golden set — Trinitite records the run as a signed, replayable receipt that binds which model, which suite, and every per-item verdict. An auditor re-runs the suite and proves the same model on the same inputs produces the same scores. Champion-vs-challenger becomes a signed delta, not a spreadsheet.

eval_harness · eh_3f9a…c7e1

VERIFIED

87.6%

MMLU · bound to checkpoint

suite

mmlu · 14,042 items

model_ref

llama-3.1-70b+lora

seed

answer_root

e8d2…41aa

Δ vs incumbent

+1.9 · deterministic

A score you can’t reproduce — and can’t bind to a model — is marketing, not assurance.

“92% on MMLU” — against which checkpoint, with which prompts? You re-run next quarter and the number moved. Was that the model, the suite, the seed, or someone editing a cell? A signed receipt answers it; a screenshot of a notebook output is a picture of a number.

What gets signed

Which model. Which suite. Every answer.

model_reference

The exact model id, optional AI bill-of-materials, and LoRA adapter under test — so the score is bound to a checkpoint, not a brand name.

suite_merkle_root

A Merkle root over the benchmark itself — MMLU, HELM, TruthfulQA, GSM8K, a red-team battery, or your internal golden set — pinning exactly what was tested.

answer_merkle_root

A Merkle root over the per-item outputs and verdicts (correct / incorrect / refused / error) — what the model actually produced.

seed

The seed the run executed with. Same model + same pinned suite + same seed reproduces the same answer root, bit-for-bit.

How it works

Run, record, sign, verify, compare.

Run a suite

Execute MMLU, HELM, GSM8K, a red-team battery, or a custom suite against a model reference.

Record the run

Every item’s input hash, output hash, and verdict — correct, incorrect, refused, or error.

Get a signed receipt

Binds model + suite + seed + per-item answers + aggregate accuracy / refusal / error rates, KMS-signed and Merkle-rooted.

Verify any time

Trinitite recomputes the attestation hash and re-checks the signature — failing closed if a stored metric was tampered with, even a doctored database row.

Compare

Diff two receipts for a signed champion/challenger accuracy delta with a deterministic flag for SR 11-7.

In your language

A benchmark that re-verifies without trusting you.

Model Risk / SR 11-7 examiner

The challenger comparison the framework requires — candidate vs incumbent on the same suite, signed, with a determinism check.

Head of AI / ML

A benchmark history you can trust: every score bound to a specific checkpoint, re-runnable, diffable.

Internal audit / Big-4

Accuracy and robustness evidence that rolls into the workpaper and re-verifies without trusting your notebook.

Regulatory affairs

Per-suite evidence for EU AI Act Art. 15 (accuracy, robustness), NIST AI RMF MEASURE-2.x, ISO 42001 §B.6.2.5.

AI vendor

Publish a capability claim a buyer’s risk team can re-verify — not a number they have to take on faith.

The eval harness is the signing primitive under the Evals module; it feeds model risk management (SR 11-7) and shares the proof chain of deterministic replay.

FAQ

Eval harness, answered

What is an eval harness?

An eval harness runs a model against a benchmark suite and records the result. Trinitite’s eval harness turns that run into a signed, replayable receipt: it pins the model reference, a Merkle root over the suite, and a Merkle root over the per-item answers, then KMS-signs the whole envelope. Because the kernel is deterministic, re-running the same model on the same pinned suite reproduces the same answer root, bit-for-bit.

How does this satisfy SR 11-7?

SR 11-7 effective challenge wants a candidate-vs-incumbent comparison on the same suite. The eval harness produces a signed accuracy / refusal / error delta between two receipts, carrying a deterministic flag — so the comparison is reproducible and citeable by id, not two spreadsheets and a promise that the runs are comparable.

Do receipts store my raw prompts and answers?

No. Receipts store input and output hashes, not raw text — so they are workpaper-embeddable while still spot-checkable per item. The verify endpoint recomputes the attestation hash and re-checks the signature, failing closed on any post-hoc tampering.

How does this relate to the Evals module?

The eval harness is the signing primitive underneath the Evals module: when Trinitite tests your agent’s behavior with our judge, each run mints one of these same eh_… receipts. The harness scores a model against a benchmark suite; the Evals module scores an agent against a rubric. Both feed model risk management.

Record one benchmark run. Re-verify the receipt.

Run a suite against a candidate model, get a signed receipt binding the model, the suite, and every verdict — then re-verify it yourself, no notebook required.

Trinitite

AI governance that catches mistakes, proves compliance, and shows the board what it saved—in dollars.

Trinitite is built by Fiscus Flows, Inc.

Products

Evals Continuous Evals Eval Harness ATLAS Red Team Guardian AI Reversible Masking MCP Governance CLI Firewall Latent Defense Skill Vault

Products

NHI Management Vector Integrity Trustworthy Knowledge Self-Improving Governance Shadow AI Inventory Risk Analytics Scenario Analytics Enterprise Reporting Compliance Connectors Pricing

Solutions

For Risk & Compliance For Model Risk (SR 11-7)For General Counsel For Engineering For Finance & Audit For CISOs For Insurers For Auditors For CPOs & Compliance

Resources

Audit Platform Continuous Audit Research Webinars & Videos Blog Podcasts AGRC Framework FAQ Architecture

Developers