NEW RESEARCH: Your Sandbox Is Made of Glass

Read

Trinitite

Agent Evals

Everyone scores agents. We sign the score.

Point Trinitite at the agent you built or bought. We exercise it — with your transcripts, your real traffic, or a simulated human — then score every interaction with a deterministic SLM judge. The output isn’t a dashboard number that drifts with GPU load; it’s a signed, replayable Eval Receipt your auditor can re-run.

eval_receipt · eh_7c1a…d09f

SIGNED

91.2

aggregate score · same bytes on re-run

reasoning

0.94

action

0.88

execution

0.96

safety

0.85 · 1 finding

judge

deterministic SLM · T=0

It scored 91. It re-scored 88 on a busy afternoon. Which 91 was real?

Every eval platform is built for the pre-ship moment: a batch, a chart, a number nobody can re-derive. When the judge is a frontier LLM on a shared GPU pool, the score is a vibe with a decimal point. A batch-invariant judge makes it bytes you signed — and the auditor reproduces them.

Run it your way

Three ways to exercise the agent you already have.

submitted

Post finished transcripts for a quick batch. The fastest way to a signed score — no integration, just JSON in, a receipt out.

proxy_capture

Judge what your agent actually did. Tag real production traffic on a perimeter you already route through; no separate test harness.

persona_sim

Let a simulated human drive your agent multi-turn, the way a varied user — or an adversary — would. Conversation, not a fixture.

The scorecard

It tells you where the agent failed, not just that it failed.

reasoning

Did the agent think correctly about the task before it acted?

action

Did it choose the right tool, with the right arguments, for the right reason?

execution

Did the steps actually run and resolve, or stall and retry into nonsense?

safety

Policy adherence and data leakage — cross-walked to your framework controls.

What you get

Evidence, not a chart you have to be trusted on.

Signed Eval Receipt

eh_…

The deterministic judge’s per-interaction verdicts plus the aggregate score — Merkle-rooted, KMS-signed, replayable bit-for-bit and comparable across runs.

Multi-dimensional scorecard

4 axes

Each verdict localizes where the agent failed — reasoning, action, execution, or safety — not just that it failed.

Versioned regression sets

golden

Frozen scenario sets make a run reproducible; every failure can be promoted into the suite, so your tests grow from real breaks.

Champion / challenger delta

signed

Compare two runs, two prompts, or two models and get a signed, deterministic before/after — not a noisy re-roll.

The deeper cuts

One module. Three productized extensions.

Continuous Evals

An always-on daily SLA on a live agent: capture real traffic, sign a daily attestation, alarm on drift.

Explore →

Eval Harness

Sign your benchmark scores — MMLU, HELM, GSM8K — bound to a specific model, re-runnable, diffable.

Explore →

ATLAS Red Team

Adversarial testing your auditor accepts: MITRE ATLAS-mapped attacks, signed, reproducible.

Explore →

Two more capabilities ship in the module: a regulation-to-eval compiler and deterministic prompt/config optimization. The proof underneath every verdict is deterministic replay; put the same kernel inline and it becomes AI guardrails.

In your language

What a signed score unlocks.

Head of AI / CTO

A reproducible quality gate on the agent you ship — re-run on every prompt or model change and diff deterministic verdicts, not noise.

AI product owner

Find your best system prompt and model config with a reproducible search — the winner is signed, not hill-climbed against a judge that moves under you.

CISO / Compliance

Your agent’s behavioral compliance becomes a signed value, mapped to controls — evidence, not a dashboard screenshot.

Vendor management

Hold a third-party agent to a signed score you can re-verify yourself — not the vendor’s marketing benchmark.

Internal audit / Big-4

Reproducible, signed test evidence drawn against the real agent — re-verifiable by an external partner.

AI vendor

A signed receipt to hand a regulated buyer’s security team during procurement — cuts weeks off the review.

FAQ

Agent evals, answered

What is agent evals?

Agent evals is the practice of scoring what an AI agent actually did — its reasoning, the tools it called, whether the steps executed, and whether it stayed inside policy. Trinitite tests your agent (submitted transcripts, captured production traffic, or a simulated human) and scores every interaction with a deterministic, batch-invariant SLM judge, then mints a signed, replayable Eval Receipt.

How is this different from an LLM-as-judge eval?

A frontier LLM-as-judge gives you a different number on a busy afternoon — the score moves with GPU load, so nobody can re-derive last quarter’s run. Trinitite’s judge is a deterministic, batch-invariant SLM: the same input yields the same bytes regardless of server load, and the run mints a KMS-signed, Merkle-rooted Eval Receipt any third party can replay.

Do I have to send you my agent or my data?

No. You point Trinitite at the agent you built or bought and run it your way: post finished transcripts (submitted), tag real traffic on a perimeter you already route through (proxy_capture), or set a persona loose to talk to it (persona_sim). We test your agent with our judge — the agent under test stays yours.

What can I do with a signed Eval Receipt?

Diff it against the next run for a deterministic before/after, hand it to an auditor or counterparty who re-verifies it without trusting you, or promote any failure into a versioned regression set. The same signing primitive runs underneath benchmark scores, daily agent SLAs, and ATLAS red-team runs.