NEW RESEARCH: Your Sandbox Is Made of Glass

Read

Trinitite

Continuous Evals · Agent SLA Telematics

You ran an eval once. Now sign it every day.

A one-time 94 becomes fiction the moment a provider updates the model behind your endpoint. Give the agent you operate a telematics box: route its traffic through a Trinitite perimeter, judge every interaction with a deterministic SLM, and roll each day into one signed Eval Attestation with drift detection.

eval_attestation · ema_2026-06-26

ANCHORED

96.1%

signed daily agent pass rate

pass / fail

4,118 / 167

psi_drift

0.07 · stable

min_pass_rate

0.95 · held

receipt_id

eh_b2f0…7a1c

anchor

RFC 3161 + Rekor

Anyone can score your agent once. We sign its behavior every day.

The model behind your endpoint changed. Your prompt got edited. Your tool catalog shifted. None of it announces itself. A daily number that can’t move just because the GPU pool got busy is the only way you find out from a pager instead of from a customer.

How it works

Bind a monitor. Tag your traffic. Read a signed number.

Create a proxy_capture eval

Define your rubric and the Agent-Under-Test descriptor — the agent you operate, reached via proxy, MCP gateway, or CLI firewall.

Bind a monitor

POST a cron and an optional min_pass_rate floor. Trinitite returns an open capture run id for the window.

Tag your traffic

Stamp X-Trinitite-Eval-Run on requests. Real production trajectories attach to the window — no separate test harness.

Each tick rolls the window

The deterministic judge scores every captured trajectory, mints the per-window Eval Receipt, and signs + anchors a daily Eval Attestation.

Drift & regression alarm

Webhooks fire the moment the pass-rate distribution shifts or drops below your floor — you find out from a pager, not a customer.

What you get

Signed daily, alarmed on drift.

Daily Eval Attestation

ema_…

pass_rate, pass_count, fail_count, PSI, and the window’s receipt_id — KMS-signed and externally anchored, reproducible by a counterparty.

Per-window Eval Receipt

eh_…

The deterministic judge’s per-trajectory verdicts, replayable bit-for-bit.

PSI drift signal

webhook

Population Stability Index between today’s {pass, fail} distribution and a trailing baseline. evals.monitor.drift_detected on breach.

Regression alarm

webhook

evals.monitor.regression the day pass-rate drops below your min_pass_rate floor.

In your language

A live SLA, in the terms you report on.

Head of AI Platform

A live SLA on the agent you operate. The day a provider swaps the model under your endpoint, PSI lights up.

Vendor management

Hold a third-party agent vendor to a signed, anchored daily score — not their marketing benchmark.

CISO / Compliance

Your agent’s behavioral compliance is a signed value every day, feeding the same MRM and attestation surfaces via violated_controls[].

Insurance underwriter

A streaming pricing signal on the insured agent — bad days light up riders within 24 hours.

Continuous Evals is the always-on cut of the broader Evals module; for benchmark scores, see the Eval Harness, and for adversarial coverage, the ATLAS red team.

FAQ

Continuous evals, answered

What is continuous evals?

Continuous evals is an always-on SLA on a live AI agent. You route the agent’s traffic through a Trinitite perimeter tagged to an Eval Monitor; every interaction is judged by a deterministic SLM, and each window rolls into one signed daily Eval Attestation with drift detection. It answers "is my agent still behaving?" with a provable, reproducible number rather than a vibe.

How is this different from Continuous Assurance?

They are inverses on the same deterministic substrate. Continuous Assurance streams Trinitite’s own Guardian verdicts into one signed daily compliance number. Continuous Evals streams your agent and scores it with our judge — telematics for the agent you bought or built.

How does drift detection work?

A nightly roll computes the Population Stability Index (PSI) between today’s pass/fail distribution and a trailing baseline. On breach, Trinitite fires an evals.monitor.drift_detected webhook with the structured comparison — the day a provider swaps a model, a prompt changes, or the traffic mix shifts. A separate evals.monitor.regression webhook fires when pass-rate drops below the floor you set.

Do I have to stand up a test harness?

No. Continuous Evals attaches your real production trajectories to the capture window via an X-Trinitite-Eval-Run header on traffic you already route through a Trinitite proxy, MCP gateway, or CLI firewall. There is no separate synthetic harness to maintain.