NEW RESEARCH: Your Sandbox Is Made of Glass

Read

Trinitite

PricingResearchBlogPodcasts

Glossary / Agent Evals

Definition

What is Agent Evals?

AI agent evaluation

Agent evals are tests that measure how well an AI agent reasons, acts, executes, and stays safe. Trinitite exercises the agent you built or bought — with submitted transcripts, captured production traffic, or a simulated human — scores each interaction with a deterministic SLM judge, and mints a signed, replayable Eval Receipt instead of a dashboard number.

Most eval tooling renders a pass rate from your own database and judges with an LLM on a shared GPU pool, so the same transcript can score differently at high utilization. The score becomes a claim no one can re-run. A deterministic, batch-invariant judge makes the verdict reproducible bit-for-bit.

Three drive modes cover the lifecycle: submitted (batch finished transcripts), proxy_capture (judge real production traffic on a perimeter you already route through), and persona_sim (a simulated human drives the agent multi-turn). Every verdict localizes failure on a four-axis scorecard, and any failure promotes into a versioned regression set. Champion/challenger comparisons are signed, deterministic deltas.

See Agent Evals in action.

Run the free 1,000-log pre-audit and get a signed, reproducible report you can verify in a browser — no NDA.