NEW RESEARCH: Your Sandbox Is Made of Glass

Read

Trinitite

PricingResearchBlogPodcasts

ATLAS Red Team · Adversarial Evidence

Anyone can run attacks. We sign the evidence.

Point an adversarial persona swarm at your real agent. Every attack maps to a MITRE ATLAS technique, every verdict is scored by a deterministic SLM judge, and the run mints a signed ATLAS attestation your auditor re-verifies — not a screenshot of “we asked the model not to do bad things.”

atlas_attestation · rt_9c01…b7d4

SIGNED

93.5%

probe pass rate · 248 probes

critical_failures

2

probe_set_hash

a14b…0d9f

top technique

AML.T0051.000

judge

deterministic · T=0

re-verify

no Trinitite login

The attacks are creative and non-deterministic. The scoring is deterministic and signable.

We test their agent with our judge. A large model plays the creative attacker; a batch-invariant SLM scores every transcript at T=0. That split is why a robustness section stops being the weak link in the review.

The gap

What you have vs. what your auditor wants.

What you have

A vibes-based red-team — someone tried jailbreaks they remembered from Twitter. No coverage map, no severity, no repeatability.

What your auditor wants

A recognized taxonomy: attacks mapped to MITRE ATLAS techniques, not a vendor’s private list.

What you have

A screenshot. "Look, it refused this one prompt." One example, against a model that behaves differently next time.

What your auditor wants

A pass/fail verdict per attack, with a rationale — cited, severity-calibrated, tied to a technique id.

What you have

A six-figure pen-test PDF you can’t re-run after a prompt change.

What your auditor wants

Cryptographic evidence the run happened as claimed — a signed attestation binding the probe set and the verdicts.

What you have

A robustness section that reads "we tried some attacks" — no longer acceptable for a regulated SKU.

What your auditor wants

Reproducibility: re-run after a fix and get a comparable, deterministic verdict, not a different answer every time.

prompt_injection

jailbreak

pii_leak

data_exfiltration

indirect_injection

roleplay_evasion

Regulatory hooks

An attestation built to be cited verbatim.

MITRE ATLAS

the matrix itself

Every probe carries ≥1 ATLAS technique id; the attestation is the auditor’s anchor.

SR 11-7

§IV.B effective challenge

Adversarial testing of the model’s behavior, with evidence.

NIST AI RMF

MANAGE-2.2 / MEASURE-2.7

Adversarial testing plus output validity and safety.

EU AI Act

Art. 15 robustness; Art. 9 risk mgmt

Documented adversarial robustness testing.

ISO 42001

§B.6.2.6

Security testing for AI systems.

OWASP LLM Top-10

the vulnerability taxonomy

Maps onto the probe categories exercised.

ATLAS red team is one exercise mode of the Evals module; failed attacks promote into a regression set, and the runtime fix lives in AI guardrails and prompt injection defense.

FAQ

AI red teaming, answered

What is AI red teaming?

AI red teaming is adversarial testing of an AI system — deliberately attacking it with prompt injection, jailbreaks, PII extraction, and data exfiltration to find where it breaks. Trinitite drives an adversarial persona swarm against your real agent, maps every attack to a MITRE ATLAS technique, and scores each one with a deterministic SLM judge so the result is signed, reproducible evidence rather than a screenshot.

Why does the judge need to be deterministic if the attacks are creative?

The attacks should be non-deterministic and creative — that is what makes them realistic. The scoring should be deterministic — that is what makes it evidence. Trinitite splits the two: a large model acts as the creative attacker, and a batch-invariant SLM judge scores every transcript at temperature 0, so the same attack transcript yields the same verdict, bit-for-bit, and you get a true before/after when you ship a fix.

What does the auditor actually receive?

One run yields two signed artifacts: an Eval Receipt (the judge’s reproducible per-attack verdicts) and an ATLAS attestation (rt_…) binding the probe-set hash, per-probe pass/fail, pass rate, critical-failure count, and a KMS signature. Each judged item carries its probe_id and atlas_techniques[] (e.g. AML.T0051.000), so a finding maps to a technique, not a vibe — and the attestation re-verifies without Trinitite.

How does this fit the rest of the platform?

An ATLAS run is one exercise mode of the broader Evals module. Any attack your agent failed can be promoted into a versioned regression set and re-run on every change; the gaps it surfaces inform the inline AI guardrails and prompt injection defense that block them in production.

Point us at a test endpoint. Get a signed ATLAS attestation.

We drive an adversarial persona swarm against your agent, score every attack with a deterministic judge, and hand back a signed, ATLAS-mapped attestation your auditor re-verifies.