What is AI Red Teaming?

MITRE ATLAS adversarial testing

AI red teaming attacks your own AI agent to find failures before an adversary does — prompt injection, jailbreaks, PII extraction, and data exfiltration. Trinitite runs an adversarial persona swarm, maps every probe to a MITRE ATLAS technique, scores each transcript with a deterministic judge at temperature 0, and signs an ATLAS attestation auditors re-verify.

The attacks are creative and non-deterministic; the scoring must be deterministic and signable. A persona swarm drives your real agent multi-turn while a T=0 SLM judge scores every transcript, so the same evidence pack reproduces the same verdict.

One run yields a signed Eval Receipt and a signed ATLAS attestation binding the probe-set hash, per-probe pass/fail, pass rate, and critical-failure count. Every probe carries a MITRE ATLAS technique id — the robustness evidence SR 11-7 §IV, NIST AI RMF MANAGE-2.2, EU AI Act Art. 15, and ISO 42001 §B.6.2.6 ask for. Failed attacks promote into a regression set; the runtime fix lives in AI guardrails and prompt injection defense.

Related terms

Agent Evals

What is Agent Evals? →

Prompt Injection Defense

What is Prompt Injection Defense? →

AI Guardrails

What is AI Guardrails? →