NEW RESEARCH: Your Sandbox Is Made of Glass

Read

Trinitite

AI Guardrails · Evidence, Not Just Blocks

AI guardrails that produce evidence.

A blocklist tells you it stopped something. A Trinitite guardrail returns a five-valued verdict — pass, correct, mask, block, or escalate — and signs a receipt you can replay. The same model that audited your history is the one enforcing in production.

guardian · verdict

SIGNED

action

db.query(args)

verdict

correct

patch

LIMIT 100000 → 100

clause

EU AI Act Art. 14

receipt

e8d2…41aa

✓ corrected, not crashed

The Difference

A blocklist fails shut. A Guardian keeps you running.

Ordinary guardrails

“Blocked. Good luck.”

A static blocklist or a vendor safety prompt gives you allow or deny. When it denies a near-miss, it crashes the workflow — and when an auditor asks why, the only record is a log line you have to be trusted on.

A Trinitite guardrail

“Corrected. Here’s the receipt.”

The Guardian rewrites the near-miss in place, masks what shouldn’t cross the boundary, and reserves hard blocks for genuinely dangerous actions — then signs a verdict that cites the clause it enforced and reproduces bit-for-bit months later.

Five verdicts, not two

Allow / block is a tripwire. This is a control.

Every input and every tool call gets one of five verdicts. The three middle options are why good traffic keeps flowing instead of failing shut.

pass

The action is in policy. It flows through untouched, with a signed receipt.

correct

A near-miss is rewritten in place via an RFC 6902 JSON patch — the workflow keeps running.

mask

Sensitive fields are reversibly tokenized before they cross a trust boundary.

block

A dangerous tool call is stopped before execution — and the block is replayable.

HITL

Genuinely ambiguous calls are parked for a human instead of guessed.

Guardrails in the latent space

Everyone guards the words. We guard the geometry too.

The 2026 attacks don’t jailbreak with clever words — they reshape the embeddings underneath: the vectors your retrieval searches and the action your agent takes. These guardrails police that hidden geometry.

Agent Action Guard

An embedding-based gate scores the semantics of the proposed tool call — not the agent’s reasoning — so a hijacked agent still can’t make "delete the production database" look safe. It survives the prompt injection.

Hybrid retrieval

Every RAG lookup runs keyword and semantic search at once. Gradient-guided poisoning can fool a vector index; it can’t fool both at the same time.

Policy-clause anchoring

Each verdict cites the exact clause it enforced — EU AI Act Art. 9–17, GDPR Art. 22 — bound by the platform and sealed in the signed chain, regardless of what the model says.

Black-hole detection

Vectors that become retrieval magnets for almost any query are flagged by hubness analysis and quarantined — the stealth poison proximity checks miss.

Trained on your policy

Not a generic classifier. Your Auditor, distilled.

A Trinitite guardrail is a per-tenant Guardian: a LoRA adapter distilled from your policy corpus — regulations, internal docs, prior opinions — over a policy-aware base model, hot-swapped into a determinism-fixed kernel on the specific call. Because the kernel is batch-invariant, the same input and the same policy produce the same verdict bytes on any cluster, on any day.

That is what makes the guardrail an auditor instead of a logger: the verdict it renders inline today is replayable in a post-incident review months later, and it’s the same opinion behind MCP governance and the auditor workflow.

FAQ

AI guardrails, answered

What are AI guardrails?

AI guardrails are controls that sit around a model to keep its inputs and outputs inside policy — blocking unsafe responses, redacting sensitive data, and constraining what an agent is allowed to do. Most guardrails are a two-valued allow/block filter. Trinitite’s guardrail is a per-tenant Guardian model that returns a five-valued verdict — pass, correct, mask, block, or escalate — and signs a replayable receipt for every decision.

How are Trinitite’s AI guardrails different from a generic classifier?

A generic guardrail runs a vendor’s safety prompt and gives you a yes/no. Trinitite distills your own policy corpus — regulations, internal docs, prior opinions — into a per-tenant LoRA adapter, runs it on a determinism-fixed kernel, and produces a signed verdict that reproduces bit-for-bit. The same Guardian that audited your history is the one enforcing in production.

Do AI guardrails break my application when they block something?

Trinitite’s do not have to. Instead of failing a near-miss shut, the Guardian can correct it in place with an RFC 6902 JSON patch or reversibly mask sensitive fields, so the workflow keeps running. Hard blocks are reserved for genuinely dangerous actions, and every block is recorded and replayable.

Can AI guardrails stop prompt injection?

Yes — the Agent Action Guard scores the semantics of the proposed action independently of the agent’s reasoning, so an injection that hijacks the model still can’t disguise a destructive tool call. See prompt injection defense for the detail.