Protect Your Inbox from Rogue AI Agents

An AI email assistant that forgets its instructions can delete your entire inbox in seconds. PreFlight makes sure that never happens - even when the agent ignores you.

What happens without PreFlight

In early 2026, Meta's own Director of AI Alignment connected an AI agent to her personal inbox with a clear instruction: "suggest what you would archive or delete, don't action until I tell you to."

The agent's context window filled up. Her safety instruction was silently dropped. The agent started mass-deleting emails. She sent three stop commands from her phone - the agent acknowledged each one and kept deleting.

She had to physically run to her computer and kill the process. (Source)

The instruction lived inside the agent's memory. When memory ran out, the instruction disappeared. A prompt is a suggestion. A PreFlight policy is a constraint.

Make sure PreFlight runs before any important action!

The policy

1. Do not delete any email without an explicit confirmation token from the user.
2. Do not archive any email without an explicit confirmation token from the user.
3. Do not move any email to trash without an explicit confirmation token from the user.
4. No more than 5 emails may be modified in a single session without re-authorization.
5. Do not perform any action on emails older than 7 days without explicit approval.

Rules 1–3 enforce the core safety behavior: look but don't touch. Rule 4 adds a hard batch limit - even if something slips through, the damage is capped at 5 emails before the agent must stop and ask again. Rule 5 catches runaway behavior targeting your archive.

Set it up in minutes

1. Compile the policy

curl -s -N -X POST https://api.icme.io/v1/makeRules \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy": "1. Do not delete any email without an explicit confirmation token from the user.\n2. Do not archive any email without an explicit confirmation token from the user.\n3. Do not move any email to trash without an explicit confirmation token from the user.\n4. No more than 5 emails may be modified in a single session without re-authorization.\n5. Do not perform any action on emails older than 7 days without explicit approval."
  }'

Save the policy_id from the response.

2. Check before every relevant action

Before your agent touches any email, it calls checkIt:

Result: UNSAT - blocked. No confirmation token.

3. See it work

❌ Bulk action with no confirmation

Expected result: UNSAT - exceeds 5-email session limit, no confirmation token.

❌ Deleting an old email without approval

Expected result: UNSAT - email is older than 7 days, no explicit approval.

✅ Single deletion with user confirmation

Expected result: SAT - confirmed by user, recent email, single action.

Why this works when prompts don't

Prompt instruction
PreFlight policy

Where it lives

Inside the agent's context window

External API, outside the agent

What happens when context fills

Silently dropped

Still enforced - the agent can't reach it

Can the agent override it?

Yes - by forgetting, reinterpreting, or ignoring

No - an SMT solver doesn't take suggestions. If the action results in unsat it will not pass.

Enforcement

Best-effort, probabilistic

Mathematical, deterministic

Audit trail

None

Cryptographic proof per decision

Adapt it to your needs

This policy is a starting point. You might add rules like:

  • "Do not forward any email to an address outside my contacts list."

  • "Do not send any email containing financial figures to external recipients."

  • "Do not access emails in the Confidential label without explicit approval."

Write one constraint per rule. Keep each rule atomic - test it with battle testing to catch edge cases before deploying.

Last updated