face-thinkingHow It Works

The ICME Preflight is built on two pieces of computer science: automated reasoning and zero knowledge machine learning. This page explains both and how they fit together.

The problem with existing guardrails

Most AI agent guardrails work by asking a model to evaluate whether an action is allowed. The model reads the policy, reads the action, and returns a judgment.

This has a fundamental weakness. The model that enforces the policy is the same kind of system as the agent trying to bypass it. Any input that can influence the agent can potentially influence the guardrail. Researchers have demonstrated repeatedly that sufficiently creative prompts can cause LLM judges to approve actions they should block. The attack surface is the model's judgment itself.

The solution is to remove judgment from the enforcement step entirely.


Automated Reasoning (ARc)

Automated reasoning is the field of computer science concerned with making decisions by formal logical deduction rather than statistical inference. Instead of asking "does this seem allowed?", it asks "can it be proven that this is allowed?"

We implement the approach described in the ARc paper (Automated Reasoning Checks). The process has three stages.

1. Formalization

Your natural language policy is translated into SMT-LIB, a standardized formal logic language used by mathematical solvers. This is the only step that uses a language model — and it is a narrow, well-defined task that LLMs perform reliably. The output is a set of logical constraints that precisely encode your rules.

2. Consistency checking

Before your policy is saved, the compiler checks it for internal contradictions. Conflicting rules, unreachable states, and logical impossibilities are caught at compile time. A policy that passes this step is guaranteed to be internally consistent.

3. Solver evaluation

When an agent action arrives, the relevant values are extracted and passed to an SMT solver alongside your compiled policy. The solver determines with mathematical certainty whether the action satisfies the constraints. It returns SAT (satisfiable — the action is permitted) or UNSAT (unsatisfiable — the action is blocked).

The solver has no language understanding, no context window, and no judgment. It cannot be social engineered. Soundness exceeds 99% on unseen adversarial datasets and does not degrade when someone is actively trying to bypass it.


What the solver enforces — and what it cannot

The solver is mathematically sound on anything that can be expressed as a fact in the action text. This covers the majority of real-world threats.

What PreFlight enforces with mathematical certainty:

Crypto and DeFi

  • Transfer and spending limits — amounts are in the action text and checked exactly

  • Recipient and contract registry rules — addresses are facts, not assertions

  • Daily aggregate limits — arithmetic constraints the solver evaluates precisely

  • Blocked contract interaction types — flash loans, multi-step sequences, unlisted contracts

  • Social engineering and manipulation tactics — urgency appeals, emotional framing, and false authority claims are named as explicit boolean variables and evaluated against the action text

Ecommerce

  • Order value caps and approval thresholds

  • Approved vendor and supplier registry enforcement

  • Refund and return policy rules — amounts, timeframes, eligibility conditions

  • Spending limits per category, per session, or per day

Personal agents

  • Data access scope — what the agent is permitted to read, summarize, or act on

  • Network call destinations — raw IP patterns and unlisted endpoints are blocked

  • Retention limits — data used only for the duration of the task

Privacy compliance

  • Enforcing rules from HR documents, employment contracts, or data protection policies against agent actions

  • Data transmission rules — approved endpoints only

  • Access controls derived directly from your policy document, formalized automatically

The one thing to understand about prompt injection:

The key distinction is what versus why. Policy variables that describe what the agent is doing are injection-resistant. Variables that describe why the agent believes it is permitted to act are not.

PreFlight evaluates an agent's proposed action before it executes. For any variable that describes what the agent is doing — amounts, addresses, destinations, data scope — injected content in an email or document cannot change what the action says. An email telling the agent to send 10,000 USDC to an unlisted address produces an action PreFlight evaluates like any other. The amount and recipient are checked against the policy. If they violate it, the action is blocked.

The narrow exception is variables that ask the agent to self-report why it believes it is permitted to act — for example, whether an instruction came from a direct user prompt. These rely on the agent's own characterization, which injected content can influence.

The defense is straightforward: write policies around what the agent is doing. Amount, recipient, destination, data scope, retention — these are verifiable facts in the action text that injected content cannot overwrite. They are the right foundation for a PreFlight policy.

For a full breakdown of threat categories and how to write policies that maximize coverage, see What ICME PreFlight defends against.


Dual verification

PreFlight does not rely on a single evaluation path. Every action check runs two independent verifications and reconciles them before returning a result.

A custom LLM check on our end evaluates the action against the formal SMT model of your policy. The cloud reasoning engine independently evaluates the same action against the same policy. A SAT result requires both to agree. If either returns UNSAT, the action is blocked.

This means a single extraction error or a single reasoning gap does not produce a false clearance. Both paths have to agree on SAT before your agent is allowed to proceed.


Zero Knowledge Machine Learning (zkML)

The ARc approach tells you whether an action was allowed or blocked. zkML tells you that the decision was made correctly, by the specific model and policy you specified, without revealing either.

Zero knowledge proofs are cryptographic constructions that allow one party to prove a statement is true to another party without revealing any information beyond the truth of the statement itself. In the context of machine learning, this means proving that a specific model produced a specific output on a specific input, without exposing the model weights, the input data, or the intermediate computation.

Applied to guardrails, this gives you three properties that prompt-based systems cannot provide:

Succinctness. A ZK proof of a guardrail decision verifies in under one second on any device, regardless of how long the original computation took. You do not need to re-run the model to trust the result.

Privacy. Your policy is never exposed to the verifier. A third party can confirm that a decision was made correctly against your rules without ever seeing what your rules are.

Tamper-evident audit trails. Every decision produces a cryptographic receipt. These receipts are permanent, verifiable by anyone, and impossible to forge or retroactively alter. For regulated industries or high-stakes agent deployments, this is the difference between a log you hope no one modified and a proof that cannot be modified.


How they fit together

The language model handles translation. The solver handles enforcement. The ZK proof handles verification. Each component does only what it is suited for.

The result is a guardrail that cannot be argued out of a decision, produces verifiable proofs of every outcome, and keeps your policy private while remaining fully auditable.

The combination of ARc-based enforcement with zkML proofs is the novel contribution described in our paper Succinctly Verifiable Agentic Guardrails With ZKP Over Automated Reasoning. Where ARc solves enforcement, zkML solves verification at scale — enabling any machine to confirm a guardrail decision was made correctly without re-executing the proof, without trusting the provider, and without seeing the policy.

Last updated