When you compile a natural language policy into formal logic, we generate a set of variable assignments that are logically possible given your rules. These are called scenarios. They are sorted to surface the combinations most likely to reveal a wrong or ambiguous rule first.
The problem they solve: a policy can compile successfully and pass a satisfiability check while still producing incorrect results at runtime. The automated reasoning translator maps natural language to your policy variables using foundation models. If two variables could reasonably represent the same concept, or if a variable description is vague, the translator will disagree with itself, returning TRANSLATION_AMBIGUOUS on real actions. Battle testing catches this before your policy goes into production.
The flow
makeRules
compiles policy to SMT
fetches scenarios (deduplicated, SATISFIABLE first)
saves scenarios to DB
returns policy_id + scenario_count + next_steps
GET /v1/policy/{id}/scenarios
returns scenarios for review
POST /v1/submitScenarioFeedback (per scenario)
approved: true saves SATISFIABLE test case, done (free, no annotation queued)
approved: false saves INVALID test case + queues annotation for rebuild
GET /v1/policy/{id}/variables
returns extracted variables, types, and rules
use this to identify junk variables and vague descriptions
POST /v1/refinePolicyVariables (storage only, instant)
queues deleteVariable, deleteRule, or updateVariable annotations
does not trigger a rebuild
call refinePolicy to apply
POST /v1/refinePolicy (after all feedback and variable changes are queued)
merges scenario annotations + variable annotations into one rebuild
max 10 annotations total per call
polls until complete
exports refined policy, compiles new SMT
updates existing guardrail (never creates a new one)
writes fresh SMT + workflow_id + guardrail_ver to DB
clears both pending queues
returns fresh scenarios for next review round
POST /v1/runPolicyTests
runs all saved test cases against the compiled policy
returns passed / failed / ambiguous counts
Step 1: Compile your policy
This is an SSE stream. Wait for the done event. It will contain:
policy_id: save this, you need it for everything else
scenario_count: number of generated scenarios
next_steps.get_scenarios: the URL to retrieve scenarios
Compilation time: makeRules compiles your natural language policy into SMT-LIB2 formal logic, generates an automated reasoning policy, and fetches adversarial scenarios for battle testing. This involves an LLM translation pass, an SMT compilation step, and a build workflow that polls until the reasoning model is ready. The full pipeline typically completes in 2 to 7 minutes depending on rule count. This is a one-time cost. You compile once and check against the same policy_id for every subsequent action. Recompile only when your requirements change.
Step 2: Review scenarios
Each scenario is a set of variable assignments that are logically possible given your compiled rules. PreFlight surfaces the most suspicious combinations first.
For each scenario ask: could this situation actually happen in my domain?
A correct scenario:
A tool missing project scoping is correctly blocked. Thumbs up.
A bad scenario:
A tool cannot be simultaneously permitted and have its bid rejected. If the tool is permitted, the bid must not be rejected. Thumbs down with an annotation.
Step 3: Submit scenario feedback
Thumbs up: scenario looks correct for your domain. Creates a test case but does not queue an annotation. Does not count toward the 10-annotation limit.
Thumbs down: scenario should be impossible. Requires an annotation explaining why. Be specific. Name the variables, the values, and the rule being violated. Counts toward the 10-annotation limit.
The annotation is what PreFlight uses to deduce the rule or variable change needed. The more specific the better.
Correcting a mismatched expected result: if a test case was saved with the wrong expected result, delete and replace it:
delete_test_case_updated_at comes from the testCase.updatedAt field in runPolicyTests results.
Step 4: Review extracted variables
If scenarios revealed issues, or if you want to improve translation accuracy, inspect what the compiler extracted. The compiler may auto-generate extra variables and rules that are not in your original policy. These add noise to the translation process and can cause TRANSLATION_AMBIGUOUS results at runtime.
The response includes:
variables: each with a name, type, and description
types: custom enum types with their possible values
rules: each with an id and human-readable description
What to look for:
Junk variables you did not define in your policy (e.g. exceptionsAllowed, featureCompensationAllowed, hasProvisionalApproval). These are auto-generated meta-variables that compete with your real variables during translation.
Vague descriptions like "Whether the tool supports assigning tasks to team members." Compare with a description that works well: "True if the checkout domain is in the approved merchant registry, false otherwise." Detailed descriptions with explicit true/false guidance produce better translation accuracy.
Bare assertion rules (e.g. exceptionsAllowed is false) rather than conditional if-then rules. These are often tied to the auto-generated variables.
Note the variable names and rule IDs for anything that needs to be deleted or updated.
Step 5: Queue variable and rule changes
Based on what you found in Steps 2-4 (scenario issues, junk variables, vague descriptions), queue any structural changes. This endpoint stores annotations instantly without triggering a rebuild.
Annotations are queued in pending_variable_annotations. They are not applied until you call refinePolicy.
Important: The reasoning engine allows a maximum of 10 annotations per refine call. Thumbs-up scenario feedback does not count (it only creates test cases). Thumbs-down scenario feedback and variable/rule changes both count. Plan accordingly.
Step 6: Rebuild with all annotations
Once you have submitted all scenario feedback and queued any variable/rule changes, trigger a single rebuild that batches everything at once:
This is an SSE stream. Steps:
Fetches the current policy definition
Merges scenario annotations and variable/rule annotations into a single batch (max 10)
Submits all as one REFINE_POLICY build
Polls until the build completes (typically 2 to 3 minutes)
Exports the refined policy and compiles a new SMT string
Updates the existing guardrail to point at the refined policy version
Writes the new SMT, workflow ID, and guardrail version to the database
Clears both pending annotation queues
Fetches fresh scenarios from the refined policy for the next review round
The done event will contain fresh scenario_count. Your policy_id stays the same throughout.
Note:refinePolicy never creates a new guardrail. It updates the existing one in place. Each policy supports a limited number of rebuild workflows (typically 1 refine after the initial compilation), so batch all your changes into a single refinePolicy call. If you need more iterations, compile a fresh policy with makeRules and start the battle testing flow again.
Step 7: Confirm with tests
Response:
If tests fail:
failed: the rule logic is wrong. Submit a thumbs-down annotation and call refinePolicy again.
ambiguous: the translator is disagreeing on how to map natural language to your variables. Improve the variable description using refinePolicyVariables with an update_variables annotation.
Step 8: Verify end-to-end
Once tests pass, confirm the policy enforces correctly with real actions:
Tip: Include "Therefore this [action] is permitted." at the end of your action text. This gives the automated reasoning translator a clear claim to verify against the premises. Without it, the translator treats all statements as claims with no premises, which produces SATISFIABLE (uncertain) instead of VALID (proven) results.
Writing good annotations
The annotation is the most important input in the refinement loop. PreFlight uses it to deduce rule and variable changes. Vague annotations produce vague fixes.
Weak
Strong
"This is wrong"
"If toolIsPermitted is true then bidMustBeRejected must be false. A tool cannot be simultaneously permitted and rejected. Rules 2-13 tie rejection to non-permission."
"The variable is confusing"
"toolSupportsAssigningTasks description should say: True if the tool supports assigning tasks to team members by any method including user ID, account ID, username, or member ID. False otherwise."
"Fix the auth rule"
"toolAuthenticationMethod description should state: Set to OAUTH_2_0 when the tool uses OAuth 2.0. Set to AuthenticationMethod_OTHER for any other method including API key or basic auth."
Writing good variable descriptions
Variable descriptions are the single most important factor in translation accuracy. The automated reasoning engine uses multiple foundation models to translate natural language into formal logic. Each model translates independently. When descriptions are vague, the models disagree, producing TRANSLATION_AMBIGUOUS results.
Pattern that works:
Pattern that causes ambiguity:
Good variable descriptions should:
Start with "True if ... false otherwise" for booleans
Include synonyms and alternative phrasings users might use
Specify enum value mappings explicitly (e.g. "Set to OAUTH_2_0 when the tool uses OAuth 2.0")
Describe boundary conditions (e.g. "Set to false when identity verification status is unknown or not mentioned")
Use GET /v1/policy/{id}/variables to inspect descriptions and POST /v1/refinePolicyVariables with update_variables to improve them.
Understanding result types
Result
Meaning
SATISFIABLE
The scenario is consistent with your rules under some conditions. Realistic and expected.
VALID
The scenario is provably correct given your rules. The automated reasoning engine confirmed it with mathematical proof.
INVALID
The scenario contradicts your rules. If you marked it as thumbs-up this is a test failure.
TRANSLATION_AMBIGUOUS
The translator disagreed on how to map natural language to variables. Fix the variable description.
IMPOSSIBLE
The premises in the scenario contradict each other, or the policy has conflicting rules.
Verification architecture
Each checkIt call runs three independent solvers:
Local extraction model (OxiZ): Extracts structured values from the natural language action text and evaluates them against the compiled SMT policy.
Automated reasoning engine: Independently translates the raw action text to formal logic and checks it against the policy rules. Returns VALID (proven correct), INVALID (proven wrong), SATISFIABLE (correct under some conditions), or TRANSLATION_AMBIGUOUS (models disagreed on translation).
Z3 SMT solver: Formally verifies the extracted values against the SMT constraints.
The system requires consensus across solvers before returning a final verdict. When the automated reasoning engine returns VALID, you get a clean SAT with mathematical proof. When it returns TRANSLATION_AMBIGUOUS, the translation models disagreed on interpretation, but if the local model and Z3 both confirm SAT, the result is still reliable. The three-solver architecture ensures correct results even when one layer is uncertain.
Every check also generates a zero-knowledge proof. The zk_proof_id and zk_proof_url in the response provide cryptographic verification of the result.
Why scenarios matter
A policy that compiles is not a policy that works. The automated reasoning translator is a set of foundation models. They can disagree with each other. Two similar variables, a vague description, or an ambiguous unit (months vs years, tokens vs dollars) can cause TRANSLATION_AMBIGUOUS at runtime, the exact moment your agent is trying to execute a real action.
Scenarios isolate the most suspect variable combinations before any real action is evaluated. Fixing them at compile time costs a rebuild. Finding them in production costs a failed guardrail.
Quick reference: the recommended flow
makeRules: compile policy
GET /scenarios: review scenarios, submit thumbs-up (free) and thumbs-down (counts toward 10)
GET /variables: inspect extracted variables, identify junk and vague descriptions
refinePolicyVariables: queue variable deletions and description improvements (counts toward 10)
refinePolicy: apply all queued changes in one rebuild (max 10 annotations total)
{
"description": "toolSupportsScopingWork is false and toolIsPermitted is false and bidMustBeRejected is true",
"expected_result": "SATISFIABLE"
}
{
"description": "toolAuthenticationMethod is equal to OAUTH_2_0 and toolIsPermitted is true and bidMustBeRejected is true",
"expected_result": "SATISFIABLE"
}
curl -s -X POST https://api.icme.io/v1/submitScenarioFeedback \
-H 'Content-Type: application/json' \
-H "X-API-Key: $ICME_API_KEY" \
-d '{
"policy_id": "YOUR_POLICY_ID",
"guard_content": "PASTE THE SCENARIO DESCRIPTION",
"approved": false,
"annotation": "If toolIsPermitted is true then bidMustBeRejected must be false. A tool cannot be simultaneously permitted and have its bid rejected."
}'
# Improve variable descriptions for better translation accuracy
curl -s -X POST https://api.icme.io/v1/refinePolicyVariables \
-H 'Content-Type: application/json' \
-H "X-API-Key: $ICME_API_KEY" \
-d '{
"policy_id": "YOUR_POLICY_ID",
"update_variables": [
{
"name": "toolSupportsAssigningTasks",
"type": "BOOL",
"description": "True if the tool supports assigning tasks to team members by any method including by user ID, account ID, username, member ID, or user mention, false otherwise."
}
]
}'
{
"passed": 11,
"failed": 0,
"ambiguous": 0,
"message": "All 11 scenario test(s) passed. Run checkIt with real actions to verify end-to-end before production use."
}
# Should be SAT: all capabilities present
curl -s -N -X POST https://api.icme.io/v1/checkIt \
-H 'Content-Type: application/json' \
-H "X-API-Key: $ICME_API_KEY" \
-d '{
"policy_id": "YOUR_POLICY_ID",
"action": "Seller agent advertises an MCP tool. The tool supports assigning tasks to team members by user ID. The tool supports labels for categorization. The tool authentication method is OAuth 2.0. The tool supports creating new items. The tool supports scoping work to a team via team ID parameter. The seller has a verified identity. Therefore this tool is permitted."
}'
# Should be UNSAT: API key auth instead of OAuth 2.0
curl -s -N -X POST https://api.icme.io/v1/checkIt \
-H 'Content-Type: application/json' \
-H "X-API-Key: $ICME_API_KEY" \
-d '{
"policy_id": "YOUR_POLICY_ID",
"action": "Seller agent advertises an MCP tool. The tool supports assigning tasks to team members by member ID. The tool supports labels for categorization. The tool authentication method is API key. The tool supports creating new items. The tool supports scoping work to a board via board ID parameter. The seller has a verified identity. Therefore this tool is permitted."
}'
"True if the checkout domain is in the approved merchant registry, false otherwise."
"Whether the tool supports assigning tasks to team members."