# Battle Testing

When you compile a natural language policy into formal logic, we generate a set of variable assignments that are logically possible given your rules. These are called **scenarios**. They are sorted to surface the combinations most likely to reveal a wrong or ambiguous rule first.

The problem they solve: a policy can compile successfully and pass a satisfiability check while still producing incorrect results at runtime. The automated reasoning translator maps natural language to your policy variables using foundation models. If two variables could reasonably represent the same concept, or if a variable description is vague, the translator will disagree with itself, returning `TRANSLATION_AMBIGUOUS` on real actions. Battle testing catches this before your policy goes into production.

***

### The flow

```
makeRules
    compiles policy to SMT
    fetches scenarios (deduplicated, SATISFIABLE first)
    saves scenarios to DB
    returns policy_id + scenario_count + next_steps

GET /v1/policy/{id}/scenarios
    returns scenarios for review

POST /v1/submitScenarioFeedback  (per scenario)
    approved: true   saves SATISFIABLE test case, done (free, no annotation queued)
    approved: false  saves INVALID test case + queues annotation for rebuild

GET /v1/policy/{id}/variables
    returns extracted variables, types, and rules
    use this to identify junk variables and vague descriptions

POST /v1/refinePolicyVariables  (storage only, instant)
    queues deleteVariable, deleteRule, or updateVariable annotations
    does not trigger a rebuild
    call refinePolicy to apply

POST /v1/refinePolicy  (after all feedback and variable changes are queued)
    merges scenario annotations + variable annotations into one rebuild
    max 10 annotations total per call
    polls until complete
    exports refined policy, compiles new SMT
    updates existing guardrail (never creates a new one)
    writes fresh SMT + workflow_id + guardrail_ver to DB
    clears both pending queues
    returns fresh scenarios for next review round

POST /v1/runPolicyTests
    runs all saved test cases against the compiled policy
    returns passed / failed / ambiguous counts
```

***

### Step 1: Compile your policy

This is an SSE stream. Wait for the `done` event. It will contain:

* `policy_id`: save this, you need it for everything else
* `scenario_count`: number of generated scenarios
* `next_steps.get_scenarios`: the URL to retrieve scenarios

```
curl -s -N -X POST https://api.icme.io/v1/makeRules \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy": "YOUR POLICY TEXT HERE"
  }'
```

**Compilation time:** makeRules compiles your natural language policy into SMT-LIB2 formal logic, generates an automated reasoning policy, and fetches adversarial scenarios for battle testing. This involves an LLM translation pass, an SMT compilation step, and a build workflow that polls until the reasoning model is ready. The full pipeline typically completes in 2 to 7 minutes depending on rule count. This is a one-time cost. You compile once and check against the same policy\_id for every subsequent action. Recompile only when your requirements change.

***

### Step 2: Review scenarios

Each scenario is a set of variable assignments that are logically possible given your compiled rules. PreFlight surfaces the most suspicious combinations first.

For each scenario ask: **could this situation actually happen in my domain?**

```
curl -s https://api.icme.io/v1/policy/YOUR_POLICY_ID/scenarios \
  -H "X-API-Key: $ICME_API_KEY" | jq .
```

**A correct scenario:**

```json
{
  "description": "toolSupportsScopingWork is false and toolIsPermitted is false and bidMustBeRejected is true",
  "expected_result": "SATISFIABLE"
}
```

A tool missing project scoping is correctly blocked. Thumbs up.

**A bad scenario:**

```json
{
  "description": "toolAuthenticationMethod is equal to OAUTH_2_0 and toolIsPermitted is true and bidMustBeRejected is true",
  "expected_result": "SATISFIABLE"
}
```

A tool cannot be simultaneously permitted and have its bid rejected. If the tool is permitted, the bid must not be rejected. Thumbs down with an annotation.

***

### Step 3: Submit scenario feedback

**Thumbs up:** scenario looks correct for your domain. Creates a test case but does not queue an annotation. Does not count toward the 10-annotation limit.

```
curl -s -X POST https://api.icme.io/v1/submitScenarioFeedback \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy_id": "YOUR_POLICY_ID",
    "guard_content": "PASTE THE SCENARIO DESCRIPTION",
    "approved": true
  }'
```

**Thumbs down:** scenario should be impossible. Requires an annotation explaining why. Be specific. Name the variables, the values, and the rule being violated. Counts toward the 10-annotation limit.

```
curl -s -X POST https://api.icme.io/v1/submitScenarioFeedback \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy_id": "YOUR_POLICY_ID",
    "guard_content": "PASTE THE SCENARIO DESCRIPTION",
    "approved": false,
    "annotation": "If toolIsPermitted is true then bidMustBeRejected must be false. A tool cannot be simultaneously permitted and have its bid rejected."
  }'
```

The annotation is what PreFlight uses to deduce the rule or variable change needed. The more specific the better.

**Correcting a mismatched expected result:** if a test case was saved with the wrong expected result, delete and replace it:

```
curl -s -X POST https://api.icme.io/v1/submitScenarioFeedback \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy_id": "YOUR_POLICY_ID",
    "guard_content": "PASTE THE SCENARIO TEXT",
    "approved": true,
    "expected_result": "VALID",
    "delete_test_case_id": "OLD_TEST_CASE_ID",
    "delete_test_case_updated_at": "2026-03-14T13:19:09.430000+00:00"
  }'
```

`delete_test_case_updated_at` comes from the `testCase.updatedAt` field in `runPolicyTests` results.

***

### Step 4: Review extracted variables

If scenarios revealed issues, or if you want to improve translation accuracy, inspect what the compiler extracted. The compiler may auto-generate extra variables and rules that are not in your original policy. These add noise to the translation process and can cause `TRANSLATION_AMBIGUOUS` results at runtime.

```
curl -s https://api.icme.io/v1/policy/YOUR_POLICY_ID/variables \
  -H "X-API-Key: $ICME_API_KEY" | jq .
```

The response includes:

* `variables`: each with a name, type, and description
* `types`: custom enum types with their possible values
* `rules`: each with an id and human-readable description

**What to look for:**

* **Junk variables** you did not define in your policy (e.g. `exceptionsAllowed`, `featureCompensationAllowed`, `hasProvisionalApproval`). These are auto-generated meta-variables that compete with your real variables during translation.
* **Vague descriptions** like "Whether the tool supports assigning tasks to team members." Compare with a description that works well: "True if the checkout domain is in the approved merchant registry, false otherwise." Detailed descriptions with explicit true/false guidance produce better translation accuracy.
* **Bare assertion rules** (e.g. `exceptionsAllowed is false`) rather than conditional if-then rules. These are often tied to the auto-generated variables.

Note the variable names and rule IDs for anything that needs to be deleted or updated.

***

### Step 5: Queue variable and rule changes

Based on what you found in Steps 2-4 (scenario issues, junk variables, vague descriptions), queue any structural changes. This endpoint stores annotations instantly without triggering a rebuild.

```
# Delete auto-generated junk variables and their bare-assertion rules
curl -s -X POST https://api.icme.io/v1/refinePolicyVariables \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy_id": "YOUR_POLICY_ID",
    "delete_variables": ["exceptionsAllowed", "featureCompensationAllowed"],
    "delete_rules": ["RULE_ID_1", "RULE_ID_2"]
  }'
```

```
# Improve variable descriptions for better translation accuracy
curl -s -X POST https://api.icme.io/v1/refinePolicyVariables \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy_id": "YOUR_POLICY_ID",
    "update_variables": [
      {
        "name": "toolSupportsAssigningTasks",
        "type": "BOOL",
        "description": "True if the tool supports assigning tasks to team members by any method including by user ID, account ID, username, member ID, or user mention, false otherwise."
      }
    ]
  }'
```

Annotations are queued in `pending_variable_annotations`. They are not applied until you call `refinePolicy`.

**Important:** The reasoning engine allows a maximum of 10 annotations per refine call. Thumbs-up scenario feedback does not count (it only creates test cases). Thumbs-down scenario feedback and variable/rule changes both count. Plan accordingly.

***

### Step 6: Rebuild with all annotations

Once you have submitted all scenario feedback and queued any variable/rule changes, trigger a single rebuild that batches everything at once:

```
curl -s -N -X POST https://api.icme.io/v1/refinePolicy \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{"policy_id": "YOUR_POLICY_ID"}'
```

This is an SSE stream. Steps:

1. Fetches the current policy definition
2. Merges scenario annotations and variable/rule annotations into a single batch (max 10)
3. Submits all as one REFINE\_POLICY build
4. Polls until the build completes (typically 2 to 3 minutes)
5. Exports the refined policy and compiles a new SMT string
6. Updates the existing guardrail to point at the refined policy version
7. Writes the new SMT, workflow ID, and guardrail version to the database
8. Clears both pending annotation queues
9. Fetches fresh scenarios from the refined policy for the next review round

The `done` event will contain fresh `scenario_count`. Your `policy_id` stays the same throughout.

**Note:** `refinePolicy` never creates a new guardrail. It updates the existing one in place. Each policy supports a limited number of rebuild workflows (typically 1 refine after the initial compilation), so batch all your changes into a single refinePolicy call. If you need more iterations, compile a fresh policy with `makeRules` and start the battle testing flow again.

***

### Step 7: Confirm with tests

```
curl -s -X POST https://api.icme.io/v1/runPolicyTests \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{"policy_id": "YOUR_POLICY_ID"}'
```

**Response:**

```json
{
  "passed": 11,
  "failed": 0,
  "ambiguous": 0,
  "message": "All 11 scenario test(s) passed. Run checkIt with real actions to verify end-to-end before production use."
}
```

If tests fail:

* `failed`: the rule logic is wrong. Submit a thumbs-down annotation and call `refinePolicy` again.
* `ambiguous`: the translator is disagreeing on how to map natural language to your variables. Improve the variable description using `refinePolicyVariables` with an `update_variables` annotation.

***

### Step 8: Verify end-to-end

Once tests pass, confirm the policy enforces correctly with real actions:

```
# Should be SAT: all capabilities present
curl -s -N -X POST https://api.icme.io/v1/checkIt \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy_id": "YOUR_POLICY_ID",
    "action": "Seller agent advertises an MCP tool. The tool supports assigning tasks to team members by user ID. The tool supports labels for categorization. The tool authentication method is OAuth 2.0. The tool supports creating new items. The tool supports scoping work to a team via team ID parameter. The seller has a verified identity. Therefore this tool is permitted."
  }'

# Should be UNSAT: API key auth instead of OAuth 2.0
curl -s -N -X POST https://api.icme.io/v1/checkIt \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ICME_API_KEY" \
  -d '{
    "policy_id": "YOUR_POLICY_ID",
    "action": "Seller agent advertises an MCP tool. The tool supports assigning tasks to team members by member ID. The tool supports labels for categorization. The tool authentication method is API key. The tool supports creating new items. The tool supports scoping work to a board via board ID parameter. The seller has a verified identity. Therefore this tool is permitted."
  }'
```

**Tip:** Include "Therefore this \[action] is permitted." at the end of your action text. This gives the automated reasoning translator a clear claim to verify against the premises. Without it, the translator treats all statements as claims with no premises, which produces SATISFIABLE (uncertain) instead of VALID (proven) results.

***

### Writing good annotations

The annotation is the most important input in the refinement loop. PreFlight uses it to deduce rule and variable changes. Vague annotations produce vague fixes.

| Weak                        | Strong                                                                                                                                                                                               |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "This is wrong"             | "If toolIsPermitted is true then bidMustBeRejected must be false. A tool cannot be simultaneously permitted and rejected. Rules 2-13 tie rejection to non-permission."                               |
| "The variable is confusing" | "toolSupportsAssigningTasks description should say: True if the tool supports assigning tasks to team members by any method including user ID, account ID, username, or member ID. False otherwise." |
| "Fix the auth rule"         | "toolAuthenticationMethod description should state: Set to OAUTH\_2\_0 when the tool uses OAuth 2.0. Set to AuthenticationMethod\_OTHER for any other method including API key or basic auth."       |

***

### Writing good variable descriptions

Variable descriptions are the single most important factor in translation accuracy. The automated reasoning engine uses multiple foundation models to translate natural language into formal logic. Each model translates independently. When descriptions are vague, the models disagree, producing `TRANSLATION_AMBIGUOUS` results.

**Pattern that works:**

```
"True if the checkout domain is in the approved merchant registry, false otherwise."
```

**Pattern that causes ambiguity:**

```
"Whether the tool supports assigning tasks to team members."
```

Good variable descriptions should:

* Start with "True if ... false otherwise" for booleans
* Include synonyms and alternative phrasings users might use
* Specify enum value mappings explicitly (e.g. "Set to OAUTH\_2\_0 when the tool uses OAuth 2.0")
* Describe boundary conditions (e.g. "Set to false when identity verification status is unknown or not mentioned")

Use `GET /v1/policy/{id}/variables` to inspect descriptions and `POST /v1/refinePolicyVariables` with `update_variables` to improve them.

***

### Understanding result types

| Result                  | Meaning                                                                                                                 |
| ----------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| `SATISFIABLE`           | The scenario is consistent with your rules under some conditions. Realistic and expected.                               |
| `VALID`                 | The scenario is provably correct given your rules. The automated reasoning engine confirmed it with mathematical proof. |
| `INVALID`               | The scenario contradicts your rules. If you marked it as thumbs-up this is a test failure.                              |
| `TRANSLATION_AMBIGUOUS` | The translator disagreed on how to map natural language to variables. Fix the variable description.                     |
| `IMPOSSIBLE`            | The premises in the scenario contradict each other, or the policy has conflicting rules.                                |

***

### Verification architecture

Each `checkIt` call runs three independent solvers:

1. **Local extraction model (OxiZ):** Extracts structured values from the natural language action text and evaluates them against the compiled SMT policy.
2. **Automated reasoning engine:** Independently translates the raw action text to formal logic and checks it against the policy rules. Returns VALID (proven correct), INVALID (proven wrong), SATISFIABLE (correct under some conditions), or TRANSLATION\_AMBIGUOUS (models disagreed on translation).
3. **Z3 SMT solver:** Formally verifies the extracted values against the SMT constraints.

The system requires consensus across solvers before returning a final verdict. When the automated reasoning engine returns VALID, you get a clean SAT with mathematical proof. When it returns TRANSLATION\_AMBIGUOUS, the translation models disagreed on interpretation, but if the local model and Z3 both confirm SAT, the result is still reliable. The three-solver architecture ensures correct results even when one layer is uncertain.

Every check also generates a zero-knowledge proof. The `zk_proof_id` and `zk_proof_url` in the response provide cryptographic verification of the result.

***

### Why scenarios matter

A policy that compiles is not a policy that works. The automated reasoning translator is a set of foundation models. They can disagree with each other. Two similar variables, a vague description, or an ambiguous unit (months vs years, tokens vs dollars) can cause `TRANSLATION_AMBIGUOUS` at runtime, the exact moment your agent is trying to execute a real action.

Scenarios isolate the most suspect variable combinations before any real action is evaluated. Fixing them at compile time costs a rebuild. Finding them in production costs a failed guardrail.

***

### Quick reference: the recommended flow

1. `makeRules`: compile policy
2. `GET /scenarios`: review scenarios, submit thumbs-up (free) and thumbs-down (counts toward 10)
3. `GET /variables`: inspect extracted variables, identify junk and vague descriptions
4. `refinePolicyVariables`: queue variable deletions and description improvements (counts toward 10)
5. `refinePolicy`: apply all queued changes in one rebuild (max 10 annotations total)
6. `runPolicyTests`: confirm all test cases pass
7. `checkIt`: verify end-to-end with real actions


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.icme.io/documentation/learning/battle-testing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
