swordsBattle Testing

Tools to help make your rules cover 100%.

When you compile a natural language policy into formal logic, we generate a set of variable assignments that are logically possible given your rules. These are called scenarios. They are sorted to surface the combinations most likely to reveal a wrong or ambiguous rule first.

The problem they solve: a policy can compile successfully and pass a satisfiability check while still producing incorrect results at runtime. The automated reasoning translator maps natural language to your policy variables using foundation models. If two variables could reasonably represent the same concept, or if a variable description is vague, the translator will disagree with itself, returning TRANSLATION_AMBIGUOUS on real actions. Battle testing catches this before your policy goes into production.


The flow

makeRules
    compiles policy to SMT
    fetches scenarios (deduplicated, SATISFIABLE first)
    saves scenarios to DB
    returns policy_id + scenario_count + next_steps

GET /v1/policy/{id}/scenarios
    returns scenarios for review

POST /v1/submitScenarioFeedback  (per scenario)
    approved: true   saves SATISFIABLE test case, done (free, no annotation queued)
    approved: false  saves INVALID test case + queues annotation for rebuild

GET /v1/policy/{id}/variables
    returns extracted variables, types, and rules
    use this to identify junk variables and vague descriptions

POST /v1/refinePolicyVariables  (storage only, instant)
    queues deleteVariable, deleteRule, or updateVariable annotations
    does not trigger a rebuild
    call refinePolicy to apply

POST /v1/refinePolicy  (after all feedback and variable changes are queued)
    merges scenario annotations + variable annotations into one rebuild
    max 10 annotations total per call
    polls until complete
    exports refined policy, compiles new SMT
    updates existing guardrail (never creates a new one)
    writes fresh SMT + workflow_id + guardrail_ver to DB
    clears both pending queues
    returns fresh scenarios for next review round

POST /v1/runPolicyTests
    runs all saved test cases against the compiled policy
    returns passed / failed / ambiguous counts

Step 1: Compile your policy

This is an SSE stream. Wait for the done event. It will contain:

  • policy_id: save this, you need it for everything else

  • scenario_count: number of generated scenarios

  • next_steps.get_scenarios: the URL to retrieve scenarios

Compilation time: makeRules compiles your natural language policy into SMT-LIB2 formal logic, generates an automated reasoning policy, and fetches adversarial scenarios for battle testing. This involves an LLM translation pass, an SMT compilation step, and a build workflow that polls until the reasoning model is ready. The full pipeline typically completes in 2 to 7 minutes depending on rule count. This is a one-time cost. You compile once and check against the same policy_id for every subsequent action. Recompile only when your requirements change.


Step 2: Review scenarios

Each scenario is a set of variable assignments that are logically possible given your compiled rules. PreFlight surfaces the most suspicious combinations first.

For each scenario ask: could this situation actually happen in my domain?

A correct scenario:

A tool missing project scoping is correctly blocked. Thumbs up.

A bad scenario:

A tool cannot be simultaneously permitted and have its bid rejected. If the tool is permitted, the bid must not be rejected. Thumbs down with an annotation.


Step 3: Submit scenario feedback

Thumbs up: scenario looks correct for your domain. Creates a test case but does not queue an annotation. Does not count toward the 10-annotation limit.

Thumbs down: scenario should be impossible. Requires an annotation explaining why. Be specific. Name the variables, the values, and the rule being violated. Counts toward the 10-annotation limit.

The annotation is what PreFlight uses to deduce the rule or variable change needed. The more specific the better.

Correcting a mismatched expected result: if a test case was saved with the wrong expected result, delete and replace it:

delete_test_case_updated_at comes from the testCase.updatedAt field in runPolicyTests results.


Step 4: Review extracted variables

If scenarios revealed issues, or if you want to improve translation accuracy, inspect what the compiler extracted. The compiler may auto-generate extra variables and rules that are not in your original policy. These add noise to the translation process and can cause TRANSLATION_AMBIGUOUS results at runtime.

The response includes:

  • variables: each with a name, type, and description

  • types: custom enum types with their possible values

  • rules: each with an id and human-readable description

What to look for:

  • Junk variables you did not define in your policy (e.g. exceptionsAllowed, featureCompensationAllowed, hasProvisionalApproval). These are auto-generated meta-variables that compete with your real variables during translation.

  • Vague descriptions like "Whether the tool supports assigning tasks to team members." Compare with a description that works well: "True if the checkout domain is in the approved merchant registry, false otherwise." Detailed descriptions with explicit true/false guidance produce better translation accuracy.

  • Bare assertion rules (e.g. exceptionsAllowed is false) rather than conditional if-then rules. These are often tied to the auto-generated variables.

Note the variable names and rule IDs for anything that needs to be deleted or updated.


Step 5: Queue variable and rule changes

Based on what you found in Steps 2-4 (scenario issues, junk variables, vague descriptions), queue any structural changes. This endpoint stores annotations instantly without triggering a rebuild.

Annotations are queued in pending_variable_annotations. They are not applied until you call refinePolicy.

Important: The reasoning engine allows a maximum of 10 annotations per refine call. Thumbs-up scenario feedback does not count (it only creates test cases). Thumbs-down scenario feedback and variable/rule changes both count. Plan accordingly.


Step 6: Rebuild with all annotations

Once you have submitted all scenario feedback and queued any variable/rule changes, trigger a single rebuild that batches everything at once:

This is an SSE stream. Steps:

  1. Fetches the current policy definition

  2. Merges scenario annotations and variable/rule annotations into a single batch (max 10)

  3. Submits all as one REFINE_POLICY build

  4. Polls until the build completes (typically 2 to 3 minutes)

  5. Exports the refined policy and compiles a new SMT string

  6. Updates the existing guardrail to point at the refined policy version

  7. Writes the new SMT, workflow ID, and guardrail version to the database

  8. Clears both pending annotation queues

  9. Fetches fresh scenarios from the refined policy for the next review round

The done event will contain fresh scenario_count. Your policy_id stays the same throughout.

Note: refinePolicy never creates a new guardrail. It updates the existing one in place. Each policy supports a limited number of rebuild workflows (typically 1 refine after the initial compilation), so batch all your changes into a single refinePolicy call. If you need more iterations, compile a fresh policy with makeRules and start the battle testing flow again.


Step 7: Confirm with tests

Response:

If tests fail:

  • failed: the rule logic is wrong. Submit a thumbs-down annotation and call refinePolicy again.

  • ambiguous: the translator is disagreeing on how to map natural language to your variables. Improve the variable description using refinePolicyVariables with an update_variables annotation.


Step 8: Verify end-to-end

Once tests pass, confirm the policy enforces correctly with real actions:

Tip: Include "Therefore this [action] is permitted." at the end of your action text. This gives the automated reasoning translator a clear claim to verify against the premises. Without it, the translator treats all statements as claims with no premises, which produces SATISFIABLE (uncertain) instead of VALID (proven) results.


Writing good annotations

The annotation is the most important input in the refinement loop. PreFlight uses it to deduce rule and variable changes. Vague annotations produce vague fixes.

Weak
Strong

"This is wrong"

"If toolIsPermitted is true then bidMustBeRejected must be false. A tool cannot be simultaneously permitted and rejected. Rules 2-13 tie rejection to non-permission."

"The variable is confusing"

"toolSupportsAssigningTasks description should say: True if the tool supports assigning tasks to team members by any method including user ID, account ID, username, or member ID. False otherwise."

"Fix the auth rule"

"toolAuthenticationMethod description should state: Set to OAUTH_2_0 when the tool uses OAuth 2.0. Set to AuthenticationMethod_OTHER for any other method including API key or basic auth."


Writing good variable descriptions

Variable descriptions are the single most important factor in translation accuracy. The automated reasoning engine uses multiple foundation models to translate natural language into formal logic. Each model translates independently. When descriptions are vague, the models disagree, producing TRANSLATION_AMBIGUOUS results.

Pattern that works:

Pattern that causes ambiguity:

Good variable descriptions should:

  • Start with "True if ... false otherwise" for booleans

  • Include synonyms and alternative phrasings users might use

  • Specify enum value mappings explicitly (e.g. "Set to OAUTH_2_0 when the tool uses OAuth 2.0")

  • Describe boundary conditions (e.g. "Set to false when identity verification status is unknown or not mentioned")

Use GET /v1/policy/{id}/variables to inspect descriptions and POST /v1/refinePolicyVariables with update_variables to improve them.


Understanding result types

Result
Meaning

SATISFIABLE

The scenario is consistent with your rules under some conditions. Realistic and expected.

VALID

The scenario is provably correct given your rules. The automated reasoning engine confirmed it with mathematical proof.

INVALID

The scenario contradicts your rules. If you marked it as thumbs-up this is a test failure.

TRANSLATION_AMBIGUOUS

The translator disagreed on how to map natural language to variables. Fix the variable description.

IMPOSSIBLE

The premises in the scenario contradict each other, or the policy has conflicting rules.


Verification architecture

Each checkIt call runs three independent solvers:

  1. Local extraction model (OxiZ): Extracts structured values from the natural language action text and evaluates them against the compiled SMT policy.

  2. Automated reasoning engine: Independently translates the raw action text to formal logic and checks it against the policy rules. Returns VALID (proven correct), INVALID (proven wrong), SATISFIABLE (correct under some conditions), or TRANSLATION_AMBIGUOUS (models disagreed on translation).

  3. Z3 SMT solver: Formally verifies the extracted values against the SMT constraints.

The system requires consensus across solvers before returning a final verdict. When the automated reasoning engine returns VALID, you get a clean SAT with mathematical proof. When it returns TRANSLATION_AMBIGUOUS, the translation models disagreed on interpretation, but if the local model and Z3 both confirm SAT, the result is still reliable. The three-solver architecture ensures correct results even when one layer is uncertain.

Every check also generates a zero-knowledge proof. The zk_proof_id and zk_proof_url in the response provide cryptographic verification of the result.


Why scenarios matter

A policy that compiles is not a policy that works. The automated reasoning translator is a set of foundation models. They can disagree with each other. Two similar variables, a vague description, or an ambiguous unit (months vs years, tokens vs dollars) can cause TRANSLATION_AMBIGUOUS at runtime, the exact moment your agent is trying to execute a real action.

Scenarios isolate the most suspect variable combinations before any real action is evaluated. Fixing them at compile time costs a rebuild. Finding them in production costs a failed guardrail.


  1. makeRules: compile policy

  2. GET /scenarios: review scenarios, submit thumbs-up (free) and thumbs-down (counts toward 10)

  3. GET /variables: inspect extracted variables, identify junk and vague descriptions

  4. refinePolicyVariables: queue variable deletions and description improvements (counts toward 10)

  5. refinePolicy: apply all queued changes in one rebuild (max 10 annotations total)

  6. runPolicyTests: confirm all test cases pass

  7. checkIt: verify end-to-end with real actions

Last updated