head-side-circuitVerifiable API Discovery: How Buyer Agents Find the Right Tool

AI agents are choosing their own tools at runtime, and tool selection accuracy collapses as the menu grows. Researchers behind the RAG-MCP projectarrow-up-right measured a drop from 43% to under 14% when agents faced bloated tool sets. The Speakeasy teamarrow-up-right found that reducing Playwright's MCP server from 26 tools to 8 dramatically improved agent accuracy. GitHub cut its Copilot MCP serverarrow-up-right from 40 tools to 13 after observing measurable benchmark regressions. Block rebuilt its Linear MCP server three times, going from 30+ tools down to 2.

MCP registries now list thousands of tools across hundreds of servers. When an agent needs a task management tool, it discovers dozens of candidates with overlapping names and near-identical descriptions: create_issue, create_ticket, create_card, add_task. Research on MCP tool selection shows agents start failing at 30+ tools when descriptions overlap, and virtually guarantee wrong picks at 100+. Today's solution is semantic search: an LLM reads descriptions and picks the best match. This is the same model-judging-model pattern that fails for every other guardrail.

The problem gets worse when you add incentives. In an agent marketplace, seller agents advertise tools to buyer agents. Sellers are optimizing for selection. A Trello agent might emphasize "project management" when the buyer needs "engineering task tracking." A Jira agent might bury the fact that it uses API key auth instead of OAuth. The descriptions aren't wrong, they are just optimized. Semantic search cannot tell the difference between a tool that fits and a tool that sounds like it fits.

ICME PreFlight replaces semantic matching with formal verification. The buyer compiles capability requirements into a private policy. Each seller advertises its tool's features. The SMT solver checks whether the advertised capabilities satisfy the buyer's constraints. SAT means the tool has everything the buyer needs. UNSAT means it doesn't. The seller never sees which rules exist or which one it failed.


Attack surface

Unlike traditional API selection where a developer evaluates documentation, agentic tool selection happens autonomously at machine speed. The agent reads tool descriptions, evaluates fit, and commits, all without human review. The failure modes are different from prompt injection. They are procurement failures.

Vector
Example

Description overlap

Five tools all say "create tasks with labels and assignees." The agent picks the one with the highest semantic similarity score, which happens to be the wrong platform.

Feature inflation

A seller describes its tool as supporting "project scoping" when it only supports flat lists. The description is technically not false, but it does not match what the buyer needs.

Auth mismatch

The buyer requires OAuth 2.0. The seller's tool uses API keys. The description says "authenticated access" without specifying the method. Semantic search treats both as equivalent.

Capability gap

A tool supports reading issues but not creating them. The seller's description says "issue management." The agent selects it for a write operation and fails at runtime.

Unverified seller

An unknown agent advertises a tool with perfect capabilities. There is no identity verification. The tool may not exist, may not work, or may be a front for data collection.


Why semantic search does not catch this

Semantic search ranks tools by how similar their descriptions are to the query. It does not verify whether the tool actually has the required capabilities. A tool described as "create and manage engineering tasks with full project support" scores higher than "create issues in Linear with label and assignee support via OAuth 2.0," even though the second one is the better match for a buyer that needs OAuth, labels, and Linear-compatible workflows.

The fundamental issue: similarity is not satisfiability. Two descriptions can be semantically close while describing tools with completely different capabilities. And when sellers know how the ranking works, they optimize their descriptions for the ranking, not for accuracy.

PreFlight compiles the buyer's requirements to formal logic and checks each seller's advertised capabilities against a mathematical solver. The solver does not evaluate how similar a description sounds. It checks whether the stated capabilities satisfy the constraints. A tool that says "authentication method is API key" returns UNSAT against a policy requiring OAuth 2.0, regardless of how the rest of the description is worded.


This use case has three properties that make it different from the standard guardrail pattern:

No product names in the policy. The buyer does not say "must be Linear" or "must be Jira." It says "must support OAuth 2.0, must support labels, must support assigning tasks, must support creating new items, must support project scoping." If Linear satisfies those requirements, Linear wins. If Jira also satisfies them, Jira also wins. Best capabilities win, not brand names.

Sellers do not know the policy. They advertise their tool's features without knowing which features the buyer cares about. They cannot game the policy because they cannot see it. ZK proofs guarantee this privacy: the seller learns SAT or UNSAT, but never which rules exist or which one it failed on.

SAT receipts are contracts. When a seller's bid returns SAT, the receipt is a cryptographic record of exactly what the seller claimed. If the tool does not deliver the advertised capabilities, the receipt is proof of what was promised. The seller cannot forge the receipt. The buyer cannot forge a rejection. Honesty becomes the dominant strategy because lying creates provable liability with no information advantage.

Prerequisites

Before compiling a policy or checking actions, create an account and top up credits. Account creation costs $5 and includes 325 credits to get started. Compiling a policy costs 300 credits ($3.00, one-time). Each checkIt call costs 1 credit ($0.01).

Open the returned checkout_url in your browser to complete payment. Save the api_key from the response. For full setup options including USDC payment, see the quickstart (for crypto x402): https://docs.icme.io/documentation/getting-started/quickstartarrow-up-right

Buyer policy

The buyer's requirements are pure capability constraints. No vendor names. No brand preferences.

Save the returned policy_id. Pass it on every checkIt call.

Note on compilation time: makeRules compiles your natural language policy into SMT-LIB2 formal logic, generates a Automated Reasoning policy, and fetches adversarial scenarios for battle testing. This involves an LLM translation pass, an SMT compilation step, and a build workflow that polls until the reasoning model is ready. The full pipeline typically completes in 2 to 7 minutes depending on rule count and queue depth. This is a one-time cost. You compile once and check against the same policy_id for every subsequent seller bid. Recompile only when your requirements change. makeRules gets you up to 99%, battle testing takes you all of the way.

A policy that compiles may still have ambiguous terms or definitions that are uncertain. You can battle test policies to get them up to 100% coverage for all terms. The makeRules response includes generated scenarios: variable assignments that are logically possible given your rules, sorted to surface the combinations most likely to reveal a wrong or ambiguous rule first.

Review each scenario and variable and ask: could this situation actually happen in my domain? If the scenario looks correct, submit thumbs-up feedback. If the scenario should be impossible given your policy, submit thumbs-down feedback with an annotation explaining why. The annotation is what PreFlight uses to deduce the rule or variable change needed. Name the variables, name the values, name the rule being violated.

If you submitted any thumbs-down feedback, call refinePolicy to rebuild the policy in place with your annotations applied. This is an SSE stream that typically takes a few minutes. It batches all pending annotations into a single rebuild and returns fresh scenarios for the next review round. Your policy_id stays the same throughout.

Full documentation: https://docs.icme.io/documentation/learning/battle-testing

Check every seller bid before the agent commits

SAT = tool has all required capabilities. UNSAT = it is missing something. Every decision returns a cryptographic receipt.

Live results

SAT: Linear create_issue -- all capabilities match

All six capability constraints are satisfied. Three independent solvers (the local OxiZ extraction model, Automated Reasoning, and Z3) all agree SAT. The buyer can proceed with this tool. The receipt records exactly what the seller claimed.

Understanding "AR uncertain" results

The Automated Reasoning layer may return "AR uncertain" on some checks. This means the AR engine found a valid interpretation of the action that satisfies the policy, but could not prove it is the only possible interpretation. This happens when policy variables use integer-encoded enums (e.g. toolAuthenticationMethod = 0 for OAuth 2.0) and the AR translator is not fully confident in the mapping from natural language to the integer value.

It is not an error. The system requires unanimous agreement from all three solvers before returning SAT. When AR is uncertain, the local OxiZ model and Z3 must both independently confirm SAT for the result to pass. The three-solver consensus provides the same effective confidence as a clean AR verdict. In practice, if all three agree, the result is reliable.

The battle testing cycle described above can reduce AR uncertainty. Submitting annotations that clarify variable descriptions and enum encodings (e.g. "toolAuthenticationMethod 0 = OAuth 2.0, 1 = API key") and then calling refinePolicy gives the AR translator more precise mappings to work with.

SAT: Jira create_ticket -- also has all capabilities

Jira also satisfies all capability constraints. The policy contains no brand preference, so both Linear and Jira are valid matches. The buyer now has a verified shortlist of two tools and can choose between them on other criteria: price, latency, existing integrations, team preference.

This is the point. No brand favoritism. Capabilities decide.

UNSAT: Trello create_card -- API key auth, not OAuth 2.0

Trello has five of six required capabilities but uses API key authentication. The solver identifies the single failing constraint. The seller learns UNSAT but not which rule it failed on. If Trello later adds OAuth 2.0 support, the same bid with updated auth would return SAT.

UNSAT: GitHub Issues -- no project scoping

GitHub Issues supports OAuth, labels, assignees, and write operations. It fails on project scoping. A semantic search system would rank this tool highly because it matches five of six criteria. The solver does not rank. It returns UNSAT because the constraints are not fully satisfied.

UNSAT: Notion query_database -- read-only, cannot create items

Notion's query tool is read-only. The seller honestly advertises this. The solver catches the capability gap immediately. This prevents the common failure where an agent selects a query tool for a write operation and fails at runtime.

UNSAT: Unknown agent -- no verified identity

Every required capability is present. It fails on seller identity alone. This is the sybil defense: even a perfect tool from an unverified seller is rejected. The buyer's policy can require any trust signal (verified identity, registry listing, reputation score) as a boolean constraint alongside capability requirements.

Reading the extracted variables

Every checkIt response includes an extracted map showing exactly what the solver evaluated. Boolean variables (toolSupportsAssigningTasksToTeamMembers, toolSupportsCreatingNewItems, etc.) are true or false. Enum variables use integer encoding: toolAuthenticationMethod is 0 for OAuth 2.0 and 1 for API key. sellerVerificationStatus is 0 for verified and 1 for unverified.

The response also includes three independent solver results. llm_result is from the local OxiZ extraction model. ar_result is from Automated Reasoning. z3_result is from the Z3 SMT solver. The system requires consensus across all available solvers before returning a final verdict. A zk_proof_id and zk_proof_url are included for cryptographic verification of the result.

Why two SAT results is the correct outcome

A guardrail that returns only one winner is a recommendation engine, not a verifier. PreFlight returns SAT for every tool that satisfies all constraints. In this example, both Linear and Jira pass. This is correct.

Now the buyer has a verified shortlist. Every tool on it provably meets the capability requirements. The buyer can select between them on secondary criteria that do not belong in a formal policy: team familiarity, pricing, existing integrations, response latency, or simple preference. The policy handles the hard constraints. The buyer handles the soft ones.

If the buyer wants to narrow the shortlist further, they add more rules and recompile. Each additional constraint costs nothing at check time. The policy grows more specific. The shortlist gets shorter. This is iterative procurement, not one-shot guessing.

Trust model

PreFlight does not make sellers honest. It makes dishonesty provable and enforceable.

Sellers describe what they have. They do not know the buyer's requirements. They cannot optimize their description to hit specific criteria because they cannot see the criteria. The best strategy is to describe the tool's actual capabilities accurately.

Buyer policy stays private. ZK proofs guarantee this. The seller learns the result (SAT or UNSAT) but never which rules exist, how many there are, or which one it failed on.

SAT receipts are cryptographic contracts. The receipt records the seller's exact claims under a tamper-proof proof. If the tool does not deliver the advertised capabilities, the receipt is evidence of what was promised. The seller cannot later claim they never said OAuth 2.0 was supported.

If a seller lies, the lie is recorded. A seller that claims "authentication method is OAuth 2.0" when the tool actually uses API keys will receive SAT. But the receipt records that specific claim. When the buyer connects and the OAuth handshake fails, the receipt proves the seller misrepresented its tool. This is not a limitation. It is the enforcement mechanism. The solver does not detect lies at bid time. It creates the conditions under which lies have consequences.

Defense in depth. For high-stakes selections, buyers can add a verification layer between bid and commitment: pull the tool's schema directly from the MCP server, construct structured values from the schema, and check those values via /v1/verify (which takes key-value pairs, no LLM extraction). This catches misrepresentation before the buyer commits, without trusting the seller's description at all.

Deploying in production

Compile once. Call makeRules with your capability requirements. Store the policy_id in your environment. The compilation takes 2 to 7 minutes but is a one-time cost. Recompile only when your requirements change.

Battle test before going live. Review the generated scenarios, submit feedback, refine if needed, and run tests until all pass. This catches ambiguous variable mappings and rule conflicts before any real action is evaluated. Fixing problems at compile time costs a rebuild. Finding them in production costs a failed guardrail.

Check every bid. Call checkIt for each candidate tool the agent discovers. This is 1 credit ($0.01) per check and completes in under 12 seconds.

Accept all SAT results as a shortlist. Do not assume only one tool should pass. Multiple tools may satisfy your requirements. Let your agent select among verified candidates on secondary criteria.

Store receipts. Every check_id is a cryptographic audit receipt. Store it alongside the tool selection decision. If the tool underperforms, the receipt documents what the seller claimed at selection time.

Treat UNSAT as final. Do not retry the same bid with rephrased descriptions. UNSAT means the advertised capabilities do not satisfy the requirements. If the seller updates its tool, a new bid with updated capabilities can be checked.

Fail closed. If the ICME API is unreachable or returns anything other than an explicit SAT, do not commit to the tool. An unavailable guardrail is not implicit permission.

Iterate the policy. If too many tools return SAT and the shortlist is too broad, add more constraints and recompile. If no tools return SAT, relax a constraint. The policy is your procurement spec. Treat it like one.

Last updated