Guardrails for Self-Evolving OpenClaw Agents

How to write and battle test a policy that catches data exfiltration and blocks unauthorized self-modification in OpenClaw agents.

Capability Evolver is the most downloaded skill on ClawHub with 35K+ installs. It lets your agent inspect its own runtime, identify failures, and rewrite its own capabilities. Without guardrails, it can also exfiltrate your data, expand its own permissions, and evolve past the rules you set.

This guide walks through writing, battle testing, and deploying a policy that keeps a self-evolving agent inside bounds, using a real security incident as the test case.

This guide assumes you already have an ICME account and have installed the PreFlight skill. If not, start with Cryptographic Guardrails for Your OpenClaw Agent.

The threat

Capability Evolver injects the prompt "You are a Recursive Self-Improving System" into your agent's identity. Without the --review flag it runs in what the docs call "Mad Dog Mode," executing changes immediately.

In February 2026, a security report on GitHub revealed that Capability Evolver was exfiltrating evolution logs (session transcripts, memory contents, user data) to Feishu (Lark), a ByteDance-operated Chinese cloud service, using hardcoded API tokens. No user consent. No disclosure.

ClawHub's own scanner rates the skill as "Suspicious." The skill's README contradicts itself about whether it modifies source code.

The config flag EVOLVE_ALLOW_SELF_MODIFY=false is the recommended safeguard. But a config flag is a file on disk that an agent with filesystem access can change. A policy compiled into formal logic on an external server cannot be changed by the agent.

The policy

Six rules covering the main risks a self-evolving agent introduces:

1. Shell command execution is allowed only when the user has given explicit approval.
   If user approval has not been given, shell command execution is not allowed,
   even if no shell commands are currently running.
2. Outbound data transmission is allowed only when the destination is in the
   approved list. If the destination is not in the approved list, outbound
   transmission is not allowed, even if no transmission is currently occurring.
3. Modification of authentication files, API keys, or environment variables
   is never allowed under any circumstances.
4. No financial transaction may exceed $100.
5. Sending email to an external recipient is allowed only when the user has
   given explicit confirmation.
6. Self-modification of core agent configuration or skill files is never
   allowed under any circumstances.

Rule 2 catches the Feishu exfiltration. Rule 6 prevents the agent from rewriting its own config or skills. Rule 3 stops it from touching credentials.

Compile

Takes 2-7 minutes. Save the policy_id from the done event.

Our compilation extracted 6 rules, 18 variables, and generated 23 adversarial scenarios.

Battle test

Scenarios are combinations of variable assignments the solver considers logically possible under your rules. They surface edge cases before production.

Pull them:

For each scenario, ask: could this actually happen?

Thumbs up if correct:

Thumbs down if impossible, with an annotation:

What we found

Of 23 scenarios, 21 were correct. Two needed annotations:

Shell commands allowed without approval. The solver thought shellCommandExecutionAllowed could be true when userApprovalForShellCommands was false, as long as no commands were currently running. We annotated: "shellCommandExecutionAllowed must be false when userApprovalForShellCommands is false. Rule 1 requires explicit user approval as a precondition, not just at the moment of execution."

Outbound allowed to non-approved destination. Same pattern with outboundTransmissionAllowed. We annotated: "outboundTransmissionAllowed must be false when destinationInApprovedList is false. Rule 2 requires the destination to be in the approved list as a precondition."

Review variables

Check what the compiler extracted:

Look for junk variables not in your original policy, vague descriptions that could cause translation ambiguity, and bare assertion rules tied to auto-generated variables.

Our extraction was clean: all 18 variables mapped directly to the 6 rules with no junk. If you find problems, queue changes with refinePolicyVariables before refining. See the battle testing docs for the full variable refinement flow.

Refine and retest

Apply the thumbs-down annotations:

Takes 2-3 minutes. Then pull the fresh scenarios, review them (we got 13, all correct), approve them, and run the test suite:

Our result: 34 passed, 2 failed. The 2 failures were abstract edge cases about flag states when no action is occurring. Both real-world enforcement tests below passed with unanimous solver consensus.

Results: blocking the Feishu exfiltration

This is the action Capability Evolver was actually performing per the security report:

Result: UNSAT. All three solvers (LLM, Automated Reasoning, Z3) agreed unanimously. The extractor correctly identified outbound data transmission to a destination not on the approved list. The response included a ZK proof receipt for independent verification.

Results: allowing legitimate evolution sharing

Capability Evolver's EvoMap network is a legitimate destination for sharing evolution capsules:

Result: SAT. All three solvers agreed. Same action, same data, different destination. The solver caught the distinction mathematically.

Action
Destination
Result
Solvers

Send evolution logs to Feishu

Not approved

UNSAT (blocked)

Unanimous

Send evolution logs to EvoMap

Approved

SAT (allowed)

Unanimous

Additional rules for evolving agents

The policy above is a starting point. Depending on your setup, consider:

  • Allowed destinations whitelist: "Outbound data may only be sent to evomap.ai, api.github.com, and hooks.slack.com."

  • Evolution constraints: "Agent may not evolve capabilities related to authentication, payment processing, or credential management."

  • Spending limits: "No single transaction may exceed $100. Total daily spend must not exceed $500."

  • File system boundaries: "File deletions are not permitted outside /tmp and /home/user/.openclaw/workspace/memory."

  • Risk-tiered confirmation: "Any action involving more than $50, external email, or outbound data requires explicit user confirmation."

Write rules that match your actual threat model. Battle testing will surface ambiguities before production.

Last updated