Security Model

Why This Matters

AI skill registries are under active attack. In January 2026, security researchers discovered 341 malicious skills distributing Atomic Stealer malware, exfiltrating credentials, and installing reverse shells on developer machines. A single trojan skill was downloaded over 7,700 times before removal. The affected registry had no pre-sharing scanning and relied solely on community flagging after damage was done. SkillSafe was designed to prevent these attacks before skills ever reach consumers.

In February 2026, researchers published SkillJect — an automated attack framework that achieves a 97.5% success rate against Claude Code using poisoned skills. The attack splits malicious behavior across two components: an innocuous-looking inducement prompt in SKILL.md and a payload hidden in a bundled script. Neither component appears malicious in isolation; together they form a complete attack. The v0.1.4 scanner ships four new detection passes in direct response to this paper. Read the full analysis →

Skills are only half the attack surface. MCP tool descriptions — visible to your AI agent but hidden from you — are an active vector for credential theft, cross-tool exploitation, and prompt injection. Scan your MCP tools →

SkillSafe is designed to protect against supply-chain attacks on AI agent skills. The security model has three pillars:

Content Integrity

The entire skill archive is hashed with SHA-256 to produce an immutable tree hash. Any change to any file produces a different archive, which produces a different hash.

Dual-Side Verification

Publishers scan and upload a report alongside the archive. Consumers independently re-scan after download. The server compares both and returns a verdict.

Automated Static Analysis

Every file in every shared skill is analyzed using AST parsing and pattern matching. Python, JavaScript, TypeScript, and Markdown are scanned for dangerous functions, credential theft, and prompt injection before the skill reaches any consumer.

How Scans Work

Every skill shared through SkillSafe undergoes automated static analysis before it reaches consumers. Scans run on both the publisher side (at upload time) and the consumer side (at install time) to ensure independent verification.

What Gets Analyzed

Python files (.py) — AST-based analysis for dangerous calls and imports
JavaScript and TypeScript (.js, .ts, .jsx, .tsx, .mjs) — regex-based unsafe pattern detection
All text files — secrets, hardcoded credentials, shell threat patterns, and Unicode obfuscation
Markdown, YAML, and plain text — prompt injection and ClickFix social engineering
Binary files (.exe, .dll, .so, .dylib) — flagged as unsafe bundled executables
Base64 blobs — decoded and re-scanned for hidden payloads

Detection Techniques

AST parsing (Python) — catches eval, exec, subprocess, and os.system even when obfuscated by aliases
Regex scanning (JS/TS) — comment-aware, skips block and inline comments to reduce false positives
Pattern matching (all text) — 85 rules across 22 threat categories aligned to the OWASP Agentic AI taxonomy, with post-filters to suppress known false positive patterns
Context-aware classification — findings in documentation code fences are classified as advisories (0 score impact) unless instructional intent is detected nearby, reducing false positives while preserving detection of OpenClaw-style attacks
Base64 deep-scan — decodes suspicious blobs and re-applies all rules to decoded content
Unicode analysis — detects zero-width characters and Cyrillic/Latin homograph mixing
Inducement language detection — 6 patterns targeting SkillJect-style social engineering that steers agents toward executing bundled scripts
Structural mimicry — multi-line context scan detecting setup section headers followed by bundled script execution references
Composite co-occurrence — escalates severity when individually low-risk capabilities co-occur in a single file (exec + network, env + network)
Surplus functionality — cross-file consistency check flagging script capabilities not documented in SKILL.md

Severity & Scoring

Critical — Active exploitation (reverse shells, credential theft, metadata endpoint access)
High — Dangerous operations that bypass user consent (persistence, exfiltration, composite co-occurrence)
Medium — Suspicious patterns that may indicate malicious intent
Low — Best-practice violations
Info — Capability indicators (subprocess, filesystem, eval). Recorded in BOM for transparency but carry zero score penalty and do not affect clean/not-clean status
Each scan produces a 0–100 safety score and an A+ through F letter grade based on severity-weighted penalties

Cryptographic Verification Flow

SkillSafe uses a dual-side verification model to ensure that what the publisher uploaded is exactly what the consumer installs — with no possibility of tampering in transit or at rest.

Publisher Packages & Hashes

When a skill is saved, the system creates a deterministic file manifest (sorted entries with SHA-256 per file), then computes a single hash of the manifest — the tree hash.

Publisher-Side Scan

A full static analysis scan runs on the skill files — 12 passes covering dangerous calls, secrets, prompt injection, shell threats, and more. The scan report is stored alongside the skill version.

Server Stores Immutably

The server stores the archive per version with the tree hash recorded as metadata for integrity verification. The archive, publisher scan report, and metadata are all immutable once written. No overwrites — only new versions.

Consumer Downloads & Re-scans

On install, the consumer downloads the files, recomputes the tree hash, and the server runs the same static analysis scan independently. This produces a consumer-side scan report.

Verdict Comparison

The consumer's scan report is sent to the server, which compares it against the publisher's report. If the tree hashes match and both scan reports agree, the skill is marked ✓ verified. Any mismatch — tampered files, divergent findings, or hash discrepancies — results in a ✗ failed verdict with details.

What Scan Reports Contain

Every scan report is a structured document that captures the full security posture of a skill version. Reports are stored immutably and can be inspected by both publishers and consumers.

Metadata

Scanner version, timestamp, tree hash, file count, total size, and the skill identifier (@scope/name@version).

Integrity

The skill tree hash (SHA-256 of the full archive blob), scanner version, and ruleset version. The tree hash is the source of truth for content integrity — any deviation between publisher and consumer hashes triggers a verification failure.

Findings

Each finding includes: rule ID, severity level, classification (threat or advisory), affected file and line range, a human-readable description, and a code snippet showing the matched pattern. Findings are deterministic — same input always produces the same output. Advisory findings appear in documentation contexts (e.g. code fences in reference files) and carry 0 score penalty.

Summary & Verdict

Threat count, advisory count, clean/not-clean flag (based on threats only), 0–100 safety score, and A+–F letter grade. The verification verdict (verified, divergent, or critical) is produced when the server compares the publisher and consumer reports.

Don't take our word for it — every public skill's report is open. See a real one: scan report for @tldraw/write-pr, or scan any GitHub repo yourself (no signup).

Ruleset Changelog

The scanner ruleset is versioned independently of the API. Both publisher and consumer reports embed their ruleset version, and the server flags differences so you can tell when a divergence is caused by a ruleset upgrade rather than actual tampering.

2026.06.05 Current

June 2026

Threat-pattern additions from the openclaw publisher-family investigation, plus systematic false-positive fixes from 300 sub-agent reviews. Five new rules cover JS/TS agent-memory writes, ~/.claude config-directory access, silent git push of agent config, browser session harvesting, and messaging-host exfiltration candidates. False positives reduced by excluding method-call forms (re.compile, Session.exec), removing the noisy prompt_system_prompt rule, ignoring XML namespace URIs in composite scoring, whitelisting dev-tool caches (Xcode/Gradle/npm), and allowing emoji ZWJ sequences. View full rule listing → · Read the write-up →

New detection rules

agent_memory_write_js (SS04, critical) — fs.writeFileSync() / fs.appendFileSync() targeting CLAUDE.md, MEMORY.md, SOUL.md, IDENTITY.md, or .cursorrules. Covers the JS/TS variant of agent-memory poisoning that the original shell-redirect rule (agent_memory_write) missed — exactly the mechanism the clawsec-nanoclaw skill used.
ai_config_dir_access (SS04, high) — Reads or writes to $HOME/.claude and the ~/.claude/{skills,memory,settings,mcp} subtree. Catches both directory enumeration (skill-name reconnaissance) and config exfiltration. ~/.claude/mcp/servers.json contains every MCP server credential the user has wired up.
agent_config_git_push (SS04, critical) — git push of a path inside ~/.claude or $HOME. The clawdbot-backup mechanism: a cron entry plus a LaunchAgent that silently push the user's entire ~/.claude directory (MCP keys, MEMORY.md, every saved skill) to an attacker-controlled remote every 6 hours.
browser_session_harvest (SS22, high) — --remote-debugging-port flag on a Chrome invocation, chrome.cookies API usage, or direct reads of Library/Application Support/(Google/Chrome|BraveSoftware|Microsoft Edge|Firefox)/*/Cookies | Login Data. Triggered on tuzi-danger-gemini-web, which read live Google session cookies via CDP.
cp05_comms_exfil_candidate (SS-CP, medium) — Outbound to api.telegram.org, discord.com/api/webhooks, hooks.slack.com, or requestcatcher.com. Surfaced at medium so AI review can verify whether the host is declared in SKILL.md (legitimate Telegram bot integration) or a covert side-channel — like the openclaw parent skill's INSTALLER_FEED_CHANNEL that routed agent observations to a configurable Telegram target.

False-positive fixes

Method-call exclusion (SS01) — py_eval, py_exec, and py_compile now use (?<!\.) lookbehind, so re.compile() and ORM Session.exec() stop triggering. py_compile alone fired ~15× per scan in the May 2026 backlog and inflated composite_medium_cluster (SS-CP04) downstream.
prompt_system_prompt removed (SS15) — The rule fired on the phrase "system prompt" anywhere in documentation. Across 300 reviewed scans it was a 100% false-positive signal in every doc context. Removed entirely; capability tracking continues via the BOM.
XML namespace URIs stripped from composite scoring (SS-CP) — http://openxmlformats.org/..., http://www.w3.org/..., http://schemas.microsoft.com/..., and similar static-identifier URLs are removed from file content before capNet is tested. Document-processing libraries (.docx, .xlsx, .pptx, ODF) no longer trip composite_exec_exfil just by importing XML namespaces.
composite_env_leak self-consistency check — When the env-var name shares a token with the network host (BINANCE_API_KEY → binance.com, OPENAI_API_KEY → openai.com), the rule fires at info instead of high. The pattern is auth, not exfiltration.
composite_write_exfil static-asset suppression — Suppressed entirely when the only outbound URLs in the file are static-asset hosts (CDN, jsdelivr, unpkg, googleapis, gstatic, fontawesome, bootstrapcdn). A Chart.js URL inside an HTML template is not an exfiltration channel.
dangerous_rm_root dev-cache whitelist (SS13) — Extended the cleanup whitelist to cover ~/Library/Developer/(Xcode|CoreSimulator|CommandLineTools), ~/Library/Caches, ~/.gradle/caches, ~/.cache, ~/.npm/_cacache, ~/.yarn/cache, ~/.cocoapods/repos, ~/.m2/repository, ~/.pub-cache, and ~/.cargo/registry/cache. Xcode and Gradle troubleshooting docs no longer flag critical.
path_traversal_sys defensive-doc context — Now suppressed when the surrounding ±2 lines contain defensive language ("do not", "avoid", "vulnerable", "attacker", "anti-pattern", "❌", "⚠️"). Security-checklist style skills that teach what not to do stop scoring as if they're doing it.
unicode_zero_width emoji ZWJ allowance — Suppressed when the line contains U+200D (ZWJ) only and otherwise contains emoji-range codepoints. Multi-person emoji like 👩‍🚒 legitimately use ZWJ.

2026.04.20

April 2026

Major false positive reduction — 91% fewer actionable findings on real-world skills. Capability rules downgraded to info severity, netcat regex tightened, targeted rm patterns exempted, structural mimicry exempts SKILL.md/README.md. AI-assisted review pipeline validates remaining findings at scale. View full rule listing →

Changes

Capability rules downgraded to info — Python (eval, exec, subprocess, os.system, compile, __import__) and JS/TS (eval, Function, child_process, execSync, spawnSync, fs) findings are now info severity. They appear in the Bill of Materials for transparency but carry zero score penalty and do not affect clean/not-clean status. Actual risk from these capabilities is captured by composite co-occurrence rules (SS-CP01–04) and surplus functionality checks (SS-SF01–04).
Tightened reverse shell netcat regex (SS09) — Now requires nc -e/nc -l immediately after the command, or a nc host port -c structure. Eliminates false positives from prose text, CI commit message templates, and CLI flags like snapshot -c that previously matched the loose -[cClLeE] pattern.
Targeted rm exemptions (SS13) — rm -f ~/.app/specific-file (non-recursive, 2+ path segments under ~/) is now exempt. rm -rf ~/, rm -rf ~, and rm -rf / remain critical. Comment lines (# prefix) are also skipped.
Structural mimicry exempts SKILL.md and README.md — These files are supposed to have setup sections with commands — that is their purpose. SS-SM01 and SS-SM02 now only fire on other markdown files where setup instructions may indicate social engineering.
Prompt system prompt downgraded to info — The phrase "system prompt" appears naturally in AI skill documentation. Downgraded from medium to info to reflect that this is descriptive, not injective, in the vast majority of cases.
AI-assisted review pipeline — Scan reports with findings are reviewed by AI agents (Claude) that classify each finding as threat or advisory with reasoning. This two-stage approach (deterministic scan + AI review) catches nuanced false positives that regex post-filters cannot.

2026.04.08

April 2026

Context-aware finding classification to reduce false positives while preserving detection of real attacks like the OpenClaw campaign. Scan report schema updated to v1.2. View full rule listing →

New features

Threat vs advisory classification — Findings in documentation code fences (e.g. references/*.md) are classified as advisory with 0 score penalty, unless instructional intent is detected nearby. Findings in executable files or near imperative language ("run this", "you must", "prerequisite") remain threats that affect the score.
Instructional intent detection — 17 patterns detecting imperative/social engineering language (e.g. "before using, run", "paste this in terminal", curl | bash near setup instructions). When found within 5 lines of a dangerous pattern, the finding stays classified as a threat even inside a code fence — directly targeting the OpenClaw attack vector.
Markdown code fence awareness — Scanner now tracks code fence boundaries in .md/.txt/.rst files, distinguishing executable instructions from documentation examples.
Tightened SS13 regex — dangerous_rm_root now only flags destructive wildcards (rm -rf /, rm -rf ~, rm -rf $HOME). Targeted removals like rm /tmp/specific-file no longer trigger false positives.
Severity downgrade for doc paths — Advisory findings in references/, docs/, examples/, and tests/ directories receive a severity downgrade (critical→high, high→medium).
Schema v1.2 — Scan reports now include advisory_count and a classification field ("threat" or "advisory") on each finding. findings_count reflects threat count only. clean is true when threat count is 0.

2026.03.15

March 2026

Four new detection passes targeting SkillJect-style skill-based prompt injection, developed in direct response to arXiv:2602.14211. Scanner expanded from 8 to 11 passes. View full rule listing → · Read the technical write-up →

New detection passes

Pass	Category	Max Severity	What it detects
SS-SI	Inducement Language	high	6 patterns for social engineering that steers agents toward script execution: "before using, run...", "for the tool to work", "this setup step is required", "run the bundled script", "automatically execute *.sh", "must be run first"
SS-SM	Structural Mimicry	high	Multi-line context scan: SS-SM01 fires when a Prerequisites/Setup/Quick Start header is followed within 10 lines by a bundled script reference; SS-SM02 fires when an urgency marker (`> IMPORTANT`, `CRITICAL`) appears within 3 lines of a script reference
SS-CP	Composite Co-occurrence	critical	Escalates severity when low-risk primitives co-occur: SS-CP01 (exec + network, critical), SS-CP02 (env vars + network, high), SS-CP03 (file write + network, high), SS-CP04 (3+ medium findings in one file, high)
SS-SF	Surplus Functionality	critical	Cross-file consistency check: flags script capabilities absent from `SKILL.md` — undocumented network calls (critical), env var reads (high), subprocess invocations (high), file writes (medium)

Design notes

Two structural gaps closed — The prior scanner lacked inducement language detection and cross-file consistency checking. SkillJect exploits exactly these gaps, hiding payloads where no single-file scan can find them.
Cross-file consistency (SS-SF) — Pass 11 extracts a documentation intent profile from SKILL.md, then checks each script for undocumented network calls, credential access, subprocess invocations, and file writes. Undocumented outbound network calls are the single most reliable signal of a SkillJect-style payload.
Composite escalation (SS-CP) — Targets the paper's finding that SkillJect composes attacks from primitives that individually fall below alert thresholds. Three or more medium-severity findings in a single file trigger a high-severity composite alert.

2026.03.01

March 2026

Added 15 new detection categories (SS03–SS21), A-F safety scoring, and ruleset-upgrade divergence detection. Total: 65 rules across 19 threat categories. View full rule listing →

New detection categories

ID	Category	Max Severity	What it detects
SS03	Data Exfiltration	high	curl/wget to ngrok, webhook.site, pipedream, RequestBin, Burp Collaborator
SS04	Agent Memory Poisoning	high	Writes to `MEMORY.md`, `CLAUDE.md`, `.cursorrules` via redirection or echo
SS05	Encoded Malware	critical	base64 decode-then-execute patterns; deep-scans decoded blobs for hidden payloads
SS07	Privilege Escalation	critical	`sudo su/bash/sh`, `setuid(0)`, `seteuid(0)`
SS08	Persistence	high	cron, macOS LaunchAgents/LaunchDaemons, systemd services, shell profile writes
SS09	Reverse Shell	critical	`/dev/tcp`, `/dev/udp`, netcat `-e`, socat `EXEC`, bash redirect
SS10	Unicode Obfuscation	high	Zero-width characters (U+200B–U+2060, U+FEFF); Cyrillic/Latin homograph mixing
SS11	ClickFix Social Engineering	high	"Open terminal and paste…", copy-paste command instructions, Windows Run dialog tricks
SS13	Dangerous File Operations	critical	`rm -rf /`, `rm -rf ~`, `dd of=/dev/sd*` block-device writes
SS14	Reconnaissance	critical	nmap, masscan, arp-scan, zmap; cloud IMDS endpoints (169.254.169.254, metadata.google.internal)
SS16	Bundled Binaries	high	.exe, .dll, .so, .dylib, .elf, .bin and other executable/library extensions
SS17	Credential File Access	critical	Reads to `~/.aws/credentials`, `~/.docker/config.json`; find/search over `.ssh`, `.gnupg`
SS18	Cryptocurrency Targeting	critical	Seed/recovery phrases; MetaMask, Phantom, Exodus, Ledger wallet references; wallet directories
SS19/20	Path Traversal	critical	`../../etc` traversals; reads to `/etc/passwd`, `/etc/shadow`; git hook writes

New features

Safety score — Every scan produces a 0–100 numeric score and A+/A/B/C/D/F letter grade based on severity-weighted penalties (critical −25, high −15, medium −5, low −2).
Ruleset-upgrade divergence — When a consumer verifies with a newer ruleset, the server sets ruleset_upgrade_divergence: true in verification details so the UI can explain the mismatch rather than implying tampering.
CI mode (--check) — Exits with code 1 if any HIGH or CRITICAL findings are present; suitable for pre-commit hooks and CI pipelines.
Rule suppression (--ignore) — Comma-separated rule IDs can be suppressed per-scan to reduce noise for known-safe patterns in a codebase.

2025.01.01 Initial release

January 2025

Four foundational detection categories covering code execution, credential theft, prompt injection, and hardcoded API keys. 34 rules total. View full rule listing →

Detection categories

ID	Category	Max Severity	What it detects
SS01	Code Execution	high	Python: eval, exec, subprocess.*, os.system (AST-based). JS/TS: eval, Function constructor, child_process, execSync (regex)
SS02	Credential Theft	critical	AWS Access Key IDs, PEM private keys, GitHub tokens (gh[pousr]_…), Slack tokens
SS15	Prompt Injection	high	"ignore previous instructions", role hijacking ("you are now…"), instruction override/forget patterns
SS21	Hardcoded API Keys	high	Generic `api_key=`, `secret_key=`, `access_token=`, `password=` patterns with ≥16-char values

Supply Chain Guarantees

SkillSafe is built to make several strong guarantees about every skill in the registry:

Tamper evidence — Any modification to skill contents after saving is detectable. The tree hash changes if even a single byte is altered. Most open registries store skills without integrity verification, making silent tampering undetectable.

Independent verification — Consumers don't trust the publisher's scan report alone. They run their own scan and the server compares both independently. Other registries rely on a single-side trust model where only the publisher's claims are accepted.

Immutable history — Once a version is saved, it cannot be overwritten or silently updated. Every version is a permanent, auditable record. On open registries, skills can be modified after upload without notice.

No blind trust — The server never executes skill code. All analysis is static. The server's role is storage and comparison — not judgment. Registries that execute or evaluate code server-side introduce a single point of compromise.

Advisories

No known security incidents have affected SkillSafe users to date.

What "zero incidents" means

When we say zero supply-chain incidents, we mean specifically: since launch (February 2026), no skill installed through SkillSafe — via the desktop app, npx skills add, or the registry API — has been confirmed to have executed a malicious payload on a user's machine, and no skill served by the registry has been confirmed tampered with between publish and download (a tree-hash mismatch surfacing as a critical dual-side verification verdict).

This claim does not cover: malicious skills that exist elsewhere in the ecosystem (we catch and quarantine those — see the threats counter on the homepage), findings in skills that were flagged before anyone installed them, or vulnerabilities in users' own configurations. The claim is monitored continuously via dual-side verification verdicts and scan-report comparisons. If an incident occurs, it will be disclosed on this page within 72 hours of confirmation, with a post-mortem.

Responsible Disclosure

If you discover a security vulnerability, please report it responsibly:

security@skillsafe.ai

We acknowledge reports within 48 hours and aim to fix critical issues within 7 days. We do not pursue legal action against researchers acting in good faith.

What to include in your report:

Description of the vulnerability and its potential impact
Steps to reproduce or proof-of-concept
Affected component (CLI, API, web, scanner)
Your suggested severity assessment

Our commitment:

Acknowledgment within 48 hours
Initial triage and severity assessment within 5 business days
Critical fixes deployed within 7 days
Credit in advisory (unless you prefer anonymity)