Best AI Testing and QA Skills for Developers [2026]
We installed and scored 18 testing skills. These 5 earned their spot — from a 61-file Playwright encyclopedia to strict TDD enforcement.
We installed 18 testing and QA skills, read every file in every archive, and scored them on five dimensions: file depth, behavioral specificity, real-world coverage, anti-pattern guidance, and install volume. Most were thin: a single Markdown file listing testing principles, little different from a generic prompt you’d write yourself. A few stood out. One ships 61 reference files across 8 directories. Another opens with “NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST” and then spends 300 lines rebutting every excuse developers use to skip that rule. A third bundles working Python scripts alongside the instruction files.
These five are worth your time.
Quick Comparison
| Skill | Score | Key Feature | Frameworks |
|---|---|---|---|
| @currents-dev/playwright-best-practices-skill | 46/50 | 61-file Playwright reference library | Playwright, React, Angular, Vue, Next.js |
| @obra/test-driven-development | 44/50 | Enforced TDD with anti-rationalization guide | Framework-agnostic |
| @daymade/qa-expert | 43/50 | Full QA lifecycle with bundled Python scripts | Framework-agnostic |
| @langgenius/frontend-testing | 42/50 | Vitest + RTL with 100% function coverage goal | Vitest, React Testing Library |
| @obra/systematic-debugging | 40/50 | 4-phase root cause process + test pollution script | Framework-agnostic |
Browse all testing skills: /tags/testing
How We Scored
Five dimensions, 10 points each:
- File depth — How many reference files ship in the archive? Does the skill go beyond a single Markdown file?
- Behavioral specificity — Does it encode a concrete methodology, or hand-wave with generic principles?
- Real-world coverage — Does it address the actual hard problems: flakiness, mock boundaries, CI integration, coverage goals?
- Anti-pattern guidance — Does it tell the AI what not to do? Skills that enumerate failure modes prevent more bugs than skills that only describe the happy path.
- Install volume — How much real-world adoption has the skill demonstrated?
Skills were evaluated on the archive contents we actually installed. No credit for claims not backed by files.
1. @currents-dev/playwright-best-practices-skill — 46/50
Score: 46/50
Best for: Teams doing serious Playwright work who want the AI to have encyclopedic knowledge of Playwright patterns — not just the basics, but accessibility testing, GraphQL, OAuth flows, browser extensions, canvas testing, and Electron.
This is the most comprehensive testing skill we found. The archive contains 61 Markdown reference files organized across 8 directories:
core/ — locators, assertions, fixtures, POM, config
testing-patterns/ — API testing, component testing, accessibility, forms
advanced/ — drag-drop, i18n, security testing, visual regression
debugging/ — flaky test analysis, trace viewer, test isolation
architecture/ — test organization, data management, parallelism
frameworks/ — React, Angular, Vue, Next.js adapters
infrastructure-ci-cd/ — GitHub Actions, GitLab CI, Docker, sharding, coverage
browser-apis/ — canvas, WebGL, Electron, browser extensions, OAuth
The main SKILL.md at 303 lines is not itself a guide — it’s a decision tree. It reads the task description, identifies which layer of testing is involved (component, API, auth flow, CI configuration), and routes the model to the appropriate reference file. That’s the right architecture for a skill of this scope: instead of forcing the AI to hold all 61 files in context simultaneously, the skill uses structured selection to load only what’s relevant.
The coverage is genuinely impressive. Most Playwright skills stop at locators, assertions, and basic POM. This one ships dedicated files for authentication flows including OAuth, multi-user testing, network interception, visual regression with screenshot comparison, canvas and WebGL testing, mobile viewport emulation, and browser extension testing. For teams whose applications have any of these requirements, the difference between having these reference files and not having them is the difference between the AI giving you a working implementation versus a plausible-looking one that breaks under real conditions.
The infrastructure-ci-cd/ directory alone is worth examining. It covers GitHub Actions workflow configuration for Playwright, GitLab CI integration, Docker containerization for consistent test environments, test sharding for parallel execution, and coverage report configuration. These are the operational problems that slow down E2E test adoption in real projects — having the AI consult purpose-built reference files for each is substantially better than asking it to reason from general knowledge.
8,828 installs. Source: currents.dev.
skillsafe install @currents-dev/playwright-best-practices-skill
2. @obra/test-driven-development — 44/50
Score: 44/50
Best for: Developers who want the AI to treat TDD as a non-negotiable workflow constraint, not a suggestion — including when the AI would prefer to skip it.
The archive is two files: SKILL.md at 371 lines and testing-anti-patterns.md at 300 lines. That’s a lean footprint for a 44/50 score, but these files are dense.
SKILL.md opens with this:
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
Not “try to write tests before implementation.” Not “consider TDD where appropriate.” The instruction is categorical, and the skill spends considerable space on why the categorical version is necessary. AI assistants — like developers — will rationalize around TDD when the constraint isn’t enforced. The skill anticipates this by explicitly listing the rationalizations and rebutting each:
- “The implementation is obvious, I’ll add tests after” — rebutted: even obvious implementations have edge cases that tests reveal
- “This is just a refactor” — rebutted: refactors that break behavior are exactly what tests catch
- “There’s no good way to test this” — rebutted: that’s a design problem, not a testing problem
The testing-anti-patterns.md file at 300 lines is the other major piece. It covers:
- Testing mock behavior: writing tests that verify the mock was called rather than verifying what the code actually does
- Adding test-only methods: exposing private state to make tests pass, rather than designing testable interfaces
- Mocking without understanding: pasting mock setup without knowing what the mock is supposed to simulate
- Incomplete mocks: mocks that don’t faithfully represent the behavior of the thing they replace
- Tests as afterthought: adding tests after the implementation is “done,” which produces tests that confirm the implementation rather than the specification
Each anti-pattern includes a “gate function” — a decision procedure the model can apply to determine whether it’s falling into the trap. This is the right structure. Generic anti-pattern lists get ignored; decision procedures get applied.
65 stars. 8,087 installs. Source: github.com/obra/superpowers.
skillsafe install @obra/test-driven-development
3. @daymade/qa-expert — 43/50
Score: 43/50
Best for: QA engineers and developers who need the AI to manage the full QA lifecycle — not just write test cases, but initialize projects, track execution, file structured bug reports, calculate quality metrics, and run OWASP security tests.
The archive ships 11 files: SKILL.md at 289 lines, 2 Python scripts, 1 template file, and 5 reference documents. The Python scripts are not documentation — they’re executable code that the skill instructs the model to run:
init_qa_project.py— scaffolds a QA project with the correct directory structure, configuration files, and tracking spreadsheetscalculate_metrics.py— computes quality metrics from test execution data: pass rate, defect density, coverage percentage, and risk-weighted quality score
Most testing skills stop at “write better tests.” This one has the AI standing up the QA infrastructure.
The seven core capabilities encoded in the skill are:
- QA project initialization — run
init_qa_project.pyto create project structure - Test case writing — AAA pattern (Arrange, Act, Assert) with explicit boundary condition coverage
- Test execution tracking — status tracking per test case with pass/fail/blocked states
- Bug reporting — P0-P4 severity classification with reproduction steps, expected vs. actual, and environment details
- Quality metrics dashboards — run
calculate_metrics.py, interpret output, surface trends - Progress reporting — structured status reports for stakeholders with coverage and defect trends
- OWASP security testing — maps OWASP Top 10 categories to concrete test cases for each
The autonomous execution feature is worth noting separately. The skill includes a “master prompt” pattern: paste a single prompt that encodes the entire test plan, and the model executes all test cases sequentially, tracking results and filing bug reports as it goes. For regression testing on a known feature set, this is a meaningful productivity gain.
The OWASP section is genuinely substantive. It doesn’t just say “test for XSS.” It maps specific OWASP categories (A1 through A10) to concrete test procedures: injection tests, authentication bypass attempts, sensitive data exposure checks, broken access control scenarios. The model can use this to conduct a systematic security review against a real standard.
9,191 installs.
skillsafe install @daymade/qa-expert
4. @langgenius/frontend-testing — 42/50
Score: 42/50
Best for: React developers using Vitest and React Testing Library who want aggressive coverage targets enforced and concrete guidance on what to mock and what not to mock.
@langgenius publishes the Dify project — one of the more widely deployed LLM application frameworks. This skill reflects the testing discipline they apply to a large, actively-developed React codebase. The archive has 11 files: SKILL.md at 336 lines, 3 test templates (.tsx/.ts), and 6 reference docs.
The coverage targets are stated without hedging:
100% function coverage and statement coverage per file 95%+ branch coverage and line coverage per file
These are not aspirational numbers — they’re the thresholds the skill instructs the model to verify before declaring a file complete. For most projects this is aggressive. For a production application with active users, it’s appropriate.
The mock guidance is the most opinionated part of the skill, and the most useful. It draws a hard line:
Mock only:
- API service calls and HTTP requests
- Browser navigation (
useRouter,useNavigate)
Never mock:
- Base components from the design system
- Sibling components in the same feature module
The reasoning behind “never mock base components” is that your tests should verify how your code behaves when composed with the real components — mocking them means your tests pass even when the component interface changes in a breaking way. This is a subtler point than most testing guides make, and it’s correct.
The incremental workflow instruction is also substantive. The skill doesn’t say “write tests for the component.” It says:
Process one file at a time. Order files by complexity: utilities first, then hooks, then components, then integration tests. Don’t move to the next file until the current one meets coverage thresholds.
That sequencing matters. Utilities are easiest to test correctly because they have no side effects. Starting there lets the AI establish good patterns before tackling components with complex state and event interactions. The skill is teaching a process, not just a standard.
The 3 test templates are *.tsx and *.ts files with pre-written test structure including the import patterns, describe organization, and cleanup hooks the Dify team uses. Giving the model concrete examples is more effective than describing the pattern in prose.
8,582 installs.
skillsafe install @langgenius/frontend-testing
5. @obra/systematic-debugging — 40/50
Score: 40/50
Best for: Developers whose first instinct when a test fails is to ask the AI to “just fix it” — and who want to break that pattern with a structured root cause process.
The archive has 12 files: SKILL.md at 296 lines, 5 technique reference files, find-polluter.sh, and 3 eval files. The shell script is the most concrete artifact. find-polluter.sh automates test pollution detection — the class of failures where test A passes in isolation but fails when run after test B, because test B left shared state that A didn’t expect. These failures are among the hardest to diagnose manually; having a script that bisects the test suite to find the polluting test changes a day-long debugging session into a ten-minute one.
The core skill encodes a 4-phase debugging process:
- Root Cause Investigation — characterize the failure before proposing any fix. What type of failure is this? What is the minimal reproduction case?
- Pattern Analysis — is this failure similar to known patterns? Is it a race condition, a test isolation issue, a logic error, or an environmental problem?
- Hypothesis Testing — form specific, falsifiable hypotheses. Test each one. Don’t propose a fix until a hypothesis has been confirmed.
- Implementation — write the fix for the confirmed root cause, not for the symptom.
The “3+ fixes failed” rule is the most distinctive feature. If the model has attempted three or more fixes and the test is still failing, the skill instructs it to stop and escalate: the problem is likely architectural, the diagnosis is wrong, or there’s a constraint that hasn’t been surfaced. Rather than continuing to iterate on the wrong approach, the skill forces a step back.
Most AI debugging behavior violates this rule constantly. Given a failing test, the model will propose fix after fix, each slightly different, until one happens to work — which doesn’t mean the root cause was found, only that the symptom was masked. The escalation rule is the right intervention.
The 5 technique reference files cover specific debugging scenarios: race conditions, test isolation failures, environment-dependent failures, assertion message interpretation, and debugging CI failures that don’t reproduce locally. The last one is particularly useful — CI-only failures are often the hardest category, and having structured guidance for the specific ways CI environments differ from local machines (ordering, timing, environment variables, filesystem state) is worth the install on its own.
30 stars. 5,873 installs. Source: github.com/obra/superpowers.
skillsafe install @obra/systematic-debugging
Frequently Asked Questions
What makes a good AI testing skill?
The best testing skills do three things that generic prompts don’t. First, they encode a specific methodology — TDD, AAA pattern, coverage-first — rather than general advice. Second, they enumerate anti-patterns explicitly, so the AI knows what failure modes to avoid, not just what success looks like. Third, they provide concrete artifacts: reference files, templates, scripts, decision trees. A skill that ships executable code or a decision procedure for boundary conditions will produce consistently better AI behavior than one that describes the same ideas in prose. File depth is a reliable proxy for how much work the publisher actually did.
Do these skills work with Vitest, Playwright, and pytest?
Yes, but with different coverage. @currents-dev/playwright-best-practices-skill is Playwright-specific — it won’t help you with Vitest or pytest. @langgenius/frontend-testing is Vitest and React Testing Library. @obra/test-driven-development is framework-agnostic: the TDD discipline applies to any language or testing tool, and the anti-patterns file doesn’t reference a specific framework. @daymade/qa-expert is similarly framework-agnostic — its Python scripts work regardless of what test runner you use. @obra/systematic-debugging works across all of them; debugging methodology doesn’t depend on the test framework. If you’re working across multiple frameworks, combine @obra/test-driven-development and @obra/systematic-debugging with whichever framework-specific skill matches your stack.
How were these skills scored?
We installed all 18 skills in our evaluation set and read every file in every archive. Skills were scored across five dimensions (10 points each): file depth, behavioral specificity, real-world coverage, anti-pattern guidance, and install volume. File depth rewards skills that go beyond a single Markdown file — reference files, templates, and scripts that the AI can actively consult. Behavioral specificity rewards skills that encode a concrete process rather than general principles. Real-world coverage rewards skills that address the actual hard problems practitioners face: flakiness, mock boundaries, CI integration. Anti-pattern guidance rewards skills that enumerate failure modes and tell the AI what not to do. Install volume rewards demonstrated adoption. Skills with incomplete archives, placeholder content, or single-file implementations scored proportionally lower.
Conclusion
If you install nothing else from this list, start with @obra/test-driven-development and @obra/systematic-debugging. Together they address the two most common ways AI-assisted testing goes wrong: implementation written before tests, and test failures fixed by symptom masking rather than root cause analysis. Both come from the same publisher (@obra/superpowers) and are designed to work together.
For Playwright projects, @currents-dev/playwright-best-practices-skill at 46/50 is the clear recommendation — nothing else in the testing skill ecosystem comes close in terms of depth or coverage. For React projects with Vitest, @langgenius/frontend-testing gives you the concrete coverage targets and mock discipline that production frontend work requires.
@daymade/qa-expert is the right choice if you need the full QA lifecycle — bug tracking, metrics, OWASP security testing — rather than just a better test-writing assistant.
skillsafe install @obra/test-driven-development
skillsafe install @obra/systematic-debugging
skillsafe install @currents-dev/playwright-best-practices-skill
Browse the full testing tag to find skills filtered by framework, tool, or testing layer.
Related roundups: Browse all Best Of roundups