#harnessed.md
Verification is the highest-leverage layer of the harness — Anthropic calls it “the single highest-leverage thing you can do.” Without it, the bottleneck just moves from writing code to reviewing it.
##The verification stack
Three tiers, ordered by cost. Fail fast, fail cheap — each tier exists to keep the next from doing work it shouldn’t have to.
| Layer | Examples | When it runs | Cost |
|---|---|---|---|
| Deterministic | Types, linters, formatters, builds | Every file edit | Free, instant |
| Tests | Unit, integration, end-to-end | Every change, every PR | Cheap |
| Agentic review | Security, style, perf subagents; PR reviewers | Pre-merge | Tokens + minutes |
##Deterministic checks
The floor. Types, linters, and formatters fire on every edit and convert whole classes of error into compile-time failures.
Snyk’s research puts around 40% of AI-generated code as containing a vulnerability, so this layer carries more weight in the agent era than it did before. Agent-specific anti-patterns worth a custom lint rule:
anyand untyped escape hatches. Agents reach foranywhenever the type is hard. Block with@typescript-eslint/no-explicit-anyor Pyright strict mode.- Stub implementations.
TODO,pass,throw new Error('not implemented'). A grep-level rule catches these before review. - Security smells. Run Semgrep’s free
p/owasp-top-tenruleset (or equivalent) in CI.
Wire these into a PostToolUse hook so failures land in the agent’s next turn, not in CI.
##Tests
The pyramid still applies — unit, integration, end-to-end — but agents introduce specific failure modes:
- Mocks of the system under test. The test mocks the thing it’s meant to verify, then asserts the mock returned what the mock was told to. Enforce real dependencies in integration suites.
- Tautological assertions. The assertion re-encodes the implementation, not the contract. Ask: “what would this catch if the implementation regressed?” If nothing, it’s decoration.
- Tests written after the fact. Tests written second mirror the code, not the spec. Make the agent follow red-green: failing test first, then the code that makes it pass.
Coverage is the wrong target — line coverage means lines ran, not that assertions are meaningful.
The honest signal is mutation testing: Stryker (JS/TS/.NET), PIT (Java), mutmut (Python), and cargo-mutants (Rust) introduce small changes — > → >=, drop a ! — then re-run your tests.
Surviving mutants are tests that weren’t actually watching that behaviour; feed them back as the next round’s target.
Still niche, but ThoughtWorks Vol. 34 put it in Trial as “the most honest signal for evaluating the real fault-detection capability of a test suite.”
##Agentic review
Once the cheap checks pass, layer on review by other agents. LLMs are better at reviewing than generating — a reviewer reliably catches what the author missed.
###Multi-agent PR review
Anthropic’s Code Review for Claude Code (March 2026) dispatches specialist agents in parallel — logic, boundary conditions, API misuse, auth flaws, project conventions — and a meta-agent verifies each finding before posting. Substantive comments rose from 16% of PRs to 54% post-adoption, with <1% engineer disagreement on findings.
###The hosted reviewer landscape
Beyond Anthropic’s own offering, a handful of options have emerged:
- CodeRabbit — cross-platform support across GitHub, GitLab, Bitbucket, and Azure DevOps
- Greptile — full-repo context, pitched at large/legacy codebases
- cubic — emphasis on high-signal output; used by Granola, n8n, Cal.com
- Graphite Diamond — built around Graphite’s stacked-PR workflow
- Vercel Agent — sandbox-validates suggested patches against your real build, tests, and linters before posting
They all do roughly the same job — pick by where your code already lives.
###The subagent reviewer pattern
Same shape, locally. Drop subagent definitions into .claude/agents/ — each is a markdown file with a focused prompt and tool allowlist:
---
name: security-reviewer
description: Review diff for auth, secret-handling, and injection risks
tools: Read, Grep, Glob
---
Audit the diff for OWASP Top 10 patterns. Flag only verified findings,
with file:line and a one-sentence fix. No speculation.
Add style-reviewer.md and performance-reviewer.md, dispatch concurrently, then loop:
Implement → Review (n parallel subagents) → Resolve → Re-review
Rounds 1–2 capture ~75% of the improvement (round 1 catches half the errors, round 2 catches half of what’s left); cap at 5–6 to avoid oscillation. Width beats depth — three reviewers in 30 seconds beats one monolithic prompt in two minutes.
Starting points: Anthropic’s built-in /security-review, or Every’s compound-engineering plugin for a full plan → work → review → compound loop.
##Closing the loop
Verification only compounds when its signals flow back into the guides. When the same class of bug appears twice, the fix isn’t a better prompt — push it into the harness. If a deterministic check can catch the pattern — a lint rule, a hook, or a test — use that. Otherwise pick the advisory form that fits: a path-scoped rule for code-area specifics, a skill for workflows, AGENTS.md for what every session needs to know.
Three questions to ask after every failure:
- Should this have been caught earlier? Push enforcement upstream — to whichever cheaper layer would catch it next time.
- Pattern or one-off? Patterns get hard-coded into the harness; one-offs just get fixed.
- Did an advisory rule fail to stick? Upgrade it to a deterministic check — a hook, lint, or test.
###Wiring up the loop
Four levels of automation, ordered by friction:
- Quick capture. In Claude Code, prefix a prompt with
#to append a memory entry. Zero ceremony. - A
/learnskill. Slash command that takes a bug plus the diff and proposes a candidate rule, lint, or test for review. - Session-end hooks. A
StoporSessionEndhook reflects over the session and appends learnings to a tracked file. Starters: claude-mem, claude-memory-compiler. - Dreaming. Anthropic’s Managed Agents (May 2026) curate memory automatically via scheduled reflection. Harvey saw a ~6× completion-rate lift.
Whatever the mechanism: push one improvement into the harness (rule, test, hook) per recurring defect — never a one-off prompt fix.
Verification is where the agent meets reality. The cleaner the layers beneath, the less you need humans at the top.
##Further reading
- Code Review for Claude Code — Anthropic Multi-agent PR review with a verification pass
- Honk Part 3: Feedback Loops for Background Coding Agents — Spotify Engineering Deterministic verifiers and an LLM judge in production
- Mutation testing for the agentic era — Trail of Bits Why mutation testing matters when agents write the tests
- AI Is Writing Our Code Faster Than We Can Verify It — Andrew Stellman The growing gap between generation speed and verification