#harnessed.md

Verification is the highest-leverage layer of the harness — Anthropic calls it “the single highest-leverage thing you can do.” Without it, the bottleneck just moves from writing code to reviewing it.

##The verification stack

Three tiers, ordered by cost. Fail fast, fail cheap — each tier exists to keep the next from doing work it shouldn’t have to.

LayerExamplesWhen it runsCost
DeterministicTypes, linters, formatters, buildsEvery file editFree, instant
TestsUnit, integration, end-to-endEvery change, every PRCheap
Agentic reviewSecurity, style, perf subagents; PR reviewersPre-mergeTokens + minutes

##Deterministic checks

The floor. Types, linters, and formatters fire on every edit and convert whole classes of error into compile-time failures.

Snyk’s research puts around 40% of AI-generated code as containing a vulnerability, so this layer carries more weight in the agent era than it did before. Agent-specific anti-patterns worth a custom lint rule:

Wire these into a PostToolUse hook so failures land in the agent’s next turn, not in CI.

##Tests

The pyramid still applies — unit, integration, end-to-end — but agents introduce specific failure modes:

Coverage is the wrong target — line coverage means lines ran, not that assertions are meaningful.

The honest signal is mutation testing: Stryker (JS/TS/.NET), PIT (Java), mutmut (Python), and cargo-mutants (Rust) introduce small changes — >>=, drop a ! — then re-run your tests.

Surviving mutants are tests that weren’t actually watching that behaviour; feed them back as the next round’s target.

Still niche, but ThoughtWorks Vol. 34 put it in Trial as “the most honest signal for evaluating the real fault-detection capability of a test suite.”

##Agentic review

Once the cheap checks pass, layer on review by other agents. LLMs are better at reviewing than generating — a reviewer reliably catches what the author missed.

###Multi-agent PR review

Anthropic’s Code Review for Claude Code (March 2026) dispatches specialist agents in parallel — logic, boundary conditions, API misuse, auth flaws, project conventions — and a meta-agent verifies each finding before posting. Substantive comments rose from 16% of PRs to 54% post-adoption, with <1% engineer disagreement on findings.

###The hosted reviewer landscape

Beyond Anthropic’s own offering, a handful of options have emerged:

They all do roughly the same job — pick by where your code already lives.

###The subagent reviewer pattern

Same shape, locally. Drop subagent definitions into .claude/agents/ — each is a markdown file with a focused prompt and tool allowlist:

---
name: security-reviewer
description: Review diff for auth, secret-handling, and injection risks
tools: Read, Grep, Glob
---
Audit the diff for OWASP Top 10 patterns. Flag only verified findings,
with file:line and a one-sentence fix. No speculation.

Add style-reviewer.md and performance-reviewer.md, dispatch concurrently, then loop:

Implement → Review (n parallel subagents) → Resolve → Re-review

Rounds 1–2 capture ~75% of the improvement (round 1 catches half the errors, round 2 catches half of what’s left); cap at 5–6 to avoid oscillation. Width beats depth — three reviewers in 30 seconds beats one monolithic prompt in two minutes.

Starting points: Anthropic’s built-in /security-review, or Every’s compound-engineering plugin for a full plan → work → review → compound loop.

##Closing the loop

Verification only compounds when its signals flow back into the guides. When the same class of bug appears twice, the fix isn’t a better prompt — push it into the harness. If a deterministic check can catch the pattern — a lint rule, a hook, or a test — use that. Otherwise pick the advisory form that fits: a path-scoped rule for code-area specifics, a skill for workflows, AGENTS.md for what every session needs to know.

Three questions to ask after every failure:

  1. Should this have been caught earlier? Push enforcement upstream — to whichever cheaper layer would catch it next time.
  2. Pattern or one-off? Patterns get hard-coded into the harness; one-offs just get fixed.
  3. Did an advisory rule fail to stick? Upgrade it to a deterministic check — a hook, lint, or test.

###Wiring up the loop

Four levels of automation, ordered by friction:

Whatever the mechanism: push one improvement into the harness (rule, test, hook) per recurring defect — never a one-off prompt fix.

Verification is where the agent meets reality. The cleaner the layers beneath, the less you need humans at the top.


##Further reading