11 minute read

Your code reviews are passing. Your production is breaking.

Across multiple internal reports and industry observations, nearly half of AI-generated code requires manual debugging after it clears QA and staging. Not because QA was sloppy. Because the code looked correct: syntactically clean, well-formatted, handling the obvious edge cases. It triggered merge confidence. It shipped. Then it failed in ways that only became visible under production conditions.

More careful review won’t fix this. It’s a structural problem with where review happens.

Code review was designed to force comprehension. To approve a pull request, a reviewer had to understand the code well enough to spot problems. That was the mechanism that made it work. At AI velocity, the volume overwhelms that mechanism. Reviewers approve code they don’t genuinely understand. The protection looks intact, but the codebase hollows out underneath. Call this comprehension debt: the growing gap between code that exists in your system and code any human actually understands.

An Anthropic randomized controlled trial (52 engineers, short structured tasks, controlled setting) found that engineers using AI assistance completed tasks in comparable time to the control group but scored 17% lower on comprehension tests afterward (50% vs. 67%). The biggest drops were in debugging and conceptual understanding. The code got written. Understanding didn’t.

The solution isn’t reviewing faster. It isn’t hiring more senior engineers to absorb the volume. It’s reviewing at a different point in the process, where you can still understand what you’re approving.

The Collapse of Code Review as a Safety Mechanism

In a previous piece in this series, I covered the speed problem: AI generates code faster than humans can meaningfully review it, and the cognitive limits of review (200 to 400 lines per hour) don’t scale with AI output. Here I want to focus on the trust problem, which is more fundamental.

In teams I’ve spoken with, agentic AI pull requests sit significantly longer before anyone picks them up than unassisted PRs. This isn’t procrastination born of laziness. It’s a rational response to a real signal: engineers know, intuitively, that reviewing AI-generated code is different. The code is superficially correct in a way that human code isn’t. It handles all the obvious cases. It’s formatted cleanly. It compiles. And yet something about it resists the quick mental model that makes human code reviewable.

What resists is the intent. Human code carries traces of the decision-making that produced it. Variable names reflect the author’s mental model. Comments appear at inflection points. The code has a shape that reflects how a person thought through the problem. AI-generated code has none of that. It’s optimized for correctness, not for communicating intent.

The Anthropic study explains the mechanism. When engineers delegate code generation to AI without actively engaging with the problem, their comprehension of the resulting code is genuinely lower. They can read it. They can’t reason about it. Reviews become pattern matching rather than understanding. The PR gets approved. The comprehension debt accrues.

This is why production failures are increasing even as code review processes appear unchanged. The reviews are happening. The protection is gone. You’re checking the code. Not the thinking.

Review where understanding is still possible: before the code exists.

The Hierarchy of Leverage

Before describing the solution, it’s worth grounding it in a framework that has been established for four decades.

Barry Boehm’s 1988 research found that defects caught at the requirements stage cost 50 to 200 times less to fix than the same defects caught in production. Widely cited industry estimates put a design bug at $25 to $45 to fix during design review, and $2,500 to $9,000 in production: a roughly 100x difference.

A wrong requirement produces a wrong design. A wrong design produces hundreds of lines of wrong code.

Donella Meadows made the same observation from systems thinking: the highest leverage interventions act on intent, not implementation. Change what a system is trying to do, and the implementation follows.

Map this to software development:

  • Research (what problem, what constraints): errors here create thousands of bad lines
  • Planning (what approach, what interfaces): errors here create hundreds of bad lines
  • Implementation (the code itself): errors here create one bad line

AI doesn’t change the hierarchy. It accelerates it. If AI generates code at 10x speed, a bad plan produces bad code 10x faster. The hierarchy of leverage becomes more critical precisely when AI is involved.

Review where the leverage is highest: on research and plans, not on the implementation they produce. That’s where mistakes are still small.

Stop reviewing code. Start reviewing decisions.

The RPI Method in Practice

The workflow that follows from this is called RPI: Research, Plan, Implement. Each phase produces a reviewable artifact. Review happens at the artifact level, not the code level.

Research: Define the problem before touching the codebase

The research phase answers three questions before any code is written or planned: What problem are we solving? What constraints apply? What does the existing codebase already do in this area that is relevant?

The output is a short, structured document that captures this context clearly enough for someone else to review.

The review question at this stage is simple: are we solving the right problem? Have we understood the existing system accurately? Is there context we’re missing?

This is the highest-leverage review in the entire workflow. A misunderstood requirement caught here costs almost nothing. The same misunderstanding caught in production costs multiple redeploy cycles and, in production systems with real users, may cost considerably more.

The comprehension problem doesn’t exist here. A research document describing what exists and how it works is entirely understandable. It’s also the place where the errors that produce the worst downstream consequences live.

Plan: The artifact that replaces the PR

The plan is the core innovation in this workflow. It defines what will change, and just as importantly, what won’t. Scope exclusions are explicit, not woven into prose. Success criteria are defined upfront: what can be validated automatically, and what requires a human to verify. Review responsibility is assigned inside the plan itself, not left implicit.

Dex Horthy of HumanLayer arrived at this through necessity, not theory. His team was receiving 20,000-line pull requests from their best AI coder and found them simply unreviewable. His framing is direct: “I can’t read 2,000 lines of Go every day, but I can sure as heck read 200 lines of an implementation plan.”

That’s the key trade. Reviewing a 200-line plan takes 20 minutes and produces genuine understanding. Reviewing 2,000 lines of AI-generated code takes hours and produces, at best, surface-level pattern matching. The plan review is not a lighter version of the code review. It’s a more effective one, because the reviewer can actually understand what they’re approving.

Academic research supports this directly. A peer-reviewed study published in IEEE Transactions on Software Engineering tested an interactive spec-clarification workflow across four LLMs and two Python datasets. Within five user interactions, code generation accuracy improved by an average of 45.97 percentage points. In a controlled user study, task correctness jumped from 0.40 to 0.84, and cognitive load dropped by 38%. When AI generates from a precise, reviewed spec, it has clear intent to follow. When it generates from a vague prompt, it fills ambiguity with assumptions, and assumptions are where the failures hide.

The review questions at the plan stage are: Does this approach make sense for our architecture? Are the interfaces right? Does the scope exclusion section cover what it should? These are questions a reviewer can answer without reading code. The comprehension is restored because the artifact is comprehensible.

Implement: AI generates from an approved plan

With a reviewed plan, the AI implements against a known, approved specification. The reviewer has already seen what will be built and signed off on the approach. Code review at this stage becomes confirmation of an approved design, not discovery of unknown behavior.

HumanLayer’s production results from this workflow: Dex Horthy ships six pull requests in a single day. An intern on his team went from two PRs on day one to ten PRs by day eight. He describes not having opened a non-markdown file in two months. His primary review surface is the plan. Verification follows the plan: automated checks plus defined human checkpoints.

At scale, Amazon Kiro implements the same three-phase structure: Specify (requirements.md with user stories and acceptance criteria), Plan (design.md with architecture, data models, interfaces), then Execute. Each phase gates the next. No code is generated until the plan is reviewed and approved. Agent Hooks provide continuous enforcement: automated security scans, style checks, and test suites run on every file change and validate output against the approved spec.

The Stanford research on this question (Yegor Denisov-Blanch, 100,000+ developers across 600+ companies) found that unstructured AI use in complex, brownfield environments shows 0 to 10% productivity improvement and sometimes produces negative results. Structured spec-driven workflows are one of the clearest candidates for recovering that lost leverage. The difference, in those environments, isn’t model quality. It’s the clarity of intent the model has to work from.

What Spec Review Actually Looks Like

The before-and-after is worth making concrete.

Before. A pull request arrives. It’s 400 lines. The description says “adds payment retry logic.” You start reading. Ten minutes in, you realize the code touches three services you didn’t know were involved, adds a new queue, and modifies error handling you thought was stable. You have two options: approve with low confidence (risky), or ask for a synchronous meeting to understand what’s happening (slow). Neither option is good. In one case I’m aware of, a retry system that passed staging duplicated charges in production. Not because the code was wrong, but because no one had reviewed the interaction with the billing service it quietly touched.

After. A plan arrives before any code is written. It has a current state section explaining how the existing retry path works and which services it touches. A scope section states explicitly: will not modify BillingService or the existing error logging pipeline. A key discoveries section includes the relevant file and line references. The success criteria split automated checks (retry cap enforced, idempotency key present) from manual ones (verify no duplicate charges appear in staging under network partition). Each section is understandable without reading any code.

You evaluate this in five minutes. You know exactly what will be built. You know what won’t be touched. You can verify that the approach fits your architecture. You approve with genuine confidence, because you understood the change before a single line of code was written.

When the code arrives, it confirms an approved design. The review is fast because there are no surprises. The comprehension debt doesn’t accumulate, because the reviewer engaged with the problem at the planning stage, not the implementation stage.

There’s a second-order effect here that the Anthropic study points to. The same research found that the comprehension gap between AI-assisted and unassisted engineers was specifically driven by delegation: using AI to generate without engaging with the problem. Engineers who used AI for exploration, asking questions, considering tradeoffs, scored above 65% on comprehension tests. Those who delegated generation scored below 40%.

Spec review forces engineers into the exploration mode rather than the delegation mode. Reviewing a plan requires engaging with tradeoffs. The team stays cognitively present in the problem. The work gets done, and the understanding follows.

The Objection: What About Spec Drift?

Specs can fall out of sync with code. But this isn’t new territory. Outdated PR descriptions, misleading commit messages, comments describing what the code used to do: documentation drift already exists in every codebase. Spec drift is the same failure mode with a different name.

In practice, teams handle it three ways: spec-first (the spec guides development but isn’t enforced afterward), spec-anchored (CI/CD enforces the spec on every commit, catching drift before it merges), and spec-as-source (code is generated entirely from specs and never manually edited).

Most teams start with spec-first and add enforcement as the codebase matures. On balance, spec drift costs less than what you’re already managing. And spec review catches more problems before drift can accumulate.

Starting the Transition

You don’t flip a switch on this. You pick one team and one type of work.

The most natural starting point is the next feature where requirements are genuinely ambiguous, or the last pull request that took three rounds of review to merge. Introduce plan review there. Before the AI generates code, require a written plan. Review the plan. Then implement.

The economics are obvious. An hour reviewing a plan versus four hours reviewing the code it generates: and a significant fraction of that code would have required production debugging regardless of how carefully you reviewed it after the fact.

Code review isn’t going away. Automated quality gates still validate security, performance, and standards. Human review still assesses design appropriateness after those gates pass. What changes is the primary artifact of review: not the generated code, but the plan that authorized it.

The teams operating at AI velocity aren’t reviewing less rigorously. They’re reviewing at the right altitude, where a single hour of attention catches errors before they propagate into thousands of lines of code and then into production.

Start this week: take the next feature with any ambiguity and require a written plan before the AI writes a line. Review the plan. Then implement. One feature is enough to see it.

The next piece in this series covers what this workflow looks like at enterprise scale, where governance requirements, compliance obligations, and organizational inertia add constraints that a three-person startup doesn’t face. The principles are the same. The adoption path is different.