10 minute read

Why Code Review Breaks at AI Velocity

Nearly half of all code is now AI-generated. Twenty million developers use GitHub Copilot. They’re writing code 55% faster than before.

Your code review process wasn’t designed for this.

I learned this the hard way. Earlier in my career, as a director, one of my teams was struggling with code reviews for a new identity system. I stepped in to help. The pull request was dense, covering authentication flows, connection pooling, and error handling. I left a comment about a socket being passed around without clarity on whether it was being closed properly.

Three weeks later, the identity system went down in production. File descriptors exhausted. Socket leak.

I caught the problem and left a comment. It was lost in other feedback, and the code shipped anyway.

Here’s what should have happened: a static analyzer should have flagged the resource leak. An automated quality gate should have blocked the merge. The build should have failed until the socket handling was fixed.

Instead, we relied on human code review to catch a deterministic quality issue. Noticing isn’t enough when humans are the enforcement mechanism.

That wasn’t a review failure. That was code review revealing a structural problem we’d been ignoring: we were using human judgment to validate things that should be automatically enforced.

AI just made that problem impossible to ignore.

What Code Review Was Really Doing

When Alberto Bacchelli and Christian Bird analyzed 570 code review comments at Microsoft, they found something surprising: 75% of review comments focused on evolvability and maintainability, not bugs.

Code review was supposed to catch defects. That’s what teams said they were doing. But the data showed three distinct purposes:

  1. Bug detection (what teams said they wanted)
  2. Knowledge transfer (38% mentioned this)
  3. Design assessment (where 75% of comments actually landed)

Code review became the Swiss Army knife of software quality. One process serving three masters.

Research from SmartBear, based on large-scale studies with teams like Cisco, found that developers can effectively review about 200 to 400 lines of code per hour. Once review speed exceeds roughly 500 lines per hour, defect detection rates drop sharply and feedback becomes superficial.

When humans wrote code slowly, this cognitive limit didn’t matter.

AI changed that.

The AI Velocity Problem

AI doesn’t respect cognitive limits.

A 2,000-line AI-generated pull request would take 5 to 10 hours to review effectively. Most reviews get 30 minutes. At that pace, you’re not catching bugs or assessing design. You’re skimming and hoping.

Recent security research underscores the risks of rapid AI code generation. In Veracode’s 2025 security report analyzing more than 100 large language models, nearly 45% of AI-generated code samples introduced known security vulnerabilities across a broad set of coding tasks, even though the code often compiled and passed basic tests.

Researchers conducting large-scale empirical analysis of 7,703 AI-generated files also identified hundreds of distinct vulnerability types that static analysis can surface consistently.

You can’t use a process designed for human writing velocity to validate AI writing velocity. The cognitive limits don’t change. You just get code that looks complete but hides problems you’ll discover later.

Like my socket leak.

The question isn’t whether code review is failing. It’s what we were actually validating when it worked, and how to validate those things at AI velocity.

The mistake isn’t trusting AI too much. It’s treating confidence as a single thing when it never was.

This article is the first in a three-part series on software quality in the AI era.

Here, I focus on the model, what “confidence” actually means. The next two pieces focus on workflow and enterprise adoption.

The Three Confidence Dimensions

Break down “I’m confident in this code” and you find three distinct questions:

1. Behavior Confidence: Does it do the right thing?

This is functional correctness. Does the authentication system actually authenticate users? Does the payment processor handle transactions correctly?

Who validates it: Humans, through specs and test plans.

Why humans? Because correctness requires understanding intent. What did the stakeholder actually want? What does “correctly authenticated” mean in this business context? What edge cases matter?

Automated tools cannot take the developer’s intentions and general business logic into consideration. An AI can check if code matches a spec. It can’t tell you if the spec captured the right intent.

2. Design Confidence: Is it structured for change?

This is architecture, modularity, clarity. Will future developers understand it? Can we extend it without rewriting everything? Does the design align with our principles?

Who validates it: Humans, through design review.

Why humans? Because design quality requires judgment about appropriateness. A microservices architecture is great for scaling teams but overkill for a prototype.

Remember Bacchelli and Bird’s finding: 75% of code review comments focus on evolvability and maintainability. This is what humans actually do in code review.

3. Quality Attributes: Does it meet non-functional requirements?

This is security, performance, compliance. Does the authentication system have SQL injection vulnerabilities? Is the search performant at scale? Does it follow code standards? Are resources being properly managed and cleaned up?

Who validates it: Automated systems. Security scanners, performance tests, linters, static analysis.

Why automation? Because quality attributes are, as Bass et al. defined them, “measurable and testable.” Security scanners check for vulnerability patterns. Performance tests measure response times. Linters enforce standards. Static analyzers detect resource leaks.

My socket leak? That was a quality attribute issue. A deterministic problem that a static analyzer should have caught and blocked automatically. Instead, it relied on human code review. I noticed it, but the code still shipped.

Research shows automated security scanning is “faster, more consistent, and has a more comprehensive scope” than human review. Humans get fatigued. We miss patterns. Even when we catch issues, we can’t enforce fixes. Automated tools don’t have those limitations.

For exhaustive, deterministic checking, automation wins.

The Division of Labor

Confidence Type What It Validates Best Validator How
Behavior Does it do the right thing? Humans Spec review, test plan review
Design Is it structured for change? Humans Architecture review, design assessment
Quality Attributes Does it meet non-functional requirements? Automation Security scans, performance tests, linters

This isn’t humans versus automation. It’s humans plus automation, with the right validator for each dimension.

Humans excel at: Context understanding. Business logic assessment. Design judgment. Understanding what “correct” and “appropriate” mean for this specific problem.

Automation excels at: Consistency. Exhaustive pattern matching. Deterministic validation. Checking whether code meets measurable criteria.

The best approach, as multiple research studies conclude, “is a combination of both manual and automated tools, leveraging the strengths of each.”

The Academic Grounding

Software engineering academics have known this for decades.

In Software Architecture in Practice, Len Bass, Paul Clements, and Rick Kazman introduced the quality attributes framework, grounded in a simple premise: “A quality attribute, by definition, should be measurable and testable.”

They were not talking about vague notions of “good code.” They meant specific, observable properties beyond basic functionality, such as security, performance, availability, and modifiability.

The framework made a clear distinction between two concerns:

Functional behavior: what the system does

Quality attributes: how well it does it

This distinction has existed in software engineering literature since the 1980s.

Practitioners mostly ignored it, not because it was wrong, but because it did not matter operationally. When a human wrote a few hundred lines of code over several days, a single review could validate both behavior and quality attributes in one sitting.

AI changed that.

At AI velocity, you can no longer validate everything through human review. Pull requests are too large, and the cognitive load is too high. The distinction between behavior and quality attributes becomes unavoidable.

As Bass later summarized: “Whether a system will exhibit its desired quality attributes is substantially determined by its architecture.” That means quality attributes should be validated systematically through measurement and testing, not through human inspection of implementation details.

What was once a theoretical distinction has become an operational requirement.

How This Changes Development Workflow

The breakthrough: Humans only review code that has already passed tests and automated quality gates.

When an AI agent implements a feature, it doesn’t immediately request human review. Instead, it enters an autonomous feedback loop: write code, run tests, analyze failures, adjust implementation, run tests again.

The agent iterates until tests pass, security scans are clean, linters approve, and quality gates turn green. Only then does human review begin.

Research from FSE 2025 validates this: agentic systems that iterate based on feedback outperformed single-shot approaches by 21%. The SWE-bench Verified benchmark jumped from 33% (August 2024) to 70%+ (2025). The difference? Feedback loops and autonomous iteration.

This creates an economic shift in code review:

Traditional model: Humans spend 200-400 lines per hour reviewing potentially broken code, trying to catch bugs, assess design, and check quality attributes simultaneously.

New model: AI agents iterate autonomously until automated quality gates pass. Humans review specs and already-validated implementations for design appropriateness only.

You’re not reviewing to find bugs. The tests found them. You’re not checking for security issues. The scanners caught them. You’re asking: “Is this design appropriate for our needs?”

That’s what humans are actually good at.

This is why enterprise adoption is accelerating:

  • Amazon Kiro: 250,000+ developers using spec-driven workflow (Specify → Plan → Execute)
  • GitHub Spec-Kit: 55,000+ stars demonstrating rapid adoption
  • Thoughtworks Technology Radar: Featured spec-driven development as an emerging practice (November 2025)

The pattern is consistent: humans define behavior and design confidence through specs and design review. Automation validates quality attributes through continuous testing. AI agents iterate autonomously in between.

Autonomous quality gates block human review until validation succeeds.

That socket leak that took down my identity system? With proper quality gates, the static analyzer would have flagged the resource leak. The build would have failed. The code would never have reached human review. I wouldn’t have needed to notice it because the system would have enforced the fix automatically.

We didn’t arrive at this framework academically. We backed into it while trying to make AI-assisted teams workable at scale, and only later recognized the pattern in existing research.

Putting this into practice

Once you see this framework, you can’t unsee it.

Code review wasn’t failing. It was revealing its hidden structure under AI velocity pressure.

The question isn’t “Should I trust AI code?” It’s “Which confidence dimension am I validating, and what’s the right tool for that?”

1. Diagnose Your Current State

Look at your last ten code review comments. Categorize them:

  • How many were about functional correctness? (Behavior)
  • How many were about design quality? (Design)
  • How many were about security, performance, or standards? (Quality attributes)

You’ll likely find the same 75% pattern: most comments focus on design and architecture, not behavior or bugs.

That’s not a problem. That’s revealing what code review was always doing.

2. Separate Concerns in Your Workflow

Before AI writes code:

  • Humans write specs that capture intent and context
  • Humans review and approve test plans (what “correct” means)
  • Humans review and approve design decisions (architectural approach)

When AI generates code:

  • AI agents iterate autonomously until tests pass
  • Automated quality gates run (security scans, linters, performance tests)
  • Human review is blocked until all automated validation passes

During human review:

  • Focus on design appropriateness, not bug hunting
  • Ask: “Is this design right for our needs?”
  • Leave quality attribute checks to automation

3. Set Up Autonomous Quality Gates

Progressive gates in your CI/CD pipeline:

  1. Linters (code standards)
  2. Security scans (SAST tools)
  3. Test suites (behavior validation)
  4. Coverage checks (completeness)

Block human review until all gates pass. No exceptions.

This eliminates a large share of the time developers spend debugging AI-generated security issues.

4. Shift Your Mental Model

Stop treating automated testing as “nice to have.” It’s the proper validator for quality attributes.

Stop trying to catch everything in code review. Use code review for what humans do best: design assessment.

Recognize that AI agents should iterate autonomously until quality gates pass, not immediately escalate to human review.

The Model Everything Else Depends On

If you separate concerns intentionally you get AI velocity plus quality. Humans for behavior and design, automation for quality attributes.

If you blindly adopt AI-generated code without understanding which confidence dimension you’re validating, you’ll get a higher churn rate and persistent security vulnerabilities.

The confidence dimensions framework explains why spec-driven development works. Not as a trend, but as the logical outcome of assigning the right validator to each dimension.

This is how you maintain confidence at AI velocity.