12 minute read

AI can be highly capable and still fundamentally unreliable. That difference matters most where mistakes can’t be undone.

In December 2025, an Amazon engineer asked Kiro, the company’s AI coding assistant, to fix a minor issue in AWS Cost Explorer. Small task. Routine.

Kiro deleted the production environment and rebuilt it from scratch. The outage lasted thirteen hours and affected one of Amazon’s China regions. The Financial Times reported the incident in February 2026, based on internal accounts from multiple anonymous AWS employees.

Amazon’s official explanation was user error: the access controls had been misconfigured. That explanation is accurate. It is also insufficient. Kiro had been granted operator-level permissions across production systems. There was no destructive-action blocklist. No requirement for peer review before production changes. The agent optimized for its goal: fix the system. It did exactly that, in the most complete interpretation available to it.

Kiro was not inaccurate. It violated an invariant that had never been specified.

Context matters here. Weeks before the incident, Amazon’s senior leadership had issued what engineers called the Kiro Mandate: an internal memo establishing Kiro as the standard AI coding assistant company-wide, with an 80% weekly usage target. By January 2026, 70% of Amazon’s engineers had used it during at least one sprint. The organization was moving fast to deploy a powerful tool. The safety architecture had not kept pace.

This is an industry-wide pattern: organizations making irreversible architectural decisions under the assumption that AI reliability will catch up to AI capability. What’s missing from most of these conversations is reliability, not accuracy. They are not the same thing, and the gap between them is not closing.

The part worth naming: AI has produced genuine productivity gains in specific domains. Software development is the clearest case. Developers writing code faster, reviewing more, shipping more. The gains are real. But the domains where AI has delivered most convincingly share a built-in checkpoint: human review is embedded by default. Code doesn’t ship because an agent wrote it. It ships after review, CI, testing, and deployment gates. The moment agents have direct production access, that checkpoint disappears. You are no longer in augmentation. You are in autonomous execution with real consequences.

There is a way to do this safely. It requires separating planning from execution. AI as planner over deterministic executors: AI proposes decisions, deterministic systems enforce invariants, humans approve actions that cross defined boundaries. Kiro needed exactly this, a destructive-action blocklist, bounded execution authority, architectural enforcement of what it could and could not do unilaterally. That work had not been done.

The rest of this piece explains why the reliability ceiling is structural, why adding validation layers doesn’t solve it, and what the planner-executor architecture looks like in practice.

Accuracy and Reliability Measure Different Things

Here’s the problem: accuracy is the metric used to evaluate AI, but it is the wrong metric for transactional deployment.

A system processing one million transactions daily at 99% accuracy produces 10,000 errors per day. That alone is enough to end careers and licenses. But the deeper problem is what “accuracy” measures: an average across all transaction types. A model can be highly accurate on common, well-represented cases and completely unreliable on rare ones. In transactional systems, rare cases are not edge cases to be tolerated. They are the cases that matter most: the fraud pattern seen for the first time, the unusual refund sequence, the account state that only occurs under specific combinations of prior transactions.

Researchers at OpenAI found the structural reason for this (Kalai et al., 2025): hallucinations are not random noise. They are provably tied to the singleton rate in training data. A model hallucinates most on what it has seen least. You cannot train this away without infinite data. The long tail of transaction types is precisely where probabilistic models fail most consistently.

Most practitioners carry a number: 95% accuracy is the threshold for production-ready AI. Consider that standard in a different context. If an API failed 5% of the time, you would not ship it. Software engineers hold production systems to 99.99% uptime, which allows 52 minutes of downtime per year. We do not accept 95% reliability anywhere else in production. Yet somehow 95% accuracy became the AI equivalent of “good enough.”

Where did it come from? Not from transactional systems. It congealed from two places: the 95% confidence interval embedded in statistical culture (the default framing in academic ML since Fisher), and early AI successes in augmentation contexts (spam filters, image classifiers) where 95% genuinely was sufficient. In those systems, a wrong answer is recoverable and a human remains downstream.

The problem isn’t that people deployed AI at 95% accuracy into autonomous systems. The problem is that 95%, and the benchmark culture it represents, trained everyone to ask the wrong question. “Is this accurate enough?” is the right question for a spam filter. It is the wrong question for a payments workflow. The right question is: can this system guarantee behavior on specific invariants? Benchmark scores do not answer that. They were never designed to.

There is also a structural problem with treating 95% as the goal. The errors that remain at 99% accuracy are not the same kind as the errors at 95%. They are the hardest ones: rare inputs, unusual transaction types, cases the model has seen least. Each additional percentage point requires eliminating the most structurally resistant failures. The OpenAI finding above is precise about why: hallucinations cluster at the singleton rate in training data. The long tail does not shrink as benchmark scores improve. You’re not climbing toward 100%. You’re running into a ceiling.

Step back: capability is improving quickly. Reliability isn’t. That gap is the constraint. Everything that follows comes from that mismatch.

Here’s the implication: human systems make errors too, but they offer something probabilistic models cannot, guarantees. A deterministic accounting system enforces invariants mechanically: never debit without a corresponding credit, never process a duplicate transaction, never apply a write that leaves the ledger inconsistent. These are not soft guidelines. A probabilistic model can perform well on average. It cannot commit to correct behavior on specific invariants.

Think of the difference between a bridge engineer and a weather forecaster. The forecaster gives you an 80% chance of rain. The bridge engineer does not give you an 80% chance the bridge stays up. The standards differ because the failure modes differ. Transactional systems are built on bridge-engineer standards. Probabilistic AI operates on forecaster standards. The question was never how good the forecast is. It was whether you need a forecast or a guarantee.

The Gap Is Measurable. And It Is Not Closing.

What makes the Princeton research useful is that it gives the problem a vocabulary.

“Towards a Science of AI Agent Reliability” (Rabanser, Kapoor, Narayanan et al., 2026) proposes four dimensions for measuring reliability beyond accuracy: consistency (does the agent behave the same way on repeated tasks?), robustness (does it hold up under non-ideal conditions?), calibration (does it know what it doesn’t know?), and safety (when it fails, does it fail gently?).

The results are specific. The best-performing model in the study, Claude Opus 4.5, achieves 73% consistency. That means roughly one in four tasks produces a different outcome when given the same input twice. On general benchmarks, reliability is improving at half the rate of accuracy. On customer service benchmarks, the kind of transactional, policy-following work most production agents actually do, reliability is improving at one-seventh the rate.

The researchers frame the implication directly: “For automation, reliability is a hard prerequisite for deployment: an agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system.”

The pharmaceutical analogy is instructive. A drug that works on 90% of patients and causes serious adverse events in 10% does not get approved at 90% efficacy. The distribution of outcomes matters, not the average. The customer whose transaction fails does not experience the 999 that succeeded.

The optimists will note that models are improving quickly. They are right. Capability benchmarks have improved dramatically. But the Princeton finding is specific: reliability is improving more slowly, by a factor of two on general tasks and seven on customer service tasks. There is no mechanism by which that gap closes automatically.

Chains Make It Worse Faster Than You Expect

Single-model accuracy is already a hard standard for transactional deployment. In practice, most production AI workflows chain multiple steps together. The math is not forgiving.

A 2025 study on AI-assisted medical diagnostics chained three tools: one with 90% accuracy, one with 85%, one with 97%. Individual performance looked reasonable across all three. Combined reliability: 74%. Roughly one in four patients received a misdiagnosis, not because any single tool was badly wrong, but because errors compound rather than cancel.

The general principle holds across domains. At 95% accuracy per step, a 20-step workflow produces end-to-end reliability of roughly 36%. This doesn’t require pessimistic assumptions about individual steps. It only requires that errors at each step be independent, which they generally are.

Sierra Research’s tau-bench benchmark (2024) was designed specifically to measure the production-relevant version of this problem. Rather than measuring single-attempt accuracy, it uses pass^k: the agent must succeed on k consecutive attempts. GPT-4o scores 65% on single-attempt accuracy. On pass^8, eight consecutive correct completions, it drops below 25%. A 60% relative decline. Most deployment decisions are made using the single-attempt number. They are measuring the wrong thing.

Watson Made the Same Mistake at $4 Billion Scale

Between 2015 and 2018, IBM Watson was deployed across 230 hospitals for cancer treatment recommendations. It had been trained on synthetic cases developed by oncologists at Memorial Sloan Kettering, learning MSK’s clinical preferences for a specific patient population.

At other hospitals, with different patients and clinical contexts, Watson made recommendations that reflected its training rather than what each patient needed. In one documented case, it recommended a treatment carrying an FDA black box warning against use in patients with serious bleeding, to a patient who had serious bleeding. The recommendation was not random. It was confident. It was a direct contraindication.

MD Anderson ended its Watson pilot after spending $62 million. IBM Watson Health was sold off entirely in 2022. Total investment: approximately $4 billion. What failed was not a broken model. It was a model deployed in contexts its training did not cover, with no architectural layer to enforce the constraints that mattered.

Kiro and Watson are the same failure in different domains. In both cases, the boundary between AI planning and system execution was not designed.

Guardrails Move the Problem. They Do Not Solve It.

The standard response to reliability failures is to add validation layers: output filters, safety classifiers, additional checks.

Here’s the structural problem: catching all model failures requires reliably detecting them. That requires one of three things: an oracle that knows the correct answer (in which case you don’t need the AI), a second probabilistic system to detect failures in the first (which introduces a new failure mode with its own accuracy ceiling), or a human reviewer (which reintroduces the bottleneck the AI was meant to remove).

There is also a mechanism problem. Reinforcement learning from human feedback trains models to express confidence regardless of accuracy: human raters prefer confident responses, so the reward model learns to prefer them. One 2025 evaluation found Kimi K2 with a calibration error of 0.726 despite only 23.3% accuracy, expressing extreme confidence on answers it got wrong three-quarters of the time. A guardrail looking for uncertainty signals will not catch this. The training process has suppressed that signal.

Guardrails catch detectable failures: malformed outputs, out-of-range values, obvious contradictions. They miss confident-but-wrong answers on novel inputs. In transactional systems, that is the failure mode that matters.

(A reasonable objection: better training data and faster scaling might close this gap. The concern is timing. Architectural decisions made now are hard to reverse in two years. Design around the documented gap, not the assumed trajectory.)

Regulators have accepted the ceiling. The FDA’s Predetermined Change Control Plan for AI medical devices manages probabilistic drift within bounded parameters rather than guaranteeing reliability. It builds governance structures around the ceiling, not through it.

The Architecture Already Exists. It Just Needs to Be Applied.

AI as planner over deterministic executors. This pattern isn’t new. It answers the question guardrails can’t: can this system guarantee behavior on specific invariants?

The AI decides what to do: which transaction class, which action to take, which protocol to follow. A deterministic system executes the decision with guarantees. The executor enforces invariants. The planner navigates possibilities. Neither does the other’s job.

In practice, this means three things:

  • AI proposes decisions
  • Deterministic systems enforce invariants mechanically
  • Humans approve actions that cross defined boundaries

This is the pattern behind double-entry bookkeeping paired with AI fraud detection. The AI identifies suspicious patterns and flags transactions. The accounting system enforces that flagged transactions are held. The AI can be wrong; the accounting system cannot be overridden by the AI’s confidence. The invariant is mechanical.

Database researchers arrived at the same architecture from a different direction. SagaLLM (Chang and Geng, VLDB 2025) applies the Saga pattern from distributed systems, a classic approach to managing long-running transactions through compensating actions, to multi-agent LLM workflows. Each operation is paired with a rollback. Global invariants are enforced across agents. The researchers arrived at reliability not by making the LLM more reliable, but by wrapping it in deterministic transaction semantics.

Before the next integration decision, audit every AI system with one question: is the execution layer mechanically enforced or just assumed?

If you’re integrating AI today, this is the minimum bar:

  • Define the explicit boundary between AI planning and system execution
  • Enforce that boundary in code, not policy
  • Enumerate destructive or irreversible actions that require human approval before execution
  • Pair each AI decision point with a deterministic executor that enforces invariants
  • Treat rollback capability as a requirement, not an afterthought

The Kiro incident makes the implication plain. The access controls were a partial implementation of this pattern. They were not a substitute for it.

What It Looks Like When It Works

Teams that have this right do not look like control rooms.

An engineer reviews a plan the AI has generated: here is what I intend to do, here are the systems I will touch, here are the actions I will not take. The plan is reviewable in five minutes because it describes decisions, not code. The engineer approves it. The agent executes within the approved scope. When it encounters something outside that scope, it stops and asks. The invariants are defined before the agent is invoked. They are enforced by the execution layer, not negotiated during the run.

The reliability ceiling is real. It is not closing on its own. The architectural decisions being made now will determine whether AI agents in transactional systems have bounded failure modes or unbounded ones.

Before the next integration decision, ask one question: where is the planning and execution boundary in this system, and is it enforced mechanically or assumed? If the answer is “assumed,” you don’t have an AI system. You have a failure waiting to happen.