AI Coding Productivity: Why More Code Doesn't Equal Better Software

The productivity gains from AI coding tools are real. What those gains do not yet demonstrate is proportional improvement in software quality, product judgment, or organizational learning. That gap is worth understanding before the bill comes due.

Key findings:

Anthropic's CEO puts the productivity impact of AI coding tools at roughly 15–20% — a conservative estimate from someone with every reason to claim higher. Anthropic's own engineers self-reported a 50% boost. The 30-point difference between the two numbers is the starting point for a more honest conversation about what these tools are actually doing.
A randomized controlled trial found experienced developers working on complex codebases they had contributed to for years were measurably slower with AI tools — while still believing they were faster.
Goldman Sachs finds no meaningful relationship between AI adoption and productivity at the economy-wide level, despite real gains in specific, targeted use cases.
The risk is not the tools. It is the competitive pressure that makes thoughtful adoption feel like a liability — and careless adoption feel like boldness.
The friction of building together — grooming sessions, code reviews, the developer who says "these requirements contradict each other" — was producing something beyond software. AI agents don't replicate that byproduct.

What does "vibe coding" mean — and why does the definition matter?

The confusion starts with the word. In February 2025, Andrej Karpathy coined the term "vibe coding" in a tweet — a casual description of a weekend workflow where you fully surrender to the AI, accept all diffs unread, and stop caring whether you understand the code. He scoped it explicitly to throwaway projects.

What happened next is instructive. Karpathy himself has since said he prefers "agentic engineering" for serious work — not because he retreated from AI-assisted coding, but because he thinks the practice deserves oversight, rigor, and craft. In a March 2026 interview on the No Priors podcast, he said he hadn't typed a line of code since December.

Meanwhile, Gene Kim and Steve Yegge published Vibe Coding: Building Production-Grade Software (2025), defining a rigorous professional practice under the same name, with a foreword by Dario Amodei. Elsewhere, the term circulates as a pejorative, as a badge of pride, as a description of democratization, and as a near-synonym for AI-assisted development generally.

The definitional chaos matters because it allows very different practices to circulate under the same label. A team using AI with disciplined review, clear ownership, and careful verification can sound linguistically similar to a team accepting generated output with minimal scrutiny. The term is slippery; the governance choices are not.

This post is not an argument against AI-assisted development. It is a question about what happens when the cultural pressure to adopt fast, adopt broadly, and announce transformation causes organizations to believe they are doing the careful thing while actually doing the careless one.

What does the evidence show about AI coding productivity?

The most credible estimates are more modest than the headlines suggest — and the gap between perception and reality is striking.

Dario Amodei, speaking on the Dwarkesh Podcast in February 2026, gave probably the most honest estimate available from someone with skin in the game. He put the productivity impact of AI coding tools at roughly 15–20% — a deliberately conservative rough estimate from someone who would have every reason to claim higher.

A separate Anthropic internal study, published in late 2025 and based on surveys of 132 of the company's own engineers, found self-reported productivity gains of around 50%. These are different measurements — one a broad productivity estimate, the other self-reported gains by engineers at an AI company with privileged early access to the tools — but the gap between them is worth sitting with. They are measuring different things — and neither number is precise. A 30-point spread between two rough estimates pointing in opposite directions still says something, particularly when most AI-driven headcount decisions are built on felt productivity rather than anything independently assessed.

METR's randomized controlled trial — among the most rigorous independent studies of AI coding tools conducted so far — focused on exactly the context most relevant to engineering leaders: experienced developers working in large, complex, mature codebases they had contributed to for years. In that setting, developers using AI tools were 19% slower than without, despite predicting a 24% speedup beforehand. More telling still: after completing the study, they estimated AI had made them 20% faster — while the objective data showed the opposite.

METR is careful to note this finding may not generalize beyond its setting. AI tools may perform differently for less experienced developers, unfamiliar codebases, or greenfield work. But the perception gap — the systematic disconnect between how fast developers believe they are working and how fast they actually are — applies much more broadly. It is this gap, more than the slowdown itself, that explains much of what is driving the current AI productivity debate.

Goldman Sachs economists, in a Q4 2025 analysis, found "no meaningful relationship between productivity and AI adoption at the economy-wide level." The same analysis found roughly 30% productivity gains in specific, localized use cases — software coding and customer service. The AI productivity story is real in pockets. It is not yet the broad transformation being claimed.

What the current evidence supports is narrower than both the hype and the backlash. AI coding tools can produce meaningful gains in specific contexts. They also create a recurring perception gap, where subjective acceleration exceeds measured improvement. What the evidence does not yet establish is that faster code production reliably translates into proportional gains in software quality, product judgment, or team learning.

How does the layoff narrative hold up against the evidence?

Some of the companies loudest about cutting headcount because of AI productivity are also the ones where the evidence is thinnest.

Klarna is the most instructive case — reported in full by Fortune. The company claimed its AI assistant was performing the work of 700 customer service agents. Within months, customers were complaining about generic responses and the lack of human support, and the company resumed hiring human staff. CEO Sebastian Siemiatkowski was direct about what happened: "As cost unfortunately seems to have been a too predominant evaluation factor… what you end up having is lower quality." Klarna is not proof that AI replacement strategies always fail. It is a vivid example of what gets missed when volume handled is mistaken for quality delivered.

Klarna is the most documented case — but it is not isolated. Across 2025 and 2026, multiple companies announced AI-driven headcount reductions while analysts and insiders pointed to pandemic over-hiring as the simpler explanation. The market rewarded the AI framing in almost every case regardless.

An IBM CEO study of 2,000 executives found that just one in four AI projects delivers the return on investment it promised. Yet nearly two-thirds say the risk of falling behind drives them to invest before they have a clear understanding of the value.

That last statistic is worth sitting with. The primary driver of AI investment is not evidence of returns — it is anxiety about what happens if competitors move faster. That anxiety is understandable and, in some contexts, rational. But it creates the conditions for a particular kind of collective self-deception: organizations announcing AI-driven transformations not because the transformation has occurred, but because announcing it is competitively necessary. The narrative travels faster than the results.

And by the time the results become clear, the narrative has already shaped hiring decisions, investor expectations, and strategic commitments that are hard to unwind.

The tools are real. The narrative is doing work the tools are not.

Why is faster code production not the same as better software?

This is the more important question — and it matters beyond the productivity debate.

Open the Claude app on your iPad. It works. It is also, if you use it regularly, clearly a product that has not been thought about hard from the perspective of someone who holds an iPad in their hands and wants something to feel right.

This is not primarily a code-generation problem. It is a judgment problem: someone had to decide what the app should be, what it should feel like to use, and what would make you reach for it rather than the browser. Better code generation may speed implementation, but it does not by itself solve that layer of decision-making. Solving it requires sustained attention to what other human beings actually experience — and that kind of attention is slow, uncertain, and cannot be vibed into existence.

When production becomes cheap and fast, the economic pressure to think carefully before building is reduced. You can ship something plausible in days and iterate. The compounding effect is that this logic actively discourages the investment in genuine user understanding that makes software worth using. Why spend weeks in careful user research when you can launch and see what happens? The trouble is that "see what happens" only produces useful learning if you've already defined what you're looking for and why. Without that, what launches reveals less than you expect, and fixing what you find costs more than you planned. By the time that becomes clear, the decisions are already made. It is the same instinct in a different form: optimize for what's visible and fast, defer what's slow and hard to measure.

This tendency — optimizing for the exciting metric while neglecting the enabling mechanism — is not new. When a customer complained about the brakes on his Type 35 racing car in the 1920s, Ettore Bugatti reportedly replied: "I make my cars to go, not to stop." The quote is approximately 100 years old. The thinking it represents is evidently eternal. Bugatti's cars were beautiful, powerful, and notoriously underbraked. They went very fast until they didn't.

In 2023, Gene Kim and Steven Spear published Wiring the Winning Organization, a study of high-performing organizations that named the antidote: slowification. Taking time to plan, prepare, and build shared understanding before executing is not the opposite of speed — it is what makes speed sustainable. The organizations that invested in deliberate preparation consistently outperformed those that treated every moment of deliberation as a cost to be eliminated. Kim went on to co-author Vibe Coding: Building Production-Grade Software — arguing for exactly this kind of rigorous, deliberate practice. The principle he identified in 2023 applies directly to how that development should be approached.

The ceiling has lifted on production. The ceiling on understanding hasn't moved.

This is not a new pattern. Every generation of better tooling — more expressive languages, richer frameworks, cloud infrastructure — made it cheaper and faster to build things without understanding them. The low-code movement promised to democratize software creation and largely delivered software that was fast to create and frustrating to use. AI coding tools are the most powerful instance of this pattern so far, not its invention.

The gap between what organizations can build and what they genuinely understand has been reproduced at every level of abstraction. It is now being reproduced faster than ever.

What gets lost when AI agents replace human developers?

The judgment that makes software good was never held only by product managers. It was distributed — across the team, through the friction of building together.

Not all friction is valuable. Some of it is bureaucracy, latency, and avoidable drag. But some of it is where collective understanding gets produced — and the current wave of automation makes it easy to remove both at once.

This isn't purely theoretical. Across engineering communities right now, something is showing up consistently in how developers describe their own experience: shipping faster, feeling more productive, and quietly losing grip on what the system is actually doing and why. The research above confirms it at the macro level. So does the conversation.

When a developer pushed back in a grooming session — "I can't build this because two of these requirements directly contradict each other" — the organization was forced to confront something it had successfully avoided. That discomfort was not a failure of the planning process. It was the planning process working. The requirement had been written in a way that looked coherent until someone actually tried to implement it. The developer's confusion was the organization's mirror.

When a code review happened between a senior and a junior engineer, both came away with something they didn't have before. The junior learned how to think about a class of problem. The senior had to articulate something they had previously only known tacitly — and the act of articulating it sharpened it. The conversation was the learning. The code was its residue.

This distinction matters more than it might appear. Michael Polanyi spent a career making it precise. "We know more than we can tell," he wrote in The Tacit Dimension — and the corollary is that what we can tell is always less than what we actually know.

Understanding built through practice is different in kind from information acquired through reading or instruction. You cannot read your way into knowing how a complex system fails under load, or how a particular kind of user encounters a particular kind of confusion. You have to have been there — to have held the model, seen it break, and built the revised understanding from the wreckage of the previous one.

That is what the grooming session and the code review were producing, as a byproduct of the friction, whether anyone named it or not.

An AI agent does not produce these byproducts. It proceeds without complaint. The agent's compliance is not evidence that your requirements were clear. It is evidence that the agent does not complain.

You can review an AI agent's code. You cannot have that conversation with it. The agent will not carry the review forward as understanding. The junior developer who would have been formed by years of such exchanges is increasingly not being hired. And the organization that replaced those exchanges with generated code has removed a mechanism it may not have known it depended on — one that was doing something quite different from producing software.

The question worth asking

The relevant metric is not just how much faster you are shipping. It is whether gains in production are being matched by gains in clarity, judgment, and learning.

This is not an argument against AI tools. Used well — with the oversight, craft, and judgment that Karpathy, Kim, and Yegge all insist upon — these tools are genuinely valuable. The question is whether the current moment is creating the conditions for that kind of use. The competitive anxiety, the pressure to ship before thinking, the growing sense that slowing down to understand something is a form of underperformance: none of this is produced by the tools themselves. It is the cultural air that currently surrounds them.

Racing drivers understand that brakes are not for slowing down. They are for enabling commitment — for making it possible to push harder into corners because you trust your ability to stop if something goes wrong. The friction that teams are currently removing in the name of velocity was often exactly this: not overhead, not inefficiency, but the mechanism that made going fast possible in the first place. The rigorous practitioner of AI-assisted development knows this. The organization swept up in competitive anxiety often doesn't.

Shipping more code is not the same as building better software — and it never was. Code was always an artifact — the visible residue of the understanding that produced it, not the understanding itself. What has changed is not the confusion between artifact and understanding. It is how fast and cheap it has become to sustain that confusion at scale.

The question worth asking — honestly, specifically, about your own team — is which kind of adoption you are actually doing. That question usually benefits from an outside perspective — someone who has seen both kinds and knows what to look for.

These questions don't have clean answers — and anyone claiming otherwise probably isn't asking them seriously enough. At DoiT, we've worked alongside hundreds of engineering and technology leaders navigating exactly this tension — figuring out what to keep, what to change, and what they were about to lose without realizing it. We've found that the right questions, asked with the right people, tend to produce better answers than any framework can. If you're seeing something we're not, or working through this in ways that are working, we'd like to hear about it.

Frequently asked questions

What are the different meanings of "vibe coding"? The term has at least three distinct meanings in current use. In its original sense — coined by Andrej Karpathy in February 2025 — it described a casual, surrender-based workflow appropriate for throwaway projects. Gene Kim and Steve Yegge subsequently defined a serious professional practice under the same name in their 2025 book Vibe Coding: Building Production-Grade Software, emphasizing rigor, oversight, and craft. In common usage it also serves as a general synonym for AI-assisted development, and alternately as a pejorative for careless AI adoption. The definitional chaos matters: when the same term describes both mindless acceptance of AI output and disciplined agentic engineering, meaningful conversation about what any particular team is doing becomes very difficult.

Can AI coding tools improve productivity without improving software quality? Yes. The evidence suggests they already are doing so at scale. AI tools can measurably increase code output and shorten delivery cycles without improving the judgment that determines what gets built, the clarity of requirements, or the depth of user understanding behind the product. Goldman Sachs finds real productivity gains in software coding as a specific use case. What it does not find — and what the broader evidence does not yet show — is that those gains translate into proportional improvements in software quality or product outcomes.

What did the METR AI coding study actually find? METR's 2025 randomized controlled trial tested 16 experienced developers on 246 real tasks across large, mature open-source codebases they knew well. Developers using AI tools took 19% longer to complete tasks than without, despite predicting a 24% speedup beforehand and still believing they were faster afterward. METR notes the finding is specific to this context and may not generalize to less experienced developers or unfamiliar codebases. The perception gap — systematically believing AI is helping when the data shows otherwise — is the most broadly applicable finding.

Why is there such a large gap between self-reported and measured AI productivity? Two data points illustrate the gap from different angles. In his February 2026 podcast appearance, Dario Amodei estimated the productivity impact of AI coding tools at roughly 15–20%. A separate Anthropic internal study from late 2025 found that Anthropic's own engineers self-reported a 50% productivity boost — a different measurement, but the gap between the two is instructive. METR's study found a similar perceptual pattern: developers believed AI had sped them up by 20% while objective measurement showed a 19% slowdown. METR's own analysis of the gap points to several factors: AI-assisted coding requires less cognitive effort and feels faster even when it isn't; developers may be confusing ease with speed; and time spent reviewing, correcting, and cleaning up AI-generated code is typically not counted in self-reports.

What happened when Klarna replaced customer service agents with AI? Klarna replaced approximately 700 customer service roles with an AI assistant developed with OpenAI, claiming the AI was performing equivalent work. Within months, customers were complaining about generic responses and the lack of human support, as AI agents struggled with complex, nuanced, and emotionally charged interactions. CEO Sebastian Siemiatkowski acknowledged the mistake directly: "As cost unfortunately seems to have been a too predominant evaluation factor… what you end up having is lower quality." The company resumed hiring human customer service staff in 2025. The case is instructive not just as a failure story but as an illustration of what gets lost when the metric is volume handled rather than understanding demonstrated.

What is slowification, and why does it matter for software teams? Slowification is a concept from Gene Kim and Steven Spear's 2023 book Wiring the Winning Organization. It describes the deliberate investment in planning, preparation, and shared understanding before execution — slowing down in order to move faster and more reliably downstream. Their research across high-performing organizations found this was a consistent differentiator: teams that slowified consistently outperformed those that treated every moment of deliberation as a cost. The principle applies directly to software development: the grooming session, the careful user research, the code review that felt slow were often doing exactly this — producing the shared knowledge that confident execution depends on.

Can AI tools increase output while reducing understanding? Yes — and this is arguably the more important question than whether they increase productivity. Output and understanding are produced by different mechanisms. Code is produced by execution; understanding is produced by the friction of working through problems together — the developer who pushes back on contradictory requirements, the code review where a senior engineer has to articulate something they previously only knew tacitly, the grooming session where a team discovers that what seemed like a clear specification isn't. AI agents proceed without generating these byproducts. They don't complain, clarify, or carry the conversation forward as learning. An organization can ship more software while simultaneously losing grip on what its systems are doing, why they were built the way they were, and what its users actually need. The output rises. The understanding that should accompany it doesn't keep pace.

Does Goldman Sachs find AI has no productivity impact? Not exactly. Goldman Sachs economists, in the same Q4 2025 analysis, find no meaningful relationship between AI adoption and productivity at the economy-wide level. The same analysis found roughly 30% productivity gains in specific, localized use cases — software coding and customer service being the two most cited. The picture is one of real gains in specific contexts that have not yet translated into broad economic productivity improvement. That distinction matters for how organizations should think about their AI investments: targeted adoption in contexts where the gains are well-evidenced is different from treating AI as a general-purpose productivity multiplier.