Research Article — March 2026

Fast Is Not Enough

Why AI makes your engineering team faster but not better — and what the math says about the optimal team size in the age of AI-augmented development.

HT
Hilmi Tolga Sahin
Engineering Manager · LinkedIn X / Twitter
Curious, Engineer, Maker, Mathematician. In That Order...

Every engineering leader says the same thing: "My team ships code 3× faster with AI." They're right — sort of. The code is being written faster. But here's the uncomfortable truth buried in every dataset from 2024–2026: the software isn't getting delivered any better.

Despite 93% of developers now using AI coding tools, net productivity gains have plateaued at roughly 10%.1 The most rigorous causal study available — METR's randomized controlled trial with experienced open-source developers — found AI tools actually made them 19% slower, even as those same developers perceived a 20% speedup.2

That's a 39-percentage-point gap between belief and reality. And it sits at the heart of what I'm calling The Engineering Paradox: AI amplifies velocity while degrading the quality signals teams depend on — creating a system that feels faster but delivers less reliably.

This has massive implications for how we size teams. Should you go lean — a 2-person "skeleton crew" leveraging AI as a force multiplier? Or is there a mathematically optimal team size? I built six simulation models to find out. Each one is interactive — plug in your own numbers and see what happens.

1. Mathematical Mental Models

Before we run any simulation, we need shared language. I've synthesized three core models from the research that capture the dynamics at play in AI-augmented engineering teams.

Model 1: The Effective Output Equation

Effective Output Model
O_eff = (P × V) × (1 − R)
Where P = base productivity per engineer, V = AI velocity multiplier, R = rework rate (proportion of output that must be redone).

This looks deceptively simple — but the devil is in the interaction effects. When V goes up (AI makes you faster), R also tends to go up (AI-generated code has more defects). The CodeRabbit analysis of 470 real-world pull requests found AI-authored PRs contain 1.7× more issues overall,3 and GitClear's analysis of 211 million changed lines showed code churn doubled between 2021 and 2024.4

At a rework rate of ~67%, a team running at V=3.0× (heavy AI) produces less effective output than a team with no AI at all. The simulation below lets you see exactly where your team sits.

SIM 01 Effective Output Model

Explore how AI velocity multipliers interact with rework rates. At what rework rate does AI help become AI harm?

2.0×
25%
No AI output
AI-assisted output
Net gain/loss
No AI (V=1.0×) AI-assisted (your V) Crossover point
O_eff = (P × V) × (1 − R)

Parameters: P (base productivity) is normalized to 1.0 for a single engineer. V (velocity multiplier) represents the throughput gain from AI tools — the DX Research framework across 450+ companies finds typical gains of 1.1×, with experienced devs on greenfield tasks reaching up to 3–5×. R (rework rate) represents the fraction of output that must be redone — CodeRabbit found 1.7× more issues in AI PRs, and GitClear documented a doubling of code churn.

Crossover point: Solve for R where AI output equals no-AI output: 1.0 × (1 − R₀) = V × (1 − R), giving R_crossover = 1 − (1 − R₀)/V. For R₀ = 12% (typical human rework rate), V=2.0 gives R_crossover = 56%. Beyond this, AI is net-negative.

Model 2: The Oversight Tax

Every AI-generated artifact demands a human decision: accept, reject, or modify. This creates a cognitive tax that compounds throughout the day:

Oversight Tax Formula
T_oversight = n_reviews × c_per_review × (1 + d / D_max)
Where n = AI outputs to review, c = cognitive cost per review, d = decisions already made today, D_max = max decision capacity (~200–300/day per Baumeister's ego depletion research). The escalating term (1 + d/D_max) models increasing cost as fatigue accumulates.

Sonar's 2026 survey of 1,100 developers found that 96% don't fully trust AI-generated code, yet only 48% always verify it before committing.5 That 48-point gap between distrust and verification is the Oversight Tax made visible.

Model 3: The Reckoning Formula (Technical Debt)

Technical Debt Compounding Model
D(t) = D₀ × (1 + g)^t + Σ α × V_i × (1 − q_i)
Where D₀ = existing debt, g = natural debt growth rate, α = AI code proportion, V_i = code volume at period i, q_i = AI quality score at period i. The first term captures organic growth; the second captures additional debt from AI code.

A 2025 InfoQ analysis found AI-generated code creates entirely new categories of technical debt,6 and copy-pasted code rose from 8.3% to 12.3% while refactoring activity collapsed from 25% of changed lines to less than 10%.4


2. The Verification Bottleneck

Code generation used to be the bottleneck. AI removed it. But it didn't remove the next bottleneck — it made it worse. Humans can effectively review approximately 400 lines of code per hour and hit a cognitive wall after roughly 60 minutes of sustained review. These are hard biological limits that have not changed despite AI scaling production by orders of magnitude.7

The fundamental asymmetry: Generation scales with silicon. Verification scales with neurons. Neurons don't get faster.

Salesforce Engineering documented this directly: after AI adoption, code volume increased 30%, PRs regularly expanded beyond 20 files and 1,000 lines of change, and review time began to plateau — indicating reviewers were no longer meaningfully engaging with the code.8

SIM 02 Verification Ceiling Function

Your team's effective throughput is capped by the lesser of generation capacity and verification capacity. This model shows where the ceiling is for your team size.

4
2.5×
25%
4h
Generation capacity
LOC/day
Verification capacity
LOC/day
Effective throughput
Generation capacity Verification capacity Effective throughput
T_eff = min(G_total, V_total) × (1 − R)

Generation: G = n × P_base × V. Base LOC/day = 200 per engineer. Verification: V = n_reviewers × 400 LOC/hr × review_hrs × fatigue_decay. Reviewers = floor(n/2). Fatigue decay follows F(d) = 1 − (1 − e^(-hrs/D_max)) × 0.4 where D_max = 4 sustainable hours. Based on Qodo's projection of 40% quality deficit as AI velocity outpaces review capacity.7


3. The 24-Month Simulation

Now let's model what happens over time. I'm comparing three team structures building the same product: a 10-person legacy team (no AI), a 2-person skeleton crew (heavy AI), and a 4-person expert pod (the "golden ratio" hypothesis).

Key assumptions grounded in the research: AI quality starts low (high rework) and improves ~25% every 6 months. Human review capacity stays constant. Communication overhead follows Brooks's Law: n(n-1)/2 channels. Rework rate for AI code decreases as AI quality improves.

SIM 03 24-Month Team Comparison

Watch how three team structures perform over 24 months as AI quality improves but human review capacity stays fixed. Where does the skeleton crew peak? When does the pod overtake?

25%
2%
Legacy (10) @ M24
Skeleton (2) @ M24
Pod (4) @ M24
Legacy (10 eng, no AI) Skeleton crew (2 + AI) Expert pod (4 + AI)
O(n,t) = n × P × V(t) × (1−R(t)) × (1−C(n)) × (1−VP(n,t))

AI quality trajectory: Q(t) = min(0.10 × 1.25^(t/6), 0.85). Rework R(t) = max(0.15, 1 − Q(t)). Communication overhead: C(n) = n(n−1)/2 × c. Verification penalty: VP = max(0, (G − V_cap) / G) × 0.4. Based on METR RCT data, DORA longitudinal reports, and CMU's Cursor adoption study.2910


4. Finding the Optimal Team Size

This is the model that directly answers "what's my ideal headcount?" It combines all the forces — output, Brooks's Law overhead, verification penalty, and bus factor risk — into a single utility function you can optimize:

SIM 04 Optimal Team Size Function

This utility function balances effective output, communication overhead, verification penalties, and bus factor risk. The blue dot marks your optimal team size.

2.5×
22%
4%
2.0
Optimal team size
Peak utility
Utility at n=2
Effective output Comm. overhead Verification penalty Bus factor cost Net utility U(n)
U(n) = n·P·V·(1−R) − n(n−1)/2·c − max(0, G_n−V_n)·λ − k/n

Derived from Hackman & Vidmar's optimal team size research (4.6 members),11 Brooks's Law communication overhead, and the verification ceiling model. The bus factor term k/n captures the catastrophic risk of losing 1 member in a 2-person team versus the manageable impact in a 4+ person team. The Scrum Guide recommends 3–9 developers; Amazon's two-pizza teams land at 4–6.12


5. Risk Analysis: Model Collapse & Code Quality Decay

A landmark 2024 paper in Nature by Shumailov et al. demonstrated that AI models trained on recursively generated data suffer irreversible quality degradation — a phenomenon they termed "model collapse."13 AI-generated code now constitutes 25–42% of committed code at major companies — and these repositories become training data for the next generation of code models.

Anthropic's 2026 RCT found AI-assisted junior engineers scored 17% lower on comprehension quizzes,14 and MIT Media Lab's EEG study showed LLM users displayed the weakest neural connectivity patterns. The researchers introduced the concept of "cognitive debt" — the delayed cost to attention, learning, and mental health from chronic AI reliance.15

SIM 05 Code Quality Decay Function

Models how code quality evolves over 24 months as AI-generated code accumulates. Higher review intensity slows the decay — but can the 2-person team keep up?

3%
1.5
0.25
Quality @ M12 (pod)
Quality @ M24 (pod)
Danger month (2-person)
2-person team (50% review) 4-person team (50% review) 4-person team (75% review) Danger threshold (0.6)
Q(t) = Q₀ × e^(−β × α(t)) + γ × (n_r / n)

Based on GitClear's analysis of 211M changed lines (code churn doubled 2021–2024),4 CMU's Cursor study (static analysis warnings +30%, complexity +41% persisting after velocity gains faded),10 and Shumailov et al.'s model collapse research.13 The 2-person team additionally suffers compounding reviewer fatigue (decay factor 1 − t × 0.02).


6. The Headcount Question: Jevons Paradox

Most headcount reduction calculations ignore the Jevons Paradox: when AI makes development cheaper, demand for software expands, partially offsetting efficiency gains. A February 2026 Fortune analysis of 6,000 executives found nearly 90% reporting no measurable AI impact on employment or productivity.16 Google's Sundar Pichai revealed 25% of Google's code is now AI-assisted, but the company is hiring more engineers, not fewer.

SIM 06 Headcount Reduction (Jevons Paradox)

The naive "40% efficiency = fire 40% of people" calculation ignores demand elasticity. This model shows the true headcount adjustment when new demand absorbs efficiency gains.

20
40%
0.60
Naive reduction
Jevons-adjusted
Demand absorbed
Naive headcount Jevons-adjusted New demand (absorbed)
ΔH = H₀ × (1 − E_ai / (1 + ε × E_ai))

Klarna cut from 5,500 to 3,400 before their CEO admitted it "went too far."17 Deloitte's 2026 survey found 70% of tech leaders plan to grow teams despite AI.18 Nobel laureate Daron Acemoglu projects only 0.5% TFP growth from AI.16 The Jevons Paradox: when a resource becomes cheaper to use, total consumption often increases rather than decreases.


7. The Evidence Base

The pattern across every rigorous study is striking: a persistent gap between what developers perceive and what's measured.

SourceFindingImplication
METR 2025 (RCT, n=16)2AI made experienced devs 19% slower despite perceived 20% speedupPerception is dangerously unreliable
DORA 2024 (n=39,000)9Each 25% AI adoption → 1.5% speed dip, 7.2% stability dropAI amplifies existing dysfunction
CodeRabbit (470 PRs)3AI code: 1.7× issues, 8× performance bugsReview capacity is non-negotiable
CMU Cursor study10281% LOC spike month 1, baseline by month 3, quality stayed degradedVelocity gains are transient, quality costs persist
Sonar 2026 (1,100 devs)596% distrust AI code, only 48% verify itThe verification gap is real
Fortune/NBER 20261690% of 6,000 CEOs say AI had no measurable productivity impactMacro-level Solow Paradox 2.0

8. Conclusion: The Paradox Resolves Through Context

The 2-person skeleton crew is not mathematically wrong — it's contextually limited. It excels for prototyping and MVPs where speed of iteration matters more than sustained quality. But it hits a wall when the verification bottleneck becomes binding, which the simulation shows happening around month 8–10 for any non-trivial production system.

The 4-person expert pod isn't magic — it's structurally resilient. Its advantage isn't that 4 > 2 (trivially true). It's that 4 humans can distribute verification load, maintain cognitive sharpness through role rotation, survive key-person loss, and sustain delivery over 24+ months.

The binding constraint on AI-augmented teams is not technical but cognitive. The verification bottleneck, the perception gap, cognitive debt, and skill atrophy are fundamentally human limitations that AI acceleration actively worsens. The faster AI generates code, the more valuable slow, careful human judgment becomes.

The future belongs not to the fastest teams, but to the most discerning ones.

Found this useful?

Share the simulations with your team. Plug in your own numbers. The models are designed to be tweaked, exported, and used in your next planning conversation.

References & Sources

[1] DX Research AI Measurement Framework, 2026. 121,000+ developers surveyed; 93% use AI, ~10% net productivity gain. getdx.com
[2] METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," 2025. RCT with 16 devs: 19% actual slowdown vs. 20% perceived speedup. metr.org · arXiv:2507.09089
[3] CodeRabbit, "State of AI vs Human Code Generation Report," 2025. 470 PRs: AI code produces 1.7× more issues, 8× more performance bugs. coderabbit.ai
[4] GitClear, "AI Copilot Code Quality: 2025 Data," 2025. 211M changed lines: code churn doubled, copy-paste up 48%, refactoring down 60%. gitclear.com
[5] Sonar, "Critical Verification Gap in AI Coding," 2026. 1,100 developers: 96% distrust, only 48% verify. sonarsource.com
[6] InfoQ, "AI-Generated Code Creates New Wave of Technical Debt," 2025. infoq.com
[7] Qodo, "Best Automated Code Review Tools for Enterprise," 2026. Projects 40% quality deficit as AI generation outpaces review. qodo.ai
[8] Salesforce Engineering, "Scaling Code Reviews: Adapting to a Surge in AI-Generated Code," 2025. engineering.salesforce.com
[9] Google Cloud DORA Report, 2024 & 2025. 39,000+ respondents (2024), 5,000 (2025). cloud.google.com
[10] Carnegie Mellon University, study of 807 GitHub repos adopting Cursor vs. 1,380 controls. 281% LOC spike, baseline reversion by M3, persistent quality issues. Summary
[11] Hackman & Vidmar, "Effects of Size and Task Type on Group Performance," 1970. Optimal satisfaction at 4.6 members.
[12] Optimum Partners, "Engineering Management 2026: Structuring an AI-Native Team." Centaur Pod model. optimumpartners.com
[13] Shumailov et al., "AI models collapse when trained on recursively generated data," Nature, 2024. DOI: 10.1038/s41586-024-07566-y
[14] Anthropic Research, "How AI Assistance Impacts the Formation of Coding Skills," 2026. 52 junior engineers, 17% lower comprehension. anthropic.com
[15] MIT Media Lab, "Your Brain on ChatGPT: Accumulation of Cognitive Debt," 2025. EEG study, n=54. arXiv:2506.08872
[16] Fortune, "AI Productivity Paradox," Feb 2026. NBER study of 6,000 executives. Acemoglu: 0.5% TFP growth. fortune.com
[17] CNBC, May 2025: Klarna 40% workforce cut. CEO later admitted "went too far" and resumed hiring. cnbc.com
[18] Deloitte Insights, "The Great Rebuild: Architecting an AI-native Tech Organization," 2026. 70% plan to grow teams. deloitte.com