Every engineering leader says the same thing: "My team ships code 3× faster with AI." They're right — sort of. The code is being written faster. But here's the uncomfortable truth buried in every dataset from 2024–2026: the software isn't getting delivered any better.
Despite 93% of developers now using AI coding tools, net productivity gains have plateaued at roughly 10%.1 The most rigorous causal study available — METR's randomized controlled trial with experienced open-source developers — found AI tools actually made them 19% slower, even as those same developers perceived a 20% speedup.2
That's a 39-percentage-point gap between belief and reality. And it sits at the heart of what I'm calling The Engineering Paradox: AI amplifies velocity while degrading the quality signals teams depend on — creating a system that feels faster but delivers less reliably.
This has massive implications for how we size teams. Should you go lean — a 2-person "skeleton crew" leveraging AI as a force multiplier? Or is there a mathematically optimal team size? I built six simulation models to find out. Each one is interactive — plug in your own numbers and see what happens.
1. Mathematical Mental Models
Before we run any simulation, we need shared language. I've synthesized three core models from the research that capture the dynamics at play in AI-augmented engineering teams.
Model 1: The Effective Output Equation
This looks deceptively simple — but the devil is in the interaction effects. When V goes up (AI makes you faster), R also tends to go up (AI-generated code has more defects). The CodeRabbit analysis of 470 real-world pull requests found AI-authored PRs contain 1.7× more issues overall,3 and GitClear's analysis of 211 million changed lines showed code churn doubled between 2021 and 2024.4
At a rework rate of ~67%, a team running at V=3.0× (heavy AI) produces less effective output than a team with no AI at all. The simulation below lets you see exactly where your team sits.
Explore how AI velocity multipliers interact with rework rates. At what rework rate does AI help become AI harm?
Parameters: P (base productivity) is normalized to 1.0 for a single engineer. V (velocity multiplier) represents the throughput gain from AI tools — the DX Research framework across 450+ companies finds typical gains of 1.1×, with experienced devs on greenfield tasks reaching up to 3–5×. R (rework rate) represents the fraction of output that must be redone — CodeRabbit found 1.7× more issues in AI PRs, and GitClear documented a doubling of code churn.
Crossover point: Solve for R where AI output equals no-AI output: 1.0 × (1 − R₀) = V × (1 − R), giving R_crossover = 1 − (1 − R₀)/V. For R₀ = 12% (typical human rework rate), V=2.0 gives R_crossover = 56%. Beyond this, AI is net-negative.
Model 2: The Oversight Tax
Every AI-generated artifact demands a human decision: accept, reject, or modify. This creates a cognitive tax that compounds throughout the day:
Sonar's 2026 survey of 1,100 developers found that 96% don't fully trust AI-generated code, yet only 48% always verify it before committing.5 That 48-point gap between distrust and verification is the Oversight Tax made visible.
Model 3: The Reckoning Formula (Technical Debt)
A 2025 InfoQ analysis found AI-generated code creates entirely new categories of technical debt,6 and copy-pasted code rose from 8.3% to 12.3% while refactoring activity collapsed from 25% of changed lines to less than 10%.4
2. The Verification Bottleneck
Code generation used to be the bottleneck. AI removed it. But it didn't remove the next bottleneck — it made it worse. Humans can effectively review approximately 400 lines of code per hour and hit a cognitive wall after roughly 60 minutes of sustained review. These are hard biological limits that have not changed despite AI scaling production by orders of magnitude.7
Salesforce Engineering documented this directly: after AI adoption, code volume increased 30%, PRs regularly expanded beyond 20 files and 1,000 lines of change, and review time began to plateau — indicating reviewers were no longer meaningfully engaging with the code.8
Your team's effective throughput is capped by the lesser of generation capacity and verification capacity. This model shows where the ceiling is for your team size.
Generation: G = n × P_base × V. Base LOC/day = 200 per engineer. Verification: V = n_reviewers × 400 LOC/hr × review_hrs × fatigue_decay. Reviewers = floor(n/2). Fatigue decay follows F(d) = 1 − (1 − e^(-hrs/D_max)) × 0.4 where D_max = 4 sustainable hours. Based on Qodo's projection of 40% quality deficit as AI velocity outpaces review capacity.7
3. The 24-Month Simulation
Now let's model what happens over time. I'm comparing three team structures building the same product: a 10-person legacy team (no AI), a 2-person skeleton crew (heavy AI), and a 4-person expert pod (the "golden ratio" hypothesis).
Key assumptions grounded in the research: AI quality starts low (high rework) and improves ~25% every 6 months. Human review capacity stays constant. Communication overhead follows Brooks's Law: n(n-1)/2 channels. Rework rate for AI code decreases as AI quality improves.
Watch how three team structures perform over 24 months as AI quality improves but human review capacity stays fixed. Where does the skeleton crew peak? When does the pod overtake?
AI quality trajectory: Q(t) = min(0.10 × 1.25^(t/6), 0.85). Rework R(t) = max(0.15, 1 − Q(t)). Communication overhead: C(n) = n(n−1)/2 × c. Verification penalty: VP = max(0, (G − V_cap) / G) × 0.4. Based on METR RCT data, DORA longitudinal reports, and CMU's Cursor adoption study.2910
4. Finding the Optimal Team Size
This is the model that directly answers "what's my ideal headcount?" It combines all the forces — output, Brooks's Law overhead, verification penalty, and bus factor risk — into a single utility function you can optimize:
This utility function balances effective output, communication overhead, verification penalties, and bus factor risk. The blue dot marks your optimal team size.
Derived from Hackman & Vidmar's optimal team size research (4.6 members),11 Brooks's Law communication overhead, and the verification ceiling model. The bus factor term k/n captures the catastrophic risk of losing 1 member in a 2-person team versus the manageable impact in a 4+ person team. The Scrum Guide recommends 3–9 developers; Amazon's two-pizza teams land at 4–6.12
5. Risk Analysis: Model Collapse & Code Quality Decay
A landmark 2024 paper in Nature by Shumailov et al. demonstrated that AI models trained on recursively generated data suffer irreversible quality degradation — a phenomenon they termed "model collapse."13 AI-generated code now constitutes 25–42% of committed code at major companies — and these repositories become training data for the next generation of code models.
Anthropic's 2026 RCT found AI-assisted junior engineers scored 17% lower on comprehension quizzes,14 and MIT Media Lab's EEG study showed LLM users displayed the weakest neural connectivity patterns. The researchers introduced the concept of "cognitive debt" — the delayed cost to attention, learning, and mental health from chronic AI reliance.15
Models how code quality evolves over 24 months as AI-generated code accumulates. Higher review intensity slows the decay — but can the 2-person team keep up?
Based on GitClear's analysis of 211M changed lines (code churn doubled 2021–2024),4 CMU's Cursor study (static analysis warnings +30%, complexity +41% persisting after velocity gains faded),10 and Shumailov et al.'s model collapse research.13 The 2-person team additionally suffers compounding reviewer fatigue (decay factor 1 − t × 0.02).
6. The Headcount Question: Jevons Paradox
Most headcount reduction calculations ignore the Jevons Paradox: when AI makes development cheaper, demand for software expands, partially offsetting efficiency gains. A February 2026 Fortune analysis of 6,000 executives found nearly 90% reporting no measurable AI impact on employment or productivity.16 Google's Sundar Pichai revealed 25% of Google's code is now AI-assisted, but the company is hiring more engineers, not fewer.
The naive "40% efficiency = fire 40% of people" calculation ignores demand elasticity. This model shows the true headcount adjustment when new demand absorbs efficiency gains.
Klarna cut from 5,500 to 3,400 before their CEO admitted it "went too far."17 Deloitte's 2026 survey found 70% of tech leaders plan to grow teams despite AI.18 Nobel laureate Daron Acemoglu projects only 0.5% TFP growth from AI.16 The Jevons Paradox: when a resource becomes cheaper to use, total consumption often increases rather than decreases.
7. The Evidence Base
The pattern across every rigorous study is striking: a persistent gap between what developers perceive and what's measured.
| Source | Finding | Implication |
|---|---|---|
| METR 2025 (RCT, n=16)2 | AI made experienced devs 19% slower despite perceived 20% speedup | Perception is dangerously unreliable |
| DORA 2024 (n=39,000)9 | Each 25% AI adoption → 1.5% speed dip, 7.2% stability drop | AI amplifies existing dysfunction |
| CodeRabbit (470 PRs)3 | AI code: 1.7× issues, 8× performance bugs | Review capacity is non-negotiable |
| CMU Cursor study10 | 281% LOC spike month 1, baseline by month 3, quality stayed degraded | Velocity gains are transient, quality costs persist |
| Sonar 2026 (1,100 devs)5 | 96% distrust AI code, only 48% verify it | The verification gap is real |
| Fortune/NBER 202616 | 90% of 6,000 CEOs say AI had no measurable productivity impact | Macro-level Solow Paradox 2.0 |
8. Conclusion: The Paradox Resolves Through Context
The 2-person skeleton crew is not mathematically wrong — it's contextually limited. It excels for prototyping and MVPs where speed of iteration matters more than sustained quality. But it hits a wall when the verification bottleneck becomes binding, which the simulation shows happening around month 8–10 for any non-trivial production system.
The 4-person expert pod isn't magic — it's structurally resilient. Its advantage isn't that 4 > 2 (trivially true). It's that 4 humans can distribute verification load, maintain cognitive sharpness through role rotation, survive key-person loss, and sustain delivery over 24+ months.
The future belongs not to the fastest teams, but to the most discerning ones.
Found this useful?
Share the simulations with your team. Plug in your own numbers. The models are designed to be tweaked, exported, and used in your next planning conversation.