Goodhart’s Law

When a measure becomes a target, it ceases to be a good measure. – Charles Goodhart

There is a moment, in the life of every flawed metric, when it stops measuring the thing you care about and starts measuring how badly people want to look good on it. That moment is invisible while it’s happening. You only see it in the rear-view mirror, usually after the budget meeting where someone asks why costs tripled and outcomes didn’t move.

This week, 4 of the largest technology companies on earth hit that moment more or less simultaneously (read the article on thestreet.com), and to their credit, said so out loud. Amazon shut down an internal AI leaderboard called KiroRank, which had been ranking developers by how many AI tokens they consumed on the company’s Kiro platform. Meta killed its equivalent, charmingly named Claudenomics, which had tracked token usage across 85,000 employees and singled out the top 250 for recognition. Uber’s COO admitted the company had found no clear link between increased AI spending and successful products, this after burning through its annual budget for certain AI coding tools by April. Microsoft cancelled Claude Code licenses across one division entirely, citing cost. I read the Amazon story twice, not because it surprised me, but because it didn’t. I want to tell you with as much precision as I can manage: this is not a story about AI. It’s a story about measurement. AI just happens to be the most recent thing we’ve pointed a bad measurement at.

The shape of the mistake

Let me describe the pattern in the abstract first, because the abstraction is the useful part. The specifics will date; the shape will not. A new tool arrives. It’s genuinely capable. This part is important, because the pattern doesn’t work with tools that don’t do anything. The tool is hard to measure directly, because its value is diffuse and shows up downstream and is tangled up with a hundred other things. But the tool produces, as a side effect, one number that is trivially easy to count. And that number is loosely, plausibly, correlated with value.

So leadership latches onto the easy number. Not because they’re foolish but because they’re responsible. They’ve spent a fortune on the tool, they need to know it’s working, and the easy number is sitting right there, updating in real time, fitting neatly onto a slide. Adopting it is the rational thing to do. I have adopted these numbers myself. I am not throwing stones from outside the building. Then the number becomes a target. And the moment a number becomes a target, the people being measured (who aren’t stupid, who are responding exactly as you’d expect rational people to respond to the incentives you’ve handed them) begin optimizing the number directly, severing whatever loose connection it once had to the thing you actually wanted.

At one company I worked for, there was a program called “Kill the Tail,” aimed at reducing the number of applications in the portfolio. The idea was straightforward: fewer applications, lower cost. The problem was definitional. What counted as an “application” was never clearly understood, which led to a surge of loosely defined entries; many of which weren’t really applications at all. When the mandate came down to reduce the count, teams responded the way metrics tend to drive behavior. Instead of actually simplifying the landscape, many simply consolidated everything under a single “application” label. The metric improved. The reality didn’t.

The economist Charles Goodhart gave us the law in 1975: when a measure becomes a target, it ceases to be a good measure. Amazon’s employees demonstrated the corollary in real time. They started “tokenmaxxing” – running meaningless tasks through AI agents purely to consume tokens and climb the leaderboard. Costs went up. Value/Outcome did not. The leaderboard was, by the end, measuring nothing except people’s willingness to game a leaderboard. This isn’t a bug in the employees. This is the metric working exactly as a bad metric works.

We have been here so many times

The reason I felt no surprise reading about KiroRank is that I have watched this identical movie with different actors multiple times in my career. The token leaderboard is a sequel. A high-budget, AI-themed sequel but a sequel.

Lines of code. For a glorious, stupid window in the history of this industry, somebody decided that the way to measure a programmer’s productivity was to count the lines of code they produced. The result was predictable to anyone who has ever actually written software: you reward verbosity and punish elegance. Put this in the context of Richard Pattis’s quote, when debugging, novices insert corrective code; experts remove defective code. By counting lines of commit, you reward the novice for adding and punish the expert for deleting. The most valuable engineer on your team, on a good day, ships a negative number. Your dashboard never records this.

Story points. Then we got more sophisticated, or thought we did. We’d measure velocity; story points completed per sprint. And teams, being made of intelligent people, learned to inflate their estimates. Velocity went up and to the right on every burndown chart in the building. Customer impact stayed exactly where it was, because story points completed has never, in the history of software, been the same thing as a customer’s problem solved.

I’ve lost count of how many times I’ve heard engineering leaders ask, “Why is this team’s velocity so much lower than another’s?” Instead of correcting the misunderstanding, the pressure trickles down. Teams respond the only way the system rewards them to. They quietly double or triple their story point estimates. On paper, velocity improves. In reality, nothing changes.

Hours worked. And underneath all of it, the oldest false proxy of them all: time at the desk. Parkinson’s law settled this in 1955, work expands so as to fill the time available for its completion. Measure hours, and you will reliably get hours. You will not get outcomes. You’ll get people performing busyness, which is a different and more expensive thing than being busy.

Early in my career, a manager asked me to manually enter a large dataset from Excel into a Vignette CMS and gave me a week to get it done. The catch: the system required Tk/Tcl scripts for input. Rather than grind through it by hand, I wrote a Perl script that read the Excel file and generated the necessary Tk/Tcl code. One thing my parents taught me at a young age, work smarter, not harder. A bit of copy-paste later, I was finished in a day. In my naivety (unknown of Parkinson’s law), I told him I was done. I don’t think he believed me at first.

Lines of code. Story points. Hours logged. Tokens consumed. The vocabulary changed, the error is identical every time. We reach for the thing that is easy to count and quietly substitute it for the thing that is hard to count but actually matters, and then we act surprised when people give us more of what we measured and less of what we wanted.

What’s genuinely different and what isn’t

I try to be fair to each new cycle, so let me name what’s actually new here, because something is. The feedback loop is faster and more brutal than it used to be. The lines-of-code era took years to discredit itself, because the cost of bad code was diffuse and deferred. The token era discredited itself in months, because tokens carry a direct, metered, immediately-visible cost. Every gamed leaderboard entry showed up on a cloud bill. When Uber exhausted its annual budget for AI coding tools by April, that wasn’t a philosophical objection to a flawed metric. It was an invoice. The proxy and the cost are, for once, denominated in the same currency, and that made the reckoning arrive on a schedule even the most committed dashboard-believer couldn’t ignore.

What is not different is the underlying confusion, and I want to be careful here, because the easy misreading of this week’s news is “AI didn’t deliver.” That’s not the lesson, and I don’t believe it. Amazon’s $200 billion capital commitment for 2026 is intact. These companies are not retreating from AI. They’re retreating from a bad way of measuring AI, which is a sign of organizational health, not organizational doubt. The tool is real. The measurement was junk. Those are completely separate facts, and conflating them is how you end up making the opposite mistake next quarter.

I have an upcoming post on hype cycles that applies directly: the technology being real is not an argument that the claims about it are proportionate. The mirror image is also true. A metric being broken is not an argument that the thing it failed to measure is worthless. Both errors come from the same place: mistaking the proxy for the territory.

The only number that survived

Here is what I find quietly encouraging about the Amazon story, and it’s the part that didn’t make the headline.

Amazon didn’t just delete KiroRank and walk away. They replaced it with something they’re calling “normalised deployments”; a measure of AI-assisted code that actually ships, rather than tokens consumed in the act of looking productive. Dave Treadwell, their SVP of engineering, told staff the thing that should have been on the wall from day one: please don’t use AI just for the sake of using AI. Use AI to help you solve customer problems, to help you solve business problems, to innovate. This is the whole game, stated plainly. Not “use more.” Use toward something.

Shipped code that solves a real problem is harder to count than tokens. It’s harder to count than lines, or points, or hours. That difficulty is not a defect to be engineered around. It is the entire reason the measure is worth anything. The metrics that are easy to game are easy to game precisely because they’re loosely coupled to value. The metric that’s hard to fake is hard to fake because it’s tied to the only question that has ever actually mattered: What value was shipped?

Every other number on the dashboard is, in the end, a proxy hoping you won’t notice it’s a proxy. Some proxies are useful as leading indicators, read with suspicion and held loosely. None of them survive being turned into a target. The moment you promote a proxy to a goal, you’ve started manufacturing the proxy and abandoning the goal, and the only question left is how many quarters and how many billions it takes before someone in the budget meeting says it out loud. Amazon, Meta, Uber, and Microsoft said it out loud this week. Good. That’s the system working; late, and after the bill arrived, but working.

I’ve seen this before and I’ll almost certainly see it again, with a new tool, a new easy number, and a new word for gaming it. When I do, I’ll reach for the same question I always do, the one that has outlasted every leaderboard I’ve ever been handed: Is this value? Or is it just activity with a dashboard?