← BLOG · 4 MIN · BY RALF KLEIN

Why most GenAI projects fail to prove AI ROI

MIT found 95 percent of GenAI pilots show no measurable return. The diagnosis is wrong: AI is not failing on the model side, it is failing on measurement.

  • company
  • metrics

The MIT NANDA project's State of AI in Business 2025 report found that 95 percent of enterprise generative AI pilots produced no measurable business return after roughly $30 to $40 billion of investment. The number gets repeated everywhere. The diagnosis underneath it is wrong.

Most GenAI projects are not failing on the model side. They are failing on measurement.

What actually breaks

Forrester ran the same question from a different angle. In their analysis, only 15 percent of AI decision makers report an EBITDA lift, and fewer than 1 in 3 can tie AI value to P&L. That gap, between an automation that runs and an automation whose value lands on a financial statement, is the entire problem.

When CFOs cancel an AI initiative, they almost never do it because the model degraded. They do it because the team running the program could not produce a defensible number for the value created. The agent kept executing. Tickets kept getting deflected. Reports kept getting drafted. And the budget conversation still ended in "we cannot prove this is working."

To prove AI ROI in a way that survives a finance review, the work happens before the agent ships, not after.

Four reporting habits that separate growing budgets from shrinking ones

Across the AI ops teams that successfully defend budget at year end, four habits show up consistently. None of them require new technology. All four are about what gets measured and when.

1. A baseline that was set before deployment

Teams whose AI budget grows can produce, on demand, the per-task minute estimate they were working from before they turned the agent on. They did the time tracked sample, or the expert estimate consensus, or the historical task log analysis. The baseline is documented, dated, and tied to a specific task type.

Teams whose budget shrinks cannot produce a baseline. Six months in, they are arguing about whether human ticket triage was 4 minutes or 8 minutes per ticket, and the savings number swings by 100 percent depending on which estimate the room agrees to. The baseline is the floor of every number that follows.

2. Per-execution logging, not monthly aggregates

The first instinct is to track AI savings in monthly summaries. The teams whose numbers survive a CFO review do the opposite. Each execution writes a row. Each row has a timestamp, a task type, a baseline minute value, an outcome, and a workspace.

Monthly aggregates collapse the signal. A 60 percent deflection rate on support tickets means nothing if you cannot answer "which 60 percent, and what did the other 40 percent cost." Per-execution logs let the finance team reconstruct any cut of the data later, including the cuts they did not know they would want.

3. Tracking after the approval gate, not before

Most agents draft, classify, or summarise for human review. Counting savings on the draft step inflates every number, because rejected drafts do not save anyone time. They cost time.

The teams whose ROI claims hold up move the tracking node behind the approval gate. An AI draft that the human accepts logs as savings. An AI draft that the human rewrites does not. The same flow, with the tracking moved two nodes to the right, can take a 70 percent reported savings rate down to a defensible 45 percent. The 45 is the number the CFO believes the second time you bring it.

4. A 30/60/90 horizon dashboard

The reports that become standing artefacts in finance reviews are structured around three time horizons, not one. Microsoft Worklab's research on AI value at work and the playbooks documented in Larridin's AI ROI measurement guide converge on the same horizons.

At 30 days, the dashboard shows adoption and quality signals: how many users invoked the agent, what the rejection rate looked like, whether output passed the eval bar. At 60 to 90 days, it shifts to operational improvements: hours saved, cycle time reduction, deflection rate stability. At 6 to 12 months, it produces the financial outcome: human equivalent hours converted to money saved, mapped to a P&L line.

The 30/60/90 frame matters because each horizon answers a different question for a different audience. Skipping straight to financial outcomes at month one produces numbers no one trusts. Reporting only adoption metrics at month nine produces a budget cut.

What this looks like in practice

A team running 8,000 monthly executions across three n8n agents and one custom Python agent does not need a separate analytics platform to defend the budget. They need one consistent unit (human equivalent hours), one consistent baseline per task type, and one place where every execution lands. The teams whose budget grew this year were the ones doing exactly that, six months before the budget conversation started.

The 95 percent failure number is not a verdict on the technology. It is a verdict on a measurement gap. AI ops leaders preparing for the next budget defense have one job between now and that conversation: produce the four artefacts above, with dates that predate the conversation by at least a quarter. The number you can defend is the number that wins the room.