← BLOG·#001
· 4 MIN · BY RALF KLEIN

Why we measure AI agents in human hours

Tokens and API calls don't answer the only question your CFO is asking. Here's why we built HumanHours around hours saved, and what one HTTP call per task gets you.

  • product
  • metrics

Every team building agents eventually gets the same question from someone who signs the budget: what did this thing actually do for us last month?

The honest answer is usually some version of "it ran 47,000 times and cost us €820 in inference." That is a true statement. It also tells the asker exactly nothing about whether the agent earned its keep.

Tokens are an input, not an outcome

Most agent observability today is built around the wrong unit. Tokens, API calls, even traces and spans, those are inputs to the work. They tell you how much fuel you burned. They do not tell you what got moved.

A support classifier that sorts 50,000 tickets a month is impressive on a dashboard. A support classifier that would have cost three FTE if a human did it manually is impressive in a board meeting. Same agent, same logs, completely different conversations.

The gap between those two framings is what the entire AI industry is currently struggling to close. We kept seeing teams build sophisticated agents and then fail at the post-launch question: prove the value. Not because there was no value, but because nobody had the right unit on hand.

The human-hours framing

So we picked a unit that everyone in the room already understands.

For every task an agent completes, ask one question: how long would this have taken a human? Then multiply by what that human costs.

That is it. That is the entire idea behind HumanHours.

A ticket classification that takes a person four minutes? The agent saved four minutes of human work, and at €45 per hour, three euros of cost. A document extraction that takes a person ninety seconds? Same math, smaller number. Run it ten thousand times a month and you have a defensible number to put in front of a CFO.

The trick is: this is not a metric you can derive from your existing logs. Your application code knows what task type just completed. Nothing else in the stack does. Which is why we made the integration a single HTTP call.

What it looks like in practice

curl -X POST https://humanhours.dev/api/v1/track \
  -H "Authorization: Bearer hh_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "support-classifier",
    "task_type": "email_classification",
    "outcome": "success"
  }'

You send a task_type. We look up the baseline minutes for that task. You get back the hours saved and the money saved, denominated in your workspace currency.

{
  "event_id": "evt_8aa6e7d8…",
  "resolved_baseline_minutes": 4,
  "resolved_baseline_source": "builtin",
  "hours_saved": 0.0656,
  "cost_saved": 2.95,
  "currency": "EUR"
}

That is the whole contract. One call per completed task, one number per row in the report. No SDK, no instrumentation framework, no protocol to learn.

Behind the scenes we maintain a library of baselines for the most common agent tasks (ticket triage, classification, extraction, drafting, summarisation, routing) and you can override any of them with your own measured baseline. The math is boring on purpose. CFOs do not want a probabilistic claim. They want a number that adds up.

Why this is the right level of abstraction

There is a temptation, especially in technical teams, to want richer data. Latency percentiles. Token cost per task. Per-call success rates. We track those too, because they matter for engineering. But they are not the headline metric.

The headline metric is the one a non-engineer can repeat in a meeting.

"Our support agent saved us 312 hours last month. That is roughly two FTE worth of triage work. The agent itself cost us €840 to run. Net: about €13,200 in salary cost avoided."

Try writing that sentence from your current observability stack. Most teams cannot, not because the data is missing, but because it is scattered across three vendors and never reduced to a single unit.

That sentence is what HumanHours is built to produce. It is what we ship reports on. It is what shows up in your weekly digest email. It is what your CFO will eventually ask you to put in front of the board.

What is next

We are launching with task baselines for the most common agent workloads, but the interesting work is in the long tail: an agent that does one specific thing your business cares about needs a baseline that reflects your team, not an industry average. That is where overrides come in, and where the next set of features is heading.

If you are building agents and the question of value keeps coming up unanswered, we would like to hear from you. Drop a note at hello@humanhours.dev, or start with the docs and try the API.

Every agent, measured in human hours.