← BLOG · 3 MIN · BY RALF KLEIN

Human-in-the-loop ROI: only count what actually replaced human work

An agent drafts 800 reply suggestions. Humans approve 240. The dashboard reads 800. Place the tracking node after the approval gate, not before, and the number stops inflating.

  • metrics
  • agents

A support agent drafts 800 reply suggestions in a week. The dashboard logs 800 savings events at five minutes each. That reads as 67 hours saved. The team actually approved 240 of those drafts, edited another 320, and rejected the rest. The CFO asks the obvious question and the number does not survive the meeting.

The fault is not the agent. The fault is where the tracking node fires.

The structural mistake

In most human-in-the-loop flows the savings call sits next to the draft step. Agent produces a draft, tracking event fires, draft moves into the approval queue. That ordering is convenient because the agent context is already loaded, and it is wrong because the approval has not happened yet.

A rejected draft saved nothing. A heavily edited draft saved a fraction. A draft a reviewer rewrote from scratch cost more in context-switch than the agent saved by producing it. Treating all four as equivalent inflates the number by 2x to 4x in any real approval workflow.

The fix is mechanical. Move the tracking node behind the approval gate. The gate becomes the source of truth for whether an agent execution replaced human work or merely staged it.

A four-branch approval flow

A typical reviewer queue resolves into four outcomes. Each branch needs its own savings payload.

Approved as-is. Reviewer clicks approve, message ships unchanged. Log the full baseline, say five minutes for a reply a human would have written from scratch.

Approved with minor edits. Reviewer changes a sentence or two, sends. Log the baseline minus the edit time. If the human alternative was five minutes and the edit took ninety seconds, log 3.5 minutes saved. The site rolls this up cleanly because the baseline is still tied to the same task type.

Heavily edited and sent. Reviewer rewrites most of it but kept the structure. Log a fraction, often 30% to 50% of baseline. Set a workspace policy on what counts as heavy and stick to it, because reviewer judgement varies and the rule needs to be auditable.

Rejected. Reviewer discards the draft and writes their own, or closes the ticket without sending. Log zero. If you want to be honest about the cost of the review itself, log a negative event with the time the reviewer spent reading and dismissing the draft, typically thirty to sixty seconds.

Payload that supports the branches

One field carries this: the approval outcome. Send it on every savings event so the dashboard can split by branch and so the per-execution log is auditable later.

{
  "agent_id": "support-replies",
  "task_type": "support_reply_draft",
  "outcome": "success",
  "human_baseline_minutes": 5,
  "metadata": {
    "approval_outcome": "approved_with_edits",
    "edit_seconds": 90,
    "ticket_id": "T-48211"
  }
}

The approval_outcome enum stays small: approved_as_is, approved_with_edits, heavily_edited, rejected. Anything finer is reviewer noise. Anything coarser collapses back into the original inflation problem.

The savings call still fires once per replaced task, same rule as any other agent flow. The difference is the call fires from the approval node, not the draft node, and the baseline value is conditional on which branch resolved.

Why this matters more than calibration

Teams spend weeks calibrating baseline minute values per task type. That work is necessary and it caps your accuracy at roughly plus or minus 15% once dialled in. A misplaced tracking node in a human-in-the-loop flow blows past that error budget on the first week of production traffic. An automation running between 5k and 20k drafts a month, the typical n8n or LangGraph review-queue range, can show a 67-hour week as a 24-hour week or a 90-hour week depending on where the call fires.

Approval-gated work is the most common place this goes wrong, because the draft-then-approve pattern is everywhere: support replies, sales follow-ups, marketing copy, code suggestions, contract redlines. Every one of those flows needs the tracking node behind the gate.

The rule is the same one that holds for tool-call counting and per-record tracking: log savings events where the human work would have ended, not where the agent work began.