← BLOG · 6 MIN · BY RALF KLEIN

Where to track AI hours saved: a decision tree for any automation stack

Eight automation patterns, eight different places to drop the tracking event. A decision tree for keeping AI hours saved defensible across n8n, Zapier and custom Python.

  • engineering

Most teams run AI automations across at least three platforms before they hit their first quarter of meaningful volume. The n8n workflow that processes inbound tickets. The Zapier flow that books demos. The custom Python agent that summarises Slack threads at 6am. Maybe a Salesforce flow that classifies leads. Each one quietly saves real time, and each one reports it differently.

n8n's Insights documentation ships a Time Saved metric, but only on parent workflows on Pro plans and above. The Microsoft Copilot Dashboard tracks Copilot. GitHub Copilot stats track GitHub Copilot. None of them measure what your custom Python agent saves, and none of them speak to each other. So the question for anyone running a mixed stack is the same: where exactly do you drop the tracking event so the numbers add up at the end of the month?

The answer depends on what the workflow looks like. Here are eight common patterns and the rule for each one.

1. Webhook intake

A webhook fires on every inbound event, and the workflow either processes the record or routes it. Examples: a support ticket coming in from Intercom, a form submission, a Stripe webhook routing to a fulfilment flow.

Where to track: at the end of the success branch, after any branching logic has resolved. Not at the start, because you don't yet know the record was actually processed. Not on the failure branch, because you saved no time.

The volume on these tends to land in the 5k to 50k events per month range for a mid-sized team, well within realistic agent volumes for n8n. One tracking event per processed record, no aggregation.

2. Scheduled batch

A flow runs on a cron schedule and processes a list of records. Examples: a nightly job classifying yesterday's leads, a 6am Salesforce flow updating opportunity stages.

Where to track: inside the per-record loop, once per record successfully processed. Not once per scheduled run. A run that processes 80 records saved 80 records' worth of human time, not one.

This is the pattern teams get wrong most often. Tracking at the run level looks tidy on the cron log but turns into a 50x undercount on the dashboard.

3. AI agent loop

An LLM agent that reasons in a loop, calling tools, observing results, and continuing until it finishes. Examples: a research agent collecting information across five tools, a triage agent classifying and acting on a ticket.

Where to track: once at the end of the loop, when the agent terminates with a final answer. Not on each tool call. The human alternative is one human session producing one outcome, not one human per tool call.

Agent loops in production tend to run between 1k and 20k completions per month for a focused use case, which sits inside the typical LangSmith observability range. One terminating-event tracking call per agent run.

4. Sub-workflow tool

A reusable sub-workflow called by parents to do one specific job. Examples: an enrichment sub-workflow that takes an email and returns LinkedIn data, a translation sub-workflow that takes text and returns a translated string.

Where to track: at the parent, not the sub-workflow. The sub-workflow is a function call. Tracking inside it double-counts the moment a parent calls the same sub twice.

This is also the pattern that breaks the platform-native metric. n8n's Insights docs confirm time saved tracking applies to parent workflows only, with sub-workflow support listed as a future addition. If your tracking layer is platform-agnostic, you sidestep this gap entirely.

5. Approval gate

The agent drafts something, a human approves or rejects, and only the approved version ships. Examples: a sales agent drafting outreach for human review, a finance agent generating an invoice memo for sign-off.

Where to track: after the approval gate, on the approval branch only. Tracking before the gate inflates the number with every rejected draft.

This is the single biggest source of dashboard inflation in real teams. Move the node. The drop in reported hours saved when you do this is real, not noise. The remaining number is the one that survives a CFO review.

6. Multi-channel triage

One workflow handles intake from email, chat, and a form, classifies the request, and routes to the right team. Examples: a support triage handling Intercom, email and form, a sales triage routing inbound demos.

Where to track: once per resolved record, after classification has succeeded and routing has fired. Not on each branch separately. The human alternative is one triage decision per record, regardless of where the record came in.

If different channels have different baseline minutes (email triage takes longer than chat triage in most teams), pass a task_type field to the tracking call so the dashboard splits them. One event, with metadata, beats three branch-specific events.

7. RAG pipeline

Retrieve, augment, generate. The user asks a question, the system retrieves context, an LLM generates an answer. Examples: an internal support copilot answering questions over policies, a customer-facing chatbot grounded in product docs.

Where to track: at the response, after generation succeeds. Not at retrieval, because retrieval without generation is a database query and a database query saves no time on its own.

Volume here can scale fast: 10k to 100k completions per month is realistic for a customer-facing RAG bot. Track per response, with a confidence score in metadata so low-confidence responses can be filtered or weighted differently in reporting.

8. Custom Python agent

An agent written outside any platform: a standalone Python service, a Lambda, a worker process. Examples: a custom embedding pipeline, a homegrown classifier service, a Python script doing nightly enrichment.

Where to track: at the end of each successful task completion, with one HTTP call to your tracking layer.

import requests
 
def track_completion(task_type: str, baseline_minutes: int, metadata: dict):
    requests.post(
        "https://humanhours.dev/api/v1/track",
        headers={"Authorization": "Bearer hh_live_..."},
        json={
            "agent_id": "custom-python-agent",
            "task_type": task_type,
            "outcome": "success",
            "human_baseline_minutes": baseline_minutes,
            "metadata": metadata,
        },
        timeout=2,
    )

A 2-second timeout and best-effort delivery keeps the tracking layer from ever blocking the actual work. Custom agents are where teams have the most freedom and the least guidance, so this is also where the rule "track at the success boundary, once per outcome" matters most.

The decision tree, in one sentence

Track once per outcome that replaced human work, on the success branch, after any approval gate, at the level the human alternative would have operated at. That covers all eight patterns. The rest is metadata.

The reason this rule survives is that it maps directly to the unit a CFO understands: we did N of these instead of a human, and each one was worth X minutes. Per-call counts, per-batch totals, per-tool tallies all break that mapping. The success-boundary count does not.