HumanHours Data-Quality Implementation Plan · HumanHours docs

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Make HumanHours enrichment data genuinely trustworthy, with continuous per-datapoint confidence scored to the integer (74, 83, 88 must be reachable, never snapped to 0.4/0.6/0.8), and fix the verified data-honesty bugs that currently let coarse or wrong numbers reach customers.

Architecture: The enrichment engine (apps/web/lib/enrichment/service.ts) assembles a per-domain record from extracted roles, batched wage resolution, and a confidence rollup. Today confidence is a cost-weighted average over a 5-value tier ladder (packages/core/src/enrichment/confidence.ts), and headcount/roles each collapse to a single tier, so overall lands on ~12 discrete blobs. We replace the discrete tier-score with a continuous per-datapoint scorer driven by source tier (prior), match quality, corroboration, recency, and fallback flags, then roll up continuously. We also fix model-price id matching, wage-fallback honesty, silent catch blocks, the confidence-tier inversion, and add an unmatched cost provenance state.

Tech Stack: Next.js 16 (App Router), TypeScript, Supabase (Postgres) EU, Vitest. Pure scoring logic lives in @agent-metrics/core (unit-tested without DB). Confidence is stored in company_research.confidence jsonb and presented as integer percent.

Measured baseline (prod, 2026-06-05): 12 distinct confidence blobs over 426 researches; 51% share {roles:0.6,wages:0.8,headcount:0.6,overall:0.71}. Only 12/7607 events carry a model; 2 of 5 distinct model strings are unpriced. 81% of events resolve baselines from builtin defaults.

Phase 0: Capture the research

Task 0.1: Save the data-quality roadmap to Second Brain

Files:

Create: /Users/ralf/Documents/Second Brain/My Companies/Triad/Producten/HumanHours/HumanHours - Data Quality Roadmap 2026-06-05.md
Step 1: Write the roadmap note

Write the full workflow output (executive read, 22-item prioritized table, the 5 verified bugs with file:line, the confidence-redesign / NDI wedge section, and the explicitly-dropped items with reasons) as a Markdown note. Front-matter:

---
type: product-roadmap
product: "[[HumanHours.dev]]"
source: "multi-agent deep research 2026-06-05 (run wf_26b618c7-e50)"
related: "[[New Digital Intelligence - HumanHours Partnership]]"
date: "2026-06-05"
tags: [humanhours, roadmap, data-quality, confidence, enrichment, ndi]
---

Link [[New Digital Intelligence - HumanHours Partnership]] in the confidence section (the per-datapoint confidence is the commercial wedge Burian called "goud").

Step 2: Add the pointer in My Companies/Triad/Producten/HumanHours/ index if one exists; otherwise skip.

This task is documentation only, no code, no commit to the agent-metrics repo.

Phase 1: Verified bug quick-wins (data honesty)

Each bug is independently shippable and S-effort. Ship before the confidence rewrite so the rewrite builds on honest inputs.

Task 1.1: Model-id normalization for pricing

Files:

Modify: apps/web/lib/model-prices.ts
Test: apps/web/lib/model-prices.test.ts (create)

The price book is keyed on the exact OpenRouter id (dots: anthropic/claude-sonnet-4.6), but events send dashes (anthropic/claude-sonnet-4-6) and may omit the provider prefix. lookupModelPrice does an exact prices.get(model), so real models go unpriced and net ROI silently falls back to gross.

Step 1: Write failing tests

// apps/web/lib/model-prices.test.ts
import { describe, expect, it } from "vitest";
import { canonicalizeModelId } from "./model-prices";
 
describe("canonicalizeModelId", () => {
  it("lowercases and keeps a known dotted id stable", () => {
    expect(canonicalizeModelId("anthropic/claude-sonnet-4.6")).toBe("anthropic/claude-sonnet-4.6");
  });
  it("converts a dashed version to a dotted version", () => {
    expect(canonicalizeModelId("anthropic/claude-sonnet-4-6")).toBe("anthropic/claude-sonnet-4.6");
  });
  it("does not mangle a trailing suffix", () => {
    expect(canonicalizeModelId("anthropic/claude-opus-4-8-fast")).toBe(
      "anthropic/claude-opus-4.8-fast",
    );
  });
  it("leaves a single trailing number alone (not a version pair)", () => {
    expect(canonicalizeModelId("openai/gpt-4")).toBe("openai/gpt-4");
  });
});

Run: pnpm exec vitest run apps/web/lib/model-prices.test.ts from repo root. Expected: FAIL (canonicalizeModelId not exported).

Step 2: Implement canonicalize + a second index

In apps/web/lib/model-prices.ts, add the exported helper and index a canonical map alongside the exact one. Only the FIRST -<digit>-<digit> pair becomes <digit>.<digit>, so suffixes like -fast are preserved.

// Canonical key for fuzzy-but-safe matching: lowercase, and collapse the first
// "-<digit>-<digit>" version pair to "<digit>.<digit>" so dashed event ids
// (anthropic/claude-sonnet-4-6) match dotted price ids (…-4.6). Suffixes after
// the version (…-4.8-fast) are preserved.
export function canonicalizeModelId(model: string): string {
  return model
    .trim()
    .toLowerCase()
    .replace(/-(\d+)-(\d+)/, "-$1.$2");
}

Update loadPrices to also build byCanonical, and lookupModelPrice to fall back to it. Keep exact match first so nothing regresses:

let cache: {
  at: number;
  byId: Map<string, ModelPrice>;
  byCanonical: Map<string, ModelPrice>;
} | null = null;
// in loadPrices, after byId.set(...):
//   byCanonical.set(canonicalizeModelId(r.model_id), { ...same... });
// lookupModelPrice:
export async function lookupModelPrice(model: string): Promise<ModelPrice | null> {
  if (!model) return null;
  const { byId, byCanonical } = await loadPricesIndexed();
  return byId.get(model) ?? byCanonical.get(canonicalizeModelId(model)) ?? null;
}

(Refactor loadPrices to loadPricesIndexed returning both maps; preserve the stale-cache-on-error behaviour.)

Step 3: Run tests → PASS. Then pnpm --filter @agent-metrics/web exec tsc --noEmit.
Step 4: Commit fix(pricing): canonical model-id matching so dashed ids resolve a price.

Task 1.2: `unmatched` cost provenance + log the offending id

Files:

Modify: apps/web/app/api/v1/track/route.ts (cost-resolution block, ~lines 220-260)
Modify: packages/types/src/* and packages/sdk-js/src/types.ts (the CostSource union)
Create: packages/db/migrations/0053_cost_source_unmatched.sql

A model that does not match a price and "no model passed" both record cost_source: 'none', so the Task 1.1 class of miss is invisible. Add unmatched.

Step 1: Migration to widen the CHECK constraint:

-- 0053_cost_source_unmatched.sql
-- Distinguish "caller passed a model we could not price" (unmatched) from
-- "no cost signal at all" (none), so pricing-coverage gaps are measurable.
alter table public.events drop constraint if exists events_agent_cost_source_check;
alter table public.events
  add constraint events_agent_cost_source_check
  check (agent_cost_source is null or agent_cost_source in ('provided','computed','none','unmatched'));

Step 2: Add "unmatched" to the CostSource union in both type files (grep CostSource to find them; the research located packages/sdk-js/src/types.ts:38 and a @agent-metrics/types copy).
Step 3: In track/route.ts, when a model is present but lookupModelPrice returns null, set agent_cost_source = "unmatched" and write cost_basis = { unmatched_model: <model string> } instead of none. Leave none for the genuinely-absent case.
Step 4: Test: extend the track route test (or add one) asserting an unknown model yields cost_source: 'unmatched' and the model string is captured. Run vitest + tsc.
Step 5: Commit feat(track): record unmatched cost provenance for unpriced models.

Task 1.3: Wage-fallback honesty

Files:

Modify: apps/web/lib/enrichment/wage-provider.ts
Modify: apps/web/lib/enrichment/service.ts (the HARD_FALLBACK_GROSS_EUR path + the silent tier upgrade)
Step 1: In wage-provider.ts, the no-source LLM estimate already sets tier: "llm_inferred" (good). Confirm it never claims a source. If any path writes a bare estimate with a source label, change the stored source to "llm_estimate" and do not set a source URL.
Step 2: In service.ts, the hard-fallback wage row (HARD_FALLBACK_GROSS_EUR) must be flagged. Add is_fallback: true to that resolution and carry it into the wage record so the report and the confidence scorer can see it (consumed in Phase 2).
Step 3: Remove the silent upgrade of a missing tier to seeded_reference (0.6); default a missing tier DOWN to llm_inferred (or hard_fallback when there is truly nothing).
Step 4: Test the resolution helper (extract a pure function if needed) asserting a fallback wage carries is_fallback. Run vitest + tsc. Commit fix(enrichment): flag fallback wages instead of laundering them to seeded.

Task 1.4: Distinguish `lookup_error` from `not_found`

Files:

Modify: apps/web/lib/enrichment/service.ts (the four catch { } blocks around the wage/cache resolution)
Step 1: Replace silent catch {} blocks with a resolution_status of "resolved" | "not_found" | "lookup_error". A DB/network error must NOT degrade a role to hard_fallback indistinguishably from a genuine miss.
Step 2: Exclude lookup_error datapoints from the confidence rollup (or mark them for retry) rather than scoring them as low-confidence real data.
Step 3: Test + tsc. Commit fix(enrichment): surface lookup errors instead of swallowing them as fallbacks.

Task 1.5: Confidence-tier ordering + official-source over-tagging

Files:

Modify: packages/core/src/enrichment/confidence.ts (handled by the Phase 2 rewrite, see Task 2.1: official_statistic prior now ABOVE fetched_cited)
Modify: apps/web/lib/enrichment/wage-provider.ts (OFFICIAL_SOURCE regex) and the BE estimated-split rows
Step 1: The tier inversion (fetched_cited 0.95 > official_statistic 0.8) is fixed by the Phase 2 priors (official 0.90 > fetched_cited 0.78). No separate change needed; cross-reference Task 2.1.
Step 2: The OFFICIAL_SOURCE regex tags any .gov/eurostat URL as official, including the seeded BE rows whose role split is explicitly ESTIMATED (migration 0046). Mark those seed rows with a flag (e.g. estimated_split: true) and have the scorer treat them as seeded_reference with matchQuality < 1, not official_statistic. Verify against 0046_wage_reference_official.sql.
Step 3: Test + tsc. Commit fix(enrichment): stop tagging estimated wage splits as official statistics.

Phase 2: Continuous per-datapoint confidence (the headline)

Replace the discrete tier-score rollup with a continuous scorer so confidence is realistic to the integer. Guarantee: differentiated inputs cannot collapse onto a small grid.

Task 2.1: Continuous confidence scorer in `@agent-metrics/core`

Files:

Modify: packages/core/src/enrichment/confidence.ts
Test: packages/core/src/enrichment/confidence.test.ts (create)
Step 1: Write failing tests

// packages/core/src/enrichment/confidence.test.ts
import { describe, expect, it } from "vitest";
import { datapointConfidence, rollupConfidence, toConfidencePct, tierPrior } from "./confidence";
 
describe("datapointConfidence", () => {
  it("ranks official statistics above a random cited page", () => {
    expect(tierPrior("official_statistic")).toBeGreaterThan(tierPrior("fetched_cited"));
  });
  it("produces continuous, non-grid values", () => {
    const a = datapointConfidence({
      tier: "official_statistic",
      matchQuality: 1,
      corroboration: 2,
      staleYears: 1,
    });
    const b = datapointConfidence({
      tier: "official_statistic",
      matchQuality: 0.7,
      corroboration: 1,
      staleYears: 3,
    });
    expect(a).not.toBeCloseTo(b, 2);
    // not snapped to 0.4/0.6/0.8
    for (const v of [a, b])
      expect([0.4, 0.6, 0.8].some((g) => Math.abs(v - g) < 0.005)).toBe(false);
  });
  it("caps a flagged fallback regardless of tier", () => {
    expect(
      datapointConfidence({ tier: "official_statistic", isFallback: true }),
    ).toBeLessThanOrEqual(0.3);
  });
  it("stays within (0.05, 0.98)", () => {
    expect(
      datapointConfidence({ tier: "hard_fallback", matchQuality: 0.4 }),
    ).toBeGreaterThanOrEqual(0.05);
    expect(
      datapointConfidence({ tier: "official_statistic", corroboration: 99 }),
    ).toBeLessThanOrEqual(0.98);
  });
});
 
describe("rollupConfidence", () => {
  it("is the cost-weighted mean of continuous scores", () => {
    const r = rollupConfidence([
      { weight: 100, score: 0.9 },
      { weight: 100, score: 0.5 },
    ]);
    expect(r).toBeCloseTo(0.7, 5);
  });
  it("returns 0 on no weight", () => {
    expect(rollupConfidence([])).toBe(0);
  });
});
 
describe("toConfidencePct", () => {
  it("rounds to integer percent so 0.743 -> 74", () => {
    expect(toConfidencePct(0.743)).toBe(74);
    expect(toConfidencePct(0.835)).toBe(84);
  });
});

Run: pnpm exec vitest run packages/core/src/enrichment/confidence.test.ts. Expected: FAIL.

Step 2: Rewrite confidence.ts (full replacement)

import type { ConfidenceTier } from "@agent-metrics/types";
 
// Continuous trust prior per source tier (0..1). official_statistic ranks ABOVE
// fetched_cited: Eurostat/BLS must beat a random salary page. These are PRIORS,
// not final scores; per-datapoint modifiers move them continuously so confidence
// is realistic to the integer (74, 83, 88), never snapped to a {0.4,0.6,0.8} grid.
const TIER_PRIOR: Record<ConfidenceTier, number> = {
  official_statistic: 0.9,
  fetched_cited: 0.78,
  seeded_reference: 0.62,
  llm_inferred: 0.42,
  hard_fallback: 0.18,
};
 
export function tierPrior(tier: ConfidenceTier): number {
  return TIER_PRIOR[tier];
}
 
const clamp = (x: number, lo: number, hi: number) => Math.min(hi, Math.max(lo, x));
 
export interface DatapointSignals {
  tier: ConfidenceTier;
  matchQuality?: number; // 1 = exact role+country match; <1 for DEFAULT-country fallback / estimated split
  corroboration?: number; // independent agreeing sources, >=1; saturating bonus
  staleYears?: number; // years the source is beyond the reference year
  isFallback?: boolean; // explicit fallback/estimate; hard-caps the score
}
 
// The single continuous scorer. A continuous matchQuality multiplier, a
// saturating corroboration bonus and a linear recency penalty make round-number
// collapse impossible for differentiated inputs.
export function datapointConfidence(s: DatapointSignals): number {
  const prior = TIER_PRIOR[s.tier];
  const match = clamp(s.matchQuality ?? 1, 0.4, 1);
  const n = Math.max(1, s.corroboration ?? 1);
  const corroborationBonus = 0.1 * (1 - 1 / n); // 0 at n=1, ~0.067 at n=3, ->0.1
  const recencyPenalty = 0.02 * Math.max(0, s.staleYears ?? 0);
  let score = prior * match + corroborationBonus - recencyPenalty;
  if (s.isFallback) score = Math.min(score, 0.3);
  return clamp(score, 0.05, 0.98);
}
 
export interface WeightedDatapoint {
  weight: number; // cost weight (EUR)
  score: number; // continuous confidence 0..1
}
 
// Cost-weighted mean of continuous per-datapoint scores. Continuous in/out.
export function rollupConfidence(items: WeightedDatapoint[]): number {
  const total = items.reduce((s, i) => s + Math.max(0, i.weight), 0);
  if (total <= 0) return 0;
  const weighted = items.reduce((s, i) => s + Math.max(0, i.weight) * i.score, 0);
  return weighted / total;
}
 
// Round a 0..1 confidence to integer-percent precision (0.743 -> 74).
export function toConfidencePct(x: number): number {
  return Math.round(clamp(x, 0, 1) * 100);
}

Step 3: Update the barrel export if tierScore/WeightedTier are re-exported from packages/core/src/index.ts; replace with tierPrior/datapointConfidence/WeightedDatapoint/toConfidencePct.
Step 4: Run tests → PASS. pnpm --filter @agent-metrics/core exec tsc --noEmit. Commit feat(confidence): continuous per-datapoint confidence scorer.

Task 2.2: Wire continuous scoring into the enrichment assembly

Files:

Modify: apps/web/lib/enrichment/service.ts (lines ~290-359)
Modify: the RoleWageResolution type to carry matchQuality/isFallback/staleYears
Step 1: Where wages resolve (the cache / seeded / researched / hard-fallback branches around lines 77-208), attach per-datapoint signals to RoleWageResolution:
- exact role+country hit → matchQuality: 1
- DEFAULT-country fallback → matchQuality: 0.7
- seeded estimated-split row (BE, Task 1.5) → matchQuality: 0.6
- hard fallback → isFallback: true
- staleYears = current reference year minus the row's year (from wage_reference).
- corroboration = number of sources that agreed (1 unless the batch returned multiple).
Step 2: Replace weighted.push({ weight: annual, tier }) with:

const score = datapointConfidence({
  tier: res.tier,
  matchQuality: res.matchQuality,
  corroboration: res.corroboration,
  staleYears: res.staleYears,
  isFallback: res.isFallback,
});
weighted.push({ weight: annual, score });

Step 3: Make headcount + roles confidence continuous (replace the single-tier collapse at lines 345-355):

// headcount: continuous, blends source strength with how well the role split
// reconciles to the stated headcount (a real cross-check, doubles as a tripwire).
const rolesSum = [...headcountByRole.values()].reduce((a, b) => a + b, 0);
const reconcile =
  headcountKnown && extracted.headcount_estimate! > 0
    ? clamp(
        1 - Math.abs(rolesSum - extracted.headcount_estimate!) / extracted.headcount_estimate!,
        0.4,
        1,
      )
    : 0.4;
const headcountConfidence = datapointConfidence({
  tier: headcountKnown ? (sourceCount >= 2 ? "seeded_reference" : "llm_inferred") : "hard_fallback",
  matchQuality: reconcile,
  corroboration: Math.max(1, sourceCount),
});
// roles: continuous, scaled by taxonomy-mapping coverage (Phase 3 Task 3.2 feeds
// `taxonomyCoverage`; default 0.7 until then).
const rolesConfidence = datapointConfidence({
  tier: sourceCount >= 3 ? "seeded_reference" : "llm_inferred",
  matchQuality: taxonomyCoverage ?? 0.7,
  corroboration: Math.max(1, sourceCount),
});

(clamp is already imported or add a small local helper.)

Step 4: Keep the overall weighting but store continuous values to 2 dp (already done via Math.round(x*100)/100); confirm overall, wages, headcount, roles are each continuous. Optionally also store integer-percent mirrors using toConfidencePct if the UI prefers ints.
Step 5: tsc + run the enrichment tests. There is no DB here for the wiring, so add/extend a unit test on an extracted helper if the logic is pulled out; otherwise rely on verify-enrichment.ts (Task 2.4). Commit feat(enrichment): continuous confidence across wages, headcount and roles.

Task 2.3: Persist per-datapoint provenance

Files:

Modify: apps/web/lib/enrichment/service.ts (the company_research upsert)
Modify: packages/types/src/enrichment.ts (extend the per-role/per-wage shape with an optional numeric confidence_score alongside the tier)
Step 1: Today roles[].confidence and wages[].confidence store only the tier enum. Add an optional confidence_score: number (0..1) and the signals used (match_quality, corroboration, stale_years, is_fallback) so the score is auditable and re-rollable. Keep the tier enum for backward-compatible reads (schema is .passthrough() on nested objects, so old rows still parse).
Step 2: Write these per-datapoint scores into the persisted record. No migration needed (jsonb columns).
Step 3: tsc + types test. Commit feat(enrichment): persist per-datapoint confidence + provenance.

Task 2.4: Backfill verification + smoke

Files:

Modify: packages/db/scripts/verify-enrichment.ts
Step 1: Add an assertion path that, after re-running enrichment on a sample of domains, the distinct-confidence-blob count rises sharply (target: from 12 toward dozens+) and no two materially different companies share an identical overall. This is a coarse guard against re-introducing the grid.
Step 2: Run verify-enrichment.ts against a handful of domains locally (read-only). Confirm continuous values like 0.74/0.83/0.88 appear. Do NOT mass-refresh prod here (that consumes lookups); a manual targeted refresh is a separate, owner-approved step.

Phase 3: Enrichment quality hardening (makes the numbers defensible)

These move the underlying accuracy, not just the confidence label. Sequenced after Phase 2 because several feed its signals.

Task 3.1: Headcount sanity bounds + reconcile tripwire

Files: apps/web/lib/enrichment/service.ts, packages/core/src/enrichment/* (a pure headcountSanity helper + test)

Bound extracted headcount against gross implausibility (decimal/thousands mis-scale, ~1000x outliers); when sum(roles) diverges hard from stated headcount, lower matchQuality (already wired in Task 2.2) AND log a tripwire. Pure helper, unit-tested. Commit per the task.

Task 3.2: Role normalization to a standard taxonomy

Files: packages/core/src/enrichment/normalize-roles.ts, tests

Replace greedy regex bucketing with a mapping to ISCO-08 / ESCO occupation groups; emit a taxonomyCoverage fraction (mapped roles / total) that feeds rolesConfidence (Task 2.2 Step 3). Log unmapped roles. Pure + unit-tested. Commit per the task.

Task 3.3: SDK auto-capture of model + tokens

Files: packages/sdk-js/*, docs

In the JS SDK wrappers (and documented adapters for Vercel AI / Anthropic / OpenAI), auto-populate model, tokens_in, tokens_out when available so net-ROI coverage rises from 0.16%. Adoption lever, not a server change. Tests on the wrapper. Commit per the task.

Task 3.4 (LARGER, schedule explicitly): broaden wage reference

Files: packages/db/migrations/00XX_wage_reference_eurostat.sql, an ingestion script

Pull Eurostat SES (earn_ses) keyed by ISCO-08 across ~27 EU members to replace the 7-country/14-role/2024 seed with broad official_statistic coverage; key wage_reference on ISCO codes. Effort: L, data-acquisition heavy. This is real data work, not a code refactor; scope and source-verify before starting. Until done, the DEFAULT fallback + matchQuality<1 (Task 2.2) keeps confidence honest.

Out of scope (verified-rejected in the deep research, do NOT build)

Anthropic Citations API for per-datapoint source binding (incompatible with our OpenRouter + structured-output path). Build the plain fetch-and-verify-the-number-is-on-the-page loop instead if/when needed.
Tiered family-fallback pricing (median-of-class): net_saved is not clamped >=0, so an over-estimate pushes real ROI down; conflicts with the "no speculative savings" rule. Keep exact+alias matching only.
Bayesian baseline calibration on events: requires measured human time, which the product does not collect (agent_duration_seconds is agent runtime). Defer until a real human-time capture exists.
Country dimension on task baselines: no credible source for country-varying task minutes. Only the cheap effective_year versioning is worth doing.

Self-Review

Confidence granularity (the core ask): Task 2.1 + 2.2 make every dimension a continuous function of continuous inputs (matchQuality, corroboration, staleYears), and Task 2.1 tests explicitly assert non-grid output and toConfidencePct(0.743)===74. Requirement met.
Type consistency: tierScore/WeightedTier are removed and replaced everywhere they were imported (service.ts, barrel). Verify with a repo-wide grep for tierScore before finishing.
No placeholders: the novel/tricky code (canonicalize, the scorer) is given in full; bug fixes carry exact files + test specs.
Scope honesty: Phase 3.4 (wage breadth) is flagged L / data-acquisition and gated; not bundled into the confidence work.