For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Make HumanHours enrichment data genuinely trustworthy, with continuous per-datapoint confidence scored to the integer (74, 83, 88 must be reachable, never snapped to 0.4/0.6/0.8), and fix the verified data-honesty bugs that currently let coarse or wrong numbers reach customers.
Architecture: The enrichment engine (apps/web/lib/enrichment/service.ts) assembles a per-domain record from extracted roles, batched wage resolution, and a confidence rollup. Today confidence is a cost-weighted average over a 5-value tier ladder (packages/core/src/enrichment/confidence.ts), and headcount/roles each collapse to a single tier, so overall lands on ~12 discrete blobs. We replace the discrete tier-score with a continuous per-datapoint scorer driven by source tier (prior), match quality, corroboration, recency, and fallback flags, then roll up continuously. We also fix model-price id matching, wage-fallback honesty, silent catch blocks, the confidence-tier inversion, and add an unmatched cost provenance state.
Tech Stack: Next.js 16 (App Router), TypeScript, Supabase (Postgres) EU, Vitest. Pure scoring logic lives in @agent-metrics/core (unit-tested without DB). Confidence is stored in company_research.confidence jsonb and presented as integer percent.
Measured baseline (prod, 2026-06-05): 12 distinct confidence blobs over 426 researches; 51% share {roles:0.6,wages:0.8,headcount:0.6,overall:0.71}. Only 12/7607 events carry a model; 2 of 5 distinct model strings are unpriced. 81% of events resolve baselines from builtin defaults.
Phase 0: Capture the research
Task 0.1: Save the data-quality roadmap to Second Brain
Files:
-
Create:
/Users/ralf/Documents/Second Brain/My Companies/Triad/Producten/HumanHours/HumanHours - Data Quality Roadmap 2026-06-05.md -
Step 1: Write the roadmap note
Write the full workflow output (executive read, 22-item prioritized table, the 5 verified bugs with file:line, the confidence-redesign / NDI wedge section, and the explicitly-dropped items with reasons) as a Markdown note. Front-matter:
---
type: product-roadmap
product: "[[HumanHours.dev]]"
source: "multi-agent deep research 2026-06-05 (run wf_26b618c7-e50)"
related: "[[New Digital Intelligence - HumanHours Partnership]]"
date: "2026-06-05"
tags: [humanhours, roadmap, data-quality, confidence, enrichment, ndi]
---Link [[New Digital Intelligence - HumanHours Partnership]] in the confidence section (the per-datapoint confidence is the commercial wedge Burian called "goud").
- Step 2: Add the pointer in
My Companies/Triad/Producten/HumanHours/index if one exists; otherwise skip.
This task is documentation only, no code, no commit to the agent-metrics repo.
Phase 1: Verified bug quick-wins (data honesty)
Each bug is independently shippable and S-effort. Ship before the confidence rewrite so the rewrite builds on honest inputs.
Task 1.1: Model-id normalization for pricing
Files:
- Modify:
apps/web/lib/model-prices.ts - Test:
apps/web/lib/model-prices.test.ts(create)
The price book is keyed on the exact OpenRouter id (dots: anthropic/claude-sonnet-4.6), but events send dashes (anthropic/claude-sonnet-4-6) and may omit the provider prefix. lookupModelPrice does an exact prices.get(model), so real models go unpriced and net ROI silently falls back to gross.
- Step 1: Write failing tests
// apps/web/lib/model-prices.test.ts
import { describe, expect, it } from "vitest";
import { canonicalizeModelId } from "./model-prices";
describe("canonicalizeModelId", () => {
it("lowercases and keeps a known dotted id stable", () => {
expect(canonicalizeModelId("anthropic/claude-sonnet-4.6")).toBe("anthropic/claude-sonnet-4.6");
});
it("converts a dashed version to a dotted version", () => {
expect(canonicalizeModelId("anthropic/claude-sonnet-4-6")).toBe("anthropic/claude-sonnet-4.6");
});
it("does not mangle a trailing suffix", () => {
expect(canonicalizeModelId("anthropic/claude-opus-4-8-fast")).toBe(
"anthropic/claude-opus-4.8-fast",
);
});
it("leaves a single trailing number alone (not a version pair)", () => {
expect(canonicalizeModelId("openai/gpt-4")).toBe("openai/gpt-4");
});
});Run: pnpm exec vitest run apps/web/lib/model-prices.test.ts from repo root. Expected: FAIL (canonicalizeModelId not exported).
- Step 2: Implement canonicalize + a second index
In apps/web/lib/model-prices.ts, add the exported helper and index a canonical map alongside the exact one. Only the FIRST -<digit>-<digit> pair becomes <digit>.<digit>, so suffixes like -fast are preserved.
// Canonical key for fuzzy-but-safe matching: lowercase, and collapse the first
// "-<digit>-<digit>" version pair to "<digit>.<digit>" so dashed event ids
// (anthropic/claude-sonnet-4-6) match dotted price ids (…-4.6). Suffixes after
// the version (…-4.8-fast) are preserved.
export function canonicalizeModelId(model: string): string {
return model
.trim()
.toLowerCase()
.replace(/-(\d+)-(\d+)/, "-$1.$2");
}Update loadPrices to also build byCanonical, and lookupModelPrice to fall back to it. Keep exact match first so nothing regresses:
let cache: {
at: number;
byId: Map<string, ModelPrice>;
byCanonical: Map<string, ModelPrice>;
} | null = null;
// in loadPrices, after byId.set(...):
// byCanonical.set(canonicalizeModelId(r.model_id), { ...same... });
// lookupModelPrice:
export async function lookupModelPrice(model: string): Promise<ModelPrice | null> {
if (!model) return null;
const { byId, byCanonical } = await loadPricesIndexed();
return byId.get(model) ?? byCanonical.get(canonicalizeModelId(model)) ?? null;
}(Refactor loadPrices to loadPricesIndexed returning both maps; preserve the stale-cache-on-error behaviour.)
-
Step 3: Run tests → PASS. Then
pnpm --filter @agent-metrics/web exec tsc --noEmit. -
Step 4: Commit
fix(pricing): canonical model-id matching so dashed ids resolve a price.
Task 1.2: unmatched cost provenance + log the offending id
Files:
- Modify:
apps/web/app/api/v1/track/route.ts(cost-resolution block, ~lines 220-260) - Modify:
packages/types/src/*andpackages/sdk-js/src/types.ts(theCostSourceunion) - Create:
packages/db/migrations/0053_cost_source_unmatched.sql
A model that does not match a price and "no model passed" both record cost_source: 'none', so the Task 1.1 class of miss is invisible. Add unmatched.
- Step 1: Migration to widen the CHECK constraint:
-- 0053_cost_source_unmatched.sql
-- Distinguish "caller passed a model we could not price" (unmatched) from
-- "no cost signal at all" (none), so pricing-coverage gaps are measurable.
alter table public.events drop constraint if exists events_agent_cost_source_check;
alter table public.events
add constraint events_agent_cost_source_check
check (agent_cost_source is null or agent_cost_source in ('provided','computed','none','unmatched'));-
Step 2: Add
"unmatched"to theCostSourceunion in both type files (grepCostSourceto find them; the research locatedpackages/sdk-js/src/types.ts:38and a@agent-metrics/typescopy). -
Step 3: In
track/route.ts, when a model is present butlookupModelPricereturns null, setagent_cost_source = "unmatched"and writecost_basis = { unmatched_model: <model string> }instead ofnone. Leavenonefor the genuinely-absent case. -
Step 4: Test: extend the track route test (or add one) asserting an unknown model yields
cost_source: 'unmatched'and the model string is captured. Run vitest + tsc. -
Step 5: Commit
feat(track): record unmatched cost provenance for unpriced models.
Task 1.3: Wage-fallback honesty
Files:
-
Modify:
apps/web/lib/enrichment/wage-provider.ts -
Modify:
apps/web/lib/enrichment/service.ts(theHARD_FALLBACK_GROSS_EURpath + the silent tier upgrade) -
Step 1: In
wage-provider.ts, the no-source LLM estimate already setstier: "llm_inferred"(good). Confirm it never claims a source. If any path writes a bare estimate with a source label, change the storedsourceto"llm_estimate"and do not set a source URL. -
Step 2: In
service.ts, the hard-fallback wage row (HARD_FALLBACK_GROSS_EUR) must be flagged. Addis_fallback: trueto that resolution and carry it into the wage record so the report and the confidence scorer can see it (consumed in Phase 2). -
Step 3: Remove the silent upgrade of a missing tier to
seeded_reference (0.6); default a missing tier DOWN tollm_inferred(orhard_fallbackwhen there is truly nothing). -
Step 4: Test the resolution helper (extract a pure function if needed) asserting a fallback wage carries
is_fallback. Run vitest + tsc. Commitfix(enrichment): flag fallback wages instead of laundering them to seeded.
Task 1.4: Distinguish lookup_error from not_found
Files:
-
Modify:
apps/web/lib/enrichment/service.ts(the fourcatch { }blocks around the wage/cache resolution) -
Step 1: Replace silent
catch {}blocks with aresolution_statusof"resolved" | "not_found" | "lookup_error". A DB/network error must NOT degrade a role tohard_fallbackindistinguishably from a genuine miss. -
Step 2: Exclude
lookup_errordatapoints from the confidence rollup (or mark them for retry) rather than scoring them as low-confidence real data. -
Step 3: Test + tsc. Commit
fix(enrichment): surface lookup errors instead of swallowing them as fallbacks.
Task 1.5: Confidence-tier ordering + official-source over-tagging
Files:
-
Modify:
packages/core/src/enrichment/confidence.ts(handled by the Phase 2 rewrite, see Task 2.1:official_statisticprior now ABOVEfetched_cited) -
Modify:
apps/web/lib/enrichment/wage-provider.ts(OFFICIAL_SOURCEregex) and the BE estimated-split rows -
Step 1: The tier inversion (
fetched_cited 0.95 > official_statistic 0.8) is fixed by the Phase 2 priors (official 0.90 > fetched_cited 0.78). No separate change needed; cross-reference Task 2.1. -
Step 2: The
OFFICIAL_SOURCEregex tags any.gov/eurostatURL as official, including the seeded BE rows whose role split is explicitly ESTIMATED (migration0046). Mark those seed rows with a flag (e.g.estimated_split: true) and have the scorer treat them asseeded_referencewithmatchQuality < 1, notofficial_statistic. Verify against0046_wage_reference_official.sql. -
Step 3: Test + tsc. Commit
fix(enrichment): stop tagging estimated wage splits as official statistics.
Phase 2: Continuous per-datapoint confidence (the headline)
Replace the discrete tier-score rollup with a continuous scorer so confidence is realistic to the integer. Guarantee: differentiated inputs cannot collapse onto a small grid.
Task 2.1: Continuous confidence scorer in @agent-metrics/core
Files:
-
Modify:
packages/core/src/enrichment/confidence.ts -
Test:
packages/core/src/enrichment/confidence.test.ts(create) -
Step 1: Write failing tests
// packages/core/src/enrichment/confidence.test.ts
import { describe, expect, it } from "vitest";
import { datapointConfidence, rollupConfidence, toConfidencePct, tierPrior } from "./confidence";
describe("datapointConfidence", () => {
it("ranks official statistics above a random cited page", () => {
expect(tierPrior("official_statistic")).toBeGreaterThan(tierPrior("fetched_cited"));
});
it("produces continuous, non-grid values", () => {
const a = datapointConfidence({
tier: "official_statistic",
matchQuality: 1,
corroboration: 2,
staleYears: 1,
});
const b = datapointConfidence({
tier: "official_statistic",
matchQuality: 0.7,
corroboration: 1,
staleYears: 3,
});
expect(a).not.toBeCloseTo(b, 2);
// not snapped to 0.4/0.6/0.8
for (const v of [a, b])
expect([0.4, 0.6, 0.8].some((g) => Math.abs(v - g) < 0.005)).toBe(false);
});
it("caps a flagged fallback regardless of tier", () => {
expect(
datapointConfidence({ tier: "official_statistic", isFallback: true }),
).toBeLessThanOrEqual(0.3);
});
it("stays within (0.05, 0.98)", () => {
expect(
datapointConfidence({ tier: "hard_fallback", matchQuality: 0.4 }),
).toBeGreaterThanOrEqual(0.05);
expect(
datapointConfidence({ tier: "official_statistic", corroboration: 99 }),
).toBeLessThanOrEqual(0.98);
});
});
describe("rollupConfidence", () => {
it("is the cost-weighted mean of continuous scores", () => {
const r = rollupConfidence([
{ weight: 100, score: 0.9 },
{ weight: 100, score: 0.5 },
]);
expect(r).toBeCloseTo(0.7, 5);
});
it("returns 0 on no weight", () => {
expect(rollupConfidence([])).toBe(0);
});
});
describe("toConfidencePct", () => {
it("rounds to integer percent so 0.743 -> 74", () => {
expect(toConfidencePct(0.743)).toBe(74);
expect(toConfidencePct(0.835)).toBe(84);
});
});Run: pnpm exec vitest run packages/core/src/enrichment/confidence.test.ts. Expected: FAIL.
- Step 2: Rewrite
confidence.ts(full replacement)
import type { ConfidenceTier } from "@agent-metrics/types";
// Continuous trust prior per source tier (0..1). official_statistic ranks ABOVE
// fetched_cited: Eurostat/BLS must beat a random salary page. These are PRIORS,
// not final scores; per-datapoint modifiers move them continuously so confidence
// is realistic to the integer (74, 83, 88), never snapped to a {0.4,0.6,0.8} grid.
const TIER_PRIOR: Record<ConfidenceTier, number> = {
official_statistic: 0.9,
fetched_cited: 0.78,
seeded_reference: 0.62,
llm_inferred: 0.42,
hard_fallback: 0.18,
};
export function tierPrior(tier: ConfidenceTier): number {
return TIER_PRIOR[tier];
}
const clamp = (x: number, lo: number, hi: number) => Math.min(hi, Math.max(lo, x));
export interface DatapointSignals {
tier: ConfidenceTier;
matchQuality?: number; // 1 = exact role+country match; <1 for DEFAULT-country fallback / estimated split
corroboration?: number; // independent agreeing sources, >=1; saturating bonus
staleYears?: number; // years the source is beyond the reference year
isFallback?: boolean; // explicit fallback/estimate; hard-caps the score
}
// The single continuous scorer. A continuous matchQuality multiplier, a
// saturating corroboration bonus and a linear recency penalty make round-number
// collapse impossible for differentiated inputs.
export function datapointConfidence(s: DatapointSignals): number {
const prior = TIER_PRIOR[s.tier];
const match = clamp(s.matchQuality ?? 1, 0.4, 1);
const n = Math.max(1, s.corroboration ?? 1);
const corroborationBonus = 0.1 * (1 - 1 / n); // 0 at n=1, ~0.067 at n=3, ->0.1
const recencyPenalty = 0.02 * Math.max(0, s.staleYears ?? 0);
let score = prior * match + corroborationBonus - recencyPenalty;
if (s.isFallback) score = Math.min(score, 0.3);
return clamp(score, 0.05, 0.98);
}
export interface WeightedDatapoint {
weight: number; // cost weight (EUR)
score: number; // continuous confidence 0..1
}
// Cost-weighted mean of continuous per-datapoint scores. Continuous in/out.
export function rollupConfidence(items: WeightedDatapoint[]): number {
const total = items.reduce((s, i) => s + Math.max(0, i.weight), 0);
if (total <= 0) return 0;
const weighted = items.reduce((s, i) => s + Math.max(0, i.weight) * i.score, 0);
return weighted / total;
}
// Round a 0..1 confidence to integer-percent precision (0.743 -> 74).
export function toConfidencePct(x: number): number {
return Math.round(clamp(x, 0, 1) * 100);
}-
Step 3: Update the barrel export if
tierScore/WeightedTierare re-exported frompackages/core/src/index.ts; replace withtierPrior/datapointConfidence/WeightedDatapoint/toConfidencePct. -
Step 4: Run tests → PASS.
pnpm --filter @agent-metrics/core exec tsc --noEmit. Commitfeat(confidence): continuous per-datapoint confidence scorer.
Task 2.2: Wire continuous scoring into the enrichment assembly
Files:
-
Modify:
apps/web/lib/enrichment/service.ts(lines ~290-359) -
Modify: the
RoleWageResolutiontype to carrymatchQuality/isFallback/staleYears -
Step 1: Where wages resolve (the cache / seeded / researched / hard-fallback branches around lines 77-208), attach per-datapoint signals to
RoleWageResolution:- exact role+country hit →
matchQuality: 1 - DEFAULT-country fallback →
matchQuality: 0.7 - seeded estimated-split row (BE, Task 1.5) →
matchQuality: 0.6 - hard fallback →
isFallback: true staleYears= current reference year minus the row'syear(fromwage_reference).corroboration= number of sources that agreed (1 unless the batch returned multiple).
- exact role+country hit →
-
Step 2: Replace
weighted.push({ weight: annual, tier })with:
const score = datapointConfidence({
tier: res.tier,
matchQuality: res.matchQuality,
corroboration: res.corroboration,
staleYears: res.staleYears,
isFallback: res.isFallback,
});
weighted.push({ weight: annual, score });- Step 3: Make headcount + roles confidence continuous (replace the single-tier collapse at lines 345-355):
// headcount: continuous, blends source strength with how well the role split
// reconciles to the stated headcount (a real cross-check, doubles as a tripwire).
const rolesSum = [...headcountByRole.values()].reduce((a, b) => a + b, 0);
const reconcile =
headcountKnown && extracted.headcount_estimate! > 0
? clamp(
1 - Math.abs(rolesSum - extracted.headcount_estimate!) / extracted.headcount_estimate!,
0.4,
1,
)
: 0.4;
const headcountConfidence = datapointConfidence({
tier: headcountKnown ? (sourceCount >= 2 ? "seeded_reference" : "llm_inferred") : "hard_fallback",
matchQuality: reconcile,
corroboration: Math.max(1, sourceCount),
});
// roles: continuous, scaled by taxonomy-mapping coverage (Phase 3 Task 3.2 feeds
// `taxonomyCoverage`; default 0.7 until then).
const rolesConfidence = datapointConfidence({
tier: sourceCount >= 3 ? "seeded_reference" : "llm_inferred",
matchQuality: taxonomyCoverage ?? 0.7,
corroboration: Math.max(1, sourceCount),
});(clamp is already imported or add a small local helper.)
-
Step 4: Keep the overall weighting but store continuous values to 2 dp (already done via
Math.round(x*100)/100); confirmoverall,wages,headcount,rolesare each continuous. Optionally also store integer-percent mirrors usingtoConfidencePctif the UI prefers ints. -
Step 5:
tsc+ run the enrichment tests. There is no DB here for the wiring, so add/extend a unit test on an extracted helper if the logic is pulled out; otherwise rely onverify-enrichment.ts(Task 2.4). Commitfeat(enrichment): continuous confidence across wages, headcount and roles.
Task 2.3: Persist per-datapoint provenance
Files:
-
Modify:
apps/web/lib/enrichment/service.ts(thecompany_researchupsert) -
Modify:
packages/types/src/enrichment.ts(extend the per-role/per-wage shape with an optional numericconfidence_scorealongside thetier) -
Step 1: Today
roles[].confidenceandwages[].confidencestore only the tier enum. Add an optionalconfidence_score: number(0..1) and the signals used (match_quality,corroboration,stale_years,is_fallback) so the score is auditable and re-rollable. Keep thetierenum for backward-compatible reads (schema is.passthrough()on nested objects, so old rows still parse). -
Step 2: Write these per-datapoint scores into the persisted record. No migration needed (jsonb columns).
-
Step 3:
tsc+ types test. Commitfeat(enrichment): persist per-datapoint confidence + provenance.
Task 2.4: Backfill verification + smoke
Files:
-
Modify:
packages/db/scripts/verify-enrichment.ts -
Step 1: Add an assertion path that, after re-running enrichment on a sample of domains, the distinct-confidence-blob count rises sharply (target: from 12 toward dozens+) and no two materially different companies share an identical
overall. This is a coarse guard against re-introducing the grid. -
Step 2: Run
verify-enrichment.tsagainst a handful of domains locally (read-only). Confirm continuous values like 0.74/0.83/0.88 appear. Do NOT mass-refresh prod here (that consumes lookups); a manual targeted refresh is a separate, owner-approved step.
Phase 3: Enrichment quality hardening (makes the numbers defensible)
These move the underlying accuracy, not just the confidence label. Sequenced after Phase 2 because several feed its signals.
Task 3.1: Headcount sanity bounds + reconcile tripwire
Files: apps/web/lib/enrichment/service.ts, packages/core/src/enrichment/* (a pure headcountSanity helper + test)
- Bound extracted headcount against gross implausibility (decimal/thousands mis-scale, ~1000x outliers); when
sum(roles)diverges hard from stated headcount, lowermatchQuality(already wired in Task 2.2) AND log a tripwire. Pure helper, unit-tested. Commit per the task.
Task 3.2: Role normalization to a standard taxonomy
Files: packages/core/src/enrichment/normalize-roles.ts, tests
- Replace greedy regex bucketing with a mapping to ISCO-08 / ESCO occupation groups; emit a
taxonomyCoveragefraction (mapped roles / total) that feedsrolesConfidence(Task 2.2 Step 3). Log unmapped roles. Pure + unit-tested. Commit per the task.
Task 3.3: SDK auto-capture of model + tokens
Files: packages/sdk-js/*, docs
- In the JS SDK wrappers (and documented adapters for Vercel AI / Anthropic / OpenAI), auto-populate
model,tokens_in,tokens_outwhen available so net-ROI coverage rises from 0.16%. Adoption lever, not a server change. Tests on the wrapper. Commit per the task.
Task 3.4 (LARGER, schedule explicitly): broaden wage reference
Files: packages/db/migrations/00XX_wage_reference_eurostat.sql, an ingestion script
- Pull Eurostat SES (earn_ses) keyed by ISCO-08 across ~27 EU members to replace the 7-country/14-role/2024 seed with broad official_statistic coverage; key
wage_referenceon ISCO codes. Effort: L, data-acquisition heavy. This is real data work, not a code refactor; scope and source-verify before starting. Until done, the DEFAULT fallback +matchQuality<1(Task 2.2) keeps confidence honest.
Out of scope (verified-rejected in the deep research, do NOT build)
- Anthropic Citations API for per-datapoint source binding (incompatible with our OpenRouter + structured-output path). Build the plain fetch-and-verify-the-number-is-on-the-page loop instead if/when needed.
- Tiered family-fallback pricing (median-of-class):
net_savedis not clamped >=0, so an over-estimate pushes real ROI down; conflicts with the "no speculative savings" rule. Keep exact+alias matching only. - Bayesian baseline calibration on events: requires measured human time, which the product does not collect (
agent_duration_secondsis agent runtime). Defer until a real human-time capture exists. - Country dimension on task baselines: no credible source for country-varying task minutes. Only the cheap
effective_yearversioning is worth doing.
Self-Review
- Confidence granularity (the core ask): Task 2.1 + 2.2 make every dimension a continuous function of continuous inputs (matchQuality, corroboration, staleYears), and Task 2.1 tests explicitly assert non-grid output and
toConfidencePct(0.743)===74. Requirement met. - Type consistency:
tierScore/WeightedTierare removed and replaced everywhere they were imported (service.ts, barrel). Verify with a repo-wide grep fortierScorebefore finishing. - No placeholders: the novel/tricky code (canonicalize, the scorer) is given in full; bug fixes carry exact files + test specs.
- Scope honesty: Phase 3.4 (wage breadth) is flagged L / data-acquisition and gated; not bundled into the confidence work.