Enrichment Engine, Phase 2b: Global Grounded Wages + Accuracy Eval, Implementation Plan · HumanHours docs

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement task-by-task. Steps use checkbox (- [ ]) syntax.

Goal: Make wage resolution work for ANY country via on-demand grounded research (cached per country/role), replacing the hard_fallback placeholder, and add a code-based accuracy eval against the ground-truth dataset that reports headcount/country/wage accuracy.

Architecture: A new wage provider does grounded research (Sonar) + extraction (Claude) to get a gross hourly wage in EUR for a given country/role, with a source. The enrichment service's wage step becomes: wages cache -> grounded research (write result into wages) -> hard_fallback, so the engine is global and pays once per country/role. A pure ground-truth parser handles the dirty CSV; an opt-in live eval test scores accuracy.

Tech Stack: TypeScript strict, OpenRouter (sonar + claude), Vitest, Supabase service-role client. No country is special-cased.

Context the engineer needs

Branch: continue on feat/enrichment-engine.
Phase 2a is done: apps/web/lib/enrichment/provider.ts (exports researchCompany, extractCompany, writeBusinessCase, RESEARCH_MODEL, REASONING_MODEL; the internal chat(model, messages, apiKey) helper is currently NOT exported), packages/core enrichment pure fns, and apps/web/lib/enrichment/service.ts (orchestrator with a private resolveLoadedHourly(country, role) that currently does wages-cache -> wage_reference -> hard_fallback at EUR 25). OPENROUTER_API_KEY is in apps/web/.env.local and in env.ts.
Vitest already stubs server-only via vitest.config.ts alias, so test files may import the service. Tests run from repo root: pnpm exec vitest run <path>. Vitest does NOT auto-load .env.local; a test that needs real secrets must load dotenv itself before importing modules that read env.
wages table: (country, role, year) unique, wage_data jsonb, source text, source_url text, confidence numeric, last_researched, ttl_days (90). This is the global per-country/role loaded-cost cache.
employer_factors: seeded for NL/DE/UK/US/FR/BE. For any other country the service uses a 1.3 default (kept; refining per-country factors is a later concern, not a country restriction).
ConfidenceTier: fetched_cited > official_statistic > seeded_reference > llm_inferred > hard_fallback.
Ground-truth file: ~/Downloads/HH Research Engine — Ground Truth Dataset - Blad2.csv. Header: domain,legal_name,true_primary_country,true_headcount,role_1,true_avg_wage_1. Quirks: European decimals where a dot is a thousands separator (65.340 = 65340, 1.156 = 1156, 414.000 = 414000); wage strings like €40.70/hr, €40.39hr (missing slash), €62.67hr, trailing spaces in role names; countries are full names ("Netherlands", "Germany"). Copy this file into the repo as a fixture in Task 1 so the eval is versioned and not dependent on a Downloads path.

File structure

Create: apps/web/lib/enrichment/ground-truth.ts (+ test) — pure CSV/value parsing.
Create: apps/web/lib/enrichment/__fixtures__/ground-truth.csv — copied dataset.
Modify: apps/web/lib/enrichment/provider.ts — export the chat helper.
Create: apps/web/lib/enrichment/wage-provider.ts (+ test) — global grounded wage lookup.
Modify: apps/web/lib/enrichment/service.ts — use the wage provider, cache into wages.
Create: apps/web/lib/enrichment/eval.test.ts — opt-in live accuracy eval.

Task 1: Ground-truth parser (pure)

Files: Create apps/web/lib/enrichment/ground-truth.ts, apps/web/lib/enrichment/ground-truth.test.ts, and copy the fixture.

Step 1: Copy the dataset into the repo. Run:

mkdir -p apps/web/lib/enrichment/__fixtures__ && cp "$HOME/Downloads/HH Research Engine — Ground Truth Dataset - Blad2.csv" apps/web/lib/enrichment/__fixtures__/ground-truth.csv && wc -l apps/web/lib/enrichment/__fixtures__/ground-truth.csv

Expected: the file copies and reports its line count (a header + N rows).

Step 2: Write the failing test apps/web/lib/enrichment/ground-truth.test.ts:

import { describe, expect, it } from "vitest";
 
import { parseHeadcount, parseWageEur, parseCountryToIso, parseGroundTruth } from "./ground-truth";
 
describe("parseHeadcount", () => {
  it("reads European thousands-separated integers", () => {
    expect(parseHeadcount("65.340")).toBe(65340);
    expect(parseHeadcount("1.156")).toBe(1156);
    expect(parseHeadcount("414.000")).toBe(414000);
    expect(parseHeadcount("464")).toBe(464);
  });
});
 
describe("parseWageEur", () => {
  it("extracts a euro hourly number from dirty strings", () => {
    expect(parseWageEur("€40.70/hr")).toBeCloseTo(40.7, 2);
    expect(parseWageEur("€40.39hr")).toBeCloseTo(40.39, 2);
    expect(parseWageEur("€62.67hr")).toBeCloseTo(62.67, 2);
    expect(parseWageEur("€15.4/hr")).toBeCloseTo(15.4, 2);
  });
  it("returns null when there is no number", () => {
    expect(parseWageEur("n/a")).toBeNull();
  });
});
 
describe("parseCountryToIso", () => {
  it("maps full country names to ISO-2", () => {
    expect(parseCountryToIso("Netherlands")).toBe("NL");
    expect(parseCountryToIso("Germany")).toBe("DE");
    expect(parseCountryToIso("United States")).toBe("US");
  });
  it("passes through an already-ISO code", () => {
    expect(parseCountryToIso("FR")).toBe("FR");
  });
});
 
describe("parseGroundTruth", () => {
  it("parses CSV rows, trimming role whitespace", () => {
    const csv =
      "domain,legal_name,true_primary_country,true_headcount,role_1,true_avg_wage_1\n" +
      "afas.nl,AFAS Software B.V.,Netherlands,739,Consultant ,€32.31/hr\n";
    const rows = parseGroundTruth(csv);
    expect(rows).toHaveLength(1);
    expect(rows[0]!.domain).toBe("afas.nl");
    expect(rows[0]!.trueHeadcount).toBe(739);
    expect(rows[0]!.role).toBe("Consultant");
    expect(rows[0]!.trueCountry).toBe("NL");
    expect(rows[0]!.trueWageEur).toBeCloseTo(32.31, 2);
  });
});

Step 3: Run, verify FAIL. pnpm exec vitest run apps/web/lib/enrichment/ground-truth.test.ts
Step 4: Implement apps/web/lib/enrichment/ground-truth.ts:

// Pure parsing of the ground-truth dataset. Defensive about European decimals
// (a dot used as a thousands separator) and dirty wage strings.
 
export interface GroundTruthRow {
  domain: string;
  legalName: string;
  trueCountry: string; // ISO-2
  trueHeadcount: number | null;
  role: string;
  trueWageEur: number | null;
}
 
// "65.340" / "1.156" / "414.000" are thousands-separated integers, not decimals.
export function parseHeadcount(raw: string): number | null {
  const s = raw.trim();
  if (!s) return null;
  const digits = s.replace(/[.\s]/g, "");
  if (!/^\d+$/.test(digits)) return null;
  return Number(digits);
}
 
export function parseWageEur(raw: string): number | null {
  const m = raw.replace(",", ".").match(/(\d+(?:\.\d+)?)/);
  return m ? Number(m[1]) : null;
}
 
const COUNTRY_TO_ISO: Record<string, string> = {
  netherlands: "NL",
  germany: "DE",
  "united kingdom": "GB",
  uk: "GB",
  "united states": "US",
  usa: "US",
  france: "FR",
  belgium: "BE",
};
 
export function parseCountryToIso(raw: string): string {
  const s = raw.trim();
  if (/^[A-Za-z]{2}$/.test(s)) return s.toUpperCase();
  return COUNTRY_TO_ISO[s.toLowerCase()] ?? s.toUpperCase();
}
 
// Minimal CSV split (the dataset has no quoted commas in the fields we use).
export function parseGroundTruth(csv: string): GroundTruthRow[] {
  const lines = csv.split(/\r?\n/).filter((l) => l.trim().length > 0);
  const rows: GroundTruthRow[] = [];
  for (let i = 1; i < lines.length; i++) {
    const cols = lines[i]!.split(",");
    if (cols.length < 6) continue;
    rows.push({
      domain: cols[0]!.trim().toLowerCase(),
      legalName: cols[1]!.trim(),
      trueCountry: parseCountryToIso(cols[2]!),
      trueHeadcount: parseHeadcount(cols[3]!),
      role: cols[4]!.trim(),
      trueWageEur: parseWageEur(cols[5]!),
    });
  }
  return rows;
}

Step 5: Run, verify PASS. Fix only the impl if needed.
Step 6: Commit.

git add apps/web/lib/enrichment/ground-truth.ts apps/web/lib/enrichment/ground-truth.test.ts apps/web/lib/enrichment/__fixtures__/ground-truth.csv
git commit -m "feat(enrichment): ground-truth dataset parser + fixture"

Let prettier run. Body line: Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>.

Task 2: Export the chat helper from the provider

Files: Modify apps/web/lib/enrichment/provider.ts.

Step 1: Change the line async function chat( to export async function chat( in apps/web/lib/enrichment/provider.ts. No other change.
Step 2: Verify nothing broke. pnpm exec vitest run apps/web/lib/enrichment/provider.test.ts (expect still pass) and pnpm --filter @agent-metrics/web exec tsc --noEmit.
Step 3: Commit.

git add apps/web/lib/enrichment/provider.ts
git commit -m "refactor(enrichment): export chat helper for reuse"

Body line as above.

Task 3: Global grounded wage provider

Files: Create apps/web/lib/enrichment/wage-provider.ts, apps/web/lib/enrichment/wage-provider.test.ts.

Step 1: Write the failing test apps/web/lib/enrichment/wage-provider.test.ts:

import { afterEach, describe, expect, it, vi } from "vitest";
 
import { researchGrossHourly } from "./wage-provider";
 
afterEach(() => vi.restoreAllMocks());
 
function mockTwoCalls(researchText: string, extractJson: string) {
  const fetchMock = vi.spyOn(globalThis, "fetch");
  for (const content of [researchText, extractJson]) {
    fetchMock.mockResolvedValueOnce(
      new Response(JSON.stringify({ choices: [{ message: { content } }] }), { status: 200 }),
    );
  }
  return fetchMock;
}
 
describe("researchGrossHourly", () => {
  it("returns a euro hourly number and a fetched_cited tier when a source is found", async () => {
    mockTwoCalls(
      "A software engineer in Brazil earns about R$ ... ~ EUR 12/hour. Source: https://example.org/br",
      '```json\n{"gross_hourly_eur": 12.0, "source_url": "https://example.org/br"}\n```',
    );
    const out = await researchGrossHourly("BR", "engineering_it", 2025, { apiKey: "k" });
    expect(out.grossHourlyEur).toBeCloseTo(12, 2);
    expect(out.tier).toBe("fetched_cited");
    expect(out.sourceUrl).toContain("example.org");
  });
 
  it("falls back to llm_inferred tier when no source url is present", async () => {
    mockTwoCalls(
      "Roughly EUR 30/hour for this role.",
      '{"gross_hourly_eur": 30, "source_url": null}',
    );
    const out = await researchGrossHourly("JP", "sales", 2025, { apiKey: "k" });
    expect(out.grossHourlyEur).toBeCloseTo(30, 2);
    expect(out.tier).toBe("llm_inferred");
  });
 
  it("throws when no number can be extracted", async () => {
    mockTwoCalls("No reliable data.", '{"gross_hourly_eur": null, "source_url": null}');
    await expect(researchGrossHourly("ZZ", "legal", 2025, { apiKey: "k" })).rejects.toThrow();
  });
});

Step 2: Run, verify FAIL. pnpm exec vitest run apps/web/lib/enrichment/wage-provider.test.ts
Step 3: Implement apps/web/lib/enrichment/wage-provider.ts:

import "server-only";
 
import { type ConfidenceTier } from "@agent-metrics/types";
 
import { chat, REASONING_MODEL, RESEARCH_MODEL } from "./provider";
import type { ProviderOpts } from "./provider";
 
export interface WageResearch {
  grossHourlyEur: number;
  sourceUrl: string | null;
  tier: ConfidenceTier;
}
 
// Global wage lookup for ANY country/role. Sonar researches the prevailing
// gross wage with a source; Claude extracts a single EUR-per-hour number.
export async function researchGrossHourly(
  country: string,
  role: string,
  year: number,
  opts: ProviderOpts,
): Promise<WageResearch> {
  const research = await chat(
    RESEARCH_MODEL,
    [
      {
        role: "system",
        content:
          "You research labour-market wages. Use web search. Give the typical GROSS wage (before employer costs) and include a source URL. If the local currency is not EUR, convert to EUR.",
      },
      {
        role: "user",
        content: `What is the typical gross hourly wage in EUR for the role "${role}" in country "${country}" around ${year}? If only an annual or monthly salary is available, convert to an hourly figure assuming ~1720 working hours per year. Give the number and a source URL.`,
      },
    ],
    opts.apiKey,
  );
 
  const extract = await chat(
    REASONING_MODEL,
    [
      {
        role: "system",
        content:
          'Extract a single JSON object {"gross_hourly_eur": number|null, "source_url": string|null} from the text. gross_hourly_eur is EUR per hour. Output only the JSON.',
      },
      { role: "user", content: research },
    ],
    opts.apiKey,
  );
 
  const parsed = JSON.parse(extractJsonBlock(extract)) as {
    gross_hourly_eur: number | null;
    source_url: string | null;
  };
  if (typeof parsed.gross_hourly_eur !== "number" || !(parsed.gross_hourly_eur > 0)) {
    throw new Error(`No usable wage for ${role} in ${country}.`);
  }
  const sourceUrl = parsed.source_url ?? null;
  return {
    grossHourlyEur: parsed.gross_hourly_eur,
    sourceUrl,
    tier: sourceUrl ? "fetched_cited" : "llm_inferred",
  };
}
 
function extractJsonBlock(s: string): string {
  const fenced = s.match(/```(?:json)?\s*([\s\S]*?)```/i);
  if (fenced?.[1]) return fenced[1].trim();
  const a = s.indexOf("{");
  const b = s.lastIndexOf("}");
  if (a !== -1 && b > a) return s.slice(a, b + 1);
  throw new Error("No JSON in wage extraction output.");
}

Step 4: Run, verify PASS. Fix only the impl if needed. Then pnpm --filter @agent-metrics/web exec tsc --noEmit.
Step 5: Commit.

git add apps/web/lib/enrichment/wage-provider.ts apps/web/lib/enrichment/wage-provider.test.ts
git commit -m "feat(enrichment): global grounded wage provider"

Body line as above.

Task 4: Wire the global wage lookup into the service

Files: Modify apps/web/lib/enrichment/service.ts.

Step 1: In apps/web/lib/enrichment/service.ts, add this import near the other ./ imports:

import { researchGrossHourly } from "./wage-provider";

Step 2: Replace the entire resolveLoadedHourly function with this version (which adds the grounded-research step before the hard fallback and writes the result into the wages cache):

// Loaded hourly cost for a country/role. Cache-first on wages, then grounded
// research (works for ANY country, cached into wages), then a hard fallback.
async function resolveLoadedHourly(
  country: string,
  role: string,
): Promise<{ loadedHourlyEur: number; tier: ConfidenceTier }> {
  const db = createSupabaseAdminClient();
 
  const { data: cached } = await db
    .from("wages")
    .select("wage_data, confidence")
    .eq("country", country)
    .eq("role", role)
    .eq("year", DEFAULT_YEAR)
    .maybeSingle();
  const cachedLoaded = (cached?.wage_data as { blended_hourly_eur?: number } | undefined)
    ?.blended_hourly_eur;
  if (typeof cachedLoaded === "number") {
    return { loadedHourlyEur: cachedLoaded, tier: "seeded_reference" };
  }
 
  const { data: factorRow } = await db
    .from("employer_factors")
    .select("factor")
    .eq("country", country)
    .maybeSingle();
  const factor = (factorRow?.factor as number | undefined) ?? 1.3;
 
  try {
    const research = await researchGrossHourly(country, role, DEFAULT_YEAR, {
      apiKey: env.OPENROUTER_API_KEY,
    });
    const loaded = loadedHourly(research.grossHourlyEur, factor);
    // Cache once: pay per country/role, reuse globally.
    await db.from("wages").upsert(
      {
        country,
        role,
        year: DEFAULT_YEAR,
        wage_data: {
          blended_hourly_eur: loaded,
          gross_hourly_eur: research.grossHourlyEur,
          employer_factor: factor,
          hours_per_year: 1720,
        },
        source: "grounded research",
        source_url: research.sourceUrl,
        last_researched: new Date().toISOString(),
      },
      { onConflict: "country,role,year" },
    );
    return { loadedHourlyEur: loaded, tier: research.tier };
  } catch {
    // Research failed: conservative, explicitly low-confidence, never a silent 0.
    return { loadedHourlyEur: loadedHourly(25, factor), tier: "hard_fallback" };
  }
}

Step 3: Ensure ConfidenceTier is imported in service.ts. If the import line import { type EnrichedCompany } from "@agent-metrics/types"; exists, change it to:

import { type ConfidenceTier, type EnrichedCompany } from "@agent-metrics/types";

Also confirm the wages.push({...}) block in enrichCompany still compiles (its confidence: tier now receives a ConfidenceTier from the new return type, which is correct).

Step 4: Typecheck. pnpm --filter @agent-metrics/web exec tsc --noEmit. Fix only service.ts if needed.
Step 5: Live smoke (real calls, a few cents). Create temp apps/web/scripts/smoke-wage.ts:

import { enrichCompany } from "../lib/enrichment/service";
const r = await enrichCompany("catawiki.com", { forceRefresh: true });
console.log(
  "country",
  r.record.country,
  "annual",
  r.record.business_case?.annual_labour_cost_eur,
  "conf",
  r.record.confidence.overall,
);
process.exit(0);

Run: pnpm --filter @agent-metrics/web exec tsx scripts/smoke-wage.ts then rm apps/web/scripts/smoke-wage.ts. Expected: confidence noticeably above 0.2 now (grounded wages instead of hard_fallback). If tsx cannot run it due to server-only, report DONE_WITH_CONCERNS (the eval in Task 5 exercises it via vitest). Do NOT commit the temp script.

Step 6: Commit.

git add apps/web/lib/enrichment/service.ts
git commit -m "feat(enrichment): global grounded wage resolution with wages cache"

Body line as above.

Task 5: Live accuracy eval (opt-in) + run it

Files: Create apps/web/lib/enrichment/eval.test.ts.

Step 1: Implement apps/web/lib/enrichment/eval.test.ts. It is gated by RUN_EVAL=1 so the normal suite never makes paid calls; it loads .env.local and a sample of the fixture, runs enrichment, and scores accuracy:

import { readFileSync } from "node:fs";
import { join } from "node:path";
 
import { describe, expect, it } from "vitest";
 
import { parseGroundTruth } from "./ground-truth";
 
const RUN = process.env.RUN_EVAL === "1";
const SAMPLE = Number(process.env.EVAL_N ?? "8");
 
// Headcount is judged on order-of-magnitude (outside-in estimates are coarse);
// wage within +/-40%; country exact.
function headcountClose(pred: number, truth: number): boolean {
  if (truth <= 0 || pred <= 0) return false;
  const ratio = pred / truth;
  return ratio >= 1 / 3 && ratio <= 3;
}
function wageClose(pred: number, truth: number): boolean {
  if (truth <= 0 || pred <= 0) return false;
  return Math.abs(pred - truth) / truth <= 0.4;
}
 
describe.runIf(RUN)("enrichment accuracy eval", () => {
  it(
    `scores a sample of ${SAMPLE} companies`,
    async () => {
      const { config } = await import("dotenv");
      config({ path: join(process.cwd(), "apps/web/.env.local") });
      const { enrichCompany } = await import("./service");
 
      const csv = readFileSync(
        join(process.cwd(), "apps/web/lib/enrichment/__fixtures__/ground-truth.csv"),
        "utf8",
      );
      const rows = parseGroundTruth(csv).slice(0, SAMPLE);
 
      let countryHits = 0;
      let headcountHits = 0;
      let wageHits = 0;
      let wageJudged = 0;
 
      for (const row of rows) {
        const { record } = await enrichCompany(row.domain, { forceRefresh: true });
        const country = (record.country ?? "").toUpperCase();
        if (country === row.trueCountry) countryHits++;
        if (
          row.trueHeadcount &&
          record.headcount_estimate &&
          headcountClose(record.headcount_estimate, row.trueHeadcount)
        ) {
          headcountHits++;
        }
        if (row.trueWageEur) {
          wageJudged++;
          const predWage = record.wages[0]?.wage_data.gross_hourly_eur;
          if (predWage && wageClose(predWage, row.trueWageEur)) wageHits++;
        }
        // eslint-disable-next-line no-console
        console.log(
          `${row.domain}: country ${country}/${row.trueCountry}, headcount ${record.headcount_estimate}/${row.trueHeadcount}`,
        );
      }
 
      const n = rows.length;
      const countryAcc = countryHits / n;
      const headcountAcc = headcountHits / n;
      const wageAcc = wageJudged ? wageHits / wageJudged : 0;
      const overall = (countryAcc + headcountAcc + wageAcc) / 3;
      // eslint-disable-next-line no-console
      console.log(
        `\nACCURACY country=${countryAcc.toFixed(2)} headcount=${headcountAcc.toFixed(2)} wage=${wageAcc.toFixed(2)} overall=${overall.toFixed(2)}`,
      );
 
      expect(overall).toBeGreaterThan(0); // informational gate; see report below
    },
    1000 * 60 * 10,
  );
});

Step 2: Confirm the normal suite still ignores the eval. pnpm exec vitest run apps/web/lib/enrichment/eval.test.ts (without RUN_EVAL) should report the test as skipped/0 ran, and the full pnpm exec vitest run must stay green.
Step 3: RUN THE EVAL (real calls; ~8 companies x several LLM calls each, a few euros).

cd /Users/ralf/projects/agent-metrics && RUN_EVAL=1 EVAL_N=8 pnpm exec vitest run apps/web/lib/enrichment/eval.test.ts 2>&1 | tail -40

Capture the per-company lines and the final ACCURACY ... line. Report those numbers verbatim. Do not tune the engine to the test; just report what it scores.

Step 4: Commit the eval.

git add apps/web/lib/enrichment/eval.test.ts
git commit -m "test(enrichment): opt-in live accuracy eval vs ground truth"

Body line as above.

Self-review (plan author)

Global, not country-limited: Task 3/4 make wages resolve via grounded research for any country, cached into wages. No country set is hard-coded; employer_factors defaults to 1.3 for unseeded countries (a factor, not a gate).
Spec coverage: wage data source (global grounded) + the >=80%-style eval are delivered; the eval is honest (reports the real number, no tuning to the test). Confidence tiers flow from the wage provider into the rollup.
No placeholders: parser, wage provider, service change, and eval are concrete. The eval's >0 assertion is intentionally informational; the printed ACCURACY line is the real deliverable to judge against the >=80% target.
Cost control: the eval is opt-in (RUN_EVAL=1) and sampled (EVAL_N), so the normal suite makes zero paid calls.

Out of scope / next

Per-country employer factors beyond the seeded six (refine later; default 1.3 is a factor not a restriction).
Industry-specific role distributions; FX precision for non-EUR wages (Sonar/Claude convert for now).
If the eval comes in under target, a follow-up iteration plan tunes prompts/headcount heuristics; that is a separate, evidence-driven plan, not a tweak buried here.
Phases 3-7 (sync API + lookups + hard cap, bulk + worker, pricing, frontend, docs).