For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement task-by-task. Steps use checkbox (
- [ ]) syntax.
Goal: Make wage resolution work for ANY country via on-demand grounded research (cached per country/role), replacing the hard_fallback placeholder, and add a code-based accuracy eval against the ground-truth dataset that reports headcount/country/wage accuracy.
Architecture: A new wage provider does grounded research (Sonar) + extraction (Claude) to get a gross hourly wage in EUR for a given country/role, with a source. The enrichment service's wage step becomes: wages cache -> grounded research (write result into wages) -> hard_fallback, so the engine is global and pays once per country/role. A pure ground-truth parser handles the dirty CSV; an opt-in live eval test scores accuracy.
Tech Stack: TypeScript strict, OpenRouter (sonar + claude), Vitest, Supabase service-role client. No country is special-cased.
Context the engineer needs
- Branch: continue on
feat/enrichment-engine. - Phase 2a is done:
apps/web/lib/enrichment/provider.ts(exportsresearchCompany,extractCompany,writeBusinessCase,RESEARCH_MODEL,REASONING_MODEL; the internalchat(model, messages, apiKey)helper is currently NOT exported),packages/coreenrichment pure fns, andapps/web/lib/enrichment/service.ts(orchestrator with a privateresolveLoadedHourly(country, role)that currently does wages-cache -> wage_reference -> hard_fallback at EUR 25).OPENROUTER_API_KEYis inapps/web/.env.localand inenv.ts. - Vitest already stubs
server-onlyviavitest.config.tsalias, so test files may import the service. Tests run from repo root:pnpm exec vitest run <path>. Vitest does NOT auto-load.env.local; a test that needs real secrets must load dotenv itself before importing modules that readenv. wagestable:(country, role, year)unique,wage_data jsonb,source text,source_url text,confidence numeric,last_researched,ttl_days(90). This is the global per-country/role loaded-cost cache.employer_factors: seeded for NL/DE/UK/US/FR/BE. For any other country the service uses a 1.3 default (kept; refining per-country factors is a later concern, not a country restriction).- ConfidenceTier:
fetched_cited>official_statistic>seeded_reference>llm_inferred>hard_fallback. - Ground-truth file:
~/Downloads/HH Research Engine — Ground Truth Dataset - Blad2.csv. Header:domain,legal_name,true_primary_country,true_headcount,role_1,true_avg_wage_1. Quirks: European decimals where a dot is a thousands separator (65.340= 65340,1.156= 1156,414.000= 414000); wage strings like€40.70/hr,€40.39hr(missing slash),€62.67hr, trailing spaces in role names; countries are full names ("Netherlands", "Germany"). Copy this file into the repo as a fixture in Task 1 so the eval is versioned and not dependent on a Downloads path.
File structure
- Create:
apps/web/lib/enrichment/ground-truth.ts(+ test) — pure CSV/value parsing. - Create:
apps/web/lib/enrichment/__fixtures__/ground-truth.csv— copied dataset. - Modify:
apps/web/lib/enrichment/provider.ts— export thechathelper. - Create:
apps/web/lib/enrichment/wage-provider.ts(+ test) — global grounded wage lookup. - Modify:
apps/web/lib/enrichment/service.ts— use the wage provider, cache intowages. - Create:
apps/web/lib/enrichment/eval.test.ts— opt-in live accuracy eval.
Task 1: Ground-truth parser (pure)
Files: Create apps/web/lib/enrichment/ground-truth.ts, apps/web/lib/enrichment/ground-truth.test.ts, and copy the fixture.
- Step 1: Copy the dataset into the repo. Run:
mkdir -p apps/web/lib/enrichment/__fixtures__ && cp "$HOME/Downloads/HH Research Engine — Ground Truth Dataset - Blad2.csv" apps/web/lib/enrichment/__fixtures__/ground-truth.csv && wc -l apps/web/lib/enrichment/__fixtures__/ground-truth.csv
Expected: the file copies and reports its line count (a header + N rows).
- Step 2: Write the failing test
apps/web/lib/enrichment/ground-truth.test.ts:
import { describe, expect, it } from "vitest";
import { parseHeadcount, parseWageEur, parseCountryToIso, parseGroundTruth } from "./ground-truth";
describe("parseHeadcount", () => {
it("reads European thousands-separated integers", () => {
expect(parseHeadcount("65.340")).toBe(65340);
expect(parseHeadcount("1.156")).toBe(1156);
expect(parseHeadcount("414.000")).toBe(414000);
expect(parseHeadcount("464")).toBe(464);
});
});
describe("parseWageEur", () => {
it("extracts a euro hourly number from dirty strings", () => {
expect(parseWageEur("€40.70/hr")).toBeCloseTo(40.7, 2);
expect(parseWageEur("€40.39hr")).toBeCloseTo(40.39, 2);
expect(parseWageEur("€62.67hr")).toBeCloseTo(62.67, 2);
expect(parseWageEur("€15.4/hr")).toBeCloseTo(15.4, 2);
});
it("returns null when there is no number", () => {
expect(parseWageEur("n/a")).toBeNull();
});
});
describe("parseCountryToIso", () => {
it("maps full country names to ISO-2", () => {
expect(parseCountryToIso("Netherlands")).toBe("NL");
expect(parseCountryToIso("Germany")).toBe("DE");
expect(parseCountryToIso("United States")).toBe("US");
});
it("passes through an already-ISO code", () => {
expect(parseCountryToIso("FR")).toBe("FR");
});
});
describe("parseGroundTruth", () => {
it("parses CSV rows, trimming role whitespace", () => {
const csv =
"domain,legal_name,true_primary_country,true_headcount,role_1,true_avg_wage_1\n" +
"afas.nl,AFAS Software B.V.,Netherlands,739,Consultant ,€32.31/hr\n";
const rows = parseGroundTruth(csv);
expect(rows).toHaveLength(1);
expect(rows[0]!.domain).toBe("afas.nl");
expect(rows[0]!.trueHeadcount).toBe(739);
expect(rows[0]!.role).toBe("Consultant");
expect(rows[0]!.trueCountry).toBe("NL");
expect(rows[0]!.trueWageEur).toBeCloseTo(32.31, 2);
});
});-
Step 3: Run, verify FAIL.
pnpm exec vitest run apps/web/lib/enrichment/ground-truth.test.ts -
Step 4: Implement
apps/web/lib/enrichment/ground-truth.ts:
// Pure parsing of the ground-truth dataset. Defensive about European decimals
// (a dot used as a thousands separator) and dirty wage strings.
export interface GroundTruthRow {
domain: string;
legalName: string;
trueCountry: string; // ISO-2
trueHeadcount: number | null;
role: string;
trueWageEur: number | null;
}
// "65.340" / "1.156" / "414.000" are thousands-separated integers, not decimals.
export function parseHeadcount(raw: string): number | null {
const s = raw.trim();
if (!s) return null;
const digits = s.replace(/[.\s]/g, "");
if (!/^\d+$/.test(digits)) return null;
return Number(digits);
}
export function parseWageEur(raw: string): number | null {
const m = raw.replace(",", ".").match(/(\d+(?:\.\d+)?)/);
return m ? Number(m[1]) : null;
}
const COUNTRY_TO_ISO: Record<string, string> = {
netherlands: "NL",
germany: "DE",
"united kingdom": "GB",
uk: "GB",
"united states": "US",
usa: "US",
france: "FR",
belgium: "BE",
};
export function parseCountryToIso(raw: string): string {
const s = raw.trim();
if (/^[A-Za-z]{2}$/.test(s)) return s.toUpperCase();
return COUNTRY_TO_ISO[s.toLowerCase()] ?? s.toUpperCase();
}
// Minimal CSV split (the dataset has no quoted commas in the fields we use).
export function parseGroundTruth(csv: string): GroundTruthRow[] {
const lines = csv.split(/\r?\n/).filter((l) => l.trim().length > 0);
const rows: GroundTruthRow[] = [];
for (let i = 1; i < lines.length; i++) {
const cols = lines[i]!.split(",");
if (cols.length < 6) continue;
rows.push({
domain: cols[0]!.trim().toLowerCase(),
legalName: cols[1]!.trim(),
trueCountry: parseCountryToIso(cols[2]!),
trueHeadcount: parseHeadcount(cols[3]!),
role: cols[4]!.trim(),
trueWageEur: parseWageEur(cols[5]!),
});
}
return rows;
}-
Step 5: Run, verify PASS. Fix only the impl if needed.
-
Step 6: Commit.
git add apps/web/lib/enrichment/ground-truth.ts apps/web/lib/enrichment/ground-truth.test.ts apps/web/lib/enrichment/__fixtures__/ground-truth.csv
git commit -m "feat(enrichment): ground-truth dataset parser + fixture"
Let prettier run. Body line: Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>.
Task 2: Export the chat helper from the provider
Files: Modify apps/web/lib/enrichment/provider.ts.
-
Step 1: Change the line
async function chat(toexport async function chat(inapps/web/lib/enrichment/provider.ts. No other change. -
Step 2: Verify nothing broke.
pnpm exec vitest run apps/web/lib/enrichment/provider.test.ts(expect still pass) andpnpm --filter @agent-metrics/web exec tsc --noEmit. -
Step 3: Commit.
git add apps/web/lib/enrichment/provider.ts
git commit -m "refactor(enrichment): export chat helper for reuse"
Body line as above.
Task 3: Global grounded wage provider
Files: Create apps/web/lib/enrichment/wage-provider.ts, apps/web/lib/enrichment/wage-provider.test.ts.
- Step 1: Write the failing test
apps/web/lib/enrichment/wage-provider.test.ts:
import { afterEach, describe, expect, it, vi } from "vitest";
import { researchGrossHourly } from "./wage-provider";
afterEach(() => vi.restoreAllMocks());
function mockTwoCalls(researchText: string, extractJson: string) {
const fetchMock = vi.spyOn(globalThis, "fetch");
for (const content of [researchText, extractJson]) {
fetchMock.mockResolvedValueOnce(
new Response(JSON.stringify({ choices: [{ message: { content } }] }), { status: 200 }),
);
}
return fetchMock;
}
describe("researchGrossHourly", () => {
it("returns a euro hourly number and a fetched_cited tier when a source is found", async () => {
mockTwoCalls(
"A software engineer in Brazil earns about R$ ... ~ EUR 12/hour. Source: https://example.org/br",
'```json\n{"gross_hourly_eur": 12.0, "source_url": "https://example.org/br"}\n```',
);
const out = await researchGrossHourly("BR", "engineering_it", 2025, { apiKey: "k" });
expect(out.grossHourlyEur).toBeCloseTo(12, 2);
expect(out.tier).toBe("fetched_cited");
expect(out.sourceUrl).toContain("example.org");
});
it("falls back to llm_inferred tier when no source url is present", async () => {
mockTwoCalls(
"Roughly EUR 30/hour for this role.",
'{"gross_hourly_eur": 30, "source_url": null}',
);
const out = await researchGrossHourly("JP", "sales", 2025, { apiKey: "k" });
expect(out.grossHourlyEur).toBeCloseTo(30, 2);
expect(out.tier).toBe("llm_inferred");
});
it("throws when no number can be extracted", async () => {
mockTwoCalls("No reliable data.", '{"gross_hourly_eur": null, "source_url": null}');
await expect(researchGrossHourly("ZZ", "legal", 2025, { apiKey: "k" })).rejects.toThrow();
});
});-
Step 2: Run, verify FAIL.
pnpm exec vitest run apps/web/lib/enrichment/wage-provider.test.ts -
Step 3: Implement
apps/web/lib/enrichment/wage-provider.ts:
import "server-only";
import { type ConfidenceTier } from "@agent-metrics/types";
import { chat, REASONING_MODEL, RESEARCH_MODEL } from "./provider";
import type { ProviderOpts } from "./provider";
export interface WageResearch {
grossHourlyEur: number;
sourceUrl: string | null;
tier: ConfidenceTier;
}
// Global wage lookup for ANY country/role. Sonar researches the prevailing
// gross wage with a source; Claude extracts a single EUR-per-hour number.
export async function researchGrossHourly(
country: string,
role: string,
year: number,
opts: ProviderOpts,
): Promise<WageResearch> {
const research = await chat(
RESEARCH_MODEL,
[
{
role: "system",
content:
"You research labour-market wages. Use web search. Give the typical GROSS wage (before employer costs) and include a source URL. If the local currency is not EUR, convert to EUR.",
},
{
role: "user",
content: `What is the typical gross hourly wage in EUR for the role "${role}" in country "${country}" around ${year}? If only an annual or monthly salary is available, convert to an hourly figure assuming ~1720 working hours per year. Give the number and a source URL.`,
},
],
opts.apiKey,
);
const extract = await chat(
REASONING_MODEL,
[
{
role: "system",
content:
'Extract a single JSON object {"gross_hourly_eur": number|null, "source_url": string|null} from the text. gross_hourly_eur is EUR per hour. Output only the JSON.',
},
{ role: "user", content: research },
],
opts.apiKey,
);
const parsed = JSON.parse(extractJsonBlock(extract)) as {
gross_hourly_eur: number | null;
source_url: string | null;
};
if (typeof parsed.gross_hourly_eur !== "number" || !(parsed.gross_hourly_eur > 0)) {
throw new Error(`No usable wage for ${role} in ${country}.`);
}
const sourceUrl = parsed.source_url ?? null;
return {
grossHourlyEur: parsed.gross_hourly_eur,
sourceUrl,
tier: sourceUrl ? "fetched_cited" : "llm_inferred",
};
}
function extractJsonBlock(s: string): string {
const fenced = s.match(/```(?:json)?\s*([\s\S]*?)```/i);
if (fenced?.[1]) return fenced[1].trim();
const a = s.indexOf("{");
const b = s.lastIndexOf("}");
if (a !== -1 && b > a) return s.slice(a, b + 1);
throw new Error("No JSON in wage extraction output.");
}-
Step 4: Run, verify PASS. Fix only the impl if needed. Then
pnpm --filter @agent-metrics/web exec tsc --noEmit. -
Step 5: Commit.
git add apps/web/lib/enrichment/wage-provider.ts apps/web/lib/enrichment/wage-provider.test.ts
git commit -m "feat(enrichment): global grounded wage provider"
Body line as above.
Task 4: Wire the global wage lookup into the service
Files: Modify apps/web/lib/enrichment/service.ts.
- Step 1: In
apps/web/lib/enrichment/service.ts, add this import near the other./imports:
import { researchGrossHourly } from "./wage-provider";- Step 2: Replace the entire
resolveLoadedHourlyfunction with this version (which adds the grounded-research step before the hard fallback and writes the result into thewagescache):
// Loaded hourly cost for a country/role. Cache-first on wages, then grounded
// research (works for ANY country, cached into wages), then a hard fallback.
async function resolveLoadedHourly(
country: string,
role: string,
): Promise<{ loadedHourlyEur: number; tier: ConfidenceTier }> {
const db = createSupabaseAdminClient();
const { data: cached } = await db
.from("wages")
.select("wage_data, confidence")
.eq("country", country)
.eq("role", role)
.eq("year", DEFAULT_YEAR)
.maybeSingle();
const cachedLoaded = (cached?.wage_data as { blended_hourly_eur?: number } | undefined)
?.blended_hourly_eur;
if (typeof cachedLoaded === "number") {
return { loadedHourlyEur: cachedLoaded, tier: "seeded_reference" };
}
const { data: factorRow } = await db
.from("employer_factors")
.select("factor")
.eq("country", country)
.maybeSingle();
const factor = (factorRow?.factor as number | undefined) ?? 1.3;
try {
const research = await researchGrossHourly(country, role, DEFAULT_YEAR, {
apiKey: env.OPENROUTER_API_KEY,
});
const loaded = loadedHourly(research.grossHourlyEur, factor);
// Cache once: pay per country/role, reuse globally.
await db.from("wages").upsert(
{
country,
role,
year: DEFAULT_YEAR,
wage_data: {
blended_hourly_eur: loaded,
gross_hourly_eur: research.grossHourlyEur,
employer_factor: factor,
hours_per_year: 1720,
},
source: "grounded research",
source_url: research.sourceUrl,
last_researched: new Date().toISOString(),
},
{ onConflict: "country,role,year" },
);
return { loadedHourlyEur: loaded, tier: research.tier };
} catch {
// Research failed: conservative, explicitly low-confidence, never a silent 0.
return { loadedHourlyEur: loadedHourly(25, factor), tier: "hard_fallback" };
}
}- Step 3: Ensure
ConfidenceTieris imported inservice.ts. If the import lineimport { type EnrichedCompany } from "@agent-metrics/types";exists, change it to:
import { type ConfidenceTier, type EnrichedCompany } from "@agent-metrics/types";Also confirm the wages.push({...}) block in enrichCompany still compiles (its confidence: tier now receives a ConfidenceTier from the new return type, which is correct).
-
Step 4: Typecheck.
pnpm --filter @agent-metrics/web exec tsc --noEmit. Fix onlyservice.tsif needed. -
Step 5: Live smoke (real calls, a few cents). Create temp
apps/web/scripts/smoke-wage.ts:
import { enrichCompany } from "../lib/enrichment/service";
const r = await enrichCompany("catawiki.com", { forceRefresh: true });
console.log(
"country",
r.record.country,
"annual",
r.record.business_case?.annual_labour_cost_eur,
"conf",
r.record.confidence.overall,
);
process.exit(0);Run: pnpm --filter @agent-metrics/web exec tsx scripts/smoke-wage.ts then rm apps/web/scripts/smoke-wage.ts. Expected: confidence noticeably above 0.2 now (grounded wages instead of hard_fallback). If tsx cannot run it due to server-only, report DONE_WITH_CONCERNS (the eval in Task 5 exercises it via vitest). Do NOT commit the temp script.
- Step 6: Commit.
git add apps/web/lib/enrichment/service.ts
git commit -m "feat(enrichment): global grounded wage resolution with wages cache"
Body line as above.
Task 5: Live accuracy eval (opt-in) + run it
Files: Create apps/web/lib/enrichment/eval.test.ts.
- Step 1: Implement
apps/web/lib/enrichment/eval.test.ts. It is gated byRUN_EVAL=1so the normal suite never makes paid calls; it loads.env.localand a sample of the fixture, runs enrichment, and scores accuracy:
import { readFileSync } from "node:fs";
import { join } from "node:path";
import { describe, expect, it } from "vitest";
import { parseGroundTruth } from "./ground-truth";
const RUN = process.env.RUN_EVAL === "1";
const SAMPLE = Number(process.env.EVAL_N ?? "8");
// Headcount is judged on order-of-magnitude (outside-in estimates are coarse);
// wage within +/-40%; country exact.
function headcountClose(pred: number, truth: number): boolean {
if (truth <= 0 || pred <= 0) return false;
const ratio = pred / truth;
return ratio >= 1 / 3 && ratio <= 3;
}
function wageClose(pred: number, truth: number): boolean {
if (truth <= 0 || pred <= 0) return false;
return Math.abs(pred - truth) / truth <= 0.4;
}
describe.runIf(RUN)("enrichment accuracy eval", () => {
it(
`scores a sample of ${SAMPLE} companies`,
async () => {
const { config } = await import("dotenv");
config({ path: join(process.cwd(), "apps/web/.env.local") });
const { enrichCompany } = await import("./service");
const csv = readFileSync(
join(process.cwd(), "apps/web/lib/enrichment/__fixtures__/ground-truth.csv"),
"utf8",
);
const rows = parseGroundTruth(csv).slice(0, SAMPLE);
let countryHits = 0;
let headcountHits = 0;
let wageHits = 0;
let wageJudged = 0;
for (const row of rows) {
const { record } = await enrichCompany(row.domain, { forceRefresh: true });
const country = (record.country ?? "").toUpperCase();
if (country === row.trueCountry) countryHits++;
if (
row.trueHeadcount &&
record.headcount_estimate &&
headcountClose(record.headcount_estimate, row.trueHeadcount)
) {
headcountHits++;
}
if (row.trueWageEur) {
wageJudged++;
const predWage = record.wages[0]?.wage_data.gross_hourly_eur;
if (predWage && wageClose(predWage, row.trueWageEur)) wageHits++;
}
// eslint-disable-next-line no-console
console.log(
`${row.domain}: country ${country}/${row.trueCountry}, headcount ${record.headcount_estimate}/${row.trueHeadcount}`,
);
}
const n = rows.length;
const countryAcc = countryHits / n;
const headcountAcc = headcountHits / n;
const wageAcc = wageJudged ? wageHits / wageJudged : 0;
const overall = (countryAcc + headcountAcc + wageAcc) / 3;
// eslint-disable-next-line no-console
console.log(
`\nACCURACY country=${countryAcc.toFixed(2)} headcount=${headcountAcc.toFixed(2)} wage=${wageAcc.toFixed(2)} overall=${overall.toFixed(2)}`,
);
expect(overall).toBeGreaterThan(0); // informational gate; see report below
},
1000 * 60 * 10,
);
});-
Step 2: Confirm the normal suite still ignores the eval.
pnpm exec vitest run apps/web/lib/enrichment/eval.test.ts(without RUN_EVAL) should report the test as skipped/0 ran, and the fullpnpm exec vitest runmust stay green. -
Step 3: RUN THE EVAL (real calls; ~8 companies x several LLM calls each, a few euros).
cd /Users/ralf/projects/agent-metrics && RUN_EVAL=1 EVAL_N=8 pnpm exec vitest run apps/web/lib/enrichment/eval.test.ts 2>&1 | tail -40
Capture the per-company lines and the final ACCURACY ... line. Report those numbers verbatim. Do not tune the engine to the test; just report what it scores.
- Step 4: Commit the eval.
git add apps/web/lib/enrichment/eval.test.ts
git commit -m "test(enrichment): opt-in live accuracy eval vs ground truth"
Body line as above.
Self-review (plan author)
- Global, not country-limited: Task 3/4 make wages resolve via grounded research for any country, cached into
wages. No country set is hard-coded;employer_factorsdefaults to 1.3 for unseeded countries (a factor, not a gate). - Spec coverage: wage data source (global grounded) + the >=80%-style eval are delivered; the eval is honest (reports the real number, no tuning to the test). Confidence tiers flow from the wage provider into the rollup.
- No placeholders: parser, wage provider, service change, and eval are concrete. The eval's >0 assertion is intentionally informational; the printed ACCURACY line is the real deliverable to judge against the >=80% target.
- Cost control: the eval is opt-in (
RUN_EVAL=1) and sampled (EVAL_N), so the normal suite makes zero paid calls.
Out of scope / next
- Per-country employer factors beyond the seeded six (refine later; default 1.3 is a factor not a restriction).
- Industry-specific role distributions; FX precision for non-EUR wages (Sonar/Claude convert for now).
- If the eval comes in under target, a follow-up iteration plan tunes prompts/headcount heuristics; that is a separate, evidence-driven plan, not a tweak buried here.
- Phases 3-7 (sync API + lookups + hard cap, bulk + worker, pricing, frontend, docs).