Scope: Local Eval System

Replace LangFuse dependency with an in-repo eval that runs against the same CF Workers AI model used in production.

What it does

A single command — npm run eval — that:

Fetches all seeds from the live API (or local D1)
Runs each seed through the conjugation prompt (same model: @cf/openai/gpt-oss-120b)
Runs 12 judge prompts against each conjugation (using Anthropic or the same CF model)
Outputs a scorecard to stdout + writes a JSON results file
Optionally fails CI if any judge score drops below a threshold

Architecture

seeds-link/
├── src/lib/
│   ├── prompts.ts          ← conjugation prompt (already exists)
│   ├── analyze.ts          ← conjugation runner (already exists)
│   └── eval/
│       ├── judges.ts       ← 12 judge prompt strings (move from run-experiment.mjs)
│       ├── runner.ts       ← orchestrator: fetch seeds → conjugate → judge → score
│       └── report.ts       ← scorecard formatting (terminal + JSON + optional HTML)
├── eval/
│   ├── results/            ← timestamped JSON results (gitignored)
│   └── thresholds.json     ← minimum acceptable scores per judge
└── package.json            ← "eval" script added

Judge model choice

Two options for the judge LLM:

Option A: CF Workers AI (same model)
Free, no API key needed, uses the wrangler OAuth token we already have. The model judges its own output — circular but consistent. Good for fast iteration.

Option B: Anthropic Claude (external judge)
Independent evaluation — a different model judges the output. More trustworthy scores. Needs ANTHROPIC_API_KEY in env. This is what we've been using.

Recommendation: default to Option B (Anthropic) for quality, with a --self-judge flag for Option A when iterating fast.

What changes vs current setup

Current (LangFuse)	New (Local)
Judge prompts in run-experiment.mjs (3000+ lines)	Judge prompts in src/lib/eval/judges.ts (clean, typed)
LangFuse SDK for tracing + scoring	Direct API calls, JSON output to eval/results/
Rate limited to 30 req/min (Hobby plan)	No rate limits (only model API limits)
~20 min for 16 seeds	~3-5 min for 16 seeds
Results in LangFuse cloud dashboard	Results in terminal + JSON file + optional HTML report
CF token expires every ~1 hour	Same issue, but runner reads live token (already fixed)

CLI interface

# Run full eval on all seeds
npm run eval

# Run eval on specific seeds
npm run eval -- --slug gmail.api.ready --slug gmail.inbox.health

# Use CF model as judge (fast, free, circular)
npm run eval -- --self-judge

# CI mode: fail if any score below threshold
npm run eval -- --ci

# Output HTML report
npm run eval -- --html eval/results/latest.html

What I need from design-engineer

This is code work — I write the judge prompts and thresholds, design-engineer implements the TypeScript runner. Specifically:

src/lib/eval/judges.ts — I provide the 12 prompt strings (I already have them)
src/lib/eval/runner.ts — design-engineer builds the orchestrator
src/lib/eval/report.ts — design-engineer builds the output formatter
eval/thresholds.json — I set the minimum scores per judge

Effort estimate

~2-3 hours of coding work. Most of the logic already exists in eval/run-experiment.mjs — it just needs to be cleaned up, typed, and moved into the main codebase.

Next steps:

Dom approves scope
I write judges.ts (prompt strings) + thresholds.json
Open a thread in #design-engineer for the TypeScript implementation
Once built: npm run eval replaces the entire LangFuse workflow