Scope: Local Eval System

Replace LangFuse dependency with an in-repo eval that runs against the same CF Workers AI model used in production.

What it does

A single command — npm run eval — that:

  1. Fetches all seeds from the live API (or local D1)
  2. Runs each seed through the conjugation prompt (same model: @cf/openai/gpt-oss-120b)
  3. Runs 12 judge prompts against each conjugation (using Anthropic or the same CF model)
  4. Outputs a scorecard to stdout + writes a JSON results file
  5. Optionally fails CI if any judge score drops below a threshold

Architecture

seeds-link/
├── src/lib/
│   ├── prompts.ts          ← conjugation prompt (already exists)
│   ├── analyze.ts          ← conjugation runner (already exists)
│   └── eval/
│       ├── judges.ts       ← 12 judge prompt strings (move from run-experiment.mjs)
│       ├── runner.ts       ← orchestrator: fetch seeds → conjugate → judge → score
│       └── report.ts       ← scorecard formatting (terminal + JSON + optional HTML)
├── eval/
│   ├── results/            ← timestamped JSON results (gitignored)
│   └── thresholds.json     ← minimum acceptable scores per judge
└── package.json            ← "eval" script added

Judge model choice

Two options for the judge LLM:

Option A: CF Workers AI (same model)
Free, no API key needed, uses the wrangler OAuth token we already have. The model judges its own output — circular but consistent. Good for fast iteration.
Option B: Anthropic Claude (external judge)
Independent evaluation — a different model judges the output. More trustworthy scores. Needs ANTHROPIC_API_KEY in env. This is what we've been using.

Recommendation: default to Option B (Anthropic) for quality, with a --self-judge flag for Option A when iterating fast.

What changes vs current setup

Current (LangFuse) New (Local)
Judge prompts in run-experiment.mjs (3000+ lines) Judge prompts in src/lib/eval/judges.ts (clean, typed)
LangFuse SDK for tracing + scoring Direct API calls, JSON output to eval/results/
Rate limited to 30 req/min (Hobby plan) No rate limits (only model API limits)
~20 min for 16 seeds ~3-5 min for 16 seeds
Results in LangFuse cloud dashboard Results in terminal + JSON file + optional HTML report
CF token expires every ~1 hour Same issue, but runner reads live token (already fixed)

CLI interface

# Run full eval on all seeds
npm run eval

# Run eval on specific seeds
npm run eval -- --slug gmail.api.ready --slug gmail.inbox.health

# Use CF model as judge (fast, free, circular)
npm run eval -- --self-judge

# CI mode: fail if any score below threshold
npm run eval -- --ci

# Output HTML report
npm run eval -- --html eval/results/latest.html

What I need from design-engineer

This is code work — I write the judge prompts and thresholds, design-engineer implements the TypeScript runner. Specifically:

Effort estimate

~2-3 hours of coding work. Most of the logic already exists in eval/run-experiment.mjs — it just needs to be cleaned up, typed, and moved into the main codebase.

Next steps:
  1. Dom approves scope
  2. I write judges.ts (prompt strings) + thresholds.json
  3. Open a thread in #design-engineer for the TypeScript implementation
  4. Once built: npm run eval replaces the entire LangFuse workflow