Replace LangFuse dependency with an in-repo eval that runs against the same CF Workers AI model used in production.
A single command — npm run eval — that:
@cf/openai/gpt-oss-120b)seeds-link/ ├── src/lib/ │ ├── prompts.ts ← conjugation prompt (already exists) │ ├── analyze.ts ← conjugation runner (already exists) │ └── eval/ │ ├── judges.ts ← 12 judge prompt strings (move from run-experiment.mjs) │ ├── runner.ts ← orchestrator: fetch seeds → conjugate → judge → score │ └── report.ts ← scorecard formatting (terminal + JSON + optional HTML) ├── eval/ │ ├── results/ ← timestamped JSON results (gitignored) │ └── thresholds.json ← minimum acceptable scores per judge └── package.json ← "eval" script added
Two options for the judge LLM:
Recommendation: default to Option B (Anthropic) for quality, with a --self-judge flag for Option A when iterating fast.
| Current (LangFuse) | New (Local) |
|---|---|
| Judge prompts in run-experiment.mjs (3000+ lines) | Judge prompts in src/lib/eval/judges.ts (clean, typed) |
| LangFuse SDK for tracing + scoring | Direct API calls, JSON output to eval/results/ |
| Rate limited to 30 req/min (Hobby plan) | No rate limits (only model API limits) |
| ~20 min for 16 seeds | ~3-5 min for 16 seeds |
| Results in LangFuse cloud dashboard | Results in terminal + JSON file + optional HTML report |
| CF token expires every ~1 hour | Same issue, but runner reads live token (already fixed) |
This is code work — I write the judge prompts and thresholds, design-engineer implements the TypeScript runner. Specifically:
src/lib/eval/judges.ts — I provide the 12 prompt strings (I already have them)src/lib/eval/runner.ts — design-engineer builds the orchestratorsrc/lib/eval/report.ts — design-engineer builds the output formattereval/thresholds.json — I set the minimum scores per judge~2-3 hours of coding work. Most of the logic already exists in eval/run-experiment.mjs — it just needs to be cleaned up, typed, and moved into the main codebase.
judges.ts (prompt strings) + thresholds.jsonnpm run eval replaces the entire LangFuse workflow