npm install autoevals → 12 custom judges → npm run eval
Autoevals provides LLMClassifierFromTemplate — you give it a prompt template with {{input}} / {{output}} / {{expected}} slots and choice scores, it calls any OpenAI-compatible API and returns a 0-1 score. Our CF Workers AI endpoint is OpenAI-compatible. So:
seeds-link/ ├── src/lib/eval/ │ ├── judges.ts ← 12 judge prompts as LLMClassifierFromTemplate configs │ ├── runner.ts ← fetch seeds → conjugate → judge → aggregate │ └── report.ts ← terminal scorecard + JSON output ├── eval/ │ ├── thresholds.json ← minimum scores per judge (CI gate) │ └── results/ ← timestamped JSON results (gitignored) ├── scripts/ │ └── eval.ts ← CLI entry point └── package.json ← "eval": "npx tsx scripts/eval.ts"
Our 12 judges become LLMClassifierFromTemplate instances. Example for Contract Independence:
Note: autoevals uses A/B/C/D choice classification, not 0-10 numeric. We'd adapt our judges to binary or multi-choice format, which is actually more reliable — LLMs are better at classification than numeric scoring.
| Judge | Evaluates | Format |
|---|---|---|
| Seed Value | prompt | A=valuable / B=obvious to model |
| Prompt Specificity | prompt | A=specific / B=vague |
| Contract Independence | conjugation | A=independent / B=parroting |
| Contract Concreteness | conjugation | A=concrete / B=abstract |
| Navigability | conjugation | A=testable / B=untestable |
| Token Density | conjugation | A=dense / B=padded |
| Slug Compression | conjugation | A=3 axes / B=overlapping |
| Distinctiveness | conjugation | A=unique / B=confusable |
| Grammar | conjugation | A=correct / B=errors |
| Referential Clarity | conjugation | A=self-contained / B=assumes context |
| Parent-Child Coherence | conjugation | A=coherent / B=mismatch |
| Verification Accuracy | conjugation | A=correct calls / B=false reject/pass |
npm install autoevals openai (openai is a peer dep)src/lib/eval/judges.ts — 12 LLMClassifierFromTemplate configssrc/lib/eval/runner.ts — fetch seeds from API, conjugate, run judges, aggregatesrc/lib/eval/report.ts — terminal formatting + JSON serializationscripts/eval.ts — CLI entry point with flags (--slug, --self-judge, --ci, --html)"eval": "npx tsx scripts/eval.ts" to package.jsoneval/thresholds.json with minimum pass rates per judgenpm run eval should produce the same scorecard we've been getting from LangFuseNothing runtime. LangFuse stays as a historical reference (all our past experiments are there) but the active development loop moves to npm run eval. If we ever want to push results back to LangFuse for comparison, we can add a --langfuse flag later.