Autoevals Integration Plan

npm install autoevals → 12 custom judges → npm run eval

How autoevals works

Autoevals provides LLMClassifierFromTemplate — you give it a prompt template with {{input}} / {{output}} / {{expected}} slots and choice scores, it calls any OpenAI-compatible API and returns a 0-1 score. Our CF Workers AI endpoint is OpenAI-compatible. So:

// Point autoevals at CF Workers AI
import OpenAI from "openai";
import { init } from "autoevals";
const client = new OpenAI({
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT}/ai/v1`,
  apiKey: getLiveCfToken(),
});
init({ client });
// Or point at Anthropic for independent judging
const anthropicClient = new OpenAI({
  baseURL: "https://api.anthropic.com/v1",
  apiKey: process.env.ANTHROPIC_API_KEY,
});

File structure

seeds-link/
├── src/lib/eval/
│   ├── judges.ts          ← 12 judge prompts as LLMClassifierFromTemplate configs
│   ├── runner.ts          ← fetch seeds → conjugate → judge → aggregate
│   └── report.ts          ← terminal scorecard + JSON output
├── eval/
│   ├── thresholds.json    ← minimum scores per judge (CI gate)
│   └── results/           ← timestamped JSON results (gitignored)
├── scripts/
│   └── eval.ts            ← CLI entry point
└── package.json           ← "eval": "npx tsx scripts/eval.ts"

Each judge as an autoevals scorer

Our 12 judges become LLMClassifierFromTemplate instances. Example for Contract Independence:

import { LLMClassifierFromTemplate } from "autoevals";
export const contractIndependence = LLMClassifierFromTemplate({
  name: "ContractIndependence",
  promptTemplate: `You judge whether a seed's contract is a
    genuine world-state assertion or just a restatement...
    SEED PROMPT: {{input}}
    CONTRACT: {{output}}
    Score A (independent) or B (parroting)`,
  choiceScores: { A: 1, B: 0 },
  useCoT: true,
});

Note: autoevals uses A/B/C/D choice classification, not 0-10 numeric. We'd adapt our judges to binary or multi-choice format, which is actually more reliable — LLMs are better at classification than numeric scoring.

The 12 judges, adapted

Judge	Evaluates	Format
Seed Value	prompt	A=valuable / B=obvious to model
Prompt Specificity	prompt	A=specific / B=vague
Contract Independence	conjugation	A=independent / B=parroting
Contract Concreteness	conjugation	A=concrete / B=abstract
Navigability	conjugation	A=testable / B=untestable
Token Density	conjugation	A=dense / B=padded
Slug Compression	conjugation	A=3 axes / B=overlapping
Distinctiveness	conjugation	A=unique / B=confusable
Grammar	conjugation	A=correct / B=errors
Referential Clarity	conjugation	A=self-contained / B=assumes context
Parent-Child Coherence	conjugation	A=coherent / B=mismatch
Verification Accuracy	conjugation	A=correct calls / B=false reject/pass

CLI output

$ npm run eval
🌱 seed.show eval — 16 seeds, 12 judges
Model: @cf/openai/gpt-oss-120b | Judge: claude-sonnet-4
gmail.api.ready          ✓ slug=gmail.auth.check
gmail.inbox.health       ✓ slug=gmail.inbox.health
...
═══════════════════════════════════════
Seed Value:             87% pass
Prompt Specificity:      93% pass
Contract Independence:   25% pass
Contract Concreteness:   93% pass
...
Results → eval/results/2026-04-06T03:18.json

Implementation steps

npm install autoevals openai (openai is a peer dep)
Write src/lib/eval/judges.ts — 12 LLMClassifierFromTemplate configs
Write src/lib/eval/runner.ts — fetch seeds from API, conjugate, run judges, aggregate
Write src/lib/eval/report.ts — terminal formatting + JSON serialization
Write scripts/eval.ts — CLI entry point with flags (--slug, --self-judge, --ci, --html)
Add "eval": "npx tsx scripts/eval.ts" to package.json
Create eval/thresholds.json with minimum pass rates per judge
Test: npm run eval should produce the same scorecard we've been getting from LangFuse

What we keep from LangFuse

Nothing runtime. LangFuse stays as a historical reference (all our past experiments are there) but the active development loop moves to npm run eval. If we ever want to push results back to LangFuse for comparison, we can add a --langfuse flag later.

Effort: ~2-3 hours. I write all of it — no design-engineer needed. It's prompt strings + API glue + formatting, all TypeScript, all in my domain.