Autoevals Integration Plan

npm install autoevals → 12 custom judges → npm run eval

How autoevals works

Autoevals provides LLMClassifierFromTemplate — you give it a prompt template with {{input}} / {{output}} / {{expected}} slots and choice scores, it calls any OpenAI-compatible API and returns a 0-1 score. Our CF Workers AI endpoint is OpenAI-compatible. So:

// Point autoevals at CF Workers AI
import OpenAI from "openai";
import { init } from "autoevals";
const client = new OpenAI({
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT}/ai/v1`,
  apiKey: getLiveCfToken(),
});
init({ client });
// Or point at Anthropic for independent judging
const anthropicClient = new OpenAI({
  baseURL: "https://api.anthropic.com/v1",
  apiKey: process.env.ANTHROPIC_API_KEY,
});

File structure

seeds-link/
├── src/lib/eval/
│   ├── judges.ts          ← 12 judge prompts as LLMClassifierFromTemplate configs
│   ├── runner.ts          ← fetch seeds → conjugate → judge → aggregate
│   └── report.ts          ← terminal scorecard + JSON output
├── eval/
│   ├── thresholds.json    ← minimum scores per judge (CI gate)
│   └── results/           ← timestamped JSON results (gitignored)
├── scripts/
│   └── eval.ts            ← CLI entry point
└── package.json           ← "eval": "npx tsx scripts/eval.ts"

Each judge as an autoevals scorer

Our 12 judges become LLMClassifierFromTemplate instances. Example for Contract Independence:

import { LLMClassifierFromTemplate } from "autoevals";
export const contractIndependence = LLMClassifierFromTemplate({
  name: "ContractIndependence",
  promptTemplate: `You judge whether a seed's contract is a
    genuine world-state assertion or just a restatement...
    SEED PROMPT: {{input}}
    CONTRACT: {{output}}
    Score A (independent) or B (parroting)`,
  choiceScores: { A: 1, B: 0 },
  useCoT: true,
});

Note: autoevals uses A/B/C/D choice classification, not 0-10 numeric. We'd adapt our judges to binary or multi-choice format, which is actually more reliable — LLMs are better at classification than numeric scoring.

The 12 judges, adapted

Judge Evaluates Format
Seed Value prompt A=valuable / B=obvious to model
Prompt Specificity prompt A=specific / B=vague
Contract Independence conjugation A=independent / B=parroting
Contract Concreteness conjugation A=concrete / B=abstract
Navigability conjugation A=testable / B=untestable
Token Density conjugation A=dense / B=padded
Slug Compression conjugation A=3 axes / B=overlapping
Distinctiveness conjugation A=unique / B=confusable
Grammar conjugation A=correct / B=errors
Referential Clarity conjugation A=self-contained / B=assumes context
Parent-Child Coherence conjugation A=coherent / B=mismatch
Verification Accuracy conjugation A=correct calls / B=false reject/pass

CLI output

$ npm run eval
🌱 seed.show eval — 16 seeds, 12 judges
Model: @cf/openai/gpt-oss-120b | Judge: claude-sonnet-4
gmail.api.ready ✓ slug=gmail.auth.check
gmail.inbox.health ✓ slug=gmail.inbox.health
...
═══════════════════════════════════════
Seed Value: 87% pass
Prompt Specificity: 93% pass
Contract Independence: 25% pass
Contract Concreteness: 93% pass
...
Results → eval/results/2026-04-06T03:18.json

Implementation steps

  1. npm install autoevals openai (openai is a peer dep)
  2. Write src/lib/eval/judges.ts — 12 LLMClassifierFromTemplate configs
  3. Write src/lib/eval/runner.ts — fetch seeds from API, conjugate, run judges, aggregate
  4. Write src/lib/eval/report.ts — terminal formatting + JSON serialization
  5. Write scripts/eval.ts — CLI entry point with flags (--slug, --self-judge, --ci, --html)
  6. Add "eval": "npx tsx scripts/eval.ts" to package.json
  7. Create eval/thresholds.json with minimum pass rates per judge
  8. Test: npm run eval should produce the same scorecard we've been getting from LangFuse

What we keep from LangFuse

Nothing runtime. LangFuse stays as a historical reference (all our past experiments are there) but the active development loop moves to npm run eval. If we ever want to push results back to LangFuse for comparison, we can add a --langfuse flag later.

Effort: ~2-3 hours. I write all of it — no design-engineer needed. It's prompt strings + API glue + formatting, all TypeScript, all in my domain.