Base 16 v3 — 12-Judge Scorecard

16/16 seeds passed. Model: @cf/openai/gpt-oss-120b. Judges: claude-sonnet-4.

🌱 Prompt Quality (NEW judges)

Seed Value NEW 8.1

Prompt Specificity NEW 8.2

Contract Independencewas 2.5 2.4

Token Densitywas 4.1 4.1

Parent-Child Coherencewas 5.0 3.2

Grammarwas 5.9 5.2

Distinctivenesswas 4.6 5.8

Referential Claritywas 6.3 7.7

Navigabilitywas 7.4 8.3

Slug Compressionwas 8.3 8.4

Contract Concretenesswas 8.8 9.3

Verification ✓ 0 false rejections, 0 false passes / 64 checks

Key takeaways:

Seed Value 8.1 + Prompt Specificity 8.2 — the prompts themselves are worth encoding. Models wouldn't do this well zero-shot.
Contract Independence 2.4 — still the #1 problem. The model still parrots prompts into contracts. This is a model behavior issue, not a prompt quality issue.
Parent-Child Coherence 3.2 — dropped from 5.0. Many parent contracts are empty (new seeds haven't been conjugated yet), so the judge sees no coherence.
Concreteness 9.3, Navigability 8.3, Ref Clarity 7.7 — all improved. Richer prompts → richer conjugations.
Next bottleneck: Contract independence. The conjugation system prompt needs even stronger anti-parroting instructions, or we need a different model for conjugation.