Base 16 v3 โ 12-Judge Scorecard
16/16 seeds passed. Model: @cf/openai/gpt-oss-120b. Judges: claude-sonnet-4.
๐ฑ Prompt Quality (NEW judges)
Prompt Specificity NEW
8.2
๐งช Conjugation Quality (vs baseline)
Contract Independencewas 2.5
2.4
Parent-Child Coherencewas 5.0
3.2
Distinctivenesswas 4.6
5.8
Referential Claritywas 6.3
7.7
Slug Compressionwas 8.3
8.4
Contract Concretenesswas 8.8
9.3
Verification
โ
0 false rejections, 0 false passes / 64 checks
Key takeaways:
- Seed Value 8.1 + Prompt Specificity 8.2 โ the prompts themselves are worth encoding. Models wouldn't do this well zero-shot.
- Contract Independence 2.4 โ still the #1 problem. The model still parrots prompts into contracts. This is a model behavior issue, not a prompt quality issue.
- Parent-Child Coherence 3.2 โ dropped from 5.0. Many parent contracts are empty (new seeds haven't been conjugated yet), so the judge sees no coherence.
- Concreteness 9.3, Navigability 8.3, Ref Clarity 7.7 โ all improved. Richer prompts โ richer conjugations.
- Next bottleneck: Contract independence. The conjugation system prompt needs even stronger anti-parroting instructions, or we need a different model for conjugation.