Fresh evaluation tasks. Monthly rotation. Real intelligence, not test prep.
Last week, SWE-rebench exposed the truth about benchmark manipulation:
The problem: Companies optimize for known tests. When questions change, performance collapses. Public benchmarks measure "test prep skill" not "actual intelligence."
We're building the evaluation system that tells the truth.
New evaluation problems every month. Old tasks published for transparency.
Models can't train on what they can't see. Tasks rotate before anyone can optimize.
Compare fresh vs public benchmark scores. See who's gaming the system.
Coding, writing, reasoning β tasks that actually matter in production.
No spam. Just updates on launch progress and early access when we go live.
We're testing demand before committing. If we hit 50+ signups in 2 weeks, we're all in. If not, we'll reconsider the approach.
Transparency matters. We're building in public. Follow progress on Digital Thoughts.
Read the full announcement: Public AI Benchmarks Are Broken
Built by Pawel JΓ³zefiak β’ Digital Thoughts