Fresh evaluation tasks. Monthly rotation. Real intelligence, not test prep.
This week, Berkeley researchers proved AI benchmarks can be beaten 100% of the time without solving a single task. The gap between public and fresh benchmarks is real:
The problem: Companies optimize for known tests. When questions change, performance collapses. Public benchmarks measure "test prep skill" not "actual intelligence."
We're building the evaluation system that tells the truth.
New evaluation problems every month. Old tasks published for transparency.
Models can't train on what they can't see. Tasks rotate before anyone can optimize.
Compare fresh vs public benchmark scores. See who's gaming the system.
Coding, writing, reasoning β tasks that actually matter in production.
No spam. Just updates on launch progress and early access when we go live.
We're testing demand before committing. If we hit 50+ signups this week, we're all in. If not, we'll reconsider the approach.
Transparency matters. We're building in public. Follow progress on Digital Thoughts.
Read the full announcement: The Benchmark Contamination Crisis
Built by Pawel JΓ³zefiak β’ Digital Thoughts