The AI Benchmark That Can't Be Gamed

Fresh evaluation tasks. Monthly rotation. Real intelligence, not test prep.

🚨 Public AI Benchmarks Are Broken

This week, Berkeley researchers proved AI benchmarks can be beaten 100% of the time without solving a single task. The gap between public and fresh benchmarks is real:

80%
MiniMax on public benchmark
39%
MiniMax on fresh problems

The problem: Companies optimize for known tests. When questions change, performance collapses. Public benchmarks measure "test prep skill" not "actual intelligence."

✨ Introducing: Decontaminated AI Evaluations

We're building the evaluation system that tells the truth.

πŸ”„ Fresh Tasks Monthly

New evaluation problems every month. Old tasks published for transparency.

πŸ”’ Never Published

Models can't train on what they can't see. Tasks rotate before anyone can optimize.

πŸ“Š Overfitting Detection

Compare fresh vs public benchmark scores. See who's gaming the system.

🌍 Real-World Problems

Coding, writing, reasoning β€” tasks that actually matter in production.

Get Early Access

... / 50 to launch Early access open

No spam. Just updates on launch progress and early access when we go live.

⏱️ Validation Phase: Apr 14-18, 2026

We're testing demand before committing. If we hit 50+ signups this week, we're all in. If not, we'll reconsider the approach.

Transparency matters. We're building in public. Follow progress on Digital Thoughts.

Pricing (When We Launch)

Free Tier

$0
  • Monthly public leaderboard (top 10 models)
  • Access to decontamination reports
  • Blog updates on fresh vs public benchmark results

Pro

$99/month
  • Submit your own model for evaluation
  • Weekly re-testing as your model improves
  • Private leaderboard for internal tracking
  • Overfitting detection scores
  • API access for results

Enterprise

$500/month
  • Everything in Pro, plus:
  • Custom evaluation suites for your domain
  • Private tasks (never published)
  • API access for CI/CD integration
  • White-label reports
  • Direct consultation on model selection

Questions?

Read the full announcement: The Benchmark Contamination Crisis

Built by Pawel JΓ³zefiak β€’ Digital Thoughts