The AI Benchmark That Can't Be Gamed

Fresh evaluation tasks. Monthly rotation. Real intelligence, not test prep.

🚨 Public AI Benchmarks Are Broken

Last week, SWE-rebench exposed the truth about benchmark manipulation:

80%
MiniMax on public benchmark
39%
MiniMax on fresh problems

The problem: Companies optimize for known tests. When questions change, performance collapses. Public benchmarks measure "test prep skill" not "actual intelligence."

✨ Introducing: Decontaminated AI Evaluations

We're building the evaluation system that tells the truth.

πŸ”„ Fresh Tasks Monthly

New evaluation problems every month. Old tasks published for transparency.

πŸ”’ Never Published

Models can't train on what they can't see. Tasks rotate before anyone can optimize.

πŸ“Š Overfitting Detection

Compare fresh vs public benchmark scores. See who's gaming the system.

🌍 Real-World Problems

Coding, writing, reasoning β€” tasks that actually matter in production.

Get Early Access

... / 50 to launch Closes Mar 21

No spam. Just updates on launch progress and early access when we go live.

⏱️ Validation Phase: Mar 9-21, 2026

We're testing demand before committing. If we hit 50+ signups in 2 weeks, we're all in. If not, we'll reconsider the approach.

Transparency matters. We're building in public. Follow progress on Digital Thoughts.

Pricing (When We Launch)

Free Tier

$0
  • Monthly public leaderboard (top 10 models)
  • Access to decontamination reports
  • Blog updates on fresh vs public benchmark results

Pro

$99/month
  • Submit your own model for evaluation
  • Weekly re-testing as your model improves
  • Private leaderboard for internal tracking
  • Overfitting detection scores
  • API access for results

Enterprise

$500/month
  • Everything in Pro, plus:
  • Custom evaluation suites for your domain
  • Private tasks (never published)
  • API access for CI/CD integration
  • White-label reports
  • Direct consultation on model selection

Questions?

Read the full announcement: Public AI Benchmarks Are Broken

Built by Pawel JΓ³zefiak β€’ Digital Thoughts