the new benches
🏁 Humanity's Last Exam & SimpleQA: The new high for AI
AI companies aren't just racing to build smarter models anymore – they're racing to pass the hardest tests humanity can throw at them. Enter: Humanity's Last Exam and SimpleQA. These two new benchmarks are making waves in the AI world, and not just because they're tough – they're redefining what it means to be an "intelligent" model.
Humanity's Last Exam is exactly what it sounds like: 3,000 brutally hard, multimodal questions spanning math, science, philosophy, and more – designed to stump AI. No regurgitating training data, no shortcuts. The latest results are eye-opening: GPT-4o leads with 3.1% accuracy, followed closely by Grok-2 at 3.9%. Claude 3.5 Sonnet hits 4.8%, while GPT-4.5 Preview manages 6.4%. The rest fall further behind – DeepSeek-R1 at 8.6%, o1 at 8.8%, Claude 3.7 Sonnet at 8.9%, o3-mini at 14%, and Gemini 2.5 Pro at 18.8%. The name is dramatic, sure, but the point is real: if an AI ever aces this thing, we may no longer have tests that separate human from machine.
Then there's SimpleQA – deceptively named. It's a benchmark of short, fact-based questions that expose hallucinations. The latest scores tell a fascinating story: GPT-4.5-preview leads the pack at 62.5%, while OpenAI's o1 and o1-preview both score around 42%. GPT-4o models hover around 39%, Claude 3.5 Sonnet reaches 28.9%, and Claude 3 Opus sits at 23.5%. The smaller models struggle significantly – o3-mini variants barely clear 13%, and gpt-4o-mini falls below 10%. It's a reminder: our smartest AIs still bluff when they don't know, though some are getting much better at admitting their limitations.
What do these tests mean for the AI race? Three things:
Benchmarks are branding – OpenAI, Google, Anthropic all want to top the leaderboards because it signals capability and trust.
Safety by testing – These aren't just scorecards; they help us catch hallucinations, overconfidence, and reasoning gaps before real-world damage.
The bar keeps rising – Just as models start to "pass" today's exams, researchers invent harder ones. And that's the point.
These benchmarks are more than academic games. They're measuring whether AI is getting smarter, more truthful, and more useful – or just better at faking it.
The race is on. And for once, we actually have a scoreboard worth watching.