A new exam has been designed to challenge the most advanced artificial intelligence systems – and early results suggest the machines have a long way to go.
Developed by the Center for AI Safety and Scale AI, Humanity’s Last Exam (HLE) consists of thousands of questions covering subjects from medicine and mathematics to classics and computer science. It has been described as the most difficult test yet created for large language models (LLMs) such as ChatGPT.
Why the test was created
Until recently, the main benchmark used to measure AI ability was the MMLU (Massive Multitask Language Understanding) exam, a set of multiple-choice questions across many topics. But modern LLMs now achieve more than 90% on MMLU, leading researchers to warn that the test has reached “saturation”.
HLE was conceived by researcher Dan Hendrycks after a conversation with Elon Musk, who argued that current exams were too easy. To make things tougher, experts around the world submitted obscure and highly specialised questions. Submissions were then screened by both machines and humans before being included.
To encourage quality contributions, prize money of $500,000 was awarded to the authors of the best questions.
How the exam works
The exam currently includes 2,500 public questions, with more kept private to prevent AI models from “cheating” by training directly on them. Subjects are weighted towards mathematics, biology and physics, but also include humanities and engineering.
Unlike earlier benchmarks, most of the HLE questions are not multiple choice. Instead, they demand precise short answers, sometimes requiring models to interpret images as well as text.
LLMs are scored not only on accuracy but also on how confident they are in their answers – a feature that could help researchers study when AI models are over-confident or prone to generating false information.
How AI is performing
So far, the results show that even leading systems find the exam extremely challenging. According to figures published in August 2025, OpenAI’s GPT-5 achieved the highest score at just over 25% accuracy. Google’s Gemini 2.5 Pro followed at 21%, while Anthropic’s Claude Opus scored 11%.
Other well-known models, including Meta’s Llama 4 and Amazon’s Nova Pro, struggled to get above 5%. By contrast, a human expert would be expected to score far higher.
Researchers believe that with rapid advances in AI, models could reach 50% accuracy before the end of 2025.
What happens if AI “passes”?
If a model eventually scores close to 100% on HLE, it would be a striking achievement. But experts stress that this would not prove the system has achieved human-like general intelligence. The exam is focused on structured academic problems rather than creativity, reasoning or adaptability.
Nonetheless, high-scoring models would be immensely powerful research tools, capable of assisting scientists in solving complex problems across many fields.
A modern Turing test?
The exam recalls the famous Turing Test proposed in 1950 to determine whether a machine could convincingly imitate human conversation. While today’s chatbots can already pass that test, HLE sets a new bar for measuring machine intelligence.
As Hendrycks and colleagues point out, the race between benchmarks and breakthroughs is likely to continue. For now, Humanity’s Last Exam provides a rare opportunity: a test that still leaves humans clearly ahead – but perhaps not for long.








