# Introduction
Humanity’s Last Exam (HLE) is a benchmark designed to measure the reasoning and deep knowledge capabilities of most modern AI systems. Its defining trait: its underlying evaluation is taken to the extreme. Think of it as nowadays’ evolution of the Turing tests, which were born quite a few decades ago.
This article takes a gentle dive into this benchmark, outlining why it was created, curating diverse opinions from groups of experts in the field about it, and wrapping up with a summary of the most widely accepted verdict.
# Why Was It Built, and What Does It Consist Of?
Traditional testing methods used in classic AI systems became obsolete as these systems evolved and started to score perfectly without much effort. For this reason, the Center for AI Safety created a novel benchmark called HLE alongside Scale AI with the aid of world experts. The benchmark was published in Nature, the most prestigious scientific journal to date, in January 2026. It has been carefully designed to avoid repeating patterns as previous evaluation frameworks did.
So, what is HLE about? Well, it is an exam to be taken by state-of-the-art AI systems like language models, and it consists of over 2,500 expert-level questions spanning over a hundred academic disciplines, including but not limited to physics, math, biology, humanities, and much more. Importantly, the questions cannot be answered by memorizing, nor are they limited to simple information retrieval or multiple-choice answering. Instead, they demand complex deductive reasoning and a deep understanding.
Here is an example of two such questions:
Two example HLE questions. Image source: Center for AI Safety
Let’s talk about the results yielded to date by the most advanced models today: even the most sophisticated frontier models like GPT, Gemini, or Claude barely surpass the accuracy threshold of 45-50% overall. The figures speak for themselves on how incredibly difficult the exam is. Moreover, they often fail it as a result of behaving in an overconfident fashion in their incorrectly answered questions.
# What Is the Dominant Experts’ Opinion About HLE?
The honest answer is: there is little consensus about this. The opinion is rather divided across the tech, developer, and academic communities, but there is a subtle, predominant leaning toward accepting some real utility in HLE. There are critical nuances, though.
In general, experts and the wider population who are acquainted with HLE do not totally consider it a meaningless initiative, but they appeal to an exaggerated, seemingly marketing-oriented way to name it.
At a large scale, there are three dominant opinion groups regarding HLE:
// 1. HLE is Truly Useful and Necessary
About 60% of the opinions lean toward this collective opinion, according to which there is a technical reason why HLE is paramount at present: previous benchmarks and testing frameworks for AI systems, including not-so-old language model benchmarks like Massive Multitask Language Understanding (MMLU), became saturated or obsolete, with nearly every modern AI scoring over 90% on them. This made it impossible to truly compare the latest models against each other to determine which one is best. One salient reason why HLE is praised by many experts is that it measures whether the AI is willing to say “I don’t know” instead of hallucinating about complex problems or questions it can’t address.
// 2. HLE is a Distraction From Real AI
This skeptical viewpoint is adopted by about 30% of the opinions. These experts consider that the test doesn’t truly evaluate AI performance and success in daily life scenarios, being purely based on overly academic and obscure knowledge. Some engineers even venture to say, rather ironically, that as soon as AI starts massively scoring over 90% in HLE, enterprises will rush to create HLE 2, and so on, thus consolidating a marketing hamster wheel in favor of large corporations.
// 3. HLE is Flawed
This is the third and smallest of the three dominant opinions, and it is being discussed in data science forums, for instance. They claim HLE has errors in some answers labeled as correct, particularly in some niche questions from areas like chemistry and advanced mathematics. Rather poetically, it has been the most powerful AI systems themselves that started to detect such errors in the benchmark.
# Wrapping Up
To summarize, HLE’s usefulness is not denied, and to some extent, its significance is underscored by many experts, although its naming is widely considered sheer marketing drama. Leveraging this benchmark seems not very likely to determine the birth of a super AI or the true emergence of artificial general intelligence (AGI): a concept that has already been discussed for many years but still is more part of fiction than reality. Nonetheless, the benchmarking is seen as a very ambitious tool to discern which AI or company owns the best model with memory and logical capabilities.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.
