Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    The best Nintendo Switch and Switch 2 accessories for Pokémon superfans

    March 22, 2026

    Michael Shannon’s Big Year | Little White Lies

    March 22, 2026

    BMW tuner AC Schnitzer will shutdown by end of 2026

    March 22, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»AI Tools»Flawed AI benchmarks put enterprise budgets at risk
    Flawed AI benchmarks put enterprise budgets at risk
    AI Tools

    Flawed AI benchmarks put enterprise budgets at risk

    gvfx00@gmail.comBy gvfx00@gmail.comNovember 4, 2025No Comments7 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    A new academic review suggests AI benchmarks are flawed, potentially leading an enterprise to make high-stakes decisions on “misleading” data.

    Enterprise leaders are committing budgets of eight or nine figures to generative AI programmes. These procurement and development decisions often rely on public leaderboards and benchmarks to compare model capabilities.

    A large-scale study, ‘Measuring what Matters: Construct Validity in Large Language Model Benchmarks,’ analysed 445 separate LLM benchmarks from leading AI conferences. A team of 29 expert reviewers found that “almost all articles have weaknesses in at least one area,” undermining the claims they make about model performance.

    For CTOs and Chief Data Officers, it strikes at the heart of AI governance and investment strategy. If a benchmark claiming to measure ‘safety’ or ‘robustness’ doesn’t actually capture those qualities, an organisation could deploy a model that exposes it to serious financial and reputational risk.

    Table of Contents

    Toggle
      • The ‘construct validity’ problem
      • Where the enterprise AI benchmarks are failing
      • From public metrics to internal validation
      • Related posts:
    • Mali fuel crisis spirals amid armed group blocking supplies to capital | Conflict News
    • T20 World Cup: South Africa beat UAE before India showdown | ICC Men's T20 World Cup News
    • State-Sponsored Hackers Exploit AI in Cyberattacks: Google

    The ‘construct validity’ problem

    The researchers focused on a core scientific principle known as construct validity. In simple terms, this is the degree to which a test measures the abstract concept it claims to be measuring.

    For example, while ‘intelligence’ cannot be measured directly, tests are created to serve as measurable proxies. The paper notes that if a benchmark has low construct validity, “then a high score may be irrelevant or even misleading”.

    This problem is widespread in AI evaluation. The study found that key concepts are often “poorly defined or operationalised”. This can lead to “poorly supported scientific claims, misdirected research, and policy implications that are not grounded in robust evidence”.

    When vendors compete for enterprise contracts by highlighting their top scores on benchmarks, leaders are effectively trusting that these scores are a reliable proxy for real-world business performance. This new research suggests that trust may be misplaced.

    Where the enterprise AI benchmarks are failing

    The review identified systemic failings across the board, from how benchmarks are designed to how their results are reported.

    Vague or contested definitions: You cannot measure what you cannot define. The study found that even when definitions for a phenomenon were provided, 47.8 percent were “contested,” addressing concepts with “many possible definitions or no clear definition at all”.

    The paper uses ‘harmlessness’ – a key goal in enterprise safety alignment – as an example of a phenomenon that often lacks a clear, agreed-upon definition. If two vendors score differently on a ‘harmlessness’ benchmark, it may only reflect two different, arbitrary definitions of the term, not a genuine difference in model safety.

    Lack of statistical rigour: Perhaps most alarming for data-driven organisations, the review found that only 16 percent of the 445 benchmarks used uncertainty estimates or statistical tests to compare model results.

    Without statistical analysis, it’s impossible to know if a 2 percent lead for Model A over Model B is a genuine capability difference or simple random chance. Enterprise decisions are being guided by numbers that would not pass a basic scientific or business intelligence review.

    Data contamination and memorisation: Many benchmarks, especially those for reasoning (like the widely used GSM8K), are undermined when their questions and answers appear in the model’s pre-training data.

    When this happens, the model isn’t reasoning to find the answer; it’s simply memorising it. A high score may indicate a good memory, not the advanced reasoning capability an enterprise actually needs for a complex task. The paper warns this “undermine[s] the validity of the results” and recommends building contamination checks directly into the benchmark.

    Unrepresentative datasets: The study found that 27 percent of benchmarks used “convenience sampling,” such as reusing data from existing benchmarks or human exams. This data is often not representative of the real-world phenomenon.

    For example, the authors note that reusing questions from a “calculator-free exam” means the problems use numbers chosen to be easy for basic arithmetic. A model might score well on this test, but this score “would not predict performance on larger numbers, where LLMs struggle”. This creates a critical blind spot, hiding a known model weakness.

    From public metrics to internal validation

    For enterprise leaders, the study serves as a strong warning: public AI benchmarks are not a substitute for internal and domain-specific evaluation. A high score on a public leaderboard is not a guarantee of fitness for a specific business purpose.

    Isabella Grandi, Director for Data Strategy & Governance, at NTT DATA UK&I, commented: “A single benchmark might not be the right way to capture the complexity of AI systems, and expecting it to do so risks reducing progress to a numbers game rather than a measure of real-world responsibility. What matters most is consistent evaluation against clear principles that ensure technology serves people as well as progress.

    “Good methodology – as laid out by ISO/IEC 42001:2023 – reflects this balance through five core principles: accountability, fairness, transparency, security and redress. Accountability establishes ownership and responsibility for any AI system that is deployed. Transparency and fairness guide decisions toward outcomes that are ethical and explainable. Security and privacy are non-negotiable, preventing misuse and reinforcing public trust. Redress and contestability provide a vital mechanism for oversight, ensuring people can challenge and correct outcomes when necessary.

    “Real progress in AI depends on collaboration that brings together the vision of government, the curiosity of academia and the practical drive of industry. When partnerships are underpinned by open dialogue and shared standards take hold, it builds the transparency needed for people to instil trust in AI systems. Responsible innovation will always rely on cooperation that strengthens oversight while keeping ambition alive.”

    The paper’s eight recommendations provide a practical checklist for any enterprise looking to build its own internal AI benchmarks and evaluations, aligning with the principles-based approach.

    • Define your phenomenon: Before testing models, organisations must first create a “precise and operational definition for the phenomenon being measured”. What does a ‘helpful’ response mean in the context of your customer service? What does ‘accurate’ mean for your financial reports?
    • Build a representative dataset: The most valuable benchmark is one built from your own data. The paper urges developers to “construct a representative dataset for the task”. This means using task items that reflect the real-world scenarios, formats, and challenges your employees and customers face.
    • Conduct error analysis: Go beyond the final score. The report recommends teams “conduct a qualitative and quantitative analysis of common failure modes”. Analysing why a model fails is more instructive than just knowing its score. If its failures are all on low-priority, obscure topics, it may be acceptable; if it fails on your most common and high-value use cases, that single score becomes irrelevant.
    • Justify validity: Finally, teams must “justify the relevance of the benchmark for the phenomenon with real-world applications”. Every evaluation should come with a clear rationale explaining why this specific test is a valid proxy for business value.

    The race to deploy generative AI is pushing organisations to move faster than their governance frameworks can keep up. This report shows that the very tools used to measure progress are often flawed. The only reliable path forward is to stop trusting generic AI benchmarks and start “measuring what matters” for your own enterprise.

    See also: OpenAI spreads $600B cloud AI bet across AWS, Oracle, Microsoft

    Banner for AI & Big Data Expo by TechEx events.

    Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events including the Cyber Security Expo, click here for more information.

    AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

    Related posts:

    Is the global public tuning out the climate change debate? | Climate Crisis

    Local AI models: How to keep control of the bidstream without losing your data

    Tuned Global strengthens its leadership in music technology with the acquisition of Figaro.ai

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleArc Raiders’ first new map arrives this month, bringing even more Arc units, just as the game casually breaks its own Steam record again
    Next Article ChatGPT-tips för att spara pengar vid julköpen
    gvfx00@gmail.com
    • Website

    Related Posts

    AI Tools

    Lebanon’s Aoun warns Israeli attack on bridge ‘prelude to ground invasion’ | Israel attacks Lebanon News

    March 22, 2026
    AI Tools

    Iran says will hit region’s energy sites if US, Israel target power plants | US-Israel war on Iran News

    March 22, 2026
    AI Tools

    Evloev upsets Murphy, sets up featherweight title shot against Volkanovski | Mixed Martial Arts News

    March 22, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.