Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Fans React To The New Star Fox

    May 7, 2026

    Final Fantasy 9 Official New Release Drops On May 16

    May 7, 2026

    We Built Our Perfect BMW iX3 and Kept It Under $72,000

    May 7, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»AI Tools»AI model using AMD GPUs for training hits milestone
    AI model using AMD GPUs for training hits milestone
    AI Tools

    AI model using AMD GPUs for training hits milestone

    gvfx00@gmail.comBy gvfx00@gmail.comNovember 25, 2025No Comments5 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Zyphra, AMD, and IBM spent a year testing whether AMD’s GPUs and platform can support large-scale AI model training, and the result is ZAYA1.

    In partnership, the three companies trained ZAYA1 – described as the first major Mixture-of-Experts foundation model built entirely on AMD GPUs and networking – which they see as proof that the market doesn’t have to depend on NVIDIA to scale AI.

    The model was trained on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software, all running across IBM Cloud’s infrastructure. What’s notable is how conventional the setup looks. Instead of experimental hardware or obscure configurations, Zyphra built the system much like any enterprise cluster—just without NVIDIA’s components.

    Zyphra says ZAYA1 performs on par with, and in some areas ahead of, well-established open models in reasoning, maths, and code. For businesses frustrated by supply constraints or spiralling GPU pricing, it amounts to something rare: a second option that doesn’t require compromising on capability.

    Table of Contents

    Toggle
      • How Zyphra used AMD GPUs to cut costs without gutting AI training performance
      • ZAYA1: An AI model that punches above its weight
      • Making ROCm behave with AMD GPUs
      • Keeping clusters on their feet
      • What the ZAYA1 AMD training milestone means for AI procurement
      • Related posts:
    • India now sets the terms of global cricket | Cricket
    • Debenhams pilots agentic AI commerce via PayPal integration
    • Trump says US launched strike against ISIL in northwest Nigeria | Donald Trump News

    How Zyphra used AMD GPUs to cut costs without gutting AI training performance

    Most organisations follow the same logic when planning training budgets: memory capacity, communication speed, and predictable iteration times matter more than raw theoretical throughput. 

    MI300X’s 192GB of high-bandwidth memory per GPU gives engineers some breathing room, allowing early training runs without immediately resorting to heavy parallelism. That tends to simplify projects that are otherwise fragile and time-consuming to tune.

    Zyphra built each node with eight MI300X GPUs connected over InfinityFabric and paired each one with its own Pollara network card. A separate network handles dataset reads and checkpointing. It’s an unfussy design, but that seems to be the point; the simpler the wiring and network layout, the lower the switch costs and the easier it is to keep iteration times steady.

    ZAYA1: An AI model that punches above its weight

    ZAYA1-base activates 760 million parameters out of a total 8.3 billion and was trained on 12 trillion tokens in three stages. The architecture leans on compressed attention, a refined routing system to steer tokens to the right experts, and lighter-touch residual scaling to keep deeper layers stable.

    The model uses a mix of Muon and AdamW. To make Muon efficient on AMD hardware, Zyphra fused kernels and trimmed unnecessary memory traffic so the optimiser wouldn’t dominate each iteration. Batch sizes were increased over time, but that depends heavily on having storage pipelines that can deliver tokens quickly enough.

    All of this leads to an AI model trained on AMD hardware that competes with larger peers such as Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One advantage of the MoE structure is that only a sliver of the model runs at once, which helps manage inference memory and reduces serving cost.

    A bank, for example, could train a domain-specific model for investigations without needing convoluted parallelism early on. The MI300X’s memory headroom gives engineers space to iterate, while ZAYA1’s compressed attention cuts prefill time during evaluation.

    Making ROCm behave with AMD GPUs

    Zyphra didn’t hide the fact that moving a mature NVIDIA-based workflow onto ROCm took work. Instead of porting components blindly, the team spent time measuring how AMD hardware behaved and reshaping model dimensions, GEMM patterns, and microbatch sizes to suit MI300X’s preferred compute ranges.

    InfinityFabric operates best when all eight GPUs in a node participate in collectives, and Pollara tends to reach peak throughput with larger messages, so Zyphra sized fusion buffers accordingly. Long-context training, from 4k up to 32k tokens, relied on ring attention for sharded sequences and tree attention during decoding to avoid bottlenecks.

    Storage considerations were equally practical. Smaller models hammer IOPS; larger ones need sustained bandwidth. Zyphra bundled dataset shards to reduce scattered reads and increased per-node page caches to speed checkpoint recovery, which is vital during long runs where rewinds are inevitable.

    Keeping clusters on their feet

    Training jobs that run for weeks rarely behave perfectly. Zyphra’s Aegis service monitors logs and system metrics, identifies failures such as NIC glitches or ECC blips, and takes straightforward corrective actions automatically. The team also increased RCCL timeouts to keep short network interruptions from killing entire jobs.

    Checkpointing is distributed across all GPUs rather than forced through a single chokepoint. Zyphra reports more than ten-fold faster saves compared with naïve approaches, which directly improves uptime and cuts operator workload.

    What the ZAYA1 AMD training milestone means for AI procurement

    The report draws a clean line between NVIDIA’s ecosystem and AMD’s equivalents: NVLINK vs InfinityFabric, NCCL vs RCCL, cuBLASLt vs hipBLASLt, and so on. The authors argue the AMD stack is now mature enough for serious large-scale model development.

    None of this suggests enterprises should tear out existing NVIDIA clusters. A more realistic path is to keep NVIDIA for production while using AMD for stages that benefit from the memory capacity of MI300X GPUs and ROCm’s openness. It spreads supplier risk and increases total training volume without major disruption.

    This all leads us to a set of recommendations: treat model shape as adjustable, not fixed; design networks around the collective operations your training will actually use; build fault tolerance that protects GPU hours rather than merely logging failures; and modernise checkpointing so it no longer derails training rhythm.

    It’s not a manifesto, just our practical takeaway from what Zyphra, AMD, and IBM learned by training a large MoE AI model on AMD GPUs. For organisations looking to expand AI capacity without relying solely on one vendor, it’s a potentially useful blueprint.

    See also: Google commits to 1000x more AI infrastructure in next 4-5 years

    Banner for AI & Big Data Expo by TechEx events.

    Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events including the Cyber Security Expo. Click here for more information.

    AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

    Related posts:

    How labour unions in Europe can help end Israel’s genocide in Gaza | Israel-Palestine conflict

    Australian police say Bondi Beach attackers inspired by ISIL | Crime News

    Iran executes two convicted members of banned opposition group | Death Penalty News

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleJavaScript Is Weird. And That’s Why We Love It.
    Next Article MG QS Super Hybrid: Plug-in hybrid large SUV coming to take on Kluger, Sorento
    gvfx00@gmail.com
    • Website

    Related Posts

    AI Tools

    Toronto World Cup tickets to be resold for face value on FIFA marketplace | World Cup 2026 News

    May 7, 2026
    AI Tools

    HP and the art of AI and data for the enterprise

    May 7, 2026
    AI Tools

    Israel bombs Beirut’s southern suburb as it targets Hezbollah commander | News

    May 6, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202571 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202569 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202571 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202569 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.