Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    The best Nintendo Switch and Switch 2 accessories for Pokémon superfans

    March 22, 2026

    Michael Shannon’s Big Year | Little White Lies

    March 22, 2026

    BMW tuner AC Schnitzer will shutdown by end of 2026

    March 22, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»AI Tools»AI model using AMD GPUs for training hits milestone
    AI model using AMD GPUs for training hits milestone
    AI Tools

    AI model using AMD GPUs for training hits milestone

    gvfx00@gmail.comBy gvfx00@gmail.comNovember 25, 2025No Comments5 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Zyphra, AMD, and IBM spent a year testing whether AMD’s GPUs and platform can support large-scale AI model training, and the result is ZAYA1.

    In partnership, the three companies trained ZAYA1 – described as the first major Mixture-of-Experts foundation model built entirely on AMD GPUs and networking – which they see as proof that the market doesn’t have to depend on NVIDIA to scale AI.

    The model was trained on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software, all running across IBM Cloud’s infrastructure. What’s notable is how conventional the setup looks. Instead of experimental hardware or obscure configurations, Zyphra built the system much like any enterprise cluster—just without NVIDIA’s components.

    Zyphra says ZAYA1 performs on par with, and in some areas ahead of, well-established open models in reasoning, maths, and code. For businesses frustrated by supply constraints or spiralling GPU pricing, it amounts to something rare: a second option that doesn’t require compromising on capability.

    Table of Contents

    Toggle
      • How Zyphra used AMD GPUs to cut costs without gutting AI training performance
      • ZAYA1: An AI model that punches above its weight
      • Making ROCm behave with AMD GPUs
      • Keeping clusters on their feet
      • What the ZAYA1 AMD training milestone means for AI procurement
      • Related posts:
    • Russia-Ukraine war: List of key events, day 1,326 | Russia-Ukraine war News
    • Infosys AI implementation framework offers business leaders guidance
    • US forces stormed cargo ship travelling from China to Iran: Report | International Trade News

    How Zyphra used AMD GPUs to cut costs without gutting AI training performance

    Most organisations follow the same logic when planning training budgets: memory capacity, communication speed, and predictable iteration times matter more than raw theoretical throughput. 

    MI300X’s 192GB of high-bandwidth memory per GPU gives engineers some breathing room, allowing early training runs without immediately resorting to heavy parallelism. That tends to simplify projects that are otherwise fragile and time-consuming to tune.

    Zyphra built each node with eight MI300X GPUs connected over InfinityFabric and paired each one with its own Pollara network card. A separate network handles dataset reads and checkpointing. It’s an unfussy design, but that seems to be the point; the simpler the wiring and network layout, the lower the switch costs and the easier it is to keep iteration times steady.

    ZAYA1: An AI model that punches above its weight

    ZAYA1-base activates 760 million parameters out of a total 8.3 billion and was trained on 12 trillion tokens in three stages. The architecture leans on compressed attention, a refined routing system to steer tokens to the right experts, and lighter-touch residual scaling to keep deeper layers stable.

    The model uses a mix of Muon and AdamW. To make Muon efficient on AMD hardware, Zyphra fused kernels and trimmed unnecessary memory traffic so the optimiser wouldn’t dominate each iteration. Batch sizes were increased over time, but that depends heavily on having storage pipelines that can deliver tokens quickly enough.

    All of this leads to an AI model trained on AMD hardware that competes with larger peers such as Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One advantage of the MoE structure is that only a sliver of the model runs at once, which helps manage inference memory and reduces serving cost.

    A bank, for example, could train a domain-specific model for investigations without needing convoluted parallelism early on. The MI300X’s memory headroom gives engineers space to iterate, while ZAYA1’s compressed attention cuts prefill time during evaluation.

    Making ROCm behave with AMD GPUs

    Zyphra didn’t hide the fact that moving a mature NVIDIA-based workflow onto ROCm took work. Instead of porting components blindly, the team spent time measuring how AMD hardware behaved and reshaping model dimensions, GEMM patterns, and microbatch sizes to suit MI300X’s preferred compute ranges.

    InfinityFabric operates best when all eight GPUs in a node participate in collectives, and Pollara tends to reach peak throughput with larger messages, so Zyphra sized fusion buffers accordingly. Long-context training, from 4k up to 32k tokens, relied on ring attention for sharded sequences and tree attention during decoding to avoid bottlenecks.

    Storage considerations were equally practical. Smaller models hammer IOPS; larger ones need sustained bandwidth. Zyphra bundled dataset shards to reduce scattered reads and increased per-node page caches to speed checkpoint recovery, which is vital during long runs where rewinds are inevitable.

    Keeping clusters on their feet

    Training jobs that run for weeks rarely behave perfectly. Zyphra’s Aegis service monitors logs and system metrics, identifies failures such as NIC glitches or ECC blips, and takes straightforward corrective actions automatically. The team also increased RCCL timeouts to keep short network interruptions from killing entire jobs.

    Checkpointing is distributed across all GPUs rather than forced through a single chokepoint. Zyphra reports more than ten-fold faster saves compared with naïve approaches, which directly improves uptime and cuts operator workload.

    What the ZAYA1 AMD training milestone means for AI procurement

    The report draws a clean line between NVIDIA’s ecosystem and AMD’s equivalents: NVLINK vs InfinityFabric, NCCL vs RCCL, cuBLASLt vs hipBLASLt, and so on. The authors argue the AMD stack is now mature enough for serious large-scale model development.

    None of this suggests enterprises should tear out existing NVIDIA clusters. A more realistic path is to keep NVIDIA for production while using AMD for stages that benefit from the memory capacity of MI300X GPUs and ROCm’s openness. It spreads supplier risk and increases total training volume without major disruption.

    This all leads us to a set of recommendations: treat model shape as adjustable, not fixed; design networks around the collective operations your training will actually use; build fault tolerance that protects GPU hours rather than merely logging failures; and modernise checkpointing so it no longer derails training rhythm.

    It’s not a manifesto, just our practical takeaway from what Zyphra, AMD, and IBM learned by training a large MoE AI model on AMD GPUs. For organisations looking to expand AI capacity without relying solely on one vendor, it’s a potentially useful blueprint.

    See also: Google commits to 1000x more AI infrastructure in next 4-5 years

    Banner for AI & Big Data Expo by TechEx events.

    Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events including the Cyber Security Expo. Click here for more information.

    AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

    Related posts:

    Israel to block dozens of aid groups working in war-battered Gaza | Human Rights News

    Manufacturing's pivot: AI as a strategic driver

    Tunisian opposition figures join hunger strike to support jailed politician | Politics News

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleJavaScript Is Weird. And That’s Why We Love It.
    Next Article MG QS Super Hybrid: Plug-in hybrid large SUV coming to take on Kluger, Sorento
    gvfx00@gmail.com
    • Website

    Related Posts

    AI Tools

    Lebanon’s Aoun warns Israeli attack on bridge ‘prelude to ground invasion’ | Israel attacks Lebanon News

    March 22, 2026
    AI Tools

    Iran says will hit region’s energy sites if US, Israel target power plants | US-Israel war on Iran News

    March 22, 2026
    AI Tools

    Evloev upsets Murphy, sets up featherweight title shot against Volkanovski | Mixed Martial Arts News

    March 22, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.