Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    How enterprise AI governance secures profit margins

    May 1, 2026

    The “Robust” Data Scientist: Winning with Messy Data and Pingouin

    May 1, 2026

    AI Performances And Screenplays Won’t Be Eligible For Oscars

    May 1, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»The “Robust” Data Scientist: Winning with Messy Data and Pingouin
    The “Robust” Data Scientist: Winning with Messy Data and Pingouin
    Business & Startups

    The “Robust” Data Scientist: Winning with Messy Data and Pingouin

    gvfx00@gmail.comBy gvfx00@gmail.comMay 1, 2026No Comments6 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email



    Image by Editor

     

    Table of Contents

    Toggle
    • # Introduction
    • # Initial Setup
        • // Adventure 1: When the Normality Test Fails
        • // Adventure 2: When the Paired T-Test Fails
        • // Adventure 3: When ANOVA Fails
    • # Wrapping Up
      • Related posts:
    • How to Build a Smart AI Voice Assistant with Vapi
    • Most Downloaded Hugging Face Datasets and Their Use-cases
    • We Tuned 4 Classifiers on the Same Dataset: None Actually Improved

    # Introduction

     
    A harsh truth to begin with: textbook data science usually becomes a lie in the real world. Concepts and techniques are taught on finely curated, beautifully bell-curved data variables, but as soon as we venture into the wild of real projects, we are hit with lots of outliers, unduly skewed distributions, and indomitable variances.

    A previous article on building an exploratory data analysis (EDA) pipeline with Pingouin showed how to detect, through tests, cases when the data violates a variety of assumptions like homoscedasticity and normality. But what if the tests fail? Throwing the data away isn’t the solution: turning robust is.

    This article uncovers the craftsmanship of using robust statistics in data science processes. These are mathematical methods particularly built to yield reliable and valid results even when the data does not meet classical assumptions or is pervaded by outliers and noise. By adopting a “choose your own adventure” approach, we will create a trio of scenarios using Python’s Pingouin to manage the ugliest aspects within the data you may encounter in your daily work.

     

    # Initial Setup

     
    Let’s start by installing (if needed) and importing Pingouin and Pandas, after which we will load the wine quality dataset available here.

    !pip install pingouin pandas
    
    import pandas as pd
    import pingouin as pg
    
    # Loading our messy, real-world-like dataset, containing red and white wine samples
    url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
    df = pd.read_csv(url)
    
    # Take a small peek at what we are about to deal with
    df.head()

     

    If you looked at the previous Pingouin article, you already know this is a notoriously messy dataset that failed to meet several common assumptions. Now we will embark on three different “adventures”, each highlighting a scenario, a core problem, and a proposed robust fix to address it.

     

    // Adventure 1: When the Normality Test Fails

    Suppose we run normality tests on two groups: white wine samples and red wine samples.

    white_wine_alcohol = df[df['type'] == 'white']['alcohol']
    red_wine_alcohol = df[df['type'] == 'red']['alcohol']
    
    print("Normality test for White Wine Alcohol content:")
    print(pg.normality(white_wine_alcohol))
    print("\nNormality test for Red Wine Alcohol content:")
    print(pg.normality(red_wine_alcohol))

     

    You will find that neither distribution is normal, with extremely low p-values. Although non-normality itself doesn’t directly signal outliers or skewness, a strong deviation from normality often suggests such characteristics may be present in the data. Comparing means through a t-test in this situation would be dangerous and likely to yield unreliable results.

    The robust fix for a scenario like this is the Mann-Whitney U test. Instead of comparing averages, this test compares the ranks in the data — sorting all wines in a group from lowest to highest alcohol content, for instance. This rank-based approach is the master trick that strips outliers of their sometimes dangerous magnitude. Here’s how:

    # Separating our two groups
    red_wine = df[df['type'] == 'red']['alcohol']
    white_wine = df[df['type'] == 'white']['alcohol']
    
    # Running the robust Mann-Whitney U test
    mwu_results = pg.mwu(x=red_wine, y=white_wine)
    print(mwu_results)

     

    Output:

             U_val alternative     p_val       RBC      CLES
    MWU  3829043.5   two-sided  0.181845 -0.022193  0.488903

     

    Since the p-value is not below 0.05, there is no statistically significant difference in alcohol content between the two wine types — and this conclusion is guaranteed to be outlier-proof and skewness-proof.

     

    // Adventure 2: When the Paired T-Test Fails

    Say you now want to compare two measurements taken from the same subject — e.g. a patient’s sugar level before and after a drug prototype, or two properties measured in the same bottle of wine. The focus here is on how the differences between paired measurements are distributed. When such differences are not normally distributed, a standard paired t-test will yield unreliable confidence intervals.

    The ideal fix in this scenario is the Wilcoxon Signed-Rank Test: the robust sibling of the paired t-test, which works by observing the differences between columns and ranking their absolute values. In Pingouin, this test is called using pg.wilcoxon(), passing in the two columns containing the paired measures within the same subject — e.g. two types of wine acidity.

    # Run the robust Wilcoxon signed-rank test for paired data
    wilcoxon_results = pg.wilcoxon(x=df['fixed acidity'], y=df['volatile acidity'])
    print(wilcoxon_results)

     

    Result:

              W_val alternative  p_val  RBC  CLES
    Wilcoxon    0.0   two-sided    0.0  1.0   1.0

     

    The result above shows a statistically significant difference, or “perfect separation,” between the two measurements. Not only are the two wine properties different, but they also operate at entirely different magnitude tiers across the dataset.

     

    // Adventure 3: When ANOVA Fails

    In this third and final adventure, we want to check whether residual sugar levels in wine differ significantly across distinct quality ratings — note that the latter range between 3 and 9, taking integer values, and can therefore be treated as discrete categories.

    If Pingouin’s Levene test of homoscedasticity fails dramatically — for instance, because sugar variance in mediocre wines is huge but very small in top-quality wines — a classical one-way ANOVA may produce misleading results, as this test assumes equal variances among groups.

    The fix is Welch’s ANOVA, which penalizes groups with high variance, thereby balancing out scales and making comparisons fairer across several categories. Here is how to run this robust alternative to traditional ANOVA using Pingouin:

    # Run Welch's ANOVA to compare sugar across quality ratings
    welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality')
    print(welch_results)

     

    Result:

        Source  ddof1      ddof2          F         p_unc       np2
    0  quality      6  54.507934  10.918282  5.937951e-08  0.008353

     

    Even where a one-way ANOVA might have struggled due to unequal variances, Welch’s ANOVA delivers a solid conclusion. The very small p-value is clear evidence that residual sugar levels differ significantly across wine quality ratings. Bear in mind, however, that sugar is only a small piece of the puzzle influencing wine quality — a point underscored by the low eta-squared value of 0.008.

     

    # Wrapping Up

     
    Through three example scenarios, each pairing a messy-data problem with a robust statistical strategy, we have learned that being a skilled data scientist doesn’t mean having perfect data or tuning it perfectly — it means knowing what to do when the data gets difficult for different reasons. Pingouin’s functions implement a variety of robust tests that help escape the failed-assumptions trap and extract mathematically sound insights with little extra effort.
     
     

    Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

    Related posts:

    The Hidden Curriculum of Data Science Interviews: What Companies Really Test

    End of Endless Product Scrolling

    Building Production-Ready AI Agents with Agent Development Kit

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAI Performances And Screenplays Won’t Be Eligible For Oscars
    Next Article How enterprise AI governance secures profit margins
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Building Long-Term Memory for AI Agents

    May 1, 2026
    Business & Startups

    5 Powerful Python Decorators to Build Clean AI Code

    May 1, 2026
    Business & Startups

    Build Real-Time Voice Agents with Grok Voice Think Fast 1.0

    April 30, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202558 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202542 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202558 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202542 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.