Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    US government increases AI suppliers and rethinks Anthropic’s role

    May 8, 2026

    Building Modern EDA Pipelines with Pingouin

    May 8, 2026

    Samsung’s Galaxy Watches Could Alert Users Before They Faint

    May 7, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Building Modern EDA Pipelines with Pingouin
    Building Modern EDA Pipelines with Pingouin
    Business & Startups

    Building Modern EDA Pipelines with Pingouin

    gvfx00@gmail.comBy gvfx00@gmail.comMay 8, 2026No Comments6 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


     

    Table of Contents

    Toggle
    • # Introduction
    • # Initial Setup
    • # Checking Univariate Normality
    • # Checking Multivariate Normality
    • # Checking Homoscedasticity
    • # Checking Sphericity
    • # Checking Multicollinearity
    • # Wrapping Up
      • Related posts:
    • How UX Research Reveals Hidden AI Orchestration Failures
    • Top AI Events To Attend In Fall And Winter 2022
    • How to Turn Ideas Directly into Prototypes and UI

    # Introduction

     
    Anyone who has spent a fair amount of time doing data science may sooner or later learn something: the golden rule of downstream machine learning modeling, known as garbage in, garbage out (GIGO).

    For example, feeding a linear regression model with highly collinear data, or running ANOVA tests on heteroscedastic variances, is the perfect recipe… for ineffective models that won’t learn properly.

    Exploratory data analysis (EDA) has a lot to say in terms of visualizations like scatter plots and histograms, yet they aren’t sufficient when we need rigorous validation of data against the mathematical assumptions needed in downstream analyses or models. Pingouin helps do this by bridging the gap between two well-known libraries in data science and statistics: SciPy and pandas. Further, it can be a great ally to build solid, automated EDA pipelines. This article teaches you how to build a holistic pipeline for rigorous, statistical EDA, validating several important data properties.

     

    # Initial Setup

     
    Let’s start by making sure we install Pingouin in our Python environment (and pandas, in case you don’t have it yet):

    !pip install pingouin pandas

     

    After that, it’s time to import these key libraries and load our data. As an example open dataset, we will use one containing samples of wine properties and their quality.

    import pandas as pd
    import pingouin as pg
    
    # Loading the wine dataset from an open dataset GitHub repository
    url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
    df = pd.read_csv(url)
    
    # Displaying the first few rows to understand our features
    df.head()

     

    # Checking Univariate Normality

     
    The first of the specific exploratory analyses we will conduct pertains to a check on univariate normality. Many traditional algorithms for training machine learning models — and statistical tests like ANOVAs and t-tests, for that matter — need the assumption that continuous variables follow a normal, a.k.a. Gaussian distribution. Pingouin’s pg.normality() function helps do this check through a Shapiro-Wilk test across the entire dataframe:

    # Selecting a subset of continuous features for normality checks
    features = ['fixed acidity', 'volatile acidity', 'citric acid', 'pH', 'alcohol']
    
    # Running the normality test
    normality_results = pg.normality(df[features])
    print(normality_results)

     

    Output:

                             W          pval  normal
    fixed acidity     0.879789  2.437973e-57   False
    volatile acidity  0.875867  6.255995e-58   False
    citric acid       0.964977  5.262332e-37   False
    pH                0.991448  2.204049e-19   False
    alcohol           0.953532  2.918847e-41   False

     

    It seems like none of the numeric features at hand satisfy normality. This is by no means something wrong with the data; it’s simply part of its traits. We are just getting the message that, in later data preprocessing steps beyond our EDA, we might want to consider applying data transformations like log-transform or Box-Cox that make the raw data look “more normal-like” and thus more suitable for models that assume normality.

     

    # Checking Multivariate Normality

     
    Similarly, evaluating normality not feature by feature, but accounting for the interaction between features, is another interesting aspect to inspect. Let’s see how to check for multivariate normality: a key requirement in techniques like multivariate ANOVA (MANOVA), for instance.

    # Henze-Zirkler multivariate normality test
    multivariate_normality_results = pg.multivariate_normality(df[features])
    print(multivariate_normality_results)

     

    Output:

    HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)

     

    And guess what: you may get something like HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False), which means multivariate normality doesn’t hold either. If you are going to train a machine learning model on this dataset, this means non-parametric, tree-based models like gradient boosting and random forests might be a more robust alternative than parametric, weight-based models like SVM, linear regression, and so on.

     

    # Checking Homoscedasticity

     
    Next comes a tricky word for a rather simple concept: homoscedasticity. This refers to equal or constant variance across errors in predictions, and it is interpreted as a measure of reliability. We will test this property (sorry, too hard to write its name again!) with Pingouin’s implementation of Levene’s test, as follows:

    # Levene's test for equal variances across groups
    # 'dv' is the target, dependent variable, 'group' is the categorical variable
    homoscedasticity_results = pg.homoscedasticity(data=df, dv='alcohol', group='quality')
    print(homoscedasticity_results)

     

    Result:

                    W          pval  equal_var
    levene  66.338684  2.317649e-80      False

     

    Since we got False once again, we have a so-called heteroscedasticity problem, which should be accounted for in downstream analyses. One possible way could be by employing robust standard errors when training regression models.

     

    # Checking Sphericity

     
    Another statistical property to analyze is sphericity, which identifies whether the variances of differences between possible pairwise combinations of conditions are equal. Testing this property is usually desirable before running principal component analysis (PCA) for dimensionality reduction, as it helps us understand whether there are correlations between variables. PCA will be rendered rather useless in case there are not any:

    # Mauchly's sphericity test
    sphericity_results = pg.sphericity(df[features])
    print(sphericity_results)

     

    Result:

    SpherResults(spher=False, W=np.float64(0.004437706589942777), chi2=np.float64(35184.26583883276), dof=9, pval=np.float64(0.0))

     

    Looks like we have chosen a pretty indomitable, arid dataset! But fear not — this article is intentionally designed to focus on the EDA process and help you identify plenty of data issues like these. At the end of the day, detecting them and knowing what to do about them before downstream, machine learning analysis is far better than building a potentially flawed model. In this case, there is a catch: we have a p-value of 0.0, which means the null hypothesis of an identity correlation matrix is rejected, i.e. meaningful correlations exist between the variables. So if we had plenty of features and wanted to reduce dimensionality, applying PCA might be a good idea.

     

    # Checking Multicollinearity

     
    Last, we will check multicollinearity: a property that indicates whether there are highly correlated predictors. This might become, at some point, an undesirable property in interpretable models like linear regressors. Let’s check it:

    # Calculating a robust correlation matrix with p-values
    correlation_matrix = pg.rcorr(df[features], method='pearson')
    print(correlation_matrix)

     

    Output matrix:

                     fixed acidity volatile acidity citric acid     pH alcohol
    fixed acidity                -              ***         ***    ***     ***
    volatile acidity         0.219                -         ***    ***      **
    citric acid              0.324           -0.378           -    ***        
    pH                      -0.253            0.261       -0.33      -     ***
    alcohol                 -0.095           -0.038       -0.01  0.121       -

     

    While pandas’ corr() can also be used, Pingouin’s counterpart uses asterisks to indicate the statistical significance level of each correlation (* for p < 0.05, ** for p < 0.01, and *** for p < 0.001). A correlation can be statistically significant yet still small in magnitude — multicollinearity becomes a concern when the absolute value of the correlation is high (typically above 0.8). In our case, none of the pairwise correlations are dangerously large, with all five evaluated features providing largely non-overlapping, unique information of their own for further analyses.

     

    # Wrapping Up

     
    Through a series of examples applied and explained one by one, we have seen how to unleash the potential of Pingouin, an open-source Python library, to perform robust, modern EDA pipelines that help you make better decisions in data preprocessing and downstream analyses based on advanced statistical tests or machine learning models, helping you choose the right actions to perform and the right models to use.
     
     

    Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

    Related posts:

    Run a Real Time Speech to Speech AI Model Locally

    Top Artificial Intelligence Companies To Work With In 2024

    Top 10 Free Data Analysis Courses With Certification

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleSamsung’s Galaxy Watches Could Alert Users Before They Faint
    Next Article US government increases AI suppliers and rethinks Anthropic’s role
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Feature Engineering with LLMs: Techniques & Python Examples

    May 7, 2026
    Business & Startups

    7 Everyday Distributions Explained Simply

    May 7, 2026
    Business & Startups

    How to Set Up Claude Code Channels Locally

    May 7, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202572 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202570 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202572 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202570 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.