Building Modern EDA Pipelines with Pingouin

Table of Contents

# Introduction

Anyone who has spent a fair amount of time doing data science may sooner or later learn something: the golden rule of downstream machine learning modeling, known as garbage in, garbage out (GIGO).

For example, feeding a linear regression model with highly collinear data, or running ANOVA tests on heteroscedastic variances, is the perfect recipe… for ineffective models that won’t learn properly.

Exploratory data analysis (EDA) has a lot to say in terms of visualizations like scatter plots and histograms, yet they aren’t sufficient when we need rigorous validation of data against the mathematical assumptions needed in downstream analyses or models. Pingouin helps do this by bridging the gap between two well-known libraries in data science and statistics: SciPy and pandas. Further, it can be a great ally to build solid, automated EDA pipelines. This article teaches you how to build a holistic pipeline for rigorous, statistical EDA, validating several important data properties.

# Initial Setup

Let’s start by making sure we install Pingouin in our Python environment (and pandas, in case you don’t have it yet):

!pip install pingouin pandas

After that, it’s time to import these key libraries and load our data. As an example open dataset, we will use one containing samples of wine properties and their quality.

import pandas as pd
import pingouin as pg

# Loading the wine dataset from an open dataset GitHub repository
url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
df = pd.read_csv(url)

# Displaying the first few rows to understand our features
df.head()

# Checking Univariate Normality

The first of the specific exploratory analyses we will conduct pertains to a check on univariate normality. Many traditional algorithms for training machine learning models — and statistical tests like ANOVAs and t-tests, for that matter — need the assumption that continuous variables follow a normal, a.k.a. Gaussian distribution. Pingouin’s pg.normality() function helps do this check through a Shapiro-Wilk test across the entire dataframe:

# Selecting a subset of continuous features for normality checks
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'pH', 'alcohol']

# Running the normality test
normality_results = pg.normality(df[features])
print(normality_results)

Output:

                         W          pval  normal
fixed acidity     0.879789  2.437973e-57   False
volatile acidity  0.875867  6.255995e-58   False
citric acid       0.964977  5.262332e-37   False
pH                0.991448  2.204049e-19   False
alcohol           0.953532  2.918847e-41   False

It seems like none of the numeric features at hand satisfy normality. This is by no means something wrong with the data; it’s simply part of its traits. We are just getting the message that, in later data preprocessing steps beyond our EDA, we might want to consider applying data transformations like log-transform or Box-Cox that make the raw data look “more normal-like” and thus more suitable for models that assume normality.

# Checking Multivariate Normality

Similarly, evaluating normality not feature by feature, but accounting for the interaction between features, is another interesting aspect to inspect. Let’s see how to check for multivariate normality: a key requirement in techniques like multivariate ANOVA (MANOVA), for instance.

# Henze-Zirkler multivariate normality test
multivariate_normality_results = pg.multivariate_normality(df[features])
print(multivariate_normality_results)

Output:

HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)

And guess what: you may get something like HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False), which means multivariate normality doesn’t hold either. If you are going to train a machine learning model on this dataset, this means non-parametric, tree-based models like gradient boosting and random forests might be a more robust alternative than parametric, weight-based models like SVM, linear regression, and so on.

# Checking Homoscedasticity

Next comes a tricky word for a rather simple concept: homoscedasticity. This refers to equal or constant variance across errors in predictions, and it is interpreted as a measure of reliability. We will test this property (sorry, too hard to write its name again!) with Pingouin’s implementation of Levene’s test, as follows:

# Levene's test for equal variances across groups
# 'dv' is the target, dependent variable, 'group' is the categorical variable
homoscedasticity_results = pg.homoscedasticity(data=df, dv='alcohol', group='quality')
print(homoscedasticity_results)

Result:

                W          pval  equal_var
levene  66.338684  2.317649e-80      False

Since we got False once again, we have a so-called heteroscedasticity problem, which should be accounted for in downstream analyses. One possible way could be by employing robust standard errors when training regression models.

# Checking Sphericity

Another statistical property to analyze is sphericity, which identifies whether the variances of differences between possible pairwise combinations of conditions are equal. Testing this property is usually desirable before running principal component analysis (PCA) for dimensionality reduction, as it helps us understand whether there are correlations between variables. PCA will be rendered rather useless in case there are not any:

# Mauchly's sphericity test
sphericity_results = pg.sphericity(df[features])
print(sphericity_results)

Result:

SpherResults(spher=False, W=np.float64(0.004437706589942777), chi2=np.float64(35184.26583883276), dof=9, pval=np.float64(0.0))

Looks like we have chosen a pretty indomitable, arid dataset! But fear not — this article is intentionally designed to focus on the EDA process and help you identify plenty of data issues like these. At the end of the day, detecting them and knowing what to do about them before downstream, machine learning analysis is far better than building a potentially flawed model. In this case, there is a catch: we have a p-value of 0.0, which means the null hypothesis of an identity correlation matrix is rejected, i.e. meaningful correlations exist between the variables. So if we had plenty of features and wanted to reduce dimensionality, applying PCA might be a good idea.

# Checking Multicollinearity

Last, we will check multicollinearity: a property that indicates whether there are highly correlated predictors. This might become, at some point, an undesirable property in interpretable models like linear regressors. Let’s check it:

# Calculating a robust correlation matrix with p-values
correlation_matrix = pg.rcorr(df[features], method='pearson')
print(correlation_matrix)

Output matrix:

                 fixed acidity volatile acidity citric acid     pH alcohol
fixed acidity                -              ***         ***    ***     ***
volatile acidity         0.219                -         ***    ***      **
citric acid              0.324           -0.378           -    ***        
pH                      -0.253            0.261       -0.33      -     ***
alcohol                 -0.095           -0.038       -0.01  0.121       -

While pandas’ corr() can also be used, Pingouin’s counterpart uses asterisks to indicate the statistical significance level of each correlation (* for p < 0.05, ** for p < 0.01, and *** for p < 0.001). A correlation can be statistically significant yet still small in magnitude — multicollinearity becomes a concern when the absolute value of the correlation is high (typically above 0.8). In our case, none of the pairwise correlations are dangerously large, with all five evaluated features providing largely non-overlapping, unique information of their own for further analyses.

# Wrapping Up

Through a series of examples applied and explained one by one, we have seen how to unleash the potential of Pingouin, an open-source Python library, to perform robust, modern EDA pipelines that help you make better decisions in data preprocessing and downstream analyses based on advanced statistical tests or machine learning models, helping you choose the right actions to perform and the right models to use.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

What's Hot

US government increases AI suppliers and rethinks Anthropic’s role

Building Modern EDA Pipelines with Pingouin

Samsung’s Galaxy Watches Could Alert Users Before They Faint

Building Modern EDA Pipelines with Pingouin

Run a Real Time Speech to Speech AI Model Locally

Top Artificial Intelligence Companies To Work With In 2024

Top 10 Free Data Analysis Courses With Certification

Feature Engineering with LLMs: Techniques & Python Examples

7 Everyday Distributions Explained Simply

How to Set Up Claude Code Channels Locally

Black Swans in Artificial Intelligence — Dan Rose AI

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Every Clue That Tony Stark Was Always Doctor Doom

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Every Clue That Tony Stark Was Always Doctor Doom

Subscribe to Updates

What's Hot

Building Modern EDA Pipelines with Pingouin

# Introduction

# Initial Setup

# Checking Univariate Normality

# Checking Multivariate Normality

# Checking Homoscedasticity

# Checking Sphericity

# Checking Multicollinearity

# Wrapping Up

Related posts:

Run a Real Time Speech to Speech AI Model Locally

Top Artificial Intelligence Companies To Work With In 2024

Top 10 Free Data Analysis Courses With Certification

Related Posts

Subscribe to Updates