7 Python EDA Tricks to Find and Fix Data Issues

Image by Editor

Table of Contents

# Introduction

Exploratory data analysis (EDA) is a crucial stage prior to deeper data analysis processes or building data-driven AI systems, such as those based on machine learning models. While fixing common, real-world data quality issues and inconsistencies is often deferred to subsequent stages of the data pipeline, EDA is also an excellent opportunity to proactively detect these issues early on — before silently biasing results, degrading model performance, or compromising downstream decision-making.

Below, we curate a list that contains 7 Python tricks applicable to your early EDA processes, namely by effectively identifying and fixing a variety of data quality issues.

To illustrate these tricks, we will use a synthetically generated employees dataset, in which we will intentionally inject a variety of data quality issues to exemplify how to detect and handle them. Before trying the tricks out, make sure you first copy and paste the following preamble code in your coding environment:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# PREAMBLE CODE THAT RANDOMLY CREATES A DATASET AND INTRODUCES QUALITY ISSUES IN IT
np.random.seed(42)

n = 1000

df = pd.DataFrame({
    "age": np.random.normal(40, 12, n).round(),
    "income": np.random.normal(60000, 15000, n),
    "experience_years": np.random.normal(10, 5, n),
    "department": np.random.choice(
        ["Sales", "Engineering", "HR", "sales", "Eng", "HR "], n
    ),
    "performance_score": np.random.normal(3, 0.7, n)
})

# Randomly injecting data issues to the dataset

# 1. Missing values
df.loc[np.random.choice(n, 80, replace=False), "income"] = np.nan
df.loc[np.random.choice(n, 50, replace=False), "department"] = np.nan

# 2. Outliers
df.loc[np.random.choice(n, 10), "income"] *= 5
df.loc[np.random.choice(n, 10), "age"] = -5

# 3. Invalid values
df.loc[np.random.choice(n, 15), "performance_score"] = 7

# 4. Skewness
df["bonus"] = np.random.exponential(2000, n)

# 5. Highly correlated features
df["income_copy"] = df["income"] * 1.02

# 6. Duplicated entries
df = pd.concat([df, df.iloc[:20]], ignore_index=True)

df.head()

# 1. Detecting Missing Values via Heatmaps

While there are functions in Python libraries like Pandas that count the number of missing values for each attribute in your dataset, an attractive approach to have a quick glimpse of all missing values in your dataset — and which columns or attributes contain some — is by visualizing a heatmap aided by the isnull() function, thus plotting white, barcode-like lines for every single missing value across your whole dataset, horizontally arranged by attributes.

plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cbar=False)
plt.title("Missing Value Heatmap")
plt.show()

df.isnull().sum().sort_values(ascending=False)

Heatmap to detect missing values | Image by Author

# 2. Removing Duplicates

This trick is a classic: simple, yet very effective to count the number of duplicated instances (rows) in your dataset, after which you can apply drop_duplicates() to remove them. By default, this function preserves the first occurrence of each duplicated row and eliminates the rest. Nonetheless, this behavior can be modified, for instance, by using the keep="last" option to retain the last occurrence instead of the first one, or keep=False to get rid of all duplicated rows entirely. The behavior to choose will depend on your specific problem needs.

duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Remove duplicates
df = df.drop_duplicates()

# 3. Identifying Outliers Using the Inter-Quartile Range Method

The inter-quartile range (IQR) method is a statistics-backed approach to identify data points that may be deemed as outliers or extreme values, due to being substantially distant from the rest of the points. This trick provides an implementation of the IQR method that can be replicated for different numeric attributes, such as “income”:

def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return data[(data[column] < lower) | (data[column] > upper)]

outliers_income = detect_outliers_iqr(df, "income")
print(f"Income outliers: {len(outliers_income)}")

# Optional: cap them
Q1 = df["income"].quantile(0.25)
Q3 = df["income"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df["income"] = df["income"].clip(lower, upper)

# 4. Managing Inconsistent Categories

Unlike outliers, which are normally associated with numeric features, inconsistent categories in categorical variables can stem from diverse factors, e.g. manual inconsistencies like uppercase or lowercase initials in names or domain-specific variations. Therefore, the right approach to handle them might partly involve subject matter expertise to decide on the right set of categories deemed as valid. This example applies management of category inconsistencies in department names that refer to the same department.

print("Before cleaning:")
print(df["department"].value_counts(dropna=False))

df["department"] = (
    df["department"]
    .str.strip()
    .str.lower()
    .replace({
        "eng": "engineering",
        "sales": "sales",
        "hr": "hr"
    })
)

print("\nAfter cleaning:")
print(df["department"].value_counts(dropna=False))

# 5. Checking and Validating Ranges

While outliers are statistically distant values, invalid values depend on domain-specific constraints, e.g. values for an “age” attribute cannot be negative. This example identifies negative values for the “age” attribute and replaces them with NaN — notice that these invalid values are turned into missing values, hence a downstream strategy for handling them may also be needed.

invalid_age = df[df["age"] < 0]
print(f"Invalid ages: {len(invalid_age)}")

# Fix by setting to NaN
df.loc[df["age"] < 0, "age"] = np.nan

# 6. Applying Log-Transform for Skewed Data

Skewed data attributes like “bonus” in our example dataset are usually better transformed into something that resembles a normal distribution, as this facilitates the majority of downstream machine learning analyses. This trick applies a log transformation, displaying the before and after of our data feature.

skewness = df["bonus"].skew()
print(f"Bonus skewness: {skewness:.2f}")

plt.hist(df["bonus"], bins=40)
plt.title("Bonus Distribution (Original)")
plt.show()

# Log transform
df["bonus_log"] = np.log1p(df["bonus"])

plt.hist(df["bonus_log"], bins=40)
plt.title("Bonus Distribution (Log Transformed)")
plt.show()

Before log-transform | Image by Author

After log-transform | Image by Author

# 7. Detecting Redundant Features via Correlation Matrix

We wrap up the list just the way we started: with a visual touch. Correlation matrices displayed as heatmaps help quickly identify pairs of features that are highly correlated — a strong sign that they might contain redundant information that is often best minimized in subsequent analysis. This example also prints the top-5 most highly correlated pairs of attributes for further interpretability:

corr_matrix = df.corr(numeric_only=True)

plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

# Find high correlations
high_corr = (
    corr_matrix
    .abs()
    .unstack()
    .sort_values(ascending=False)
)

high_corr = high_corr[high_corr < 1]
print(high_corr.head(5))

Correlation matrix to detect redundant features | Image by Author

# Wrapping Up

With the above list, you have learned 7 useful tricks to make the most of your exploratory data analysis, helping reveal and deal with different kinds of data quality issues and inconsistencies effectively and intuitively.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

What's Hot

A Should Pad Landed Warhammer FTL In DMCA Takedown Jail

This Horror Classic Still Holds the Guinness Record for Most Appearances of a Film in Other Movies

BMW Opened the Bespoke Door With Skytop and Speedtop. Now It’s Time for an ALPINA Coupe.

7 Python EDA Tricks to Find and Fix Data Issues

The Best Proxy Providers for Large-Scale Scraping for 2026

15 Free LLM APIs You Can Use in 2026

What is Context Window in LLM? Explained in 2 Minutes

How to Learn AI for FREE in 2026?

Claude Code Power Tips – KDnuggets

Why Industries Need Custom AI Tools?

BMW Will Put eFuel In Cars Made In Germany From 2028

Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

Most Popular

BMW Will Put eFuel In Cars Made In Germany From 2028

Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

Subscribe to Updates

What's Hot

7 Python EDA Tricks to Find and Fix Data Issues

# Introduction

# 1. Detecting Missing Values via Heatmaps

# 2. Removing Duplicates

# 3. Identifying Outliers Using the Inter-Quartile Range Method

# 4. Managing Inconsistent Categories

# 5. Checking and Validating Ranges

# 6. Applying Log-Transform for Skewed Data

# 7. Detecting Redundant Features via Correlation Matrix

# Wrapping Up

Related posts:

The Best Proxy Providers for Large-Scale Scraping for 2026

15 Free LLM APIs You Can Use in 2026

What is Context Window in LLM? Explained in 2 Minutes

Related Posts

Subscribe to Updates