Image by Editor
# Introduction
Exploratory data analysis (EDA) is a crucial stage prior to deeper data analysis processes or building data-driven AI systems, such as those based on machine learning models. While fixing common, real-world data quality issues and inconsistencies is often deferred to subsequent stages of the data pipeline, EDA is also an excellent opportunity to proactively detect these issues early on — before silently biasing results, degrading model performance, or compromising downstream decision-making.
Below, we curate a list that contains 7 Python tricks applicable to your early EDA processes, namely by effectively identifying and fixing a variety of data quality issues.
To illustrate these tricks, we will use a synthetically generated employees dataset, in which we will intentionally inject a variety of data quality issues to exemplify how to detect and handle them. Before trying the tricks out, make sure you first copy and paste the following preamble code in your coding environment:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# PREAMBLE CODE THAT RANDOMLY CREATES A DATASET AND INTRODUCES QUALITY ISSUES IN IT
np.random.seed(42)
n = 1000
df = pd.DataFrame({
"age": np.random.normal(40, 12, n).round(),
"income": np.random.normal(60000, 15000, n),
"experience_years": np.random.normal(10, 5, n),
"department": np.random.choice(
["Sales", "Engineering", "HR", "sales", "Eng", "HR "], n
),
"performance_score": np.random.normal(3, 0.7, n)
})
# Randomly injecting data issues to the dataset
# 1. Missing values
df.loc[np.random.choice(n, 80, replace=False), "income"] = np.nan
df.loc[np.random.choice(n, 50, replace=False), "department"] = np.nan
# 2. Outliers
df.loc[np.random.choice(n, 10), "income"] *= 5
df.loc[np.random.choice(n, 10), "age"] = -5
# 3. Invalid values
df.loc[np.random.choice(n, 15), "performance_score"] = 7
# 4. Skewness
df["bonus"] = np.random.exponential(2000, n)
# 5. Highly correlated features
df["income_copy"] = df["income"] * 1.02
# 6. Duplicated entries
df = pd.concat([df, df.iloc[:20]], ignore_index=True)
df.head()
# 1. Detecting Missing Values via Heatmaps
While there are functions in Python libraries like Pandas that count the number of missing values for each attribute in your dataset, an attractive approach to have a quick glimpse of all missing values in your dataset — and which columns or attributes contain some — is by visualizing a heatmap aided by the isnull() function, thus plotting white, barcode-like lines for every single missing value across your whole dataset, horizontally arranged by attributes.
plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cbar=False)
plt.title("Missing Value Heatmap")
plt.show()
df.isnull().sum().sort_values(ascending=False)
Heatmap to detect missing values | Image by Author
# 2. Removing Duplicates
This trick is a classic: simple, yet very effective to count the number of duplicated instances (rows) in your dataset, after which you can apply drop_duplicates() to remove them. By default, this function preserves the first occurrence of each duplicated row and eliminates the rest. Nonetheless, this behavior can be modified, for instance, by using the keep="last" option to retain the last occurrence instead of the first one, or keep=False to get rid of all duplicated rows entirely. The behavior to choose will depend on your specific problem needs.
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")
# Remove duplicates
df = df.drop_duplicates()
# 3. Identifying Outliers Using the Inter-Quartile Range Method
The inter-quartile range (IQR) method is a statistics-backed approach to identify data points that may be deemed as outliers or extreme values, due to being substantially distant from the rest of the points. This trick provides an implementation of the IQR method that can be replicated for different numeric attributes, such as “income”:
def detect_outliers_iqr(data, column):
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return data[(data[column] < lower) | (data[column] > upper)]
outliers_income = detect_outliers_iqr(df, "income")
print(f"Income outliers: {len(outliers_income)}")
# Optional: cap them
Q1 = df["income"].quantile(0.25)
Q3 = df["income"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df["income"] = df["income"].clip(lower, upper)
# 4. Managing Inconsistent Categories
Unlike outliers, which are normally associated with numeric features, inconsistent categories in categorical variables can stem from diverse factors, e.g. manual inconsistencies like uppercase or lowercase initials in names or domain-specific variations. Therefore, the right approach to handle them might partly involve subject matter expertise to decide on the right set of categories deemed as valid. This example applies management of category inconsistencies in department names that refer to the same department.
print("Before cleaning:")
print(df["department"].value_counts(dropna=False))
df["department"] = (
df["department"]
.str.strip()
.str.lower()
.replace({
"eng": "engineering",
"sales": "sales",
"hr": "hr"
})
)
print("\nAfter cleaning:")
print(df["department"].value_counts(dropna=False))
# 5. Checking and Validating Ranges
While outliers are statistically distant values, invalid values depend on domain-specific constraints, e.g. values for an “age” attribute cannot be negative. This example identifies negative values for the “age” attribute and replaces them with NaN — notice that these invalid values are turned into missing values, hence a downstream strategy for handling them may also be needed.
invalid_age = df[df["age"] < 0]
print(f"Invalid ages: {len(invalid_age)}")
# Fix by setting to NaN
df.loc[df["age"] < 0, "age"] = np.nan
# 6. Applying Log-Transform for Skewed Data
Skewed data attributes like “bonus” in our example dataset are usually better transformed into something that resembles a normal distribution, as this facilitates the majority of downstream machine learning analyses. This trick applies a log transformation, displaying the before and after of our data feature.
skewness = df["bonus"].skew()
print(f"Bonus skewness: {skewness:.2f}")
plt.hist(df["bonus"], bins=40)
plt.title("Bonus Distribution (Original)")
plt.show()
# Log transform
df["bonus_log"] = np.log1p(df["bonus"])
plt.hist(df["bonus_log"], bins=40)
plt.title("Bonus Distribution (Log Transformed)")
plt.show()
Before log-transform | Image by Author
After log-transform | Image by Author
# 7. Detecting Redundant Features via Correlation Matrix
We wrap up the list just the way we started: with a visual touch. Correlation matrices displayed as heatmaps help quickly identify pairs of features that are highly correlated — a strong sign that they might contain redundant information that is often best minimized in subsequent analysis. This example also prints the top-5 most highly correlated pairs of attributes for further interpretability:
corr_matrix = df.corr(numeric_only=True)
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
# Find high correlations
high_corr = (
corr_matrix
.abs()
.unstack()
.sort_values(ascending=False)
)
high_corr = high_corr[high_corr < 1]
print(high_corr.head(5))
Correlation matrix to detect redundant features | Image by Author
# Wrapping Up
With the above list, you have learned 7 useful tricks to make the most of your exploratory data analysis, helping reveal and deal with different kinds of data quality issues and inconsistencies effectively and intuitively.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.
