Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    A Should Pad Landed Warhammer FTL In DMCA Takedown Jail

    February 10, 2026

    This Horror Classic Still Holds the Guinness Record for Most Appearances of a Film in Other Movies

    February 10, 2026

    BMW Opened the Bespoke Door With Skytop and Speedtop. Now It’s Time for an ALPINA Coupe.

    February 10, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»7 Python EDA Tricks to Find and Fix Data Issues
    7 Python EDA Tricks to Find and Fix Data Issues
    Business & Startups

    7 Python EDA Tricks to Find and Fix Data Issues

    gvfx00@gmail.comBy gvfx00@gmail.comFebruary 10, 2026No Comments6 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email



    Image by Editor

     

    Table of Contents

    Toggle
    • # Introduction
    • # 1. Detecting Missing Values via Heatmaps
    • # 2. Removing Duplicates
    • # 3. Identifying Outliers Using the Inter-Quartile Range Method
    • # 4. Managing Inconsistent Categories
    • # 5. Checking and Validating Ranges
    • # 6. Applying Log-Transform for Skewed Data
    • # 7. Detecting Redundant Features via Correlation Matrix
    • # Wrapping Up
      • Related posts:
    • The interview guide for domain experts in AI — Dan Rose AI
    • A Guide to Coordinated Multi-Agent Workflows
    • ChatGPT Image 1.5 vs Nano Banana Pro: AI Image Showdown

    # Introduction

     
    Exploratory data analysis (EDA) is a crucial stage prior to deeper data analysis processes or building data-driven AI systems, such as those based on machine learning models. While fixing common, real-world data quality issues and inconsistencies is often deferred to subsequent stages of the data pipeline, EDA is also an excellent opportunity to proactively detect these issues early on — before silently biasing results, degrading model performance, or compromising downstream decision-making.

    Below, we curate a list that contains 7 Python tricks applicable to your early EDA processes, namely by effectively identifying and fixing a variety of data quality issues.

    To illustrate these tricks, we will use a synthetically generated employees dataset, in which we will intentionally inject a variety of data quality issues to exemplify how to detect and handle them. Before trying the tricks out, make sure you first copy and paste the following preamble code in your coding environment:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # PREAMBLE CODE THAT RANDOMLY CREATES A DATASET AND INTRODUCES QUALITY ISSUES IN IT
    np.random.seed(42)
    
    n = 1000
    
    df = pd.DataFrame({
        "age": np.random.normal(40, 12, n).round(),
        "income": np.random.normal(60000, 15000, n),
        "experience_years": np.random.normal(10, 5, n),
        "department": np.random.choice(
            ["Sales", "Engineering", "HR", "sales", "Eng", "HR "], n
        ),
        "performance_score": np.random.normal(3, 0.7, n)
    })
    
    # Randomly injecting data issues to the dataset
    
    # 1. Missing values
    df.loc[np.random.choice(n, 80, replace=False), "income"] = np.nan
    df.loc[np.random.choice(n, 50, replace=False), "department"] = np.nan
    
    # 2. Outliers
    df.loc[np.random.choice(n, 10), "income"] *= 5
    df.loc[np.random.choice(n, 10), "age"] = -5
    
    # 3. Invalid values
    df.loc[np.random.choice(n, 15), "performance_score"] = 7
    
    # 4. Skewness
    df["bonus"] = np.random.exponential(2000, n)
    
    # 5. Highly correlated features
    df["income_copy"] = df["income"] * 1.02
    
    # 6. Duplicated entries
    df = pd.concat([df, df.iloc[:20]], ignore_index=True)
    
    df.head()

     

    # 1. Detecting Missing Values via Heatmaps

     
    While there are functions in Python libraries like Pandas that count the number of missing values for each attribute in your dataset, an attractive approach to have a quick glimpse of all missing values in your dataset — and which columns or attributes contain some — is by visualizing a heatmap aided by the isnull() function, thus plotting white, barcode-like lines for every single missing value across your whole dataset, horizontally arranged by attributes.

    plt.figure(figsize=(10, 5))
    sns.heatmap(df.isnull(), cbar=False)
    plt.title("Missing Value Heatmap")
    plt.show()
    
    df.isnull().sum().sort_values(ascending=False)

     

    Heatmap to detect missing values
    Heatmap to detect missing values | Image by Author

     

    # 2. Removing Duplicates

     
    This trick is a classic: simple, yet very effective to count the number of duplicated instances (rows) in your dataset, after which you can apply drop_duplicates() to remove them. By default, this function preserves the first occurrence of each duplicated row and eliminates the rest. Nonetheless, this behavior can be modified, for instance, by using the keep="last" option to retain the last occurrence instead of the first one, or keep=False to get rid of all duplicated rows entirely. The behavior to choose will depend on your specific problem needs.

    duplicate_count = df.duplicated().sum()
    print(f"Number of duplicate rows: {duplicate_count}")
    
    # Remove duplicates
    df = df.drop_duplicates()

     

    # 3. Identifying Outliers Using the Inter-Quartile Range Method

     
    The inter-quartile range (IQR) method is a statistics-backed approach to identify data points that may be deemed as outliers or extreme values, due to being substantially distant from the rest of the points. This trick provides an implementation of the IQR method that can be replicated for different numeric attributes, such as “income”:

    def detect_outliers_iqr(data, column):
        Q1 = data[column].quantile(0.25)
        Q3 = data[column].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        return data[(data[column] < lower) | (data[column] > upper)]
    
    outliers_income = detect_outliers_iqr(df, "income")
    print(f"Income outliers: {len(outliers_income)}")
    
    # Optional: cap them
    Q1 = df["income"].quantile(0.25)
    Q3 = df["income"].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    df["income"] = df["income"].clip(lower, upper)

     

    # 4. Managing Inconsistent Categories

     
    Unlike outliers, which are normally associated with numeric features, inconsistent categories in categorical variables can stem from diverse factors, e.g. manual inconsistencies like uppercase or lowercase initials in names or domain-specific variations. Therefore, the right approach to handle them might partly involve subject matter expertise to decide on the right set of categories deemed as valid. This example applies management of category inconsistencies in department names that refer to the same department.

    print("Before cleaning:")
    print(df["department"].value_counts(dropna=False))
    
    df["department"] = (
        df["department"]
        .str.strip()
        .str.lower()
        .replace({
            "eng": "engineering",
            "sales": "sales",
            "hr": "hr"
        })
    )
    
    print("\nAfter cleaning:")
    print(df["department"].value_counts(dropna=False))

     

    # 5. Checking and Validating Ranges

     
    While outliers are statistically distant values, invalid values depend on domain-specific constraints, e.g. values for an “age” attribute cannot be negative. This example identifies negative values for the “age” attribute and replaces them with NaN — notice that these invalid values are turned into missing values, hence a downstream strategy for handling them may also be needed.

    invalid_age = df[df["age"] < 0]
    print(f"Invalid ages: {len(invalid_age)}")
    
    # Fix by setting to NaN
    df.loc[df["age"] < 0, "age"] = np.nan

     

    # 6. Applying Log-Transform for Skewed Data

     
    Skewed data attributes like “bonus” in our example dataset are usually better transformed into something that resembles a normal distribution, as this facilitates the majority of downstream machine learning analyses. This trick applies a log transformation, displaying the before and after of our data feature.

    skewness = df["bonus"].skew()
    print(f"Bonus skewness: {skewness:.2f}")
    
    plt.hist(df["bonus"], bins=40)
    plt.title("Bonus Distribution (Original)")
    plt.show()
    
    # Log transform
    df["bonus_log"] = np.log1p(df["bonus"])
    
    plt.hist(df["bonus_log"], bins=40)
    plt.title("Bonus Distribution (Log Transformed)")
    plt.show()

     

    Before log-transform
    Before log-transform | Image by Author

     
    After log-transform
    After log-transform | Image by Author

     

    # 7. Detecting Redundant Features via Correlation Matrix

     
    We wrap up the list just the way we started: with a visual touch. Correlation matrices displayed as heatmaps help quickly identify pairs of features that are highly correlated — a strong sign that they might contain redundant information that is often best minimized in subsequent analysis. This example also prints the top-5 most highly correlated pairs of attributes for further interpretability:

    corr_matrix = df.corr(numeric_only=True)
    
    plt.figure(figsize=(10, 6))
    sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm")
    plt.title("Correlation Matrix")
    plt.show()
    
    # Find high correlations
    high_corr = (
        corr_matrix
        .abs()
        .unstack()
        .sort_values(ascending=False)
    )
    
    high_corr = high_corr[high_corr < 1]
    print(high_corr.head(5))

     

    Correlation matrix to detect redundant features
    Correlation matrix to detect redundant features | Image by Author

     

    # Wrapping Up

     
    With the above list, you have learned 7 useful tricks to make the most of your exploratory data analysis, helping reveal and deal with different kinds of data quality issues and inconsistencies effectively and intuitively.
     
     

    Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

    Related posts:

    The Best Proxy Providers for Large-Scale Scraping for 2026

    15 Free LLM APIs You Can Use in 2026

    What is Context Window in LLM? Explained in 2 Minutes

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow to watch The Artful Dodger season 2 online from anywhere
    Next Article How does the cutoff of Starlink terminals affect Russia’s moves in Ukraine? | Russia-Ukraine war News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    How to Learn AI for FREE in 2026?

    February 10, 2026
    Business & Startups

    Claude Code Power Tips – KDnuggets

    February 9, 2026
    Business & Startups

    Why Industries Need Custom AI Tools?

    February 9, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.