Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Romeo is a Dead Man Review: More Lynchian lunacy from one of gaming’s most uncompromising studios

    February 10, 2026

    ‘Friday the 13th’ Movies Returning to Theaters on Friday the 13th

    February 10, 2026

    2026 BYD Sealion 8 Dynamic FWD review

    February 10, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»The Data Detox: Training Yourself for the Messy, Noisy, Real World
    The Data Detox: Training Yourself for the Messy, Noisy, Real World
    Business & Startups

    The Data Detox: Training Yourself for the Messy, Noisy, Real World

    gvfx00@gmail.comBy gvfx00@gmail.comDecember 16, 2025No Comments9 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Data DetoxData Detox
    Image by Author

     

    Table of Contents

    Toggle
    • # Introduction
    • # NoBroker Data Project: A Hands-On Test of Real-World Chaos
    • # Comparing Clean Interview Data Versus Real Production Data: The Reality Check
    • # Practice #1: Handling Missing Data
    • # Practice #2: Detecting Outliers
    • # Practice #3: Dealing with Duplicates and Inconsistencies
    • # Practice #4: Data Type Validation and Schema Checks
    • # Documenting Your Cleaning Steps
    • # Final Thoughts
      • Related posts:
    • Meta's AI for 3D Scene and Body Modeling
    • 15 AI Agents Trends to Watch in 2026
    • What Is Cross-Validation? A Plain English Guide with Diagrams

    # Introduction

     
    We have all spent hours debugging a model, only to discover that it wasn’t the algorithm but a wrong null value manipulating your results in row 47,832. Kaggle competitions give the impression that data is produced as clean, well-labeled CSVs with no class imbalance issues, but in reality, that is not the case.

    In this article, we’ll use a real-life data project to explore four practical steps for preparing to deal with messy, real-life datasets.

     

    # NoBroker Data Project: A Hands-On Test of Real-World Chaos

     
    NoBroker is an Indian property technology (prop-tech) company that connects property owners and tenants directly in a broker-free marketplace.

     
    Data DetoxData Detox
     

    This data project is used during the recruitment process for the data science positions at NoBroker.

    In this data project, NoBroker wants you to build a predictive model that estimates how many interactions a property will receive within a given time frame. We won’t complete the entire project here, but it’ll help us discover methods for training ourselves on messy real-world data.

    It has three datasets:

    • property_data_set.csv
      • Contains property details such as type, location, amenities, size, rent, and other housing features.
    • property_photos.tsv
      • Contains property photos.
    • property_interactions.csv
      • Contains the timestamp of the interaction on the properties.

     

    # Comparing Clean Interview Data Versus Real Production Data: The Reality Check

     
    Interview datasets are polished, balanced, and boring. Real production data? It’s a dumpster fire with missing values, duplicate rows, inconsistent formats, and silent errors that wait until Friday at 5 PM to break your pipeline.

    Take the NoBroker property dataset, a real-world mess with 28,888 properties across three tables. At first glance, it looks fine. But dig deeper, and you’ll find 11,022 missing photo uniform resource locators (URLs), corrupted JSON strings with rogue backslashes, and more.

    This is the line between clean and chaotic. Clean data trains you to build models, but production data trains you to survive by struggling.

    We’ll explore four practices to train yourself.

     
    Data DetoxData Detox
     

    # Practice #1: Handling Missing Data

     
    Missing data isn’t just annoying; it’s a decision point. Delete the row? Fill it with the mean? Flag it as unknown? The answer depends on why the data is missing and how much you can afford to lose.

    The NoBroker dataset had three types of missing data. The photo_urls column was missing 11,022 values out of 28,888 rows — that is 38% of the dataset. Here is the code.

     

    Here is the output.

     
    Data DetoxData Detox
     

    Deleting these rows would wipe out valuable property records. Instead, the solution was to treat missing photos as if there were zero and move on.

    def correction(x):
        if x is np.nan or x == 'NaN':
            return 0  # Missing photos = 0 photos
        else:
            return len(json.loads(x.replace('\\', '').replace('{title','{"title')))
    pics['photo_count'] = pics['photo_urls'].apply(correction)

     

    For numerical columns like total_floor (23 missing) and categorical columns like building_type (38 missing), the strategy was imputation. Fill numerical gaps with the mean, and categorical gaps with the mode.

    for col in x_remain_withNull.columns:
        x_remain[col] = x_remain_withNull[col].fillna(x_remain_withNull[col].mean())
    for col in x_cat_withNull.columns:
        x_cat[col] = x_cat_withNull[col].fillna(x_cat_withNull[col].mode()[0])

     

    The first decision: do not delete without a questioning mind!

    Understand the pattern. The missing photo URLs were not random.

     

    # Practice #2: Detecting Outliers

     
    An outlier is not always an error, but it is always suspicious.

    Can you imagine a property with 21 bathrooms, 800 years old, or 40,000 square feet of space? You either found your dream place or someone made a data entry error.

    The NoBroker dataset was full of these red flags. Box plots revealed extreme values across multiple columns: property ages over 100, sizes beyond 10,000 square feet (sq ft), and deposits exceeding 3.5 million. Some were legitimate luxury properties. Most were data entry mistakes.

    df_num.plot(kind='box', subplots=True, figsize=(22,10))
    plt.show()

     

    Here is the output.

     
    Data DetoxData Detox
     

    The solution was interquartile range (IQR)-based outlier removal, a simple statistical method that flags values beyond 2 times the IQR.

    To handle this, we first write a function that removes those outliers.

    def remove_outlier(df_in, col_name):
        q1 = df_in[col_name].quantile(0.25)
        q3 = df_in[col_name].quantile(0.75)
        iqr = q3 - q1
        fence_low = q1 - 2 * iqr
        fence_high = q3 + 2 * iqr
        df_out = df_in.loc[(df_in[col_name] <= fence_high) & (df_in[col_name] >= fence_low)]
        return df_out  # Note: Multiplier changed from 1.5 to 2 to match implementation.

     

    And we run this code on numerical columns.

    df = dataset.copy()
    for col in df_num.columns:
        if col in ['gym', 'lift', 'swimming_pool', 'request_day_within_3d', 'request_day_within_7d']:
            continue  # Skip binary and target columns
        df = remove_outlier(df, col)
    print(f"Before: {dataset.shape[0]} rows")
    print(f"After: {df.shape[0]} rows")
    print(f"Removed: {dataset.shape[0] - df.shape[0]} rows ({((dataset.shape[0] - df.shape[0]) / dataset.shape[0] * 100):.1f}% reduction)")

     

    Here is the output.

     
    Data DetoxData Detox
     

    After removing outliers, the dataset shrank from 17,386 rows to 15,170, losing 12.7% of the data while keeping the model sane. The trade-off was worth it.

    For target variables like request_day_within_3d, capping was used instead of deletion. Values above 10 were capped at 10 to prevent extreme outliers from skewing predictions. In the following code, we also compare the results before and after.

    def capping_for_3days(x):
        num = 10
        return num if x > num else x
    df['request_day_within_3d_capping'] = df['request_day_within_3d'].apply(capping_for_3days)
    before_count = (df['request_day_within_3d'] > 10).sum()
    after_count = (df['request_day_within_3d_capping'] > 10).sum()
    total_rows = len(df)
    change_count = before_count - after_count
    percent_change = (change_count / total_rows) * 100
    print(f"Before capping (>10): {before_count}")
    print(f"After capping (>10): {after_count}")
    print(f"Reduced by: {change_count} ({percent_change:.2f}% of total rows affected)")

     

    The result?

     
    Data DetoxData Detox
     

    A cleaner distribution, better model performance, and fewer debugging sessions.

     

    # Practice #3: Dealing with Duplicates and Inconsistencies

     
    Duplicates are easy. Inconsistencies are hard. A duplicate row is just df.drop_duplicates(). An inconsistent format, like a JSON string that’s been mangled by three different systems, requires detective work.

    The NoBroker dataset had one of the worst JSON inconsistencies I’ve seen. The photo_urls column was supposed to contain valid JSON arrays, but instead, it was filled with malformed strings, missing quotes, escaped backslashes, and random trailing characters.

    text_before = pics['photo_urls'][0]
    print('Before Correction: \n\n', text_before)

     

    Here is the before correction.

     
    Data DetoxData Detox
     

    The fix required multiple string replacements to correct the formatting before parsing. Here is the code.

    text_after = text_before.replace('\\', '').replace('{title', '{"title').replace(']"', ']').replace('],"', ']","')
    parsed_json = json.loads(text_after)

     

    Here is the output.

     
    Data DetoxData Detox
     

    The JSON was indeed valid and parseable after the fix. It is not the cleanest way to do this kind of string manipulation, but it works.

    You see inconsistent formats everywhere: dates saved as strings, typos in categorical values, and numerical IDs saved as floats.

    The solution is standardization, as we did with the JSON formatting.

     

    # Practice #4: Data Type Validation and Schema Checks

     
    It all begins when you load your data. Finding out later that dates are strings or that numbers are objects would be a waste of time.

    In the NoBroker project, the types were validated during the CSV read itself, as the project was enforcing the right data types upfront with pandas parameters. Here is the code.

    data = pd.read_csv('property_data_set.csv')
    print(data['activation_date'].dtype)  
    data = pd.read_csv('property_data_set.csv',
                       parse_dates=['activation_date'], 
                       infer_datetime_format=True, 
                       dayfirst=True)
    print(data['activation_date'].dtype)

     

    Here is the output.

     
    Data DetoxData Detox
     

    The same validation was applied to the interaction dataset.

    interaction = pd.read_csv('property_interactions.csv',
        parse_dates=['request_date'], 
        infer_datetime_format=True, 
        dayfirst=True)

     

    Not only was this good practice, but it was essential for anything downstream. The project required calculations of date and time differences between the activation and request dates.

    So the following code would produce an error if dates are strings.

    num_req['request_day'] = (num_req['request_date'] - num_req['activation_date']) / np.timedelta64(1, 'D')

     

    Schema checks will ensure that the structure does not change, but in reality, the data will also drift as its distribution will tend to change over time. You can mimic this drift by having input proportions vary a little and check whether your model or its validation is able to detect and respond to that drift.

     

    # Documenting Your Cleaning Steps

     
    In three months, you won’t remember why you limited request_day_within_3d to 10. Six months from now, your teammate will break the pipeline by removing your outlier filter. In a year, the model will hit production, and no one will understand why it simply fails.

    Documentation isn’t optional. That is the difference between a reproducible pipeline and a voodoo script that works until it doesn’t.

    The NoBroker project documented every transformation in code comments and structured notebook sections with explanations and a table of contents.

    # Assignment
    # Read and Explore All Datasets
    # Data Engineering
    Handling Pics Data
    Number of Interactions Within 3 Days
    Number of Interactions Within 7 Days
    Merge Data
    # Exploratory Data Analysis and Processing
    # Feature Engineering
    Remove Outliers
    One-Hot Encoding
    MinMaxScaler
    Classical Machine Learning
    Predicting Interactions Within 3 Days
    Deep Learning
    # Try to correct the first Json
    # Try to replace corrupted values then convert to json
    # Function to correct corrupted json and get count of photos

     

    Version control matters too. Track changes to your cleaning logic. Save intermediate datasets. Keep a changelog of what you tried and what worked.

    The goal isn’t perfection. The goal is clarity. If you can’t explain why you made a decision, you can’t defend it when the model fails.

     

    # Final Thoughts

     
    Clean data is a myth. The best data scientists are not the ones who run away from messy datasets; they are the ones who know how to tame them. They discover the missing values before training.

    They are able to identify the outliers before they influence predictions. They check schemas before joining tables. And they write everything down so that the next person doesn’t have to begin from zero.

    No real impact comes from perfect data. It comes from the ability to deal with erroneous data and still construct something functional.

    So when you have to deal with a dataset and you see null values, broken strings, and outliers, don’t fear. What you see is not a problem but an opportunity to show your skills against a real-world dataset.
     
     

    Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



    Related posts:

    Can AI Outsmart Humans? 5 times AI found unexpected solutions

    The One Data Analyst Role That’s AI-Proof

    5 N8N Projects to Master Low-Code AI Automation

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticlePolish SSD vendor quietly launches a massive 122.88TB enterprise SSD for immersion cooling
    Next Article Australian police say Bondi Beach attackers inspired by ISIL | Crime News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    A Developer-First Platform for Orchestrating AI Agents

    February 10, 2026
    Business & Startups

    7 Python EDA Tricks to Find and Fix Data Issues

    February 10, 2026
    Business & Startups

    How to Learn AI for FREE in 2026?

    February 10, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.