Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    The Mandalorian & Grogu’s Villains Officially Confirmed In New Trailer

    March 23, 2026

    Single-Seat Dallara Supercar Heading To Auction

    March 23, 2026

    Socialist Emmanuel Gregoire wins Paris mayoral race | Elections News

    March 23, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset
    How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset
    Business & Startups

    How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset

    gvfx00@gmail.comBy gvfx00@gmail.comOctober 17, 2025No Comments6 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    How I Built a Data Cleaning Pipeline Using One Messy DoorDash DatasetHow I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset
    Image by Editor

     

    Table of Contents

    Toggle
    • # Introduction
    • # Predicting Food Delivery Times with DoorDash Data
    • # Data Exploration
        • // Load and Preview the Dataset
        • // Explore The Columns With info()
    • # Building Data Cleaning Pipeline
        • // Fixing the Date and Time Columns Data Types
        • // Data Imputation With mode()
        • // Dropping Remaining NaNs
        • // What Can You Do Next?
    • # Final Thoughts
      • Related posts:
    • Managing Secrets and API Keys in Python Projects (.env Guide)
    • Top 10 Hackathon Platforms for Every Skill and Style
    • WTF is a Parameter?!? - KDnuggets

    # Introduction

     
    According to CrowdFlower’s survey, data scientists spend 60% of their time organizing and cleaning the data.

    In this article, we’ll walk through building a data cleaning pipeline using a real-life dataset from DoorDash. It contains nearly 200,000 food delivery records, each of which includes dozens of features such as delivery time, total items, and store category (e.g., Mexican, Thai, or American cuisine).

     

    # Predicting Food Delivery Times with DoorDash Data

     
    Predicting Food Delivery Times with DoorDash DataPredicting Food Delivery Times with DoorDash Data
     
    DoorDash aims to estimate the time it takes to deliver food accurately, from the moment a customer places an order to the time it arrives at their door. In this data project, we are tasked with developing a model that predicts the total delivery duration based on historical delivery data.

    However, we won’t do the whole project—i.e., we won’t build a predictive model. Instead, we’ll use the dataset provided in the project and create a data cleaning pipeline.

    Our workflow consists of two major steps.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

     

    # Data Exploration

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    Let’s start by loading and viewing the first few rows of the dataset.

     

    // Load and Preview the Dataset

    import pandas as pd
    df = pd.read_csv("historical_data.csv")
    df.head()

     

    Here is the output.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    This dataset includes datetime columns that capture the order creation time and actual delivery time, which can be used to calculate delivery duration. It also contains other features such as store category, total item count, subtotal, and minimum item price, making it suitable for various types of data analysis. We can already see that there are some NaN values, which we’ll explore more closely in the following step.

     

    // Explore The Columns With info()

    Let’s inspect all column names with the info() method. We will use this method throughout the article to see the changes in column value counts; it’s a good indicator of missing data and overall data health.

     

    Here is the output.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    As you can see, we have 15 columns, but the number of non-null values differs across them. This means some columns contain missing values, which could affect our analysis if not handled properly. One last thing: the created_at and actual_delivery_time data types are objects; these should be datetime.

     

    # Building Data Cleaning Pipeline

     
    In this step, we build a structured data cleaning pipeline to prepare the dataset for modeling. Each stage addresses common issues such as timestamp formatting, missing values, and irrelevant features.
     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    // Fixing the Date and Time Columns Data Types

    Before doing data analysis, we need to fix the columns that show the time. Otherwise, the calculation that we mentioned (actual_delivery_time – created_at) will go wrong.

    What we’re fixing:

    • created_at: when the order was placed
    • actual_delivery_time: when the food arrived

    These two columns are stored as objects, so to be able to do calculations correctly, we have to convert them to the datetime format. To do that, we can use datetime functions in pandas. Here is the code.

    import pandas as pd
    df = pd.read_csv("historical_data.csv")
    # Convert timestamp strings to datetime objects
    df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
    df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
    df.info()

     

    Here is the output.

     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    As you can see from the screenshot above, the created_at and actual_delivery_time are datetime objects now.

     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    Among the key columns, store_primary_category has the fewest non-null values (192,668), which means it has the most missing data. That’s why we’ll focus on cleaning it first.

     

    // Data Imputation With mode()

    One of the messiest columns in the dataset, evident from its high number of missing values, is store_primary_category. It tells us what kind of food stores are available, like Mexican, American, and Thai. However, many rows are missing this information, which is a problem. For instance, it can limit how we can group or analyze the data. So how do we fix it?

    We will fill these rows instead of dropping them. To do that, we will use smarter imputation.

    We write a dictionary that maps each store_id to its most frequent category, and then use that mapping to fill in missing values. Let’s see the dataset before doing that.

     
    Data Imputation With modeData Imputation With mode
     

    Here is the code.

    import numpy as np
    
    # Global most-frequent category as a fallback
    global_mode = df["store_primary_category"].mode().iloc[0]
    
    # Build store-level mapping to the most frequent category (fast and robust)
    store_mode = (
        df.groupby("store_id")["store_primary_category"]
          .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
    )
    
    # Fill missing categories using the store-level mode, then fall back to global mode
    df["store_primary_category"] = (
        df["store_primary_category"]
          .fillna(df["store_id"].map(store_mode))
          .fillna(global_mode)
    )
    
    df.info()

     

    Here is the output.

     
    Data Imputation With modeData Imputation With mode
     

    As you can see from the screenshot above, the store_primary_category column now has a higher non-null count. But let’s double-check with this code.

    df["store_primary_category"].isna().sum()

     

    Here is the output showing the number of NaN values. It’s zero; we got rid of all of them.

     
    Data Imputation With modeData Imputation With mode
     

    And let’s see the dataset after the imputation.

     
    Data Imputation With modeData Imputation With mode

     

    // Dropping Remaining NaNs

    In the previous step, we corrected the store_primary_category, but did you notice something? The non-null counts across the columns still don’t match!

    This is a clear sign that we’re still dealing with missing values in some part of the dataset. Now, when it comes to data cleaning, we have two options:

    • Fill these missing values
    • Drop them

    Given that this dataset contains nearly 200,000 rows, we can afford to lose some. With smaller datasets, you’d need to be more cautious. In that case, it is advisable to analyze each column, establish standards (decide how missing values will be filled—using the mean, median, most frequent value, or domain-specific defaults), and then fill them.

    To remove the NaNs, we will use the dropna() method from the pandas library. We are setting inplace=True to apply the changes directly to the DataFrame without needing to assign it again. Let’s see the dataset at this point.

     
    Dropping NaNsDropping NaNs
     

    Here is the code.

    df.dropna(inplace=True)
    df.info()

     

    Here is the output.

     
    Dropping NaNsDropping NaNs
     

    As you can see from the screenshot above, each column now has the same number of non-null values.

    Let’s see the dataset after all the changes.

     
    Dropping NaNsDropping NaNs
     

     

    // What Can You Do Next?

    Now that we have a clean dataset, here are a few things you can do next:

    • Perform EDA to understand delivery patterns.
    • Engineer new features like delivery hours or busy dashers ratio to add more meaning to your analysis.
    • Analyze correlations between variables to increase your model’s performance.
    • Build different regression models and find the best-performing model.
    • Predict the delivery duration with the best-performing model.

     

    # Final Thoughts

     
    In this article, we have cleaned the real-life dataset from DoorDash by addressing common data quality issues, such as fixing incorrect data types and handling missing values. We built a simple data cleaning pipeline tailored to this data project and explored potential next steps.

    Real-world datasets can be messier than you think, but there are also many methods and tricks to solve these issues. Thanks for reading!
     
     

    Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



    Related posts:

    5 Data Privacy Stories from 2025 Every Analyst Should Know

    Deterministic vs Stochastic Explained (ML & Risk Examples)

    Build Smart iOS Apps with Apple's Foundation Models Framework

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWi-Fi 8 moves toward real-world use as Sercomm reveals working Broadcom-based router
    Next Article The Most Beautiful BMW Interiors Ever Designed
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Top 10 AI Coding Assistants of 2026

    March 22, 2026
    Business & Startups

    5 Useful Python Scripts for Synthetic Data Generation

    March 21, 2026
    Business & Startups

    The Better Way For Document Chatbots?

    March 21, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.