Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Fans React To The New Star Fox

    May 7, 2026

    Final Fantasy 9 Official New Release Drops On May 16

    May 7, 2026

    We Built Our Perfect BMW iX3 and Kept It Under $72,000

    May 7, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset
    How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset
    Business & Startups

    How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset

    gvfx00@gmail.comBy gvfx00@gmail.comOctober 17, 2025No Comments6 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    How I Built a Data Cleaning Pipeline Using One Messy DoorDash DatasetHow I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset
    Image by Editor

     

    Table of Contents

    Toggle
    • # Introduction
    • # Predicting Food Delivery Times with DoorDash Data
    • # Data Exploration
        • // Load and Preview the Dataset
        • // Explore The Columns With info()
    • # Building Data Cleaning Pipeline
        • // Fixing the Date and Time Columns Data Types
        • // Data Imputation With mode()
        • // Dropping Remaining NaNs
        • // What Can You Do Next?
    • # Final Thoughts
      • Related posts:
    • Building a Real Image Matching Project with Gemini Embedding 2
    • A Framework For Building Clean, Easy-To-Maintain JavaScript Apps
    • WTF is a Parameter?!? - KDnuggets

    # Introduction

     
    According to CrowdFlower’s survey, data scientists spend 60% of their time organizing and cleaning the data.

    In this article, we’ll walk through building a data cleaning pipeline using a real-life dataset from DoorDash. It contains nearly 200,000 food delivery records, each of which includes dozens of features such as delivery time, total items, and store category (e.g., Mexican, Thai, or American cuisine).

     

    # Predicting Food Delivery Times with DoorDash Data

     
    Predicting Food Delivery Times with DoorDash DataPredicting Food Delivery Times with DoorDash Data
     
    DoorDash aims to estimate the time it takes to deliver food accurately, from the moment a customer places an order to the time it arrives at their door. In this data project, we are tasked with developing a model that predicts the total delivery duration based on historical delivery data.

    However, we won’t do the whole project—i.e., we won’t build a predictive model. Instead, we’ll use the dataset provided in the project and create a data cleaning pipeline.

    Our workflow consists of two major steps.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

     

    # Data Exploration

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    Let’s start by loading and viewing the first few rows of the dataset.

     

    // Load and Preview the Dataset

    import pandas as pd
    df = pd.read_csv("historical_data.csv")
    df.head()

     

    Here is the output.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    This dataset includes datetime columns that capture the order creation time and actual delivery time, which can be used to calculate delivery duration. It also contains other features such as store category, total item count, subtotal, and minimum item price, making it suitable for various types of data analysis. We can already see that there are some NaN values, which we’ll explore more closely in the following step.

     

    // Explore The Columns With info()

    Let’s inspect all column names with the info() method. We will use this method throughout the article to see the changes in column value counts; it’s a good indicator of missing data and overall data health.

     

    Here is the output.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    As you can see, we have 15 columns, but the number of non-null values differs across them. This means some columns contain missing values, which could affect our analysis if not handled properly. One last thing: the created_at and actual_delivery_time data types are objects; these should be datetime.

     

    # Building Data Cleaning Pipeline

     
    In this step, we build a structured data cleaning pipeline to prepare the dataset for modeling. Each stage addresses common issues such as timestamp formatting, missing values, and irrelevant features.
     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    // Fixing the Date and Time Columns Data Types

    Before doing data analysis, we need to fix the columns that show the time. Otherwise, the calculation that we mentioned (actual_delivery_time – created_at) will go wrong.

    What we’re fixing:

    • created_at: when the order was placed
    • actual_delivery_time: when the food arrived

    These two columns are stored as objects, so to be able to do calculations correctly, we have to convert them to the datetime format. To do that, we can use datetime functions in pandas. Here is the code.

    import pandas as pd
    df = pd.read_csv("historical_data.csv")
    # Convert timestamp strings to datetime objects
    df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
    df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
    df.info()

     

    Here is the output.

     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    As you can see from the screenshot above, the created_at and actual_delivery_time are datetime objects now.

     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    Among the key columns, store_primary_category has the fewest non-null values (192,668), which means it has the most missing data. That’s why we’ll focus on cleaning it first.

     

    // Data Imputation With mode()

    One of the messiest columns in the dataset, evident from its high number of missing values, is store_primary_category. It tells us what kind of food stores are available, like Mexican, American, and Thai. However, many rows are missing this information, which is a problem. For instance, it can limit how we can group or analyze the data. So how do we fix it?

    We will fill these rows instead of dropping them. To do that, we will use smarter imputation.

    We write a dictionary that maps each store_id to its most frequent category, and then use that mapping to fill in missing values. Let’s see the dataset before doing that.

     
    Data Imputation With modeData Imputation With mode
     

    Here is the code.

    import numpy as np
    
    # Global most-frequent category as a fallback
    global_mode = df["store_primary_category"].mode().iloc[0]
    
    # Build store-level mapping to the most frequent category (fast and robust)
    store_mode = (
        df.groupby("store_id")["store_primary_category"]
          .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
    )
    
    # Fill missing categories using the store-level mode, then fall back to global mode
    df["store_primary_category"] = (
        df["store_primary_category"]
          .fillna(df["store_id"].map(store_mode))
          .fillna(global_mode)
    )
    
    df.info()

     

    Here is the output.

     
    Data Imputation With modeData Imputation With mode
     

    As you can see from the screenshot above, the store_primary_category column now has a higher non-null count. But let’s double-check with this code.

    df["store_primary_category"].isna().sum()

     

    Here is the output showing the number of NaN values. It’s zero; we got rid of all of them.

     
    Data Imputation With modeData Imputation With mode
     

    And let’s see the dataset after the imputation.

     
    Data Imputation With modeData Imputation With mode

     

    // Dropping Remaining NaNs

    In the previous step, we corrected the store_primary_category, but did you notice something? The non-null counts across the columns still don’t match!

    This is a clear sign that we’re still dealing with missing values in some part of the dataset. Now, when it comes to data cleaning, we have two options:

    • Fill these missing values
    • Drop them

    Given that this dataset contains nearly 200,000 rows, we can afford to lose some. With smaller datasets, you’d need to be more cautious. In that case, it is advisable to analyze each column, establish standards (decide how missing values will be filled—using the mean, median, most frequent value, or domain-specific defaults), and then fill them.

    To remove the NaNs, we will use the dropna() method from the pandas library. We are setting inplace=True to apply the changes directly to the DataFrame without needing to assign it again. Let’s see the dataset at this point.

     
    Dropping NaNsDropping NaNs
     

    Here is the code.

    df.dropna(inplace=True)
    df.info()

     

    Here is the output.

     
    Dropping NaNsDropping NaNs
     

    As you can see from the screenshot above, each column now has the same number of non-null values.

    Let’s see the dataset after all the changes.

     
    Dropping NaNsDropping NaNs
     

     

    // What Can You Do Next?

    Now that we have a clean dataset, here are a few things you can do next:

    • Perform EDA to understand delivery patterns.
    • Engineer new features like delivery hours or busy dashers ratio to add more meaning to your analysis.
    • Analyze correlations between variables to increase your model’s performance.
    • Build different regression models and find the best-performing model.
    • Predict the delivery duration with the best-performing model.

     

    # Final Thoughts

     
    In this article, we have cleaned the real-life dataset from DoorDash by addressing common data quality issues, such as fixing incorrect data types and handling missing values. We built a simple data cleaning pipeline tailored to this data project and explored potential next steps.

    Real-world datasets can be messier than you think, but there are also many methods and tricks to solve these issues. Thanks for reading!
     
     

    Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



    Related posts:

    Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs

    Harness Engineering with LangChain DeepAgents and LangSmith

    Deterministic vs Stochastic Explained (ML & Risk Examples)

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWi-Fi 8 moves toward real-world use as Sercomm reveals working Broadcom-based router
    Next Article The Most Beautiful BMW Interiors Ever Designed
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    How to Set Up Claude Code Channels Locally

    May 7, 2026
    Business & Startups

    Abacus AI Review: Features, AI Agents & Automation Explained (Honest Guide)

    May 7, 2026
    Business & Startups

    Is AI Taking Over Wall Street?

    May 6, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202571 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202569 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202571 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202569 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.