Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Crimson Desert developer apologizes and promises to replace AI-generated art

    March 23, 2026

    The best Nintendo Switch and Switch 2 accessories for Pokémon superfans

    March 22, 2026

    Michael Shannon’s Big Year | Little White Lies

    March 22, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Data Cleaning at the Command Line for Beginner Data Scientists
    Data Cleaning at the Command Line for Beginner Data Scientists
    Business & Startups

    Data Cleaning at the Command Line for Beginner Data Scientists

    gvfx00@gmail.comBy gvfx00@gmail.comNovember 24, 2025No Comments8 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Data Cleaning at the Command Line for Beginner Data Scientists
    Image by Author

     

    Table of Contents

    Toggle
    • # Introduction
    • # Setting Up Sample Data
    • # 1. Exploring Your Data with head, tail, and wc
    • # 2. Viewing Specific Columns with cut
    • # 3. Removing Duplicate Rows with sort and uniq
    • # 4. Searching and Filtering with grep
    • # 5. Trimming Whitespace with sed
    • # 6. Replacing Values with sed
    • # 7. Counting and Summarizing with awk
    • # 8. Combining Commands with Pipes
    • # 9. Converting Data Formats
    • # 10. A Complete Cleaning Pipeline
    • # Conclusion
      • Related posts:
    • Machine Learning vs. Deep Learning: From a Business Perspective
    • Single-Agent vs Multi-Agent Systems - Analytics Vidhya
    • 5 Workflow Automation Tools for All Professionals

    # Introduction

     
    When you think about data cleaning, you probably think of spinning up a Jupyter Notebook. But here’s something that might surprise you: some of the most powerful data cleaning can happen right in your terminal, using tools that are already installed on your system.

    In this article, you will learn how to use basic command-line utilities to clean, transform, and explore data files. No installations are required; just your terminal and some CSV files.

    Before we get started cleaning data at the command line, let’s talk about why this matters:

    • Command-line tools are easy to use, fast, and efficient, especially for large files.
    • These tools are built into Linux/macOS and are available on Windows.
    • They are great for getting a first look at data before loading and analyzing it with Python.
    • It is easy to chain commands together in scripts and use these scripts for automation.

    Now, let’s start coding!

    Note: You can find all the commands in this Bash script on GitHub. To get the most out of this article, I encourage you to open your terminal and code along.

     

    # Setting Up Sample Data

     
    Let’s create a messy CSV file to work with. This simulates real-world data issues you will likely run into.

    cat > messy_data.csv << 'EOF'
    name,age,salary,department,email
    John Lee,32,50000,Engineering,john@example.com
    Jane Smith,28,55000,Marketing,jane@example.com
       Bob Davis    ,35,60000,Engineering,bob@example.com
    Alice Williams,29,,Marketing,alice@example.com
    Charlie Brown,45,70000,Sales,charlie@example.com
    Dave Wilson,31,52000,Engineering,
    Emma Davis,,58000,Marketing,emma@example.com
    Frank Miller,38,65000,Sales,frank@example.com
    John Lee,32,50000,Engineering,john@example.com
    Grace Lee,27,51000,Engineering,grace@example.com
    EOF

     

    This dataset has several common issues: leading and trailing whitespace, missing values, and duplicate rows. It is great for learning!

     

    # 1. Exploring Your Data with head, tail, and wc

     
    Before cleaning your data, you need to understand what you are working with. Let’s start with the basics.

    # See the first 5 rows (including header)
    head -n 5 messy_data.csv
    
    # See the last 3 rows
    tail -n 3 messy_data.csv
    
    # Count total rows (including header)
    wc -l messy_data.csv

     

    Here is what is happening:

    • head -n 5 shows the first 5 lines, giving you a quick preview.
    • tail -n 3 shows the last 3 lines (useful for checking if data is complete).
    • wc -l counts lines — subtract 1 for the header to get your record count.

    Output:

    name,age,salary,department,email
    John Lee,32,50000,Engineering,john@example.com
    Jane Smith,28,55000,Marketing,jane@example.com
       Bob Davis    ,35,60000,Engineering,bob@example.com
    Alice Williams,29,,Marketing,alice@example.com
    Frank Miller,38,65000,Sales,frank@example.com
    John Lee,32,50000,Engineering,john@example.com
    Grace Lee,27,51000,Engineering,grace@example.com
    11 messy_data.csv

     

    # 2. Viewing Specific Columns with cut

     
    You do not always need to see all the columns. Let’s extract only the names and departments.

    cut -d',' -f1,4 messy_data.csv

     

    Breaking it down:

    • cut is a tool for extracting sections from each line.
    • -d',' sets the delimiter to a comma (for CSV files).
    • -f1,4 selects fields (columns) 1 and 4.
    • You can also use ranges: -f1-3 for columns 1 through 3.

    Here is the output:

    name,department
    John Lee,Engineering
    Jane Smith,Marketing
       Bob Davis    ,Engineering
    Alice Williams,Marketing
    Charlie Brown,Sales
    Dave Wilson,Engineering
    Emma Davis,Marketing
    Frank Miller,Sales
    John Lee,Engineering
    Grace Lee,Engineering

     

    # 3. Removing Duplicate Rows with sort and uniq

     
    Notice that “John Lee” appears twice in our dataset. Let’s fix that.

    # Save the header first
    head -n 1 messy_data.csv > cleaned_data.csv
    
    # Remove duplicates from the data (excluding header)
    tail -n +2 messy_data.csv | sort | uniq >> cleaned_data.csv

     

    Here is what each command does: head -n 1 grabs just the header row. tail -n +2 gets everything starting from line 2 (skipping the header). Then, sort sorts the lines. Please note that uniq only works on sorted data, and uniq removes adjacent duplicate lines. Finally, >> appends to the file (versus > which overwrites).

     

    # 4. Searching and Filtering with grep

     
    Let’s now do some searching and filtering operations. Want to find all engineers or filter out rows with missing data? grep comes in handy for all such tasks.

    # Find all engineers
    grep "Engineering" messy_data.csv
    
    # Find rows with empty fields (two consecutive commas)
    grep ",," messy_data.csv
    
    # Exclude rows with missing data
    grep -v ",," messy_data.csv > no_missing.csv

     

    Here, grep "pattern" searches for lines containing that pattern. grep -v inverts the match (shows lines that DO NOT match). This is a quick way to filter out incomplete records, provided the missing value results in a double comma (,,).

     

    # 5. Trimming Whitespace with sed

     
    See how the record of “Bob Davis” has extra spaces? Let’s clean that up.

    sed 's/^[ \t]*//; s/[ \t]*$//' messy_data.csv > trimmed_data.csv

     

    Now let’s understand the command: sed is a stream editor for text transformation. s/pattern/replacement/ is the substitution syntax. ^[ \t]* matches spaces/tabs at the start of a line. [ \t]*$ matches spaces/tabs at the end of a line. The semicolon separates two operations (trim the line start, then trim the line end).

     

    # 6. Replacing Values with sed

     
    Sometimes you need to standardize values or fix typos. Let’s try to replace all occurrences of “Engineering” with “Tech”.

    # Replace all "Engineering" with "Tech"
    sed 's/Engineering/Tech/g' messy_data.csv
    

     

    Next, let’s fill empty email fields (denoted by a comma at the end of the line) with a default email value.

    # Replace empty email fields with "no-email@example.com"
    sed 's/,$/,no-email@example.com/' messy_data.csv

     

    Run the above commands and observe the output. I have excluded the output here to avoid being repetitive.

    Breaking it down:

    • The g flag means “global” — replace all occurrences on each line.
    • ,$ matches a comma at the end of a line (indicating an empty last field).
    • You can chain multiple replacements with ; between them.

     

    # 7. Counting and Summarizing with awk

     
    awk is super useful for field-based operations. Let’s do some basic analysis.

    # Count records by department
    tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c
    
    # Calculate average age (excluding header and empty values)
    tail -n +2 messy_data.csv | awk -F',' '{if($2) sum+=$2; if($2) count++} END {print "Average age:", sum/count}'

     

    In this awk command, -F',' sets the field separator to a comma, and $2 refers to the second field (age). The condition if($2) ensures only non-empty values are processed, while sum += $2 accumulates the total. Finally, the END block executes after all lines are read to calculate and print the average age.

    Output:

          5 Engineering
          3 Marketing
          2 Sales
    Average age: 33
    

     

    # 8. Combining Commands with Pipes

     
    You get more useful processing when you chain these command-line tools together.

    # Get unique departments, sorted alphabetically
    tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq
    
    # Find engineers with salary > 55000
    tail -n +2 messy_data.csv | grep "Engineering" | awk -F',' '$3 > 55000' | cut -d',' -f1,3
    
    # Count employees per department with counts
    tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c | sort -rn

     

    Here, each | passes the output of one command as input to the next. This lets you build complex data transformations step by step. The last step sorts by count in reverse numerical order (-rn).

    This outputs:

    Engineering
    Marketing
    Sales
       Bob Davis    ,60000
          5 Engineering
          3 Marketing
          2 Sales

     

    # 9. Converting Data Formats

     
    Sometimes you need to work with different delimiters. Here, we try to use a tab as the separator instead of a comma.

    # Convert CSV to TSV (tab-separated)
    sed 's/,/\t/g' messy_data.csv > data.tsv
    
    # Add a new column with a fixed value
    awk -F',' 'BEGIN{OFS=","} {print $0, "2024"}' messy_data.csv > data_with_year.csv

     

    In this awk command, BEGIN{OFS=","} sets the output field separator to a comma. $0 represents the entire input line, and print $0, "2024" appends “2024” as a new column to each line of output.

     

    # 10. A Complete Cleaning Pipeline

     
    Let’s put it all together into one useful command that cleans our messy data:

    # Save header
    head -n 1 messy_data.csv > final_clean.csv
    
    # Clean the data: remove duplicates, trim whitespace, exclude missing values
    tail -n +2 messy_data.csv | \
      sed 's/^[ \t]*//; s/[ \t]*$//' | \
      grep -v ",," | \
      sort | \
      uniq >> final_clean.csv
    
    echo "Cleaning complete! Check final_clean.csv"

     

    This pipeline first saves the header to preserve the column names, then skips it while processing the data rows. It trims leading and trailing whitespace from each line, removes any rows containing empty fields (specifically double commas), sorts the data, and eliminates duplicate entries. Finally, it appends the cleaned data to the output file.

     

    # Conclusion

     
    Command-line data cleaning is a powerful yet underrated skill for data scientists. These tools are fast and reliable. While you will still use Python for complex analysis, mastering these basics will make you more efficient and give you options when Python isn’t ideal.

    The best part is that these skills transfer to data engineering, DevOps, and system administration roles. Learning to manipulate data at the command line makes you a more versatile developer.

    Start practicing with your own datasets, and you will be surprised how often you reach for these tools instead of spinning up Python notebooks. Happy data cleaning!
     
     

    Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.



    Related posts:

    Pixi: A Smarter Way to Manage Python Environments

    AI Writes Python Code, But Maintaining It Is Still Your Job

    How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOne year of Headspace drops to $35 for Black Friday
    Next Article Bain & Company issues AI guide for CEOs, opens Singapore hub
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Top 10 AI Coding Assistants of 2026

    March 22, 2026
    Business & Startups

    5 Useful Python Scripts for Synthetic Data Generation

    March 21, 2026
    Business & Startups

    The Better Way For Document Chatbots?

    March 21, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.