Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Resident Evil Requiem producer seems glad you rejected the Nvidia AI version of Grace

    May 5, 2026

    27 Years Later, This Near-Perfect 5-Part Dark Fantasy Spin-off Is Now the Perfect Free Binge

    May 5, 2026

    Australia’s EV incentives extended, but they’re being wound back

    May 5, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Testing SQL Like a Software Engineer: Unit Testing, CI/CD, and Data Quality Automation
    Testing SQL Like a Software Engineer: Unit Testing, CI/CD, and Data Quality Automation
    Business & Startups

    Testing SQL Like a Software Engineer: Unit Testing, CI/CD, and Data Quality Automation

    gvfx00@gmail.comBy gvfx00@gmail.comMay 5, 2026No Comments13 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    SQL Unit Testing and Data Quality Automation

     

    Table of Contents

    Toggle
    • # Introduction
    • # Step 1: Solving an Interview-Style SQL Question
        • // Understanding the Problem
        • // Understanding the Dataset
        • // Writing the SQL Solution
        • // Defining the Expected Output
    • # Step 2: Making the SQL Logic Reliable with Unit Tests
        • // Turning the Query into a Reusable Component
        • // Defining Test Input and Expected Output
        • // Writing SQL Unit Tests
    • # Step 3: Automating SQL Tests with Continuous Integration and Continuous Deployment
        • // Organizing the Project
        • // Creating the GitHub Actions Workflow
        • // Adding the Workflow Steps
    • # Step 4: Automating Data Quality
        • // Understanding Why Data Quality Checks Matter for SQL Workflows
        • // Turning Data Assumptions into Automated Rules
        • // Converting Rules into an Automated Check
    • # Concluding Remarks
      • Related posts:
    • 11 Books Every Data Scientist Must Read In 2024
    • Building a Self-Improving AI Support Agent with Langfuse
    • Build Your Own Open-Source Logo Detector

    # Introduction

     
    Everyone focuses on writing SQL that “works,” but very few test whether it keeps working tomorrow. A single new row, a changed assumption, or a refactor can break a query silently. This article walks through a complete workflow, showing how to treat SQL like software: versioned, tested, and automated. We’ll use a real Amazon interview question about identifying customers with the highest daily spending. Then we will convert the SQL into a testable component, define expected outputs, and automate testing with continuous integration and continuous deployment (CI/CD).

     
    SQL Unit Testing and Data Quality Automation
     

    # Step 1: Solving an Interview-Style SQL Question

     

    // Understanding the Problem

     
    SQL Unit Testing and Data Quality Automation
     

    In this interview question from Amazon, you are asked to find the customers with the highest daily total order cost between a certain date range.

     

    // Understanding the Dataset

    There are two data tables in this project: customers and orders.

    The customers table:

     
    SQL Unit Testing and Data Quality Automation
     

    Here is a preview of the dataset:

     
    SQL Unit Testing and Data Quality Automation
     

    The orders table:

     
    SQL Unit Testing and Data Quality Automation
     

    Here is a preview of the dataset:

     
    SQL Unit Testing and Data Quality Automation
     

    This problem is perfect for illustrating how SQL can be treated like software: the query must be correct, stable, and resistant to regressions.

     

    // Writing the SQL Solution

    The logic breaks down into three parts:

    1. Aggregate each customer’s total spending per day
    2. Rank customers by total spending for each date
    3. Return only the daily top spenders

    Here is the final PostgreSQL solution:

    WITH customer_daily_totals AS (
      SELECT
        o.cust_id,
        o.order_date,
        SUM(o.total_order_cost) AS total_daily_cost
      FROM orders o
      WHERE o.order_date BETWEEN '2019-02-01' AND '2019-05-01'
      GROUP BY o.cust_id, o.order_date
    ),
    ranked_daily_totals AS (
      SELECT
        cust_id,
        order_date,
        total_daily_cost,
        RANK() OVER (
          PARTITION BY order_date
          ORDER BY total_daily_cost DESC
        ) AS rnk
      FROM customer_daily_totals
    )
    SELECT
      c.first_name,
      rdt.order_date,
      rdt.total_daily_cost AS max_cost
    FROM ranked_daily_totals rdt
    JOIN customers c ON rdt.cust_id = c.id
    WHERE rdt.rnk = 1
    ORDER BY rdt.order_date;

     

    // Defining the Expected Output

    Here is the expected output:

     
    SQL Unit Testing and Data Quality Automation
     

    At this stage, most people stop.

     

    # Step 2: Making the SQL Logic Reliable with Unit Tests

     
    SQL breaks more easily than most think. A changed default, a renamed column, or a new data source can introduce silent errors. Testing protects you from these issues. There are three testing steps we will cover: converting the logic into a function, defining expected output, and writing a unit test suite.

     

    // Turning the Query into a Reusable Component

    To test the SQL code, we begin by wrapping it in a Python function using a lightweight testing framework like unittest. First, we define the query that we want to test:

        query = """
            WITH customer_daily_totals AS (
              SELECT
                o.cust_id,
                o.order_date,
                SUM(o.total_order_cost) AS total_daily_cost
              FROM orders o
              WHERE o.order_date BETWEEN '2019-02-01' AND '2019-05-01'
              GROUP BY o.cust_id, o.order_date
            ),
            ranked_daily_totals AS (
              SELECT
                cust_id,
                order_date,
                total_daily_cost,
                RANK() OVER (
                  PARTITION BY order_date
                  ORDER BY total_daily_cost DESC
                ) AS rnk
              FROM customer_daily_totals
            )
            SELECT
              c.first_name,
              rdt.order_date,
              rdt.total_daily_cost AS max_cost
            FROM ranked_daily_totals rdt
            JOIN customers c ON rdt.cust_id = c.id
            WHERE rdt.rnk = 1
            ORDER BY rdt.order_date;
        """

     

    // Defining Test Input and Expected Output

    Next, we must create a controlled sample dataset to test against.

    test_customers = [
        (15, "Mia"),
        (7, "Jill"),
        (3, "Farida")
    ]
    
    test_orders = [
        (1, 3, "2019-03-04", 100),
        (2, 3, "2019-03-01", 80),
        (4, 7, "2019-02-01", 25),
        (6, 15, "2019-02-01", 100)
    ]

     

    We also create the expected output:

    expected = [
        ("Mia", "2019-02-01", 100),
        ("Farida", "2019-03-01", 80),
        ("Farida", "2019-03-04", 100)
    ]

     

    Why? Because defining expected outputs creates a benchmark.

     

    // Writing SQL Unit Tests

    Now we have the query defined, the test inputs, and the expected outputs. We can write an actual unit test. The idea is simple:

    1. Create an isolated, in-memory database
    2. Load controlled test data
    3. Execute the SQL query
    4. Assert that the result obtained matches the expected output

     
    SQL Unit Testing and Data Quality Automation
     

    Python’s built-in unittest framework is highly effective because it allows us to keep dependencies minimal while providing structure and repeatability. We start by creating an in-memory SQLite database:

    conn = sqlite3.connect(":memory:")
    cursor = conn.cursor()

     

    Using :memory: ensures that:

    • the test database is fully isolated
    • no external state can affect the result
    • the database is discarded automatically once the test finishes

    Next, we recreate only the tables required by the query:

    CREATE TABLE customers (...)
    CREATE TABLE orders (...)

     

    Even though the query only uses a subset of columns, the schema mirrors a realistic production table. This reduces the risk of false confidence caused by oversimplified schemas. We then insert the controlled test data defined earlier:

    cursor.executemany("INSERT INTO customers VALUES (?, ?, ?, ?, ?, ?)", test_customers)
    cursor.executemany("INSERT INTO orders VALUES (?, ?, ?, ?, ?)", test_orders)
    conn.commit()

     

    At this point, the database contains a known, deterministic state, which is essential for meaningful tests. Before executing the query, we load and print the test tables using Pandas:

    customers_df = pd.read_sql("SELECT id, first_name, last_name, city FROM customers", conn)
    orders_df = pd.read_sql("SELECT * FROM orders", conn)

     

    While this step is not strictly required for automation, it is highly useful during development and debugging. When a test fails, being able to immediately inspect the input data saves significantly more time than checking the SQL logic, because it allows you to understand step-by-step what the code is computing. Now we run the query under test:

    result = pd.read_sql(query, conn)

     

    The result is loaded into a DataFrame, which provides:

    • structured access to rows and columns
    • easy comparison with expected outputs
    • readable printing for debugging

    Next, we must verify the results row by row. The verification logic makes a manual assertion between the query output and the expected result:

    all_correct = True
    
    if len(result) != len(expected):
        all_correct = False

     

    The first check confirms whether the number of rows returned by the query matches what we expect. A mismatch here immediately indicates missing or extra records. Next, we iterate through the expected output and compare it to the actual query result row by row:

    for i, (fname, lname, date, cost) in enumerate(expected):
        if i < len(result):
            actual = result.iloc[i]
            if not (
                actual["first_name"] == fname
                and actual["last_name"] == lname
                and actual["order_date"] == date
                and actual["max_cost"] == cost
            ):
                all_correct = False

     

    Each row is checked on all relevant dimensions:

    • customer name
    • order date
    • aggregated daily cost

    If any value differs from the expected, the test is marked as failed. Finally, the test result is summarized in a clear pass/fail message:

    if all_correct and len(result) == len(expected):
        print("ALL TESTS PASSED")
    else:
        print("SOME TESTS FAILED")

     

    The database connection is then closed:

     

    If the tests pass, the expected output is:

     
    SQL Unit Testing and Data Quality Automation
     

    This test carries some assumptions worth noting:

    • a stable row order (ORDER BY order_date)
    • exact matches on all values
    • no tolerance for ties or duplicate winners per day

    The full script, ready to be used, can be seen here.

     

    # Step 3: Automating SQL Tests with Continuous Integration and Continuous Deployment

     
    A test suite is only useful if it runs consistently whenever needed. We utilize CI/CD to automate testing whenever a code change is made.

     

    // Organizing the Project

    A minimal repository structure can look like this:

     
    SQL Unit Testing and Data Quality Automation
     

    // Creating the GitHub Actions Workflow

    The next step is to ensure these tests run automatically whenever the code changes. For this, we use GitHub Actions. This tool allows us to define a CI workflow that runs the SQL tests every time code is pushed or a pull request is opened.

    Create the workflow file: In your repository, create the following folder structure if it doesn’t already exist: .github/workflows/. Inside this folder, create a new file called test_sql.yml. The name is not special; GitHub only cares that the file lives inside the .github/workflows/ directory. You can name it anything, but test_sql.yml keeps things clear and simple.

    Define when the workflow should run: Here is the full workflow file:

    name: Run SQL Tests
    
    on:
      push:
        branches: [ "main" ]
      pull_request:
        branches: [ "main" ]

     

    This section defines when the workflow runs:

    • on every push to the main branch
    • on every pull request targeting main

    In practice, this means:

    • pushing directly to main will trigger the tests
    • opening or updating a pull request will also trigger the tests

    This helps catch SQL regressions before they get merged.

    Define the test job: Next, we define a job called test:

    jobs:
      test:
        runs-on: ubuntu-latest

     

    This tells GitHub to:

    • create a fresh Linux machine
    • run all test steps inside it

    Each workflow run starts from a clean environment, which prevents “it works on my machine” problems.

     

    // Adding the Workflow Steps

    Now we define the steps the machine should execute:

    - name: Checkout repository
      uses: actions/checkout@v4

     

    This step downloads your repository’s code into the runner so it can access your SQL files and tests.

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: "3.10"

     

    This installs Python 3.10, ensuring a consistent runtime across all runs.

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

     

    This installs all required Python libraries (such as Pandas) defined in requirements.txt.

    - name: Run unit tests
      run: python -m unittest discover

     

    Finally, this command:

    • automatically discovers test files
    • runs all SQL tests defined in the tests/ folder
    • fails the workflow if any test fails

    The full workflow can be found here.

    Running the workflow: You don’t need to run this file manually. Once committed:

    • pushing to main will trigger the workflow
    • opening a pull request will trigger the workflow

    You can view the results directly in GitHub by navigating to your repository’s Actions tab.

     
    SQL Unit Testing and Data Quality Automation
     

    Each run will show whether your SQL tests passed or failed.

     

    # Step 4: Automating Data Quality

     
    Unit tests confirm whether the logic still returns the expected output, and CI ensures these tests run automatically. But in real data environments, the input data itself can cause failures: late-arriving rows, malformed dates, missing keys, and unexpected duplicates can break queries long before the SQL logic does. This is where data quality automation comes in. Testing and versioning form a safety net for code changes; data quality automation extends that safety net to the data itself, preventing downstream issues before they impact results.

     

    // Understanding Why Data Quality Checks Matter for SQL Workflows

    In our interview problem, the following issues could make the query return incorrect results:

    • A customer’s first name is no longer unique.
    • An order arrives with a negative cost.
    • Dates fall outside the expected range.
    • Daily aggregates contain duplicate rows for the same customer and date.
    • A customer exists in orders but not in customers.

     
    SQL Unit Testing and Data Quality Automation
     

    Without automated checks, these issues may silently distort results. Because SQL doesn’t raise obvious exceptions in many of these scenarios, errors spread unnoticed. Automated data quality checks detect these issues early and prevent the pipeline from running with corrupted or incomplete data.

     

    // Turning Data Assumptions into Automated Rules

    Every SQL query relies on assumptions about the data. The problem is that these assumptions are rarely written down and almost never enforced. In our daily spenders query, correctness depends not only on SQL logic, but also on the shape and validity of the input data. Instead of trusting those assumptions implicitly, we can turn them into automated data quality rules. The idea is simple:

    • express each assumption as a SQL check
    • run those checks automatically
    • fail fast if any assumption is violated

    First names must be unique: Our query joins customers by ID, but returns first_name as an identifier. If first names are no longer unique, the output becomes ambiguous.

    SELECT first_name, COUNT(*)
    FROM customers
    GROUP BY first_name
    HAVING COUNT(*) > 1;

     

    If this query returns any rows, the assumption is broken.

    Order costs must be non-negative: Negative order values usually indicate ingestion or upstream transformation issues.

    SELECT *
    FROM orders
    WHERE total_order_cost < 0;

     

    Even a single row here invalidates financial aggregates.

    Order dates must be valid and within expectations: Dates that are missing or wildly out of range often reveal synchronization or parsing errors.

    SELECT *
    FROM orders
    WHERE order_date IS NULL
       OR order_date < '2010-01-01'
       OR order_date > CURRENT_DATE;

     

    This protects the query from silently including bad temporal data.

    Every order must reference a valid customer: If an order refers to a non-existent customer, joins will silently drop rows.

    SELECT o.*
    FROM orders o
    LEFT JOIN customers c ON c.id = o.cust_id
    WHERE c.id IS NULL;

     

    This rule ensures referential integrity before analytics logic runs.

     

    // Converting Rules into an Automated Check

    Instead of running these checks manually, we can wrap them into a single Python function that fails immediately if any rule is violated.

    import pandas as pd
    
    def run_data_quality_checks(conn):
        checks = {
            "Duplicate first names": """
                SELECT first_name
                FROM customers
                GROUP BY first_name
                HAVING COUNT(*) > 1;
            """,
            "Negative order costs": """
                SELECT *
                FROM orders
                WHERE total_order_cost < 0;
            """,
            "Invalid order dates": """
                SELECT *
                FROM orders
                WHERE order_date IS NULL
                   OR order_date < '2010-01-01'
                   OR order_date > CURRENT_DATE;
            """,
            "Orders without customers": """
                SELECT o.*
                FROM orders o
                LEFT JOIN customers c ON c.id = o.cust_id
                WHERE c.id IS NULL;
            """
        }
    
        for rule_name, query in checks.items():
            result = pd.read_sql(query, conn)
            if not result.empty:
                raise ValueError(f"Data quality check failed: {rule_name}")
    
        print("All data quality checks passed.")

     

    This function:

    • executes each rule
    • checks whether any rows are returned
    • raises an error immediately if a violation is found

    At this point, data quality rules behave just like unit tests: pass or fail. If tests pass, you will see something like:

     
    SQL Unit Testing and Data Quality Automation
     

    Because the data quality checks run inside Python, they are automatically picked up by the existing GitHub Actions workflow:

    - name: Run unit tests
      run: python -m unittest discover

     

    The CI pipeline will stop immediately as long as:

    • the function is imported or executed by your test file
    • a failure raises an exception

     

    # Concluding Remarks

     
    Most people stop once the SQL query produces a correct answer. But real data environments reward those who make their queries stable, testable, and version-controlled.

     
    SQL Unit Testing and Data Quality Automation
     

    Combining the following practices ensures the query continues to deliver reliable results, even as data changes over time:

    • a clear solution
    • a reusable component
    • unit tests
    • automated CI

    Correctness is good, but reliability is essential.
     
     

    Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



    Related posts:

    100 Data Science Interview Questions & Answers 2026

    10 Best Applications For People With Diabetes

    Build an AI-Powered WhatsApp Sticker Generator with Python

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleGameStop offers $56 billion for eBay, struggles to explain how it’ll pay for it
    Next Article Agentic AI Governance Is Now a Product. Are Enterprises Ready?
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    From Prompt to a Shipped Hugging Face Model

    May 4, 2026
    Business & Startups

    7 Practical Ways to Reduce Claude Code Token Usage

    May 4, 2026
    Business & Startups

    15+ Solved Agentic AI Projects with Github Links

    May 3, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202569 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202558 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202569 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202558 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.