Skip to content
Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Israel fetes Somaliland’s leader as it seeks to expand Red Sea influence | Border Disputes News

    June 20, 2026

    Practical SQL Tricks Every Data Scientist Should Know

    June 20, 2026

    Apple patches eavesdropping vulnerability in Beats Studio Buds

    June 20, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Practical SQL Tricks Every Data Scientist Should Know
    Practical SQL Tricks Every Data Scientist Should Know
    Business & Startups

    Practical SQL Tricks Every Data Scientist Should Know

    gvfx00@gmail.comBy gvfx00@gmail.comJune 20, 2026No Comments11 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email



     

    Table of Contents

    Toggle
    • # Introduction
    • # Setting Up the Dataset
    • # 1. Measuring Time Between Events with LAG()
    • # 2. Comparing a Row to Other Rows in the Same Table with a Self-Join
    • # 3. Selecting the Top Row per Group with ROW_NUMBER()
    • # 4. Segmenting Customers by Spend with NTILE(n)
    • # 5. Smoothing Noisy Data with a Rolling Window
    • # 6. Aggregating Conditionally with FILTER
    • # 7. Detecting Consecutive Activity Streaks with Window Functions
    • # Quick Reference
      • Related posts:
    • 7 Key Benefits Of Using Natural Language Processing In Business
    • What Actually Improved and What Still Breaks
    • Every Feature Explained for Developers

    # Introduction

     
    Focusing only on SELECT, WHERE, and GROUP BY is enough for basic aggregation, but many real analytical tasks require patterns that go beyond simple queries. Examples include detecting consecutive activity streaks, segmenting customers by spend tier, smoothing noisy time-series data, or tracing plan upgrade paths across rows.

    This article walks through 7 practical SQL patterns beyond the basics, focusing on techniques that solve real analytical problems.

     

    # Setting Up the Dataset

     
    We’ll use a sample customer transactions table from a fictional subscription software as a service (SaaS) company:

    CREATE TABLE transactions (
        transaction_id   SERIAL PRIMARY KEY,
        customer_id      INT,
        plan_type        VARCHAR(20),   -- 'starter', 'pro', 'enterprise'
        amount           NUMERIC(10,2),
        status           VARCHAR(20),   -- 'completed', 'refunded', 'failed'
        created_at       TIMESTAMP
    );

     

    The full dataset of 36 transactions across 7 customers, spanning September 2023 through June 2024, is available in seed.sql. Run it before you move on to the queries.

     

    # 1. Measuring Time Between Events with LAG()

     
    LAG() and LEAD() let you access a previous or next row’s value without a self-join. They’re particularly useful for calculating gaps between events like renewal cadence, churn signals, and re-engagement delays.

    Task: Calculate how many days elapsed between each customer’s successive completed transactions.

    SELECT
        customer_id,
        created_at,
        LAG(created_at) OVER (
            PARTITION BY customer_id
            ORDER BY created_at
        ) AS previous_transaction_at,
        ROUND(
            EXTRACT(EPOCH FROM (
                created_at - LAG(created_at) OVER (
                    PARTITION BY customer_id
                    ORDER BY created_at
                )
            )) / 86400
        ) AS days_since_last
    FROM transactions
    WHERE status="completed"
    ORDER BY customer_id, created_at;

     

    Output (truncated):

    customer_id |     created_at      | previous_transaction_at | days_since_last
    -------------+---------------------+-------------------------+-----------------
            3317 | 2024-01-03 11:02:00 |                         |
            3317 | 2024-03-15 10:45:00 | 2024-01-03 11:02:00     |              72
            3317 | 2024-05-22 09:30:00 | 2024-03-15 10:45:00     |              68
            4482 | 2023-09-10 09:00:00 |                         |
            4482 | 2023-10-10 09:00:00 | 2023-09-10 09:00:00     |              30
            4482 | 2023-11-10 09:14:00 | 2023-10-10 09:00:00     |              31
            4482 | 2024-01-03 09:14:00 | 2023-11-10 09:14:00     |              54
            4482 | 2024-03-03 08:20:00 | 2024-01-03 09:14:00     |              60
            4482 | 2024-04-03 10:00:00 | 2024-03-03 08:20:00     |              31
            4482 | 2024-05-01 11:00:00 | 2024-04-03 10:00:00     |              28
            ...
            7891 | 2024-02-01 09:00:00 |                         |
            7891 | 2024-04-01 09:00:00 | 2024-02-01 09:00:00     |              60
            7891 | 2024-05-15 09:00:00 | 2024-04-01 09:00:00     |              44
            8810 | 2024-01-05 12:00:00 |                         |
            8810 | 2024-02-05 12:00:00 | 2024-01-05 12:00:00     |              31
            8810 | 2024-04-05 12:00:00 | 2024-02-05 12:00:00     |              60
    (29 rows)

     

    The first row per customer always has NULL for both columns — there’s no prior event to reference. EXTRACT(EPOCH ...) converts the timestamp interval to seconds; dividing by 86400 gives days.

    LEAD() works the same way but looks forward instead of backward, making it useful for calculating time-to-next-renewal or flagging the last transaction before churn.

     

    # 2. Comparing a Row to Other Rows in the Same Table with a Self-Join

     
    A self-join relates rows within the same table to each other. It’s the right tool when you need to compare two events for the same entity across time — upgrades, downgrades, re-activations, or any before/after pattern.

    Task: Find customers who upgraded from starter to pro (or pro to enterprise) at any point.

    SELECT DISTINCT t1.customer_id
    FROM transactions t1
    JOIN transactions t2
        ON  t1.customer_id = t2.customer_id
        AND t1.plan_type="starter"
        AND t2.plan_type="pro"
        AND t2.created_at  > t1.created_at
    WHERE t1.status="completed"
      AND t2.status="completed"
    ORDER BY t1.customer_id;

     

    Output:

    customer_id
    -------------
            4482
            6204
            7891
    (3 rows)

     

    The table is aliased twice (t1, t2) so each alias can represent a different point in time for the same customer. The condition t2.created_at > t1.created_at enforces temporal order — without it, you’d match customers who simply had both plan types in any order, including the wrong one. DISTINCT collapses cases where a customer had multiple starter transactions before upgrading, which would otherwise produce duplicate rows.

    This same structure works for detecting downgrades, finding customers who churned and came back, or comparing any two states that need to be ordered by time.

     

    # 3. Selecting the Top Row per Group with ROW_NUMBER()

     
    When you need the top-N rows per category — highest transaction per customer, most recent event per account, first purchase per cohort — ROW_NUMBER() inside a common table expression (CTE) is the standard approach.

    Task: Get each customer’s single highest completed transaction.

    WITH ranked AS (
        SELECT
            customer_id,
            transaction_id,
            amount,
            plan_type,
            ROW_NUMBER() OVER (
                PARTITION BY customer_id
                ORDER BY amount DESC, created_at DESC
            ) AS rn
        FROM transactions
        WHERE status="completed"
    )
    SELECT customer_id, transaction_id, amount, plan_type
    FROM ranked
    WHERE rn = 1
    ORDER BY customer_id;

     

    Output:

    customer_id  | transaction_id  | amount  | plan_type
    -------------+----------------+--------+------------
            3317 |             12 |  19.00 | starter
            4482 |              8 | 299.00 | enterprise
            5901 |             19 | 299.00 | enterprise
            6103 |             25 | 299.00 | enterprise
            6204 |             28 |  79.00 | pro
            7891 |             32 |  79.00 | pro
            8810 |             36 |  79.00 | pro
    (7 rows)

     

    ROW_NUMBER() assigns 1 to the row that sorts first within each partition. The outer query then filters to only those rows. The secondary sort on created_at DESC acts as a tiebreaker; when two transactions have the same amount, the more recent one wins.

    If you want ties included rather than broken, swap ROW_NUMBER() for RANK(). RANK() assigns the same number to tied rows and skips the next rank (1, 1, 3), while DENSE_RANK() does the same without skipping (1, 1, 2).

     

    # 4. Segmenting Customers by Spend with NTILE(n)

     
    NTILE(n) divides ordered rows into n roughly equal buckets and assigns each row a bucket number. It’s the right tool for customer tiering, spend quartiles, or building cohorts for A/B analysis without hardcoding thresholds.

    Task: Rank customers into spend quartiles based on their total completed transaction value.

    WITH customer_spend AS (
        SELECT
            customer_id,
            SUM(amount) AS total_spend,
            COUNT(*) AS total_transactions
        FROM transactions
        WHERE status="completed"
        GROUP BY customer_id
    )
    SELECT
        customer_id,
        total_spend,
        total_transactions,
        NTILE(4) OVER (ORDER BY total_spend) AS spend_quartile
    FROM customer_spend
    ORDER BY total_spend DESC;

     

    Output:

    customer_id | total_spend | total_transactions | spend_quartile
    -------------+-------------+--------------------+----------------
            5901 |     1495.00 |                  5 |              4
            6103 |      835.00 |                  5 |              3
            4482 |      653.00 |                  7 |              3
            8810 |      237.00 |                  3 |              2
            6204 |      177.00 |                  3 |              2
            7891 |      177.00 |                  3 |              1
            3317 |       57.00 |                  3 |              1
    (7 rows)

     

    Quartile 4 is your highest spenders; quartile 1 is your lowest. NTILE() doesn’t hardcode spend thresholds, so the buckets recalibrate automatically as new customers are added. This makes it more robust than static cutoffs like CASE WHEN total_spend > 500.

     

    # 5. Smoothing Noisy Data with a Rolling Window

     
    A rolling (or moving) average smooths out month-to-month volatility, making trends in time-series data much easier to read. Window functions with an explicit ROWS BETWEEN frame give you precise control over how many periods to include.

    Task: Calculate a 3-month rolling average of monthly revenue to smooth out noise.

    WITH monthly AS (
        SELECT
            DATE_TRUNC('month', created_at)::DATE AS month,
            SUM(amount) AS monthly_revenue
        FROM transactions
        WHERE status="completed"
        GROUP BY DATE_TRUNC('month', created_at)
    )
    SELECT
        month,
        monthly_revenue,
        ROUND(AVG(monthly_revenue) OVER (
            ORDER BY month
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ), 2) AS revenue_3mo_avg
    FROM monthly
    ORDER BY month;

     

    Output:

    month    | monthly_revenue | revenue_3mo_avg
    -------------+-----------------+-----------------
     2023-09-01  |           19.00 |           19.00
     2023-10-01  |           19.00 |           19.00
     2023-11-01  |           79.00 |           39.00
     2024-01-01  |          275.00 |          124.33
     2024-02-01  |          476.00 |          276.67
     2024-03-01  |          555.00 |          435.33
     2024-04-01  |          835.00 |          622.00
     2024-05-01  |          775.00 |          721.67
     2024-06-01  |          598.00 |          736.00
    (9 rows)

     

    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW tells the window function to look at the current row and the two rows before it. The first two rows use fewer inputs since there’s no prior history, so they act as a 1-month and 2-month average respectively.

    Swap ROWS for RANGE if you want to include all rows with the same ORDER BY value (useful when multiple rows share a timestamp). For longer smoothing, change 2 PRECEDING to 5 PRECEDING for a 6-month window.

     

    # 6. Aggregating Conditionally with FILTER

     
    FILTER lets you apply a WHERE condition to a specific aggregate without splitting the query into multiple subqueries. The result is multiple conditional aggregations in a single pass over the data.

    Task: Get total revenue, refunds, and failed transaction counts broken out by month — all in one row per month.

    SELECT
        DATE_TRUNC('month', created_at) AS month,
        SUM(amount) FILTER (WHERE status="completed") AS revenue_completed,
        SUM(amount) FILTER (WHERE status="refunded")  AS revenue_refunded,
        COUNT(*)    FILTER (WHERE status="failed")    AS failed_count
    FROM transactions
    GROUP BY DATE_TRUNC('month', created_at)
    ORDER BY month;

     

    Output:

    month             | revenue_completed | revenue_refunded | failed_count
    ------------------------+-------------------+------------------+--------------
     2023-09-01 00:00:00+00 |             19.00 |                  |            0
     2023-10-01 00:00:00+00 |             19.00 |                  |            0
     2023-11-01 00:00:00+00 |             79.00 |                  |            0
     2024-01-01 00:00:00+00 |            275.00 |                  |            0
     2024-02-01 00:00:00+00 |            476.00 |            79.00 |            1
     2024-03-01 00:00:00+00 |            555.00 |            79.00 |            0
     2024-04-01 00:00:00+00 |            835.00 |           299.00 |            0
     2024-05-01 00:00:00+00 |            775.00 |                  |            1
     2024-06-01 00:00:00+00 |            598.00 |                  |            2
    (9 rows)

     

    The alternative to FILTER is three separate subqueries joined together — more code, harder to read, and often slower. Note that SUM with FILTER returns NULL (not zero) when no rows match in a given month, which is accurate: there genuinely were no refunds in those months. Wrap in COALESCE(..., 0) if you prefer zeros.

    FILTER is standard SQL and works in PostgreSQL and BigQuery. In Snowflake and some others, use SUM(CASE WHEN status="completed" THEN amount END) instead.

     

    # 7. Detecting Consecutive Activity Streaks with Window Functions

     
    Finding unbroken sequences — active months without a gap, consecutive days with transactions, subscription streaks — is one of the trickier SQL problems. The classic solution uses a window function to group rows into streaks without a recursive CTE.

    The technique: assign each active month a sequential row number within its customer partition. If the months are truly consecutive, subtracting that row number from the month date produces the same constant value for every month in the streak. A gap breaks the constant.

    Task: Find each customer’s consecutive active months (months with at least one completed transaction).

    WITH monthly_activity AS (
        SELECT
            customer_id,
            DATE_TRUNC('month', created_at)::DATE AS active_month
        FROM transactions
        WHERE status="completed"
        GROUP BY customer_id, DATE_TRUNC('month', created_at)
    ),
    with_prev AS (
        SELECT
            customer_id,
            active_month,
            LAG(active_month) OVER (
                PARTITION BY customer_id
                ORDER BY active_month
            ) AS prev_month
        FROM monthly_activity
    ),
    streak_groups AS (
        SELECT
            customer_id,
            active_month,
            SUM(CASE WHEN active_month = prev_month + INTERVAL '1 month' THEN 0 ELSE 1 END)
                OVER (PARTITION BY customer_id ORDER BY active_month) AS streak_id
        FROM with_prev
    ),
    streaks AS (
        SELECT
            customer_id,
            streak_id,
            MIN(active_month) AS streak_start,
            MAX(active_month) AS streak_end,
            COUNT(*) AS streak_length_months
        FROM streak_groups
        GROUP BY customer_id, streak_id
    )
    SELECT customer_id, streak_start, streak_end, streak_length_months
    FROM streaks
    ORDER BY customer_id, streak_start;

     

    Output:

    customer_id | streak_start | streak_end | streak_length_months
    -------------+--------------+------------+----------------------
            3317 | 2024-01-01   | 2024-01-01 |                    1
            3317 | 2024-03-01   | 2024-03-01 |                    1
            3317 | 2024-05-01   | 2024-05-01 |                    1
            4482 | 2023-09-01   | 2023-11-01 |                    3
            4482 | 2024-01-01   | 2024-01-01 |                    1
            4482 | 2024-03-01   | 2024-05-01 |                    3
            5901 | 2024-02-01   | 2024-06-01 |                    5
            6103 | 2024-01-01   | 2024-04-01 |                    4
            6103 | 2024-06-01   | 2024-06-01 |                    1
            6204 | 2024-01-01   | 2024-01-01 |                    1
            6204 | 2024-03-01   | 2024-03-01 |                    1
            6204 | 2024-05-01   | 2024-05-01 |                    1
            7891 | 2024-02-01   | 2024-02-01 |                    1
            7891 | 2024-04-01   | 2024-05-01 |                    2
            8810 | 2024-01-01   | 2024-02-01 |                    2
            8810 | 2024-04-01   | 2024-04-01 |                    1
    (16 rows)

     

    # Quick Reference

     
    These patterns work in standard SQL without relying on database-specific features, and they appear frequently in analytical workflows such as retention analysis, upgrade funnel tracking, and revenue reporting.

     

    Tip When to Use It
    LAG() / LEAD() Time between events, before/after comparisons per entity
    Self-join Detect transitions between states (upgrades, re-activations)
    ROW_NUMBER() Top-N rows per group, deduplication
    NTILE(n) Customer segmentation into spend/activity tiers
    Rolling window (ROWS BETWEEN) Smooth noisy time-series, moving averages
    FILTER Multiple conditional aggregations in one query pass
    Consecutive streak detection Subscription streaks, retention analysis, session gaps

     

    Once you’re comfortable with them, many multi-step data transformations that are often handled in Python can be expressed more cleanly and efficiently in a single SQL query.
     
     

    Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.



    Related posts:

    Advanced NotebookLM Tips & Tricks for Power Users

    What is gpt-oss-safeguard? OpenAI's Policy-Driven Safety Model

    5 N8N Projects to Master Low-Code AI Automation

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleApple patches eavesdropping vulnerability in Beats Studio Buds
    Next Article Israel fetes Somaliland’s leader as it seeks to expand Red Sea influence | Border Disputes News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Loss Function Explained For Noobs (How Models Know They Are Wrong)

    June 19, 2026
    Business & Startups

    Machine Learning System Design: 10 Interview Problems Solved

    June 19, 2026
    Business & Startups

    Advanced Join Techniques: LATERAL Joins, Semi Joins, Anti Joins

    June 18, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025202 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 2025129 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202599 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025202 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 2025129 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202599 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.