Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Universal AI is “a pathway to AI fluency that’s accessible and approachable to anyone, anywhere” | MIT News

    May 12, 2026

    ‘Cotton picking’: US lawmaker condemned for racist comment about Jeffries | Race Issues News

    May 12, 2026

    Using Polars Instead of Pandas: Performance Deep Dive

    May 12, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Using Polars Instead of Pandas: Performance Deep Dive
    Using Polars Instead of Pandas: Performance Deep Dive
    Business & Startups

    Using Polars Instead of Pandas: Performance Deep Dive

    gvfx00@gmail.comBy gvfx00@gmail.comMay 12, 2026No Comments12 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Polars vs Pandas
     

    Table of Contents

    Toggle
    • # Introduction
    • # Using rank() vs. with_row_count(): Activity Rank
        • // Data View
        • // Common Mistake
        • // Solutions
        • // Performance Comparison
    • # Using cumcount() + pivot() vs. over(): Finding User Purchases
        • // Data View
        • // Edge Case
        • // Solutions
        • // Performance Comparison
    • # Using expanding().mean() vs. cum_mean(): Monthly Sales Rolling Average
        • // Data View
        • // Trade-Offs
        • // Solutions
        • // Performance Comparison
    • # Conclusion
      • Related posts:
    • Build an AI-Powered Learning Management System That Actually Trains People
    • Top 10 AI Tools for Writers: With My Favourite 3
    • 21 Computer Vision Projects from Beginner to Advanced

    # Introduction

     
    Over the last decade, Pandas has been the foundation for data work in Python. For datasets that fit in memory, it is fast and familiar enough that switching libraries rarely crosses any programmer’s mind.

    However, once you start working with millions of rows, the flaws start to appear: groupby operations that take several seconds, intermediate copies that consume RAM, and window functions that run as Python-level loops rather than vectorized C or Rust code.

    Polars is a DataFrame library built in Rust on top of Apache Arrow. It was designed with parallelism and lazy evaluation as first-class features. Pandas executes each operation upfront and in sequence, whereas Polars can build up a query plan and optimize it prior to executing, with most operations executing concurrently across all available CPU cores automatically.

    In this article, we explore three real data problems using real questions from the StrataScratch coding platform. For each problem, we compare both libraries’ solutions and point out where the performance difference matters most.

     
    Polars vs Pandas

     

    # Using rank() vs. with_row_count(): Activity Rank

     
    In this question, the goal is to find the email activity rank for each user based on the total number of emails sent. The user with the most emails gets rank 1. Results must be sorted by total emails in descending order, using alphabetical order as a tiebreaker, and each rank must be distinct, even if two users have the same email count.

     

    // Data View

    The google_gmail_emails table stores one row per email sent, with a sender ID (from_user), recipient ID (to_user), and the day the email was sent. Here is a preview of the table:

     

    id from_user to_user day
    0 6edf0be4b2267df1fa 75d295377a46f83236 10
    1 6edf0be4b2267df1fa 32ded68d89443e808 6
    2 6edf0be4b2267df1fa 55e60cfcc9dc49c17e 10
    3 6edf0be4b2267df1fa e0e0defbb9ec47f6f7 6
    4 … … …
    314 e6088004caf0c8cc51 e6088004caf0c8cc51 5

     

    Grain (what one output row means): one user, with their total email count and unique activity rank.

     

    // Common Mistake

    The question asks for a unique rank even when two users have the same email count. A common mistake is to use the rank(method='dense') method in Pandas, which assigns the same rank to tied users. The correct method is 'first', which breaks ties by position in the sorted frame. Since we sort alphabetically by user_id before ranking, the resulting ranks are unique and deterministic.

    The Polars optimal solution avoids the rank function entirely. After sorting by ["total_emails", "user_id"] in descending and ascending order, respectively, the .with_row_count("activity_rank", offset=1) clause assigns sequential integers starting from 1. No tie-breaking logic is needed because the sort already handled it.

     

    // Solutions

    1. Pandas Solution

    We rename from_user to user_id, group by user, count emails, compute the first rank, and sort by email count in descending order, with alphabetical tie-breaking.

    import pandas as pd
    import numpy as np
    google_gmail_emails = google_gmail_emails.rename(columns={"from_user": "user_id"})
    result = google_gmail_emails.groupby(
        ['user_id']).size().to_frame('total_emails').reset_index()
    result['activity_rank'] = result['total_emails'].rank(method='first', ascending=False)
    result = result.sort_values(by=['total_emails', 'user_id'], ascending=[False, True])

     

    2. Polars Solution

    We use a lazy chain that renames, groups, sorts, and assigns row numbers in a single pass. Calling .collect() at the end materializes the result.

    import polars as pl
    google_gmail_emails = google_gmail_emails.rename({"from_user": "user_id"})
    result = (
        google_gmail_emails.lazy()
        .group_by("user_id")
        .agg(total_emails = pl.count())
        .sort(
            by=["total_emails", "user_id"],
            descending=[True, False]
        )
        .with_row_count("activity_rank", offset=1)
        .select([
            pl.col("user_id"),
            "total_emails",
            "activity_rank"
        ])
        .collect()
    )

     

    // Performance Comparison

     
    Polars vs Pandas
     

    The Pandas solution iterates over the data twice after grouping: once to compute sizes and once to assign ranks. Internally, rank(method='first') allocates a rank array, resolves ties via argsort, and writes back — which is considerably more expensive than it looks for a single column. The Polars group_by function divides the workload across all available CPU cores, resulting in significantly faster aggregation for large tables. And since the .with_row_count() clause is a single O(n) sequential pass after sorting, it replaces the rank function with the cheapest possible operation. On a table containing millions of email records, the use of parallel aggregation without a rank function can result in a 5–10x improvement in wall-clock time compared to the Pandas approach.

    Here is the code output preview:

     

    user_id total_emails activity_rank
    32ded68d89443e808 19 1
    ef5fe98c6b9f313075 19 2
    5b8754928306a18b68 18 3
    55e60cfcc9dc49c17e 16 4
    91f59516cb9dee1e88 16 5
    … … …
    e6088004caf0c8cc51 6 25

     

    # Using cumcount() + pivot() vs. over(): Finding User Purchases

     
    In this question, we’re asked to identify returning active users — specifically, those who made a second purchase within 1 and 7 days after their first. Purchases made on the same day should not be included. The result is simply a list of qualifying user_id values.

     

    // Data View

    The amazon_transactions table has one row per purchase, with user_id, item, created_at date, and revenue.

    Here is a preview of the table:

     

    id user_id item created_at revenue
    1 109 milk 2020-03-03 123
    2 139 biscuit 2020-03-18 421
    3 120 milk 2020-03-18 176
    … … … … …
    100 117 bread 2020-03-10 209

     

    Grain (what one output row means): one user ID that made a qualifying return purchase within 7 days of their first.

     

    // Edge Case

    Same-day purchases should be ignored, meaning the gap between first and second purchase must exceed 0 days and be at most 7 days. A customer who buys twice on the same day does not qualify.

     

    // Solutions

    Both solutions find each user’s earliest purchase date and then filter for subsequent purchases within the 1- to 7-day timeframe. One thing to watch: if created_at has timestamps instead of plain dates, you need to truncate to the date before comparing. Otherwise, two purchases made at different times on the same day would incorrectly pass the strict inequality.

    1. Pandas Solution

    In Pandas, the solution involves isolating unique purchase dates per user, ranking them with cumcount(), pivoting to get first and second dates side by side, and computing the day difference.

    import pandas as pd
    amazon_transactions["purchase_date"] = pd.to_datetime(amazon_transactions["created_at"]).dt.date
    daily = amazon_transactions[["user_id", "purchase_date"]].drop_duplicates()
    ranked = daily.sort_values(["user_id", "purchase_date"])
    ranked["rn"] = ranked.groupby("user_id").cumcount() + 1
    first_two = (ranked[ranked["rn"] <= 2]
                 .pivot(index="user_id", columns="rn", values="purchase_date")
                 .reset_index()
                 .rename(columns={1: "first_date", 2: "second_date"}))
    first_two = first_two.dropna(subset=["second_date"])
    first_two["diff"] = (pd.to_datetime(first_two["second_date"]) - pd.to_datetime(first_two["first_date"])).dt.days
    result = first_two[(first_two["diff"] >= 1) & (first_two["diff"] <= 7)][["user_id"]]

     

    2. Polars Solution

    The Polars solution involves computing the first purchase date per user as a window expression with .over("user_id"), filtering to purchases that fit the time window, and returning a deduplicated user_id list.

    import polars as pl
    # returning active users: 2nd purchase 1–7 days after the first (ignore same-day)
    returning_users = (
        amazon_transactions
        .lazy()
        # first purchase date per user (window so we avoid .groupby on LazyFrame)
        .with_columns(
            pl.col("created_at").min().over("user_id").alias("first_purchase_date")
        )
        # keep transactions strictly 1-7 days after that first purchase
        .filter(
            (pl.col("created_at") > pl.col("first_purchase_date")) &
            (pl.col("created_at") <= pl.col("first_purchase_date") + pl.duration(days=7))
        )
        # distinct user list
        .select("user_id")
        .unique()
        .sort("user_id", descending=[False])
    )

     

    // Performance Comparison

     
    Polars vs Pandas
     

    Notice the number of distinct DataFrame allocations in the Pandas solution: the deduplicated daily table, the sorted ranked table, the pivoted frame, the dropna result, and the filtered output. These consist of five separate objects, each of which copies data into a new memory block. On a large transactions table, the pivot step alone can significantly increase memory usage, as it reshapes the entire dataset into a wide format.

    The Polars lazy chain does not allocate any memory until .collect(). The .over("user_id") window expression computes each user’s earliest purchase date in one pass, the .filter() applies immediately in the same step, and .unique() runs concurrently across CPU cores. There is no pivot, no intermediate sorted copy, and no separate date-casting step — Polars handles date arithmetic natively inside the expression engine. This approach consumes less memory and runs faster, even on moderately sized datasets.

    Here is the code output preview:

     

    user_id
    100
    103
    105
    …
    143

     

    # Using expanding().mean() vs. cum_mean(): Monthly Sales Rolling Average

     
    In this question, we’re asked to determine a cumulative average for monthly book sales throughout 2022. The average grows each month using all preceding months: February averages January and February, March averages all three, and so on. The output should include the month, that month’s total sales, and the cumulative average rounded to the nearest whole number.

     

    // Data View

    The amazon_books table has one row per book and its unit price. The book_orders table has one row per order, linking a book ID to a quantity and an order date. Here is a preview of the table:

     

    book_id book_title unit_price
    B001 The Hunger Games 25
    B002 The Outsiders 50
    B003 To Kill a Mockingbird 100
    … … …
    B020 The Pillars of the Earth 60

     

    The book_orders table has one row per book order, linking each order ID to an order date, book ID, and the quantity ordered:

     

    order_id order_date book_id quantity
    1001 2022-01-10 B001 1
    1002 2022-01-10 B009 1
    1003 2022-01-15 B012 2
    … … … …
    1084 2023-02-01 B009 1

     

    Grain (what one output row means): one month in 2022, with total sales for that month and a cumulative average of all monthly sales up to and including that month.

     

    // Trade-Offs

    Using Pandas, the .expanding().mean() clause is convenient, but operates internally with a Python-level loop over growing window slices. For a 12-row monthly summary, this cost is negligible. For daily or hourly data at scale (say, three years of hourly transactions), each expanding window slice adds overhead that compounds row by row.

    Polars’ cum_mean() runs a single pass in Rust and is inherently faster at scale. There is one catch: the question requires rounding to the nearest whole number, and Pandas uses banker’s rounding (round half to even) by default. The Polars solution uses NumPy’s cumsum with an explicit floor(x + 0.5) formula to enforce round-half-up behavior. If you need an exact match to the expected output, the NumPy method is more reliable than the built-in rounding in either library.

     

    // Solutions

    1. Pandas Solution

    We merge books with orders, filter to 2022, aggregate monthly sales, and apply .expanding().mean() to compute the cumulative average.

    import pandas as pd
    import numpy as np
    import datetime as dt
    merged = pd.merge(book_orders, amazon_books, on="book_id", how="inner")
    merged["order_date"] = pd.to_datetime(merged["order_date"])
    merged["order_month"] = merged["order_date"].dt.month
    merged["year"] = merged["order_date"].dt.year
    merged["sales"] = merged["unit_price"] * merged["quantity"]
    merged = merged.loc[(merged["year"] == 2022), :]
    result = (
        merged.groupby("order_month")["sales"]
        .sum()
        .to_frame("monthly_sales")
        .sort_values(by="order_month")
        .reset_index()
    )
    result["rolling_average"] = result["monthly_sales"].expanding().mean().round(0)
    result

     

    2. Polars: Building the Lazy Pipeline and Collecting

    We join the two tables inside a lazy chain, compute sales as unit_price * quantity, filter to 2022, aggregate by month, and call .collect() to switch to eager mode before the NumPy rolling step.

    import polars as pl
    import numpy as np
    # Step 1: Prepare monthly sales (LazyFrame)
    monthly_sales_lazy = (
        book_orders.lazy()
        .join(amazon_books.lazy(), on="book_id", how="inner")
        .with_columns([
            (pl.col("unit_price") * pl.col("quantity")).alias("sales"),
            pl.col("order_date").cast(pl.Datetime),
            pl.col("order_date").dt.year().alias("year"),
            pl.col("order_date").dt.month().alias("order_month")
        ])
        .filter(pl.col("year") == 2022)
        .group_by("order_month")
        .agg(pl.col("sales").sum().alias("monthly_sales"))
        .sort("order_month")
    )
    # Step 2: Switch to eager mode for rolling computation
    monthly_sales = monthly_sales_lazy.collect()

     

    3. Computing the Rolling Average and Finalizing

    With the monthly sales as a NumPy array, we apply round-half-up rounding, add the result back to the Polars DataFrame, and select the output columns.

    # Step 3: Rolling average with round-half-up
    sales_np = monthly_sales["monthly_sales"].to_numpy()
    cumsum = np.cumsum(sales_np)
    rolling_avg = np.floor(cumsum / np.arange(1, len(cumsum)+1) + 0.5).astype(int)
    # Step 4: Add back to Polars DataFrame
    monthly_sales = monthly_sales.with_columns([
        pl.Series("rolling_average", rolling_avg)
    ])
    # Step 5: Final result with correct column names
    result = monthly_sales.select(["order_month", "monthly_sales", "rolling_average"])

     

    // Performance Comparison

     
    Polars vs Pandas
     

    This question has two operations that affect performance the most: the join and the cumulative window. In Pandas, pd.merge joins all rows from both tables before filtering for 2022. This means that every year’s worth of orders is processed before rows outside the target period are discarded. Polars builds a lazy query plan and pushes the filter(year == 2022) condition before the join executes, so it joins a smaller dataset from the start. That predicate pushdown happens automatically, with no extra writing required.

    The most noticeable difference is the rolling average gap. Pandas’ .expanding().mean() grows its window one row at a time, calling into C for each segment while remaining controlled by a Python loop. Polars’ cum_mean() computes the whole column in a single Rust loop with no Python overhead. While the difference may be imperceptible with monthly data, if you run this same query on daily data for three years (roughly 1,000 rows), the Polars version completes in microseconds while Pandas shows measurable latency due to the expanding window.

    Here is the code output preview:

     

    order_month sales rolling_average
    1 145 145
    2 250 198
    3 315 237
    … … …
    12 710 402

     

     

    # Conclusion

     
    Across all three problems, the Polars solutions follow the same pattern: build a lazy query plan, push as much computation as possible into the optimizer, and call .collect() only when you need a concrete result.

    The syntax takes some adjustment if you, like most analysts, have years of Pandas habits, but the operations align closely. .groupby() becomes .group_by(), .rename() takes a plain dict instead of a columns= keyword, and ranking becomes a sort followed by .with_row_count().

    The true difference shows at scale. When dealing with small datasets, both libraries return results fast enough that the difference is not noticeable. As row counts reach the millions, Polars’ Rust-level parallelism and single-pass algorithms significantly outperform. If you’re encountering performance issues with Pandas, these three challenges are a great starting point for migration.
     
     

    Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



    Related posts:

    Self-Hosted AI: A Complete Roadmap for Beginners

    What is artificial intelligence? - The pragmatic definition — Dan Rose AI

    How to Crawl an Entire Documentation Site with Olostep

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleFive Best Wi-Fi Range Extender: The Most Flexible and Convinient Options for Better Coverage
    Next Article ‘Cotton picking’: US lawmaker condemned for racist comment about Jeffries | Race Issues News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    What is it and How to Use it?

    May 12, 2026
    Business & Startups

    Build an AI-Powered Learning Management System That Actually Trains People

    May 12, 2026
    Business & Startups

    Guardrails for LLMs: Measuring AI ‘Hallucination’ and Verbosity

    May 12, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025150 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202583 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202576 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025150 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202583 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202576 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.