# Introduction
Over the last decade, Pandas has been the foundation for data work in Python. For datasets that fit in memory, it is fast and familiar enough that switching libraries rarely crosses any programmer’s mind.
However, once you start working with millions of rows, the flaws start to appear: groupby operations that take several seconds, intermediate copies that consume RAM, and window functions that run as Python-level loops rather than vectorized C or Rust code.
Polars is a DataFrame library built in Rust on top of Apache Arrow. It was designed with parallelism and lazy evaluation as first-class features. Pandas executes each operation upfront and in sequence, whereas Polars can build up a query plan and optimize it prior to executing, with most operations executing concurrently across all available CPU cores automatically.
In this article, we explore three real data problems using real questions from the StrataScratch coding platform. For each problem, we compare both libraries’ solutions and point out where the performance difference matters most.
# Using rank() vs. with_row_count(): Activity Rank
In this question, the goal is to find the email activity rank for each user based on the total number of emails sent. The user with the most emails gets rank 1. Results must be sorted by total emails in descending order, using alphabetical order as a tiebreaker, and each rank must be distinct, even if two users have the same email count.
// Data View
The google_gmail_emails table stores one row per email sent, with a sender ID (from_user), recipient ID (to_user), and the day the email was sent. Here is a preview of the table:
| id | from_user | to_user | day |
|---|---|---|---|
| 0 | 6edf0be4b2267df1fa | 75d295377a46f83236 | 10 |
| 1 | 6edf0be4b2267df1fa | 32ded68d89443e808 | 6 |
| 2 | 6edf0be4b2267df1fa | 55e60cfcc9dc49c17e | 10 |
| 3 | 6edf0be4b2267df1fa | e0e0defbb9ec47f6f7 | 6 |
| 4 | … | … | … |
| 314 | e6088004caf0c8cc51 | e6088004caf0c8cc51 | 5 |
Grain (what one output row means): one user, with their total email count and unique activity rank.
// Common Mistake
The question asks for a unique rank even when two users have the same email count. A common mistake is to use the rank(method='dense') method in Pandas, which assigns the same rank to tied users. The correct method is 'first', which breaks ties by position in the sorted frame. Since we sort alphabetically by user_id before ranking, the resulting ranks are unique and deterministic.
The Polars optimal solution avoids the rank function entirely. After sorting by ["total_emails", "user_id"] in descending and ascending order, respectively, the .with_row_count("activity_rank", offset=1) clause assigns sequential integers starting from 1. No tie-breaking logic is needed because the sort already handled it.
// Solutions
1. Pandas Solution
We rename from_user to user_id, group by user, count emails, compute the first rank, and sort by email count in descending order, with alphabetical tie-breaking.
import pandas as pd
import numpy as np
google_gmail_emails = google_gmail_emails.rename(columns={"from_user": "user_id"})
result = google_gmail_emails.groupby(
['user_id']).size().to_frame('total_emails').reset_index()
result['activity_rank'] = result['total_emails'].rank(method='first', ascending=False)
result = result.sort_values(by=['total_emails', 'user_id'], ascending=[False, True])
2. Polars Solution
We use a lazy chain that renames, groups, sorts, and assigns row numbers in a single pass. Calling .collect() at the end materializes the result.
import polars as pl
google_gmail_emails = google_gmail_emails.rename({"from_user": "user_id"})
result = (
google_gmail_emails.lazy()
.group_by("user_id")
.agg(total_emails = pl.count())
.sort(
by=["total_emails", "user_id"],
descending=[True, False]
)
.with_row_count("activity_rank", offset=1)
.select([
pl.col("user_id"),
"total_emails",
"activity_rank"
])
.collect()
)
// Performance Comparison
The Pandas solution iterates over the data twice after grouping: once to compute sizes and once to assign ranks. Internally, rank(method='first') allocates a rank array, resolves ties via argsort, and writes back — which is considerably more expensive than it looks for a single column. The Polars group_by function divides the workload across all available CPU cores, resulting in significantly faster aggregation for large tables. And since the .with_row_count() clause is a single O(n) sequential pass after sorting, it replaces the rank function with the cheapest possible operation. On a table containing millions of email records, the use of parallel aggregation without a rank function can result in a 5–10x improvement in wall-clock time compared to the Pandas approach.
Here is the code output preview:
| user_id | total_emails | activity_rank |
|---|---|---|
| 32ded68d89443e808 | 19 | 1 |
| ef5fe98c6b9f313075 | 19 | 2 |
| 5b8754928306a18b68 | 18 | 3 |
| 55e60cfcc9dc49c17e | 16 | 4 |
| 91f59516cb9dee1e88 | 16 | 5 |
| … | … | … |
| e6088004caf0c8cc51 | 6 | 25 |
# Using cumcount() + pivot() vs. over(): Finding User Purchases
In this question, we’re asked to identify returning active users — specifically, those who made a second purchase within 1 and 7 days after their first. Purchases made on the same day should not be included. The result is simply a list of qualifying user_id values.
// Data View
The amazon_transactions table has one row per purchase, with user_id, item, created_at date, and revenue.
Here is a preview of the table:
| id | user_id | item | created_at | revenue |
|---|---|---|---|---|
| 1 | 109 | milk | 2020-03-03 | 123 |
| 2 | 139 | biscuit | 2020-03-18 | 421 |
| 3 | 120 | milk | 2020-03-18 | 176 |
| … | … | … | … | … |
| 100 | 117 | bread | 2020-03-10 | 209 |
Grain (what one output row means): one user ID that made a qualifying return purchase within 7 days of their first.
// Edge Case
Same-day purchases should be ignored, meaning the gap between first and second purchase must exceed 0 days and be at most 7 days. A customer who buys twice on the same day does not qualify.
// Solutions
Both solutions find each user’s earliest purchase date and then filter for subsequent purchases within the 1- to 7-day timeframe. One thing to watch: if created_at has timestamps instead of plain dates, you need to truncate to the date before comparing. Otherwise, two purchases made at different times on the same day would incorrectly pass the strict inequality.
1. Pandas Solution
In Pandas, the solution involves isolating unique purchase dates per user, ranking them with cumcount(), pivoting to get first and second dates side by side, and computing the day difference.
import pandas as pd
amazon_transactions["purchase_date"] = pd.to_datetime(amazon_transactions["created_at"]).dt.date
daily = amazon_transactions[["user_id", "purchase_date"]].drop_duplicates()
ranked = daily.sort_values(["user_id", "purchase_date"])
ranked["rn"] = ranked.groupby("user_id").cumcount() + 1
first_two = (ranked[ranked["rn"] <= 2]
.pivot(index="user_id", columns="rn", values="purchase_date")
.reset_index()
.rename(columns={1: "first_date", 2: "second_date"}))
first_two = first_two.dropna(subset=["second_date"])
first_two["diff"] = (pd.to_datetime(first_two["second_date"]) - pd.to_datetime(first_two["first_date"])).dt.days
result = first_two[(first_two["diff"] >= 1) & (first_two["diff"] <= 7)][["user_id"]]
2. Polars Solution
The Polars solution involves computing the first purchase date per user as a window expression with .over("user_id"), filtering to purchases that fit the time window, and returning a deduplicated user_id list.
import polars as pl
# returning active users: 2nd purchase 1–7 days after the first (ignore same-day)
returning_users = (
amazon_transactions
.lazy()
# first purchase date per user (window so we avoid .groupby on LazyFrame)
.with_columns(
pl.col("created_at").min().over("user_id").alias("first_purchase_date")
)
# keep transactions strictly 1-7 days after that first purchase
.filter(
(pl.col("created_at") > pl.col("first_purchase_date")) &
(pl.col("created_at") <= pl.col("first_purchase_date") + pl.duration(days=7))
)
# distinct user list
.select("user_id")
.unique()
.sort("user_id", descending=[False])
)
// Performance Comparison
Notice the number of distinct DataFrame allocations in the Pandas solution: the deduplicated daily table, the sorted ranked table, the pivoted frame, the dropna result, and the filtered output. These consist of five separate objects, each of which copies data into a new memory block. On a large transactions table, the pivot step alone can significantly increase memory usage, as it reshapes the entire dataset into a wide format.
The Polars lazy chain does not allocate any memory until .collect(). The .over("user_id") window expression computes each user’s earliest purchase date in one pass, the .filter() applies immediately in the same step, and .unique() runs concurrently across CPU cores. There is no pivot, no intermediate sorted copy, and no separate date-casting step — Polars handles date arithmetic natively inside the expression engine. This approach consumes less memory and runs faster, even on moderately sized datasets.
Here is the code output preview:
| user_id |
|---|
| 100 |
| 103 |
| 105 |
| … |
| 143 |
# Using expanding().mean() vs. cum_mean(): Monthly Sales Rolling Average
In this question, we’re asked to determine a cumulative average for monthly book sales throughout 2022. The average grows each month using all preceding months: February averages January and February, March averages all three, and so on. The output should include the month, that month’s total sales, and the cumulative average rounded to the nearest whole number.
// Data View
The amazon_books table has one row per book and its unit price. The book_orders table has one row per order, linking a book ID to a quantity and an order date. Here is a preview of the table:
| book_id | book_title | unit_price |
|---|---|---|
| B001 | The Hunger Games | 25 |
| B002 | The Outsiders | 50 |
| B003 | To Kill a Mockingbird | 100 |
| … | … | … |
| B020 | The Pillars of the Earth | 60 |
The book_orders table has one row per book order, linking each order ID to an order date, book ID, and the quantity ordered:
| order_id | order_date | book_id | quantity |
|---|---|---|---|
| 1001 | 2022-01-10 | B001 | 1 |
| 1002 | 2022-01-10 | B009 | 1 |
| 1003 | 2022-01-15 | B012 | 2 |
| … | … | … | … |
| 1084 | 2023-02-01 | B009 | 1 |
Grain (what one output row means): one month in 2022, with total sales for that month and a cumulative average of all monthly sales up to and including that month.
// Trade-Offs
Using Pandas, the .expanding().mean() clause is convenient, but operates internally with a Python-level loop over growing window slices. For a 12-row monthly summary, this cost is negligible. For daily or hourly data at scale (say, three years of hourly transactions), each expanding window slice adds overhead that compounds row by row.
Polars’ cum_mean() runs a single pass in Rust and is inherently faster at scale. There is one catch: the question requires rounding to the nearest whole number, and Pandas uses banker’s rounding (round half to even) by default. The Polars solution uses NumPy’s cumsum with an explicit floor(x + 0.5) formula to enforce round-half-up behavior. If you need an exact match to the expected output, the NumPy method is more reliable than the built-in rounding in either library.
// Solutions
1. Pandas Solution
We merge books with orders, filter to 2022, aggregate monthly sales, and apply .expanding().mean() to compute the cumulative average.
import pandas as pd
import numpy as np
import datetime as dt
merged = pd.merge(book_orders, amazon_books, on="book_id", how="inner")
merged["order_date"] = pd.to_datetime(merged["order_date"])
merged["order_month"] = merged["order_date"].dt.month
merged["year"] = merged["order_date"].dt.year
merged["sales"] = merged["unit_price"] * merged["quantity"]
merged = merged.loc[(merged["year"] == 2022), :]
result = (
merged.groupby("order_month")["sales"]
.sum()
.to_frame("monthly_sales")
.sort_values(by="order_month")
.reset_index()
)
result["rolling_average"] = result["monthly_sales"].expanding().mean().round(0)
result
2. Polars: Building the Lazy Pipeline and Collecting
We join the two tables inside a lazy chain, compute sales as unit_price * quantity, filter to 2022, aggregate by month, and call .collect() to switch to eager mode before the NumPy rolling step.
import polars as pl
import numpy as np
# Step 1: Prepare monthly sales (LazyFrame)
monthly_sales_lazy = (
book_orders.lazy()
.join(amazon_books.lazy(), on="book_id", how="inner")
.with_columns([
(pl.col("unit_price") * pl.col("quantity")).alias("sales"),
pl.col("order_date").cast(pl.Datetime),
pl.col("order_date").dt.year().alias("year"),
pl.col("order_date").dt.month().alias("order_month")
])
.filter(pl.col("year") == 2022)
.group_by("order_month")
.agg(pl.col("sales").sum().alias("monthly_sales"))
.sort("order_month")
)
# Step 2: Switch to eager mode for rolling computation
monthly_sales = monthly_sales_lazy.collect()
3. Computing the Rolling Average and Finalizing
With the monthly sales as a NumPy array, we apply round-half-up rounding, add the result back to the Polars DataFrame, and select the output columns.
# Step 3: Rolling average with round-half-up
sales_np = monthly_sales["monthly_sales"].to_numpy()
cumsum = np.cumsum(sales_np)
rolling_avg = np.floor(cumsum / np.arange(1, len(cumsum)+1) + 0.5).astype(int)
# Step 4: Add back to Polars DataFrame
monthly_sales = monthly_sales.with_columns([
pl.Series("rolling_average", rolling_avg)
])
# Step 5: Final result with correct column names
result = monthly_sales.select(["order_month", "monthly_sales", "rolling_average"])
// Performance Comparison
This question has two operations that affect performance the most: the join and the cumulative window. In Pandas, pd.merge joins all rows from both tables before filtering for 2022. This means that every year’s worth of orders is processed before rows outside the target period are discarded. Polars builds a lazy query plan and pushes the filter(year == 2022) condition before the join executes, so it joins a smaller dataset from the start. That predicate pushdown happens automatically, with no extra writing required.
The most noticeable difference is the rolling average gap. Pandas’ .expanding().mean() grows its window one row at a time, calling into C for each segment while remaining controlled by a Python loop. Polars’ cum_mean() computes the whole column in a single Rust loop with no Python overhead. While the difference may be imperceptible with monthly data, if you run this same query on daily data for three years (roughly 1,000 rows), the Polars version completes in microseconds while Pandas shows measurable latency due to the expanding window.
Here is the code output preview:
| order_month | sales | rolling_average |
|---|---|---|
| 1 | 145 | 145 |
| 2 | 250 | 198 |
| 3 | 315 | 237 |
| … | … | … |
| 12 | 710 | 402 |
# Conclusion
Across all three problems, the Polars solutions follow the same pattern: build a lazy query plan, push as much computation as possible into the optimizer, and call .collect() only when you need a concrete result.
The syntax takes some adjustment if you, like most analysts, have years of Pandas habits, but the operations align closely. .groupby() becomes .group_by(), .rename() takes a plain dict instead of a columns= keyword, and ranking becomes a sort followed by .with_row_count().
The true difference shows at scale. When dealing with small datasets, both libraries return results fast enough that the difference is not noticeable. As row counts reach the millions, Polars’ Rust-level parallelism and single-pass algorithms significantly outperform. If you’re encountering performance issues with Pandas, these three challenges are a great starting point for migration.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.
