Using Polars Instead of Pandas: Performance Deep Dive

# Introduction

Over the last decade, Pandas has been the foundation for data work in Python. For datasets that fit in memory, it is fast and familiar enough that switching libraries rarely crosses any programmer’s mind.

However, once you start working with millions of rows, the flaws start to appear: groupby operations that take several seconds, intermediate copies that consume RAM, and window functions that run as Python-level loops rather than vectorized C or Rust code.

Polars is a DataFrame library built in Rust on top of Apache Arrow. It was designed with parallelism and lazy evaluation as first-class features. Pandas executes each operation upfront and in sequence, whereas Polars can build up a query plan and optimize it prior to executing, with most operations executing concurrently across all available CPU cores automatically.

In this article, we explore three real data problems using real questions from the StrataScratch coding platform. For each problem, we compare both libraries’ solutions and point out where the performance difference matters most.

# Using rank() vs. with_row_count(): Activity Rank

In this question, the goal is to find the email activity rank for each user based on the total number of emails sent. The user with the most emails gets rank 1. Results must be sorted by total emails in descending order, using alphabetical order as a tiebreaker, and each rank must be distinct, even if two users have the same email count.

// Data View

The google_gmail_emails table stores one row per email sent, with a sender ID (from_user), recipient ID (to_user), and the day the email was sent. Here is a preview of the table:

id	from_user	to_user	day
0	6edf0be4b2267df1fa	75d295377a46f83236	10
1	6edf0be4b2267df1fa	32ded68d89443e808	6
2	6edf0be4b2267df1fa	55e60cfcc9dc49c17e	10
3	6edf0be4b2267df1fa	e0e0defbb9ec47f6f7	6
4	…	…	…
314	e6088004caf0c8cc51	e6088004caf0c8cc51	5

Grain (what one output row means): one user, with their total email count and unique activity rank.

// Common Mistake

The question asks for a unique rank even when two users have the same email count. A common mistake is to use the rank(method='dense') method in Pandas, which assigns the same rank to tied users. The correct method is 'first', which breaks ties by position in the sorted frame. Since we sort alphabetically by user_id before ranking, the resulting ranks are unique and deterministic.

The Polars optimal solution avoids the rank function entirely. After sorting by ["total_emails", "user_id"] in descending and ascending order, respectively, the .with_row_count("activity_rank", offset=1) clause assigns sequential integers starting from 1. No tie-breaking logic is needed because the sort already handled it.

// Solutions

1. Pandas Solution

We rename from_user to user_id, group by user, count emails, compute the first rank, and sort by email count in descending order, with alphabetical tie-breaking.

import pandas as pd
import numpy as np
google_gmail_emails = google_gmail_emails.rename(columns={"from_user": "user_id"})
result = google_gmail_emails.groupby(
    ['user_id']).size().to_frame('total_emails').reset_index()
result['activity_rank'] = result['total_emails'].rank(method='first', ascending=False)
result = result.sort_values(by=['total_emails', 'user_id'], ascending=[False, True])

2. Polars Solution

We use a lazy chain that renames, groups, sorts, and assigns row numbers in a single pass. Calling .collect() at the end materializes the result.

import polars as pl
google_gmail_emails = google_gmail_emails.rename({"from_user": "user_id"})
result = (
    google_gmail_emails.lazy()
    .group_by("user_id")
    .agg(total_emails = pl.count())
    .sort(
        by=["total_emails", "user_id"],
        descending=[True, False]
    )
    .with_row_count("activity_rank", offset=1)
    .select([
        pl.col("user_id"),
        "total_emails",
        "activity_rank"
    ])
    .collect()
)

// Performance Comparison

The Pandas solution iterates over the data twice after grouping: once to compute sizes and once to assign ranks. Internally, rank(method='first') allocates a rank array, resolves ties via argsort, and writes back — which is considerably more expensive than it looks for a single column. The Polars group_by function divides the workload across all available CPU cores, resulting in significantly faster aggregation for large tables. And since the .with_row_count() clause is a single O(n) sequential pass after sorting, it replaces the rank function with the cheapest possible operation. On a table containing millions of email records, the use of parallel aggregation without a rank function can result in a 5–10x improvement in wall-clock time compared to the Pandas approach.

Here is the code output preview:

user_id	total_emails	activity_rank
32ded68d89443e808	19	1
ef5fe98c6b9f313075	19	2
5b8754928306a18b68	18	3
55e60cfcc9dc49c17e	16	4
91f59516cb9dee1e88	16	5
…	…	…
e6088004caf0c8cc51	6	25

# Using cumcount() + pivot() vs. over(): Finding User Purchases

In this question, we’re asked to identify returning active users — specifically, those who made a second purchase within 1 and 7 days after their first. Purchases made on the same day should not be included. The result is simply a list of qualifying user_id values.

// Data View

The amazon_transactions table has one row per purchase, with user_id, item, created_at date, and revenue.

Here is a preview of the table:

id	user_id	item	created_at	revenue
1	109	milk	2020-03-03	123
2	139	biscuit	2020-03-18	421
3	120	milk	2020-03-18	176
…	…	…	…	…
100	117	bread	2020-03-10	209

Grain (what one output row means): one user ID that made a qualifying return purchase within 7 days of their first.

// Edge Case

Same-day purchases should be ignored, meaning the gap between first and second purchase must exceed 0 days and be at most 7 days. A customer who buys twice on the same day does not qualify.

// Solutions

Both solutions find each user’s earliest purchase date and then filter for subsequent purchases within the 1- to 7-day timeframe. One thing to watch: if created_at has timestamps instead of plain dates, you need to truncate to the date before comparing. Otherwise, two purchases made at different times on the same day would incorrectly pass the strict inequality.

1. Pandas Solution

In Pandas, the solution involves isolating unique purchase dates per user, ranking them with cumcount(), pivoting to get first and second dates side by side, and computing the day difference.

import pandas as pd
amazon_transactions["purchase_date"] = pd.to_datetime(amazon_transactions["created_at"]).dt.date
daily = amazon_transactions[["user_id", "purchase_date"]].drop_duplicates()
ranked = daily.sort_values(["user_id", "purchase_date"])
ranked["rn"] = ranked.groupby("user_id").cumcount() + 1
first_two = (ranked[ranked["rn"] <= 2]
             .pivot(index="user_id", columns="rn", values="purchase_date")
             .reset_index()
             .rename(columns={1: "first_date", 2: "second_date"}))
first_two = first_two.dropna(subset=["second_date"])
first_two["diff"] = (pd.to_datetime(first_two["second_date"]) - pd.to_datetime(first_two["first_date"])).dt.days
result = first_two[(first_two["diff"] >= 1) & (first_two["diff"] <= 7)][["user_id"]]

2. Polars Solution

The Polars solution involves computing the first purchase date per user as a window expression with .over("user_id"), filtering to purchases that fit the time window, and returning a deduplicated user_id list.

import polars as pl
# returning active users: 2nd purchase 1–7 days after the first (ignore same-day)
returning_users = (
    amazon_transactions
    .lazy()
    # first purchase date per user (window so we avoid .groupby on LazyFrame)
    .with_columns(
        pl.col("created_at").min().over("user_id").alias("first_purchase_date")
    )
    # keep transactions strictly 1-7 days after that first purchase
    .filter(
        (pl.col("created_at") > pl.col("first_purchase_date")) &
        (pl.col("created_at") <= pl.col("first_purchase_date") + pl.duration(days=7))
    )
    # distinct user list
    .select("user_id")
    .unique()
    .sort("user_id", descending=[False])
)

// Performance Comparison

Notice the number of distinct DataFrame allocations in the Pandas solution: the deduplicated daily table, the sorted ranked table, the pivoted frame, the dropna result, and the filtered output. These consist of five separate objects, each of which copies data into a new memory block. On a large transactions table, the pivot step alone can significantly increase memory usage, as it reshapes the entire dataset into a wide format.

The Polars lazy chain does not allocate any memory until .collect(). The .over("user_id") window expression computes each user’s earliest purchase date in one pass, the .filter() applies immediately in the same step, and .unique() runs concurrently across CPU cores. There is no pivot, no intermediate sorted copy, and no separate date-casting step — Polars handles date arithmetic natively inside the expression engine. This approach consumes less memory and runs faster, even on moderately sized datasets.

Here is the code output preview:

user_id
100
103
105
…
143

# Using expanding().mean() vs. cum_mean(): Monthly Sales Rolling Average

In this question, we’re asked to determine a cumulative average for monthly book sales throughout 2022. The average grows each month using all preceding months: February averages January and February, March averages all three, and so on. The output should include the month, that month’s total sales, and the cumulative average rounded to the nearest whole number.

// Data View

The amazon_books table has one row per book and its unit price. The book_orders table has one row per order, linking a book ID to a quantity and an order date. Here is a preview of the table:

book_id	book_title	unit_price
B001	The Hunger Games	25
B002	The Outsiders	50
B003	To Kill a Mockingbird	100
…	…	…
B020	The Pillars of the Earth	60

The book_orders table has one row per book order, linking each order ID to an order date, book ID, and the quantity ordered:

order_id	order_date	book_id	quantity
1001	2022-01-10	B001	1
1002	2022-01-10	B009	1
1003	2022-01-15	B012	2
…	…	…	…
1084	2023-02-01	B009	1

Grain (what one output row means): one month in 2022, with total sales for that month and a cumulative average of all monthly sales up to and including that month.

// Trade-Offs

Using Pandas, the .expanding().mean() clause is convenient, but operates internally with a Python-level loop over growing window slices. For a 12-row monthly summary, this cost is negligible. For daily or hourly data at scale (say, three years of hourly transactions), each expanding window slice adds overhead that compounds row by row.

Polars’ cum_mean() runs a single pass in Rust and is inherently faster at scale. There is one catch: the question requires rounding to the nearest whole number, and Pandas uses banker’s rounding (round half to even) by default. The Polars solution uses NumPy’s cumsum with an explicit floor(x + 0.5) formula to enforce round-half-up behavior. If you need an exact match to the expected output, the NumPy method is more reliable than the built-in rounding in either library.

// Solutions

1. Pandas Solution

We merge books with orders, filter to 2022, aggregate monthly sales, and apply .expanding().mean() to compute the cumulative average.

import pandas as pd
import numpy as np
import datetime as dt
merged = pd.merge(book_orders, amazon_books, on="book_id", how="inner")
merged["order_date"] = pd.to_datetime(merged["order_date"])
merged["order_month"] = merged["order_date"].dt.month
merged["year"] = merged["order_date"].dt.year
merged["sales"] = merged["unit_price"] * merged["quantity"]
merged = merged.loc[(merged["year"] == 2022), :]
result = (
    merged.groupby("order_month")["sales"]
    .sum()
    .to_frame("monthly_sales")
    .sort_values(by="order_month")
    .reset_index()
)
result["rolling_average"] = result["monthly_sales"].expanding().mean().round(0)
result

2. Polars: Building the Lazy Pipeline and Collecting

We join the two tables inside a lazy chain, compute sales as unit_price * quantity, filter to 2022, aggregate by month, and call .collect() to switch to eager mode before the NumPy rolling step.

import polars as pl
import numpy as np
# Step 1: Prepare monthly sales (LazyFrame)
monthly_sales_lazy = (
    book_orders.lazy()
    .join(amazon_books.lazy(), on="book_id", how="inner")
    .with_columns([
        (pl.col("unit_price") * pl.col("quantity")).alias("sales"),
        pl.col("order_date").cast(pl.Datetime),
        pl.col("order_date").dt.year().alias("year"),
        pl.col("order_date").dt.month().alias("order_month")
    ])
    .filter(pl.col("year") == 2022)
    .group_by("order_month")
    .agg(pl.col("sales").sum().alias("monthly_sales"))
    .sort("order_month")
)
# Step 2: Switch to eager mode for rolling computation
monthly_sales = monthly_sales_lazy.collect()

3. Computing the Rolling Average and Finalizing

With the monthly sales as a NumPy array, we apply round-half-up rounding, add the result back to the Polars DataFrame, and select the output columns.

# Step 3: Rolling average with round-half-up
sales_np = monthly_sales["monthly_sales"].to_numpy()
cumsum = np.cumsum(sales_np)
rolling_avg = np.floor(cumsum / np.arange(1, len(cumsum)+1) + 0.5).astype(int)
# Step 4: Add back to Polars DataFrame
monthly_sales = monthly_sales.with_columns([
    pl.Series("rolling_average", rolling_avg)
])
# Step 5: Final result with correct column names
result = monthly_sales.select(["order_month", "monthly_sales", "rolling_average"])

// Performance Comparison

This question has two operations that affect performance the most: the join and the cumulative window. In Pandas, pd.merge joins all rows from both tables before filtering for 2022. This means that every year’s worth of orders is processed before rows outside the target period are discarded. Polars builds a lazy query plan and pushes the filter(year == 2022) condition before the join executes, so it joins a smaller dataset from the start. That predicate pushdown happens automatically, with no extra writing required.

The most noticeable difference is the rolling average gap. Pandas’ .expanding().mean() grows its window one row at a time, calling into C for each segment while remaining controlled by a Python loop. Polars’ cum_mean() computes the whole column in a single Rust loop with no Python overhead. While the difference may be imperceptible with monthly data, if you run this same query on daily data for three years (roughly 1,000 rows), the Polars version completes in microseconds while Pandas shows measurable latency due to the expanding window.

Here is the code output preview:

order_month	sales	rolling_average
1	145	145
2	250	198
3	315	237
…	…	…
12	710	402

# Conclusion

Across all three problems, the Polars solutions follow the same pattern: build a lazy query plan, push as much computation as possible into the optimizer, and call .collect() only when you need a concrete result.

The syntax takes some adjustment if you, like most analysts, have years of Pandas habits, but the operations align closely. .groupby() becomes .group_by(), .rename() takes a plain dict instead of a columns= keyword, and ranking becomes a sort followed by .with_row_count().

The true difference shows at scale. When dealing with small datasets, both libraries return results fast enough that the difference is not noticeable. As row counts reach the millions, Polars’ Rust-level parallelism and single-pass algorithms significantly outperform. If you’re encountering performance issues with Pandas, these three challenges are a great starting point for migration.

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

What's Hot

Universal AI is “a pathway to AI fluency that’s accessible and approachable to anyone, anywhere” | MIT News

‘Cotton picking’: US lawmaker condemned for racist comment about Jeffries | Race Issues News

Using Polars Instead of Pandas: Performance Deep Dive

Using Polars Instead of Pandas: Performance Deep Dive

Self-Hosted AI: A Complete Roadmap for Beginners

What is artificial intelligence? - The pragmatic definition — Dan Rose AI

How to Crawl an Entire Documentation Site with Olostep

What is it and How to Use it?

Build an AI-Powered Learning Management System That Actually Trains People

Guardrails for LLMs: Measuring AI ‘Hallucination’ and Verbosity

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

Using Polars Instead of Pandas: Performance Deep Dive

# Introduction

# Using rank() vs. with_row_count(): Activity Rank

// Data View

// Common Mistake

// Solutions

// Performance Comparison

# Using cumcount() + pivot() vs. over(): Finding User Purchases

// Data View

// Edge Case

// Solutions

// Performance Comparison

# Using expanding().mean() vs. cum_mean(): Monthly Sales Rolling Average

// Data View

// Trade-Offs

// Solutions

// Performance Comparison

# Conclusion

Related posts:

Self-Hosted AI: A Complete Roadmap for Beginners

What is artificial intelligence? - The pragmatic definition — Dan Rose AI

How to Crawl an Entire Documentation Site with Olostep

Related Posts

Subscribe to Updates