Image by Author
# Introduction
If you’ve been working with data in Python, you’ve almost certainly used pandas. It’s been the go-to library for data manipulation for over a decade. But recently, Polars has been gaining serious traction. Polars promises to be faster, more memory-efficient, and more intuitive than pandas. But is it worth learning? And how different is it really?
In this article, we’ll compare pandas and Polars side-by-side. You’ll see performance benchmarks, and learn the syntax differences. By the end, you’ll be able to make an informed decision for your next data project.
You can find the code on GitHub.
# Getting Started
Let’s get both libraries installed first:
pip install pandas polars
Note: This article uses pandas 2.2.2 and Polars 1.31.0.
For this comparison, we’ll also use a dataset that’s large enough to see real performance differences. We’ll use Faker to generate test data:
Now we’re ready to start coding.
# Measuring Speed By Reading Large CSV Files
Let’s start with one of the most common operations: reading a CSV file. We’ll create a dataset with 1 million rows to see real performance differences.
First, let’s generate our sample data:
import pandas as pd
from faker import Faker
import random
# Generate a large CSV file for testing
fake = Faker()
Faker.seed(42)
random.seed(42)
data = {
'user_id': range(1000000),
'name': [fake.name() for _ in range(1000000)],
'email': [fake.email() for _ in range(1000000)],
'age': [random.randint(18, 80) for _ in range(1000000)],
'salary': [random.randint(30000, 150000) for _ in range(1000000)],
'department': [random.choice(['Engineering', 'Sales', 'Marketing', 'HR', 'Finance'])
for _ in range(1000000)]
}
df_temp = pd.DataFrame(data)
df_temp.to_csv('large_dataset.csv', index=False)
print("✓ Generated large_dataset.csv with 1M rows")
This code creates a CSV file with realistic data. Now let’s compare reading speeds:
import pandas as pd
import polars as pl
import time
# pandas: Read CSV
start = time.time()
df_pandas = pd.read_csv('large_dataset.csv')
pandas_time = time.time() - start
# Polars: Read CSV
start = time.time()
df_polars = pl.read_csv('large_dataset.csv')
polars_time = time.time() - start
print(f"Pandas read time: {pandas_time:.2f} seconds")
print(f"Polars read time: {polars_time:.2f} seconds")
print(f"Polars is {pandas_time/polars_time:.1f}x faster")
Output when reading the sample CSV:
Pandas read time: 1.92 seconds
Polars read time: 0.23 seconds
Polars is 8.2x faster
Here’s what’s happening: We time how long it takes each library to read the same CSV file. While pandas uses its traditional single-threaded CSV reader, Polars automatically parallelizes the reading across multiple CPU cores. We calculate the speedup factor.
On most machines, you’ll see Polars is 2-5x faster at reading CSVs. This difference becomes even more significant with larger files.
# Measuring Memory Usage During Operations
Speed isn’t the only consideration. Let’s see how much memory each library uses. We’ll perform a series of operations and measure memory consumption. Please pip install psutil if you don’t already have it in your working environment:
import pandas as pd
import polars as pl
import psutil
import os
import gc # Import garbage collector for better memory release attempts
def get_memory_usage():
"""Get current process memory usage in MB"""
process = psutil.Process(os.getpid())
return process.memory_info().rss / 1024 / 1024
# — - Test with Pandas — -
gc.collect()
initial_memory_pandas = get_memory_usage()
df_pandas = pd.read_csv('large_dataset.csv')
filtered_pandas = df_pandas[df_pandas['age'] > 30]
grouped_pandas = filtered_pandas.groupby('department')['salary'].mean()
pandas_memory = get_memory_usage() - initial_memory_pandas
print(f"Pandas memory delta: {pandas_memory:.1f} MB")
del df_pandas, filtered_pandas, grouped_pandas
gc.collect()
# — - Test with Polars (eager mode) — -
gc.collect()
initial_memory_polars = get_memory_usage()
df_polars = pl.read_csv('large_dataset.csv')
filtered_polars = df_polars.filter(pl.col('age') > 30)
grouped_polars = filtered_polars.group_by('department').agg(pl.col('salary').mean())
polars_memory = get_memory_usage() - initial_memory_polars
print(f"Polars memory delta: {polars_memory:.1f} MB")
del df_polars, filtered_polars, grouped_polars
gc.collect()
# — - Summary — -
if pandas_memory > 0 and polars_memory > 0:
print(f"Memory savings (Polars vs Pandas): {(1 - polars_memory/pandas_memory) * 100:.1f}%")
elif pandas_memory == 0 and polars_memory > 0:
print(f"Polars used {polars_memory:.1f} MB while Pandas used 0 MB.")
elif polars_memory == 0 and pandas_memory > 0:
print(f"Polars used 0 MB while Pandas used {pandas_memory:.1f} MB.")
else:
print("Cannot compute memory savings due to zero or negative memory usage delta in both frameworks.")
This code measures the memory footprint:
- We use the psutil library to track memory usage before and after operations
- Both libraries read the same file and perform filtering and grouping
- We calculate the difference in memory consumption
Sample output:
Pandas memory delta: 44.4 MB
Polars memory delta: 1.3 MB
Memory savings (Polars vs Pandas): 97.1%
The results above show the memory usage delta for both pandas and Polars when performing filtering and aggregation operations on the large_dataset.csv.
- pandas memory delta: Indicates the memory consumed by pandas for the operations.
- Polars memory delta: Indicates the memory consumed by Polars for the same operations.
- Memory savings (Polars vs pandas): This metric provides a percentage of how much less memory Polars used compared to pandas.
It’s common for Polars to demonstrate memory efficiency due to its columnar data storage and optimized execution engine. Typically, you’ll see 30% to 70% improvements from using Polars.
Note: However, sequential memory measurements within the same Python process using
psutil.Process(...).memory_info().rsscan sometimes be misleading. Python’s memory allocator doesn’t always release memory back to the operating system immediately, so a ‘cleaned’ baseline for a subsequent test might still be influenced by prior operations. For the most accurate comparisons, tests should ideally be run in separate, isolated Python processes.
# Comparing Syntax For Basic Operations
Now let’s look at how syntax differs between the two libraries. We’ll cover the most common operations you’ll use.
// Selecting Columns
Let’s select a subset of columns. We’ll create a much smaller DataFrame for this (and subsequent examples).
import pandas as pd
import polars as pl
# Create sample data
data = {
'name': ['Anna', 'Betty', 'Cathy'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
}
# Pandas approach
df_pandas = pd.DataFrame(data)
result_pandas = df_pandas[['name', 'salary']]
# Polars approach
df_polars = pl.DataFrame(data)
result_polars = df_polars.select(['name', 'salary'])
# Alternative: More expressive
result_polars_alt = df_polars.select([pl.col('name'), pl.col('salary')])
print("Pandas result:")
print(result_pandas)
print("\nPolars result:")
print(result_polars)
The key differences here:
- pandas uses bracket notation:
df[['col1', 'col2']] - Polars uses the
.select()method - Polars also supports the more expressive
pl.col()syntax, which becomes powerful for complex operations
Output:
Pandas result:
name salary
0 Anna 50000
1 Betty 60000
2 Cathy 70000
Polars result:
shape: (3, 2)
┌───────┬────────┐
│ name ┆ salary │
│ — - ┆ — - │
│ str ┆ i64 │
╞═══════╪════════╡
│ Anna ┆ 50000 │
│ Betty ┆ 60000 │
│ Cathy ┆ 70000 │
└───────┴────────┘
Both produce the same output, but Polars’ syntax is more explicit about what you’re doing.
// Filtering Rows
Now let’s filter rows:
# pandas: Filter rows where age > 28
filtered_pandas = df_pandas[df_pandas['age'] > 28]
# Alternative Pandas syntax with query
filtered_pandas_alt = df_pandas.query('age > 28')
# Polars: Filter rows where age > 28
filtered_polars = df_polars.filter(pl.col('age') > 28)
print("Pandas filtered:")
print(filtered_pandas)
print("\nPolars filtered:")
print(filtered_polars)
Notice the differences:
- In pandas, we use boolean indexing with bracket notation. You can also use the
.query()method. - Polars uses the
.filter()method withpl.col()expressions. - Polars’ syntax reads more like SQL: “filter where column age is greater than 28”.
Output:
Pandas filtered:
name age salary
1 Betty 30 60000
2 Cathy 35 70000
Polars filtered:
shape: (2, 3)
┌───────┬─────┬────────┐
│ name ┆ age ┆ salary │
│ — - ┆ — - ┆ — - │
│ str ┆ i64 ┆ i64 │
╞═══════╪═════╪════════╡
│ Betty ┆ 30 ┆ 60000 │
│ Cathy ┆ 35 ┆ 70000 │
└───────┴─────┴────────┘
// Adding New Columns
Now let’s add new columns to the DataFrame:
# pandas: Add a new column
df_pandas['bonus'] = df_pandas['salary'] * 0.1
df_pandas['total_comp'] = df_pandas['salary'] + df_pandas['bonus']
# Polars: Add new columns
df_polars = df_polars.with_columns([
(pl.col('salary') * 0.1).alias('bonus'),
(pl.col('salary') * 1.1).alias('total_comp')
])
print("Pandas with new columns:")
print(df_pandas)
print("\nPolars with new columns:")
print(df_polars)
Output:
Pandas with new columns:
name age salary bonus total_comp
0 Anna 25 50000 5000.0 55000.0
1 Betty 30 60000 6000.0 66000.0
2 Cathy 35 70000 7000.0 77000.0
Polars with new columns:
shape: (3, 5)
┌───────┬─────┬────────┬────────┬────────────┐
│ name ┆ age ┆ salary ┆ bonus ┆ total_comp │
│ — - ┆ — - ┆ — - ┆ — - ┆ — - │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═══════╪═════╪════════╪════════╪════════════╡
│ Anna ┆ 25 ┆ 50000 ┆ 5000.0 ┆ 55000.0 │
│ Betty ┆ 30 ┆ 60000 ┆ 6000.0 ┆ 66000.0 │
│ Cathy ┆ 35 ┆ 70000 ┆ 7000.0 ┆ 77000.0 │
└───────┴─────┴────────┴────────┴────────────┘
Here’s what is happening:
- pandas uses direct column assignment, which modifies the DataFrame in place
- Polars uses
.with_columns()and returns a new DataFrame (immutable by default) - In Polars, you use
.alias()to name the new column
The Polars approach promotes immutability and makes data transformations more readable.
# Measuring Performance In Grouping And Aggregating
Let’s look at a more useful example: grouping data and calculating multiple aggregations. This code shows how we group data by department, calculate multiple statistics on different columns, and time both operations to see the performance difference:
# Load our large dataset
df_pandas = pd.read_csv('large_dataset.csv')
df_polars = pl.read_csv('large_dataset.csv')
# pandas: Group by department and calculate stats
import time
start = time.time()
result_pandas = df_pandas.groupby('department').agg({
'salary': ['mean', 'median', 'std'],
'age': 'mean'
}).reset_index()
result_pandas.columns = ['department', 'avg_salary', 'median_salary', 'std_salary', 'avg_age']
pandas_time = time.time() - start
# Polars: Same operation
start = time.time()
result_polars = df_polars.group_by('department').agg([
pl.col('salary').mean().alias('avg_salary'),
pl.col('salary').median().alias('median_salary'),
pl.col('salary').std().alias('std_salary'),
pl.col('age').mean().alias('avg_age')
])
polars_time = time.time() - start
print(f"Pandas time: {pandas_time:.3f}s")
print(f"Polars time: {polars_time:.3f}s")
print(f"Speedup: {pandas_time/polars_time:.1f}x")
print("\nPandas result:")
print(result_pandas)
print("\nPolars result:")
print(result_polars)
Output:
Pandas time: 0.126s
Polars time: 0.077s
Speedup: 1.6x
Pandas result:
department avg_salary median_salary std_salary avg_age
0 Engineering 89954.929266 89919.0 34595.585863 48.953405
1 Finance 89898.829762 89817.0 34648.373383 49.006690
2 HR 90080.629637 90177.0 34692.117761 48.979005
3 Marketing 90071.721095 90154.0 34625.095386 49.085454
4 Sales 89980.433386 90065.5 34634.974505 49.003168
Polars result:
shape: (5, 5)
┌─────────────┬──────────────┬───────────────┬──────────────┬───────────┐
│ department ┆ avg_salary ┆ median_salary ┆ std_salary ┆ avg_age │
│ — - ┆ — - ┆ — - ┆ — - ┆ — - │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════════╪══════════════╪═══════════════╪══════════════╪═══════════╡
│ HR ┆ 90080.629637 ┆ 90177.0 ┆ 34692.117761 ┆ 48.979005 │
│ Sales ┆ 89980.433386 ┆ 90065.5 ┆ 34634.974505 ┆ 49.003168 │
│ Engineering ┆ 89954.929266 ┆ 89919.0 ┆ 34595.585863 ┆ 48.953405 │
│ Marketing ┆ 90071.721095 ┆ 90154.0 ┆ 34625.095386 ┆ 49.085454 │
│ Finance ┆ 89898.829762 ┆ 89817.0 ┆ 34648.373383 ┆ 49.00669 │
└─────────────┴──────────────┴───────────────┴──────────────┴───────────┘
Breaking down the syntax:
- pandas uses a dictionary to specify aggregations, which can be confusing with complex operations
- Polars uses method chaining: each operation is clear and named
The Polars syntax is more verbose but also more readable. You can immediately see what statistics are being calculated.
# Understanding Lazy Evaluation In Polars
Lazy evaluation is one of Polars’ most helpful features. This means it doesn’t execute your query immediately. Instead, it plans the entire operation and optimizes it before running.
Let’s see this in action:
import polars as pl
# Read in lazy mode
df_lazy = pl.scan_csv('large_dataset.csv')
# Build a complex query
result = (
df_lazy
.filter(pl.col('age') > 30)
.filter(pl.col('salary') > 50000)
.group_by('department')
.agg([
pl.col('salary').mean().alias('avg_salary'),
pl.len().alias('employee_count')
])
.filter(pl.col('employee_count') > 1000)
.sort('avg_salary', descending=True)
)
# Nothing has been executed yet!
print("Query plan created, but not executed")
# Now execute the optimized query
import time
start = time.time()
result_df = result.collect() # This runs the query
execution_time = time.time() - start
print(f"\nExecution time: {execution_time:.3f}s")
print(result_df)
Output:
Query plan created, but not executed
Execution time: 0.177s
shape: (5, 3)
┌─────────────┬───────────────┬────────────────┐
│ department ┆ avg_salary ┆ employee_count │
│ — - ┆ — - ┆ — - │
│ str ┆ f64 ┆ u32 │
╞═════════════╪═══════════════╪════════════════╡
│ HR ┆ 100101.595816 ┆ 132212 │
│ Marketing ┆ 100054.012365 ┆ 132470 │
│ Sales ┆ 100041.01049 ┆ 132035 │
│ Finance ┆ 99956.527217 ┆ 132143 │
│ Engineering ┆ 99946.725458 ┆ 132384 │
└─────────────┴───────────────┴────────────────┘
Here, scan_csv() doesn’t load the file immediately; it only plans to read it. We chain multiple filters, groupings, and sorts. Polars analyzes the entire query and optimizes it. For example, it might filter before reading all data.
Only when we call .collect() does the actual computation happen. The optimized query runs much faster than executing each step separately.
# Wrapping Up
As seen, Polars is super useful for data processing with Python. It’s faster, more memory-efficient, and has a cleaner API than pandas. That said, pandas isn’t going anywhere. It has over a decade of development, a massive ecosystem, and millions of users. For many projects, pandas is still the right choice.
Learn Polars if you’re considering large-scale analysis for data engineering projects and the like. The syntax differences aren’t huge, and the performance gains are real. But keep pandas in your toolkit for compatibility and quick exploratory work.
Start by trying Polars on a side project or a data pipeline that’s running slowly. You’ll quickly get a feel for whether it’s right for your use case. Happy data wrangling!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.
