Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Check Your CGM: Recalled FreeStyle Libre 3 Sensors Associated With 7 Deaths

    February 5, 2026

    Overwatch’s Heroes Are Getting Hotter, Here’s Why

    February 4, 2026

    Taylor Sheridan’s TV Shows, Ranked Worst to Best

    February 4, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Working with Billion-Row Datasets in Python (Using Vaex)
    Working with Billion-Row Datasets in Python (Using Vaex)
    Business & Startups

    Working with Billion-Row Datasets in Python (Using Vaex)

    gvfx00@gmail.comBy gvfx00@gmail.comFebruary 3, 2026No Comments10 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email



    Image by Author

     

    Table of Contents

    Toggle
    • # Introduction
    • # What Is Vaex?
    • # Comparing Vaex And Dask
    • # Why Traditional Tools Struggle
    • # How Vaex Works Under The Hood
        • // Out-of-Core Execution
        • // Lazy Evaluation
        • // Virtual Columns
    • # Getting Started With Vaex
        • // Installing Vaex
        • // Opening Large Datasets
        • // Core Operations In Vaex
        • // Demonstrating With A Taxi Dataset
        • // Vaex vs. Pandas Performance
      • // Advanced Vaex Features
        • // Data Export
    • # Concluding Thoughts
      • Related posts:
    • The Algorithmic X-Men - KDnuggets
    • What Actually Improved and What Still Breaks
    • The Data Science Behind Zepto's 10-Minute Delivery Success

    # Introduction

     
    Handling massive datasets containing billions of rows is a major challenge in data science and analytics. Traditional tools like Pandas work well for small to medium datasets that fit in system memory, but as dataset sizes grow, they become slow, use a large amount of random access memory (RAM) to function, and often crash with out of memory (OOM) errors.

    This is where Vaex, a high-performance Python library for out-of-core data processing, comes in. Vaex lets you check, modify, visualize, and analyze large tabular datasets efficiently and memory-friendly, even on a standard laptop.

     

    # What Is Vaex?

     
    Vaex is a Python library for lazy, out-of-core DataFrames (similar to Pandas) designed for data larger than your RAM.

    Key characteristics:

    Vaex is designed to handle massive datasets efficiently by working directly with data on disk and reading only the portions needed, avoiding loading entire files into memory.

    Vaex uses lazy evaluation, meaning operations are only computed when results are actually requested, and it can open columnar databases — which store data by column instead of rows — like HDF5, Apache Arrow, and Parquet instantly via memory mapping.

    Built on optimized C/C++ backends, Vaex can compute statistics and perform operations on billions of rows per second, making large-scale analysis fast even on modest hardware.

    It has a Pandas-like application programming interface (API) that makes the transition smoother for users already familiar with Pandas, helping them leverage big data capabilities without a steep learning curve.

     

    # Comparing Vaex And Dask

     
    Vaex is not similar to Dask as a whole but is similar to Dask DataFrames, which are built on top of Pandas DataFrames. This means that Dask inherits certain Pandas issues, such as the requirement that data be loaded completely into RAM to be processed in some contexts. This is not the case for Vaex. Vaex does not make a DataFrame copy, so it can process larger DataFrames on machines with less main memory. Both Vaex and Dask use lazy processing. The primary difference is that Vaex calculates the field only when needed, whereas with Dask, we need to explicitly call the compute() function. Data needs to be in HDF5 or Apache Arrow format to take full advantage of Vaex.

     

    # Why Traditional Tools Struggle

     
    Tools like Pandas load the entire dataset into RAM before processing. For datasets larger than memory, this leads to:

    • Slow performance
    • System crashes (OOM errors)
    • Limited interactivity

    Vaex never loads the entire dataset into memory; instead, it:

    • Streams data from disk
    • Uses virtual columns and lazy evaluation to delay computation
    • Only materializes results when explicitly needed

    This enables analysis of large datasets even on modest hardware.

     

    # How Vaex Works Under The Hood

     

    // Out-of-Core Execution

    Vaex reads data from disk as needed using memory mapping. This allows it to operate on data files much larger than RAM can hold.

     

    // Lazy Evaluation

    Instead of performing each operation immediately, Vaex builds a computation graph. Calculations are only executed when you request a result (e.g. when printing or plotting).

     

    // Virtual Columns

    Virtual columns are expressions defined on the dataset that do not occupy memory until computed. This saves RAM and speeds up workflows.

     

    # Getting Started With Vaex

     

    // Installing Vaex

    Create a clean virtual environment:

    conda create -n vaex_demo python=3.9
    conda activate vaex_demo

     

    Install Vaex with pip:

    pip install vaex-core vaex-hdf5 vaex-viz

     

    Upgrade Vaex:

    pip install --upgrade vaex

     

    Install supporting libraries:

    pip install pandas numpy matplotlib

     

     

    // Opening Large Datasets

    Vaex supports various popular storage formats for handling large datasets. It can work directly with HDF5, Apache Arrow, and Parquet files, all of which are optimized for efficient disk access and fast analytics. While Vaex can also read CSV files, it first needs to convert them to a more efficient format to improve performance when working with large datasets.

    How to open a Parquet file:

    import vaex
    
    df = vaex.open("your_huge_dataset.parquet")
    print(df)

     

    Now you can inspect the dataset structure without loading it into memory.

     

    // Core Operations In Vaex

    Filtering data:

    filtered = df[df.sales > 1000]

     

    This does not compute the result immediately; instead, the filter is registered and applied only when needed.

    Group-by and aggregations:

    result = df.groupby("category", agg=vaex.agg.mean("sales"))
    print(result)

     

    Vaex computes aggregations efficiently using parallel algorithms and minimal memory.

    Computing statistics:

    mean_price = df["price"].mean()
    print(mean_price)

     

    Vaex computes this on the fly by scanning the dataset in chunks.

     

    // Demonstrating With A Taxi Dataset

    We will create a realistic 50 million row taxi dataset to demonstrate Vaex’s capabilities:

    import vaex
    import numpy as np
    import pandas as pd
    import time

     

    Set random seed for reproducibility:

    np.random.seed(42)
    print("Creating 50 million row dataset...")
    n = 50_000_000

     

    Generate realistic taxi trip data:

    data = {
        'passenger_count': np.random.randint(1, 7, n),
        'trip_distance': np.random.exponential(3, n),
        'fare_amount': np.random.gamma(10, 1.5, n),
        'tip_amount': np.random.gamma(2, 1, n),
        'total_amount': np.random.gamma(12, 1.8, n),
        'payment_type': np.random.choice(['credit', 'cash', 'mobile'], n),
        'pickup_hour': np.random.randint(0, 24, n),
        'pickup_day': np.random.randint(1, 8, n),
    }

     

    Create Vaex DataFrame:

    df_vaex = vaex.from_dict(data)

     

    Export to HDF5 format (efficient for Vaex):

    df_vaex.export_hdf5('taxi_50M.hdf5')
    print(f"Created dataset with {n:,} rows")

     

    Output:

    Shape: (50000000, 8)
    Created dataset with 50,000,000 rows

     

    We now have a 50 million row dataset with 8 columns.

     

    // Vaex vs. Pandas Performance

    Opening large files with Vaex memory-mapped opening:

    start = time.time()
    df_vaex = vaex.open('taxi_50M.hdf5')
    vaex_time = time.time() - start
    
    print(f"Vaex opened {df_vaex.shape[0]:,} rows in {vaex_time:.4f} seconds")
    print(f"Memory usage: ~0 MB (memory-mapped)")

     

    Output:

    Vaex opened 50,000,000 rows in 0.0199 seconds
    Memory usage: ~0 MB (memory-mapped)

     

    Pandas: Load into memory (do not try this with 50M rows!):

    # This would fail on most machines
    df_pandas = pd.read_hdf('taxi_50M.hdf5')

     

    This will result in a memory error! Vaex opens files almost instantly, regardless of size, because it does not load data into memory.

    Basic aggregations: Calculate statistics on 50 million rows:

    start = time.time()
    stats = {
        'mean_fare': df_vaex.fare_amount.mean(),
        'mean_distance': df_vaex.trip_distance.mean(),
        'total_revenue': df_vaex.total_amount.sum(),
        'max_fare': df_vaex.fare_amount.max(),
        'min_fare': df_vaex.fare_amount.min(),
    }
    agg_time = time.time() - start
    
    print(f"\nComputed 5 aggregations in {agg_time:.4f} seconds:")
    print(f"  Mean fare: ${stats['mean_fare']:.2f}")
    print(f"  Mean distance: {stats['mean_distance']:.2f} miles")
    print(f"  Total revenue: ${stats['total_revenue']:,.2f}")
    print(f"  Fare range: ${stats['min_fare']:.2f} - ${stats['max_fare']:.2f}")

     

    Output:

    Computed 5 aggregations in 0.8771 seconds:
      Mean fare: $15.00
      Mean distance: 3.00 miles
      Total revenue: $1,080,035,827.27
      Fare range: $1.25 - $55.30

     

    Filtering operations: Filter long trips:

    start = time.time()
    long_trips = df_vaex[df_vaex.trip_distance > 10]
    filter_time = time.time() - start
    
    print(f"\nFiltered for trips > 10 miles in {filter_time:.4f} seconds")
    print(f"  Found: {len(long_trips):,} long trips")
    print(f"  Percentage: {(len(long_trips)/len(df_vaex)*100):.2f}%")

     

    Output:

    Filtered for trips > 10 miles in 0.0486 seconds
    Found: 1,784,122 long trips
    Percentage: 3.57%

     

    Multiple conditions:

    start = time.time()
    premium_trips = df_vaex[(df_vaex.trip_distance > 5) & 
                            (df_vaex.fare_amount > 20) & 
                            (df_vaex.payment_type == 'credit')]
    multi_filter_time = time.time() - start
    
    print(f"\nMultiple condition filter in {multi_filter_time:.4f} seconds")
    print(f"  Premium trips (>5mi, >$20, credit): {len(premium_trips):,}")

     

    Output:

    Multiple condition filter in 0.0582 seconds
    Premium trips (>5mi, >$20, credit): 457,191

     

    Group-by operations:

    start = time.time()
    by_payment = df_vaex.groupby('payment_type', agg={
        'mean_fare': vaex.agg.mean('fare_amount'),
        'mean_tip': vaex.agg.mean('tip_amount'),
        'total_trips': vaex.agg.count(),
        'total_revenue': vaex.agg.sum('total_amount')
    })
    groupby_time = time.time() - start
    
    print(f"\nGroupBy operation in {groupby_time:.4f} seconds")
    print(by_payment.to_pandas_df())

     

    Output:

    GroupBy operation in 5.6362 seconds
      payment_type  mean_fare  mean_tip  total_trips  total_revenue
    0       credit  15.001817  2.000065     16663623   3.599456e+08
    1       mobile  15.001200  1.999679     16667691   3.600165e+08
    2         cash  14.999397  2.000115     16668686   3.600737e+08

     

    More complex group-by:

    start = time.time()
    by_hour = df_vaex.groupby('pickup_hour', agg={
        'avg_distance': vaex.agg.mean('trip_distance'),
        'avg_fare': vaex.agg.mean('fare_amount'),
        'trip_count': vaex.agg.count()
    })
    complex_groupby_time = time.time() - start
    
    print(f"\nGroupBy by hour in {complex_groupby_time:.4f} seconds")
    print(by_hour.to_pandas_df().head(10))

     

    Output:

    GroupBy by hour in 1.6910 seconds
       pickup_hour  avg_distance   avg_fare  trip_count
    0            0      2.998120  14.997462     2083481
    1            1      3.000969  14.998814     2084650
    2            2      3.003834  15.001777     2081962
    3            3      3.001263  14.998196     2081715
    4            4      2.998343  14.999593     2083882
    5            5      2.997586  15.003988     2083421
    6            6      2.999887  15.011615     2083213
    7            7      3.000240  14.996892     2085156
    8            8      3.002640  15.000326     2082704
    9            9      2.999857  14.997857     2082284

     

    // Advanced Vaex Features

    Virtual columns (computed columns) allow adding columns with no data copying:

    df_vaex['tip_percentage'] = (df_vaex.tip_amount / df_vaex.fare_amount) * 100
    df_vaex['is_generous_tipper'] = df_vaex.tip_percentage > 20
    df_vaex['rush_hour'] = (df_vaex.pickup_hour >= 7) & (df_vaex.pickup_hour <= 9) | \
                            (df_vaex.pickup_hour >= 17) & (df_vaex.pickup_hour <= 19)

     

    These are computed on the fly with no memory overhead:

    print("Added 3 virtual columns with zero memory overhead")
    generous_tippers = df_vaex[df_vaex.is_generous_tipper]
    print(f"Generous tippers (>20% tip): {len(generous_tippers):,}")
    
    rush_hour_trips = df_vaex[df_vaex.rush_hour]
    print(f"Rush hour trips: {len(rush_hour_trips):,}")

     

    Output:

    VIRTUAL COLUMNS
    Added 3 virtual columns with zero memory overhead
    Generous tippers (>20% tip): 11,997,433
    Rush hour trips: 12,498,848

     

    Correlation analysis:

    corr = df_vaex.correlation(df_vaex.trip_distance, df_vaex.fare_amount)
    print(f"Correlation (distance vs fare): {corr:.4f}")

     

    Percentiles:

    try:
        percentiles = df_vaex.percentile_approx('fare_amount', [25, 50, 75, 90, 95, 99])
    except AttributeError:
        percentiles = [
            df_vaex.fare_amount.quantile(0.25),
            df_vaex.fare_amount.quantile(0.50),
            df_vaex.fare_amount.quantile(0.75),
            df_vaex.fare_amount.quantile(0.90),
            df_vaex.fare_amount.quantile(0.95),
            df_vaex.fare_amount.quantile(0.99),
        ]
    
    print(f"\nFare percentiles:")
    print(f"25th: ${percentiles[0]:.2f}")
    print(f"50th (median): ${percentiles[1]:.2f}")
    print(f"75th: ${percentiles[2]:.2f}")
    print(f"90th: ${percentiles[3]:.2f}")
    print(f"95th: ${percentiles[4]:.2f}")
    print(f"99th: ${percentiles[5]:.2f}")

     

    Standard deviation:

    std_fare = df_vaex.fare_amount.std()
    print(f"\nStandard deviation of fares: ${std_fare:.2f}")

     

    Additional useful statistics:

    print(f"\nAdditional statistics:")
    print(f"Mean: ${df_vaex.fare_amount.mean():.2f}")
    print(f"Min: ${df_vaex.fare_amount.min():.2f}")
    print(f"Max: ${df_vaex.fare_amount.max():.2f}")

     

    Output:

    Correlation (distance vs fare): -0.0001
    
    Fare percentiles:
      25th: $11.57
      50th (median): $nan
      75th: $nan
      90th: $nan
      95th: $nan
      99th: $nan
    
    Standard deviation of fares: $4.74
    
    Additional statistics:
      Mean: $15.00
      Min: $1.25
      Max: $55.30

     

     

    // Data Export

    # Export filtered data
    high_value_trips = df_vaex[df_vaex.total_amount > 50]

     

    Exporting to different formats:

    start = time.time()
    high_value_trips.export_hdf5('high_value_trips.hdf5')
    export_time = time.time() - start
    print(f"Exported {len(high_value_trips):,} rows to HDF5 in {export_time:.4f}s")

     

    You can also export to CSV, Parquet, etc.:

    high_value_trips.export_csv('high_value_trips.csv')
    high_value_trips.export_parquet('high_value_trips.parquet')

     

    Output:

    Exported 13,054 rows to HDF5 in 5.4508s

     

    Performance Summary Dashboard

    print("VAEX PERFORMANCE SUMMARY")
    print(f"Dataset size:           {n:,} rows")
    print(f"File size on disk:      ~2.4 GB")
    print(f"RAM usage:              ~0 MB (memory-mapped)")
    print()
    print(f"Open time:              {vaex_time:.4f} seconds")
    print(f"Single aggregation:     {agg_time:.4f} seconds")
    print(f"Simple filter:          {filter_time:.4f} seconds")
    print(f"Complex filter:         {multi_filter_time:.4f} seconds")
    print(f"GroupBy operation:      {groupby_time:.4f} seconds")
    print()
    print(f"Throughput:             ~{n/groupby_time:,.0f} rows/second")

     

    Output:

    VAEX PERFORMANCE SUMMARY
    Dataset size:           50,000,000 rows
    File size on disk:      ~2.4 GB
    RAM usage:              ~0 MB (memory-mapped)
    
    Open time:              0.0199 seconds
    Single aggregation:     0.8771 seconds
    Simple filter:          0.0486 seconds
    Complex filter:         0.0582 seconds
    GroupBy operation:      5.6362 seconds
    
    Throughput:             ~8,871,262 rows/second

     

     

    # Concluding Thoughts

     
    Vaex is ideal when you are working with large datasets that are greater than 1GB and do not fit in RAM, exploring big data, performing feature engineering with millions of rows, or building data preprocessing pipelines.

    You should not use Vaex for datasets smaller than 100MB. For these, using Pandas is simpler. If you are dealing with complex joins across multiple tables, using structured query language (SQL) databases may be better. When you need the full Pandas API, note that Vaex has limited compatibility. For real-time streaming data, other tools are more appropriate.

    Vaex fills a gap in the Python data science ecosystem: the ability to work on billion-row datasets efficiently and interactively without loading everything into memory. Its out-of-core architecture, lazy execution model, and optimized algorithms make it a powerful tool for big data exploration even on a laptop. Whether you are exploring massive logs, scientific surveys, or high-frequency time series, Vaex helps bridge the gap between ease of use and big data scalability.
     
     

    Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.



    Related posts:

    Meet A Teenage Lionel Messi

    5 Ways to Run LLMs Locally With Privacy and Security 

    10 Python Projects for Beginners

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleNotepad++ users take note: It’s time to check if you’re hacked
    Next Article Klarna backs Google UCP to power AI agent payments
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    AI Agents Can Now Hire Real Humans via rentahuman.ai

    February 4, 2026
    Business & Startups

    5 Open Source Image Editing AI Models

    February 4, 2026
    Business & Startups

    Top 10 MCP Servers for AI Builders in 2026

    February 4, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.