Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    US lawmakers press Israel to let cancer patients out of Gaza for treatment | Gaza News

    June 11, 2026

    Feature Stores from Scratch: A Minimal Working Implementation

    June 11, 2026

    High-severity vulnerability in Linux caused by a single faulty character

    June 11, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Feature Stores from Scratch: A Minimal Working Implementation
    Feature Stores from Scratch: A Minimal Working Implementation
    Business & Startups

    Feature Stores from Scratch: A Minimal Working Implementation

    gvfx00@gmail.comBy gvfx00@gmail.comJune 11, 2026No Comments9 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Feature Stores
     

    Table of Contents

    Toggle
    • # Introduction
    • # What a Feature Store Actually Solves
    • # The Five Components
    • # Running Example: A Personalized LLM Recommender
        • // 1. Defining the Feature Registry
        • // 2. Building the Offline Store with DuckDB and Parquet
        • // 3. Setting Up the Online Store on Redis
        • // 4. Running the Materialization Pipeline
        • // 5. Exposing the FastAPI Retrieval Service
    • # Where the Feature Store Ends and the Vector Database Begins
    • # Common Anti-Patterns
    • # Comparing This to Feast, Tecton, and Databricks
    • # Conclusion
      • Related posts:
    • 5 Powerful Python Decorators for High-Performance Data Pipelines
    • Data Engineer Roadmap 2026: 6-Month Learning Plan
    • 7 Steps to Mastering Data Storytelling for Business Impact

    # Introduction

     
    Most teams discover they need a feature store the hard way. A fraud model works in the notebook and quietly breaks in production. A support agent gives a generic answer because it has no idea who the user is. A recommender pipeline duplicates the same “30-day spend” calculation across three jobs, and two of them disagree.

    A feature store is the piece of infrastructure that fixes those problems. It defines features once, stores them in two shapes (one for training, one for serving), and keeps both in sync. We are going to build a minimal one from scratch in Python, using DuckDB, Parquet, Redis, and FastAPI. Then we will look at how AI applications change what we actually use it for.

    The full code is short enough that we will walk through every component.

     
    Feature Stores
     

    # What a Feature Store Actually Solves

     
    The classic pitch is training-serving skew: the SQL that built your training set is not the same code path that runs at inference, so the values drift. That problem is real, and the offline plus online split is the standard fix.

    The modern pitch is broader. Large language model (LLM) agents and retrieval-augmented generation (RAG) pipelines need structured user context at inference time, on every request, in under 10ms. An LLM has no memory of who the user is. If we want personalized output, we have to inject the user’s plan tier, recent activity, and account state into the prompt, and we need a system that can return those values fast and consistently. That is exactly what a feature store’s online store and retrieval API give us.

    So we build for both. The same five components handle the predictive machine learning use case and the LLM context use case.

     

    # The Five Components

     

    • A feature registry that defines features as code.
    • An offline store on Parquet, queried with DuckDB, for training and backfills.
    • An online store on Redis for low-latency lookups at inference.
    • A materialization pipeline that pushes the latest values from offline to online.
    • A FastAPI service that exposes a typed retrieval API.

     
    Feature Stores
     

    # Running Example: A Personalized LLM Recommender

     
    We are running a streaming service. When a user opens the app, an LLM generates a short, personalized “what to watch next” message. The LLM needs three things about the user:

     

    Feature Type Freshness
    user_segment string daily
    watch_count_30d int hourly
    last_genre string per-event

     

    The entity is user_id. We will register these three features, materialize them, and serve them to the LLM at request time.

     

    // 1. Defining the Feature Registry

    A registry is just a place where features are declared once, with their entity, dtype, and source. We use a dataclass.

    from dataclasses import dataclass
    from typing import Literal
    
    @dataclass(frozen=True)
    class Feature:
        name: str
        entity: str
        dtype: Literal["int", "float", "str"]
        source: str  # path to a Parquet file or a SQL view
    
    REGISTRY: dict[str, Feature] = {
        "user_segment": Feature("user_segment", "user_id", "str", "data/user_segment.parquet"),
        "watch_count_30d": Feature("watch_count_30d", "user_id", "int", "data/watch_count_30d.parquet"),
        "last_genre": Feature("last_genre", "user_id", "str", "data/last_genre.parquet"),
    }

     

    The full code can be found here.

    When you run it, the output shows:

    Registered features:
    user_segment  entity=user_id  dtype=str  source=data/user_segment.parquet
    watch_count_30d  entity=user_id  dtype=int  source=data/watch_count_30d.parquet
    last_genre  entity=user_id  dtype=str  source=data/last_genre.parquet

     

    This is the contract. Every other component reads from REGISTRY, so renaming a feature, changing its dtype, or pointing it at a new source happens in one place. In production systems, this would be YAML or a Python module checked into a Git repo, with code review on every change.

     

    // 2. Building the Offline Store with DuckDB and Parquet

    The offline store holds the full history of every feature value. We use Parquet files as the storage layer and DuckDB as the query engine. DuckDB reads Parquet directly, which means no separate database to run.

    Here is a sample of the code:

    import duckdb
    import pandas as pd
    
    def get_historical_features(
        entity_df: pd.DataFrame, features: list[str]
    ) -> pd.DataFrame:
        con = duckdb.connect()
        con.register("entities", entity_df)
        base = "SELECT * FROM entities"
        for fname in features:
            f = REGISTRY[fname]
            src = f.source.replace("'", "''")
            con.execute(f"CREATE VIEW {fname}_src AS SELECT * FROM '{src}'")
            base = f"""
                SELECT t.*, s.{fname}
                FROM ({base}) t
                ASOF LEFT JOIN {fname}_src s
                  ON t.user_id = s.user_id
                 AND t.event_timestamp >= s.event_timestamp
            """
        return con.execute(base).df()

     

    The full code can be found here.

    When you run it, the output shows:

     

    user_id event_timestamp user_segment watch_count_30d last_genre
    8a2f 2026-05-05 12:00:00 casual 22 NaN
    b13c 2026-05-07 20:00:00 casual 5 thriller
    8a2f 2026-05-07 22:00:00 power_user 47 documentary

     

    The AsOf join is the point-in-time join. For every entity row, it picks the most recent feature value where the feature’s timestamp is at or before the event timestamp. That is what prevents leakage — where a training row is built with a feature value that did not exist yet at the moment we are predicting for.

    Point-in-time joins are still the right answer for any model we plan to train or fine-tune. For a pure inference-time LLM use case, we may never call this function. We still want the offline store, since it is where backfills, evaluation datasets, and audits come from.

     

    // 3. Setting Up the Online Store on Redis

    The online store keeps only the latest value per entity. Redis is the standard choice because hash lookups are sub-millisecond.

    import json
    import fakeredis  # use redis.Redis() against a real server in production
    
    r = fakeredis.FakeRedis(decode_responses=True)
    
    def write_online(entity: str, entity_id: str, values: dict) -> None:
        r.hset(
            f"{entity}:{entity_id}",
            mapping={k: json.dumps(v) for k, v in values.items()},
        )
    
    def read_online(entity: str, entity_id: str, features: list[str]) -> dict:
        raw = r.hmget(f"{entity}:{entity_id}", features)
        return {f: json.loads(v) if v else None for f, v in zip(features, raw)}

     

    The full code can be found here.

    When you run it, the output shows:

    read_online -> {'user_segment': 'power_user', 'watch_count_30d': 47, 'last_genre': 'documentary'}
    missing key -> {'user_segment': None}

     

    The key shape is entity:entity_id. The value is a hash with one field per feature. A single HMGET returns all the features we asked for in one round trip. On a local Redis instance with three features, this finishes in well under 1ms.

     

    // 4. Running the Materialization Pipeline

    Materialization moves values from offline to online. In a real system this runs on a schedule (Airflow, cron, a streaming job). Here it is a function.

    def materialize(features: list[str]) -> None:
        by_entity: dict[str, dict] = {}
        for fname in features:
            f = REGISTRY[fname]
            src = f.source.replace("'", "''")
            df = duckdb.sql(f"""
                SELECT {f.entity}, {fname}
                FROM '{src}'
                QUALIFY ROW_NUMBER() OVER (
                    PARTITION BY {f.entity}
                    ORDER BY event_timestamp DESC
                ) = 1
            """).df()
            for _, row in df.iterrows():
                by_entity.setdefault(row[f.entity], {})[fname] = row[fname]
        for entity_id, values in by_entity.items():
            write_online("user_id", entity_id, values)

     

    The full code can be found here.

    When you run it, the output shows:

    user_id:8a2f -> {'user_segment': 'power_user', 'watch_count_30d': 47, 'last_genre': 'documentary'}
    user_id:b13c -> {'user_segment': 'casual', 'watch_count_30d': 5, 'last_genre': 'thriller'}

     

    The QUALIFY clause keeps the latest row per entity. We group all features for the same user into one Redis write to cut round trips. Run this on the cadence each feature needs: hourly for watch_count_30d, near-real-time for last_genre, daily for user_segment. The registry is the right place to encode that cadence in a real implementation.

     

    // 5. Exposing the FastAPI Retrieval Service

    The retrieval service is the production surface. It is what the LLM application calls.

    f = resp.json()["features"]
    print("\nPrompt the LLM would receive:")
    print(
        f"  System: You recommend shows for a streaming service.\n"
        f"  User context: segment={f['user_segment']}, "
        f"watched {f['watch_count_30d']} titles in last 30 days, "
        f"last genre watched: {f['last_genre']}.\n"
        f"  Task: suggest 3 titles in a friendly, short message."
    )

     

    The full code can be found here.

    When you run it, the output shows:

    POST /get-online-features -> 200
    body: {'user_id': '8a2f', 'features': {'user_segment': 'power_user', 'watch_count_30d': 47, 'last_genre': 'documentary'}}
    Prompt the LLM would receive:
      System: You recommend shows for a streaming service.
      User context: segment=power_user, watched 47 titles in last 30 days, last genre watched: documentary.
      Task: suggest 3 titles in a friendly, short message.

     

    The feature store is the piece that turns “user 8a2f” into a structured context the LLM can use.

     

    # Where the Feature Store Ends and the Vector Database Begins

     
    A vector database (Pinecone, Weaviate, pgvector) is not a feature store, even though both sit in front of a model at inference. They solve different retrieval problems.

     
    Feature Stores
     

    A real LLM stack uses both. The vector database returns the three most similar past viewing sessions. The feature store returns the user’s segment and recent counts. The prompt combines them.

     

    # Common Anti-Patterns

     
    A few patterns that we keep seeing fail:

    • Computing features inside the model service. The same logic ends up in the training notebook and the API, and the two definitions drift within a quarter.
    • Treating the online store as the source of truth. Redis loses data on a bad restart. The offline store is canonical; the online store is a cache.
    • Skipping the registry. Three teams independently define active_user and the dashboards stop matching the model.
    • Calling a vector database a feature store. It cannot do entity-keyed structured lookups, and a prompt that needs both will end up wired to two systems anyway.
    • Backfilling without point-in-time joins. The training set looks great, the production model looks broken, and the gap is the leakage.

     

    # Comparing This to Feast, Tecton, and Databricks

     
    Our ~200 lines do the same job in miniature.

     
    Feature Stores
     

    Feast is the closest comparison if we want to go further on the same pattern, self-hosted. Tecton and Databricks are the managed paths and have explicit LLM features (Tecton’s Feature Retrieval API for LLMs, Databricks Feature Serving for compound generative AI systems). Picking between them is mostly a question of how much we want to operate ourselves and whether the rest of our stack already lives in Databricks.

     

    # Conclusion

     
    A working feature store fits in five components: a registry, an offline store, an online store, a materialization step, and a retrieval API. Building it once teaches us why the production systems look the way they do. It also shows where the design changes for AI: the online retrieval path is the surface the LLM hits, point-in-time joins matter when we train or evaluate, and the vector database sits next to the feature store, not inside it.

    Once we have these pieces, swapping our minimal version for Feast, Tecton, or Databricks is mostly a migration of the registry. The shape of the system stays the same.
     
     

    Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



    Related posts:

    Top 10 AI Coding Assistants of 2026

    7 Ways to Build Investment Portfolio Tracker in Excel

    A 6-Month Guide to Mastering AI Agents

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHigh-severity vulnerability in Linux caused by a single faulty character
    Next Article US lawmakers press Israel to let cancer patients out of Gaza for treatment | Gaza News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Google’s Faster Text Generation Model

    June 11, 2026
    Business & Startups

    5 Must-Know Python Concepts for AI Engineers

    June 11, 2026
    Business & Startups

    5 Useful Python Scripts to Automate Boring PDF Tasks

    June 11, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025191 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 2025117 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202595 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025191 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 2025117 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202595 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.