Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Watch the Battlefield 6 Season 2 gameplay reveal here

    February 12, 2026

    ‘Spider-Noir’ Show Will Be Available in Color and Black-And-White

    February 12, 2026

    2026 KGM Musso EV review

    February 12, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Building Practical MLOps for a Personal ML Project
    Building Practical MLOps for a Personal ML Project
    Business & Startups

    Building Practical MLOps for a Personal ML Project

    gvfx00@gmail.comBy gvfx00@gmail.comFebruary 12, 2026No Comments11 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    MLOps for a Personal ML Project
    Image by Author

     

    Table of Contents

    Toggle
    • # Introduction
    • # Understanding the Task and the Dataset
    • # Starting with Version Control
    • # Building Reproducible Pipelines (and Leaving Notebook Chaos Behind)
        • // From Top-of-Notebook Cells to a Reusable Function
        • // What the Preprocessing Pipeline Actually Does
    • # Saving Your Statistical Models and Artifacts
    • # Making It Work Locally (With an API or Tiny Web UI)
        • // Turning Your Analyses Into a Tiny “Analysis API”
        • // A Simple Local Entry Point
    • # Logging Everything (Even the Details)
        • // Basic Logging Setup
        • // Logging the Preprocessing and Analyses
    • # Telling the Story (AKA Writing for Humans)
    • # Conclusion
      • Related posts:
    • Model-assisted labelling - For better or for worse — Dan Rose AI
    • Claude Sonnet 4.5: The New Coding King?
    • How to Build a Smart AI Voice Assistant with Vapi

    # Introduction

     
    You’ve probably done your fair share of data science and machine learning projects.

    They are great for sharpening skills and showing off what you know and have learned. But here’s the thing: they often stop short of what real-world, production-level data science looks like.

    In this article, we take a project — the U.S. Occupational Wage Analysis — and turn it into something that says, “This is ready for real-world use.”

    For this, we will walk through a simple but solid machine learning operations (MLOps) setup that covers everything from version control to deployment.

    It’s great for early-career data people, freelancers, portfolio builders, or whoever wants their work to look like it came out of a professional setup, even if it did not.

    In this article, we will go beyond notebook projects: we will set up our MLOps structure, learn how to set up reproducible pipelines, model artifacts, a simple local application programming interface (API), logging, and finally, how to produce useful documentation.

     

    MLOps for a Personal ML Project
    Image by Author

     

    # Understanding the Task and the Dataset

     
    The scenario for the project consists of a national U.S. dataset that has annual occupational wage and employment data in all 50 U.S. states and territories. The data details employment totals, mean wages, occupational groups, wage percentiles, and also geographic identifiers.

     
    MLOps for a Personal ML Project
     

    Your main objectives are:

    • Comparing differences in wages across different states and job categories
    • Running statistical tests (T-tests, Z-tests, F-tests)
    • Building regressions to understand the relationship between employment and wages
    • Visualizing wage distributions and occupation trends

    Some key columns of the dataset:

    • OCC_TITLE — Occupation name
    • TOT_EMP — Total employment
    • A_MEAN — Average annual wage
    • PRIM_STATE — State abbreviation
    • O_GROUP — Occupation category (Major, Total, Detailed)

     
    MLOps for a Personal ML Project
     

    Your mission here is to produce reliable insights about wage disparities, job distribution, and statistical relationships, but it does not stop there.

    The challenge is also to structure the project in a way that it becomes reusable, reproducible, and clean. This is a very important skill required for all data scientists nowadays.

     

    # Starting with Version Control

     
    Let’s not skip the basics. Even small projects deserve a clean structure and proper version control. Here’s a folder setup that’s both intuitive and reviewer-friendly:

     
    MLOps for a Personal ML Project
     

    A few best practices:

    • Keep raw data immutable. You do not need to touch it, just copy it for processing.
    • Consider using Git LFS if your datasets get big and chunky.
    • Keep each script in src/ focused on one thing. Your future self will thank you.
    • Commit often and use clear messages like:
      feat: add T-test comparison between management and production wages.

    Even with this simple structure, you are showing hiring managers that you’re thinking and planning like a professional, not like a junior.

     

    # Building Reproducible Pipelines (and Leaving Notebook Chaos Behind)

     
    Notebooks are amazing for exploration. You try something, tweak a filter, re-run a cell, copy a chart, and before you know it, you’ve got 40 cells and no idea what actually produced the final answer.

    To make this project feel “production-ish”, we’ll take the logic that already lives in the notebook and wrap it in a single preprocessing function. That function becomes the one, canonical place where the U.S. occupational wage data is:

    • Loaded from the Excel file
    • Cleaned and converted to numeric
    • Normalized (states, occupation groups, occupation codes)
    • Enriched with helper columns like total payroll

    From then on, every analysis — plots, T-tests, regressions, correlations, Z-tests — will reuse the same cleaned DataFrame.

     

    // From Top-of-Notebook Cells to a Reusable Function

    Right now, the notebook roughly does this:

    • Loads the file: state_M2024_dl.xlsx
    • Parses the first sheet into a DataFrame
    • Converts columns like A_MEAN, TOT_EMP to numeric
    • Uses those columns in:
      • State-level wage comparisons
      • Linear regression (TOT_EMP → A_MEAN)
      • Pearson correlation (Q6)
      • Z-test for tech vs non-tech (Q7)
      • Levene test for wage variance

    We’ll turn that into a single function called preprocess_wage_data that you can call from anywhere in the project:

    from src.preprocessing import preprocess_wage_data
    df = preprocess_wage_data("data/raw/state_M2024_dl.xlsx")

     

    Now your notebook, scripts, or future API call all agree on what “clean data” means.

     

    // What the Preprocessing Pipeline Actually Does

    MLOps for a Personal ML Project
     

    For this dataset, the preprocessing pipeline will:

    1. Load the Excel file once.

    xls = pd.ExcelFile(file_path)
    df_raw = xls.parse(xls.sheet_names[0])
    df_raw.head()

     
    MLOps for a Personal ML Project
     

    2. Convert key numeric columns to numeric.

    These are the columns your analysis actually uses:

    • Employment and intensity: TOT_EMP, EMP_PRSE, JOBS_1000, LOC_QUOTIENT
    • Wage measures: H_MEAN, A_MEAN, MEAN_PRSE
    • Wage percentiles: H_PCT10, H_PCT25, H_MEDIAN, H_PCT75, H_PCT90, A_PCT10, A_PCT25, A_MEDIAN, A_PCT75, A_PCT90

    We coerce them safely:

    df = df_raw.copy()
    numeric_cols = [
            "TOT_EMP", "EMP_PRSE", "JOBS_1000", "LOC_QUOTIENT" ….]
    for col in numeric_cols:
            if col in df.columns:
                df[col] = pd.to_numeric(df[col], errors="coerce")

     

    If a future file contains weird values (e.g. ‘**’ or ‘N/A’), your code will not explode, it will just treat them as missing, and the pipeline will not break.

    3. Normalize text identifiers.

    For consistent grouping and filtering:

    • PRIM_STATE to uppercase (e.g. “ca” → “CA”)
    • O_GROUP to lowercase (e.g. “Major” → “major”)
    • OCC_CODE to string (for .str.startswith("15") in the tech vs non-tech Z-test)

    4. Add helper columns used in analyses.

    These are simple but handy. The helper for the total payroll per row is, approximate, using the mean wage:

     df["TOTAL_PAYROLL"] = df["A_MEAN"] * df["TOT_EMP"]

     

    The wage-to-employment ratio is useful for spotting high wage / low employment niches, with protection against division by zero:

     df["WAGE_EMP_RATIO"] = df["A_MEAN"] / df["TOT_EMP"].replace({0: np.nan})

     

    5. Return a clean DataFrame for the rest of the project.

    Your later code for:

    1. Plotting top/bottom states
    2. T-tests (Management vs Production)
    3. Regression (TOT_EMP → A_MEAN)
    4. Correlations (Q6)
    5. Z-tests (Q7)
    6. Levene’s test

    can all start with:

     df = preprocess_wage_data("state_M2024_dl.xlsx")

     

    Full preprocessing function:

    Drop this into src/preprocessing.py:

    import pandas as pd
    import numpy as np
    def preprocess_wage_data(file_path: str = "state_M2024_dl.xlsx") -> pd.DataFrame:
        """Load and clean the U.S. occupational wage data from Excel.
        - Reads the first sheet of the Excel file.
        - Ensures key numeric columns are numeric.
        - Normalizes text identifiers (state, occupation group, occupation code).
        - Adds helper columns used in later analysis.
        """
        # Load raw Excel file
        xls = pd.ExcelFile(file_path)

     

    Check the rest of the code here.

     

    # Saving Your Statistical Models and Artifacts

     
    What are model artifacts? Some examples: regression models, correlation matrices, cleaned datasets, and figures.

    import joblib
    joblib.dump(model, "models/employment_wage_regression.pkl")

     

    Why save artifacts?

    • You avoid recomputing results during API calls or dashboards
    • You preserve versions for future comparisons
    • You keep analysis and inference separate

    These small habits elevate your project from exploratory to production-friendly.

     

    # Making It Work Locally (With an API or Tiny Web UI)

     
    You don’t need to jump straight into Docker and Kubernetes to “deploy” this. For a lot of real-world analytics work, your first API is simply:

    • A clean preprocessing function
    • A few well-named analysis functions
    • A small script or notebook cell that wires them together

    That alone makes your project easy to call from:

    • Another notebook
    • A Streamlit/Gradio dashboard
    • A future FastAPI or Flask app

     

    // Turning Your Analyses Into a Tiny “Analysis API”

    You already have the core logic in the notebook:

    • T-test: Management vs Production wages
    • Regression: TOT_EMP → A_MEAN
    • Pearson correlation (Q6)
    • Z-test tech vs non-tech (Q7)
    • Levene’s test for wage variance

    We’ll wrap at least one of them into a function so it behaves like a tiny API endpoint.

    Example: “Compare management vs production wages”

    This is a function version of the T-test code that’s already in the notebook:

    from scipy.stats import ttest_ind
    import pandas as pd
    def compare_management_vs_production(df: pd.DataFrame):
        """Two-sample T-test between Management and Production occupations."""
        # Filter for relevant occupations
        mgmt = df[df["OCC_TITLE"].str.contains("Management", case=False, na=False)]
        prod = df[df["OCC_TITLE"].str.contains("Production", case=False, na=False)]
        # Drop missing values
        mgmt_wages = mgmt["A_MEAN"].dropna()
        prod_wages = prod["A_MEAN"].dropna()
        # Perform two-sample T-test (Welch's t-test)
        t_stat, p_value = ttest_ind(mgmt_wages, prod_wages, equal_var=False)
        return t_stat, p_value

     

    Now this test can be reused from:

    • A main script
    • A Streamlit slider
    • A future FastAPI route

    without copying any notebook cells.

     

    // A Simple Local Entry Point

    Here’s how all the pieces fit together in a plain Python script, which you can call main.py or run in one notebook cell:

    from preprocessing import preprocess_wage_data
    from statistics import run_q6_pearson_test, run_q7_ztest  # move these from the notebook
    from analysis import compare_management_vs_production      # the function above
    if __name__ == "__main__":
        # 1. Load and preprocess the data
        df = preprocess_wage_data("state_M2024_dl.xlsx")
        # 2. Run core analyses
        t_stat, p_value = compare_management_vs_production(df)
        print(f"T-test (Management vs Production) -> t={t_stat:.2f}, p={p_value:.4f}")
        corr_q6, p_q6 = run_q6_pearson_test(df)
        print(f"Pearson correlation (TOT_EMP vs A_MEAN) -> r={corr_q6:.4f}, p={p_q6:.4f}")
        z_q7 = run_q7_ztest(df)
        print(f"Z-test (Tech vs Non-tech median wages) -> z={z_q7:.4f}")

     

    This does not look like a web API yet, but conceptually it is:

    • Input: the cleaned DataFrame
    • Operations: named analytical functions
    • Output: well-defined numbers you can surface in a dashboard, a report, or, later, a REST endpoint.

     

    # Logging Everything (Even the Details)

     
    Most people overlook logging, but it is how you make your project debuggable and trustworthy.
    Even in a beginner-friendly analytics project like this one, it’s useful to know:

    • Which file you loaded
    • How many rows survived preprocessing
    • Which tests ran
    • What the key test statistics were

    Instead of manually printing everything and scrolling through notebook output, we’ll set up a simple logging configuration that you can reuse in scripts and notebooks.

     

    // Basic Logging Setup

    Create a logs/ folder in your project, and then add this somewhere early in your code (e.g. at the top of main.py or in a dedicated logging_config.py):

    import logging
    from pathlib import Path
    # Make sure logs/ exists
    Path("logs").mkdir(exist_ok=True)
    logging.basicConfig(
        filename="logs/pipeline.log",
        level=logging.INFO,
        format="%(asctime)s - %(levelname)s - %(message)s"
    )

     

    Now, every time you run your pipeline, a logs/pipeline.log file will be updated.

     

    // Logging the Preprocessing and Analyses

    We can extend the main example from Step 5 to log what’s happening:

    from preprocessing import preprocess_wage_data
    from statistics import run_q6_pearson_test, run_q7_ztest
    from analysis import compare_management_vs_production
    import logging
    if __name__ == "__main__":
        logging.info("Starting wage analysis pipeline.")
        # 1. Preprocess data
        df = preprocess_wage_data("state_M2024_dl.xlsx")
        logging.info("Loaded cleaned dataset with %d rows and %d columns.", df.shape[0], df.shape[1])
        # 2. T-test: Management vs Production
        t_stat, p_value = compare_management_vs_production(df)
        logging.info("T-test (Mgmt vs Prod) -> t=%.3f, p=%.4f", t_stat, p_value)
        # 3. Pearson correlation (Q6)
        corr_q6, p_q6 = run_q6_pearson_test(df)
        logging.info("Pearson (TOT_EMP vs A_MEAN) -> r=%.4f, p=%.4f", corr_q6, p_q6)
        # 4. Z-test (Q7)
        z_q7 = run_q7_ztest(df)
        logging.info("Z-test (Tech vs Non-tech median wages) -> z=%.3f", z_q7)
        logging.info("Pipeline finished successfully.")

     

    Now, instead of guessing what happened last time you ran the notebook, you can open logs/pipeline.log and see a timeline of:

    • When preprocessing started
    • How many rows/columns you had
    • What the test statistics were

    That’s a small step, but a very “MLOps” thing to do: you’re not just running analyses, you’re observing them.

     

    # Telling the Story (AKA Writing for Humans)

     
    Documentation matters, especially when dealing with wages, occupations, and regional comparisons, topics real decision-makers care about.

    Your README or final notebook should include:

    • Why this analysis matters
    • A summary of wage and employment patterns
    • Key visualizations (top/bottom states, wage distributions, group comparisons)
    • Explanations of each statistical test and why it was chosen
    • Clear interpretations of regression and correlation results
    • Limitations (e.g. missing state records, sampling variance)
    • Next steps for deeper analysis or dashboard deployment

    Good documentation turns a dataset project into something anyone can use and understand.

     

    # Conclusion

     
    Why does all of this matter?

    Because in the real world, data science doesn’t live in a vacuum. Your beautiful model isn’t helpful if no one else can run it, understand it, or trust it. That’s where MLOps comes in, not as a buzzword, but as the bridge between a cool experiment and an actual, usable product.

    In this article, we started with a typical notebook-based assignment and showed how to give it structure and staying power. We introduced:

    • Version control to keep our work organized
    • Clean, reproducible pipelines for preprocessing and detection
    • Model serialization so we can re-use (not re-train) our models
    • A lightweight API for local deployment
    • Logging to track what’s going on behind the scenes
    • And finally, documentation that speaks to both techies and business folks

     

    MLOps for a Personal ML Project
    Image by Author

     
     

    Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



    Related posts:

    ChatGPT Image 1.5 vs Nano Banana Pro: AI Image Showdown

    Are AI Browsers Any Good? A Day with Perplexity’s Comet and OpenAI’s Atlas

    Getting Started with Langfuse [2026 Guide]

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleNYT Strands hints and answers for Friday, February 13 (game #712)
    Next Article State-Sponsored Hackers Exploit AI in Cyberattacks: Google
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    I Built a Complete AI Resume with a 90+ ATS Score

    February 12, 2026
    Business & Startups

    Why Most People Misuse SMOTE, And How to Do It Right

    February 12, 2026
    Business & Startups

    Claude Agents Just Built a Fully Functioning C Compiler

    February 11, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.