Docker for Python & Data Projects: A Beginner’s Guide

Image by Author

Table of Contents

# Introduction

Python and data projects have a dependency problem. Between Python versions, virtual environments, system-level packages, and operating system differences, getting someone else’s code to run on your machine can sometimes take longer than understanding the code itself.

Docker solves this by packaging your code and its entire environment — Python version, dependencies, system libraries — into a single artifact called the image. From the image you can start containers that run identically on your laptop, your teammate’s machine, and a cloud server. You stop debugging environments and start shipping work.

In this article, you’ll learn Docker through practical examples with a focus on data projects: containerizing a script, serving a machine learning model with FastAPI, wiring up a multi-service pipeline with Docker Compose, and scheduling a job with a cron container.

# Prerequisites

Before working through the examples, you’ll need:

Docker and Docker Compose installed for your operating system. Follow the official installation guide for your platform.
Familiarity with the command line and Python.
Familiarity with writing a Dockerfile, building an image, and running a container from that image.

If you’d like a quick refresher, here are a couple of articles to get you up to speed:

You don’t need deep Docker knowledge to follow along. Each example explains what’s happening as it goes.

# Containerizing a Python Script with Pinned Dependencies

Let’s start with the most common use case: you have a Python script and a requirements.txt, and you want it to run reliably anywhere.

We’ll build a data cleaning script that reads a raw sales CSV file, removes duplicates, fills in missing values, and writes a cleaned version to disk.

// Structuring the Project

The project is organized as follows:

data-cleaner/
├── Dockerfile
├── requirements.txt
├── clean_data.py
└── data/
    └── raw_sales.csv

// Writing the Script

Here’s the data cleaning script that uses Pandas to do the heavy lifting:

# clean_data.py
import pandas as pd
import os

INPUT_PATH = "data/raw_sales.csv"
OUTPUT_PATH = "data/cleaned_sales.csv"

print("Reading data...")
df = pd.read_csv(INPUT_PATH)
print(f"Rows before cleaning: {len(df)}")

# Drop duplicate rows
df = df.drop_duplicates()

# Fill missing numeric values with column median
for col in df.select_dtypes(include="number").columns:
    df[col] = df[col].fillna(df[col].median())

# Fill missing text values with 'Unknown'
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].fillna('Unknown')

print(f"Rows after cleaning: {len(df)}")
df.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned file saved to {OUTPUT_PATH}")

// Pinning Dependencies

Pinning exact versions is important. Without it, pip install pandas might install different versions on different machines. Pinned versions guarantee everyone gets the same behavior. You can define the exact versions in the requirements.txt file like so:

pandas==2.2.0
openpyxl==3.1.2

// Defining the Dockerfile

This Dockerfile builds a minimal, cache-friendly image for the cleaning script:

# Use a slim Python 3.11 base image
FROM python:3.11-slim

# Set the working directory inside the container
WORKDIR /app

# Copy and install dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the script into the container
COPY clean_data.py .

# Default command to run when the container starts
CMD ["python", "clean_data.py"]

There are a few things worth explaining here. We use python:3.11-slim instead of the full Python image because it’s significantly smaller and strips out packages you don’t need.

We copy requirements.txt before copying the rest of the code and this is intentional. Docker builds images in layers and caches each one. If you only change clean_data.py, Docker won’t reinstall all your dependencies on the next build. It reuses the cached pip layer and jumps straight to copying your updated script. That small ordering decision can save you minutes of rebuild time.

// Building and Running

With the image built, you can run the container and mount your local data folder:

# Build the image and tag it
docker build -t data-cleaner .

# Run it, mounting your local data/ folder into the container
docker run --rm -v $(pwd)/data:/app/data data-cleaner

The -v $(pwd)/data:/app/data flag mounts your local data/ folder into the container at /app/data. This is how the script reads your CSV and how the cleaned output gets written back to your machine. Nothing is baked into the image and the data stays on your filesystem.

The --rm flag automatically removes the container after it finishes. Since this is a one-off script, there’s no reason to keep a stopped container lying around.

# Serving a Machine Learning Model with FastAPI

You’ve trained a model and you want to make it available over HTTP so other services can send data and get predictions back. FastAPI works great for this: it’s fast, lightweight, and handles input validation with Pydantic.

// Structuring the Project

The project separates the model artifact from the application code:

ml-api/
├── Dockerfile
├── requirements.txt
├── app.py
└── model.pkl

// Writing the App

The following app loads the model once at startup and exposes a /predict endpoint:

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pickle
import numpy as np

app = FastAPI(title="Sales Forecast API")

# Load the model once at startup
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

class PredictRequest(BaseModel):
    region: str
    month: int
    marketing_spend: float
    units_in_stock: int

class PredictResponse(BaseModel):
    region: str
    predicted_revenue: float

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict", response_model=PredictResponse)
def predict(request: PredictRequest):
    try:
        features = [[
            request.month,
            request.marketing_spend,
            request.units_in_stock
        ]]
        prediction = model.predict(features)
        return PredictResponse(
            region=request.region,
            predicted_revenue=round(float(prediction[0]), 2)
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

The PredictRequest class does the input validation for you. If someone sends a request with a missing field or a string where a number is expected, FastAPI rejects it with a clear error message before your model code even runs. The model is loaded once at startup — not on every request — which keeps response times fast.

The /health endpoint is a small but important addition: Docker, load balancers, and cloud platforms use it to check whether your service is actually up and ready.

// Defining the Dockerfile

This Dockerfile bakes the model directly into the image so the container is fully self-contained:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the model and the app together
COPY model.pkl .
COPY app.py .

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

The model.pkl is baked into the image at build time. This means the container is completely self-contained, and you don’t need to mount anything when you run it. The --host 0.0.0.0 flag tells Uvicorn to listen on all network interfaces inside the container, not just localhost. Without this, you won’t be able to reach the API from outside the container.

// Building and Running

Build the image and start the API server:

docker build -t ml-api .
docker run --rm -p 8000:8000 ml-api

Test it with curl:

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"region": "North", "month": 3, "marketing_spend": 5000.0, "units_in_stock": 320}'

# Building a Multi-Service Pipeline with Docker Compose

Real data projects rarely involve just one process. You might need a database, a script that loads data into it, and a dashboard that reads from it — all running together.

Docker Compose lets you define and run multiple containers as a single application. Each service has its own container, but they all share a private network so they can talk to each other.

// Structuring the Project

The pipeline splits each service into its own subdirectory:

pipeline/
├── docker-compose.yml
├── loader/
│   ├── Dockerfile
│   ├── requirements.txt
│   └── load_data.py
└── dashboard/
    ├── Dockerfile
    ├── requirements.txt
    └── app.py

// Defining the Compose File

This Compose file declares all three services and wires them together with health checks and shared URL environment variables:

# docker-compose.yml
version: "3.9"

services:

  db:
    image: postgres:15
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: analytics
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U admin -d analytics"]
      interval: 5s
      retries: 5

  loader:
    build: ./loader
    depends_on:
      db:
        condition: service_healthy
    environment:
      DATABASE_URL: postgresql://admin:secret@db:5432/analytics

  dashboard:
    build: ./dashboard
    depends_on:
      db:
        condition: service_healthy
    ports:
      - "8501:8501"
    environment:
      DATABASE_URL: postgresql://admin:secret@db:5432/analytics

volumes:
  pgdata:

// Writing the Loader Script

This script waits briefly for the database, then loads a CSV into the sales table using SQLAlchemy:

# loader/load_data.py
import pandas as pd
from sqlalchemy import create_engine
import os
import time

DATABASE_URL = os.environ["DATABASE_URL"]

# Give the DB a moment to be fully ready
time.sleep(3)

engine = create_engine(DATABASE_URL)

df = pd.read_csv("sales_data.csv")
df.to_sql("sales", engine, if_exists="replace", index=False)

print(f"Loaded {len(df)} rows into the sales table.")

Let’s take a closer look at the Compose file. Each service runs in its own container, but they’re all on the same Docker-managed network, so they can reach each other using the service name as a hostname. The loader connects to db:5432 — and not localhost — because db is the service name, and Docker handles the DNS resolution automatically.

The healthcheck on the PostgreSQL service is important. depends_on alone only waits for the container to start, not for PostgreSQL to be ready to accept connections. The healthcheck uses pg_isready to confirm the database is actually up before the loader tries to connect. The pgdata volume persists the database between runs; stopping and restarting the pipeline won’t wipe your data.

// Starting Everything

Bring up all services with a single command:

docker compose up --build

To stop everything, run:

# Scheduling Jobs with a Cron Container

Sometimes you need a script to run on a schedule. Maybe it fetches data from an API every hour and writes it to a database or a file. You don’t want to set up a full orchestration system like Airflow for something this simple. A cron container does the job cleanly.

// Structuring the Project

The project includes a crontab file alongside the script and Dockerfile:

data-fetcher/
├── Dockerfile
├── requirements.txt
├── fetch_data.py
└── crontab

// Writing the Fetch Script

This script uses Requests to hit an API endpoint and saves the results as a timestamped CSV:

# fetch_data.py
import requests
import pandas as pd
from datetime import datetime
import os

API_URL = "https://api.example.com/sales/latest"
OUTPUT_DIR = "/app/output"

os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"[{datetime.now()}] Fetching data...")

response = requests.get(API_URL, timeout=10)
response.raise_for_status()

data = response.json()
df = pd.DataFrame(data["records"])

timestamp = datetime.now().strftime("%Y%m%d_%H%M")
output_path = f"{OUTPUT_DIR}/sales_{timestamp}.csv"
df.to_csv(output_path, index=False)

print(f"[{datetime.now()}] Saved {len(df)} records to {output_path}")

// Defining the Crontab

The crontab schedules the script to run every hour and redirects all output to a log file:

# Run every hour, on the hour
0 * * * * python /app/fetch_data.py >> /var/log/fetch.log 2>&1

The >> /var/log/fetch.log 2>&1 part redirects both standard output and error output to a log file. This is how you inspect what happened after the fact.

// Defining the Dockerfile

This Dockerfile installs cron, registers the schedule, and keeps it running in the foreground:

FROM python:3.11-slim

# Install cron
RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY fetch_data.py .
COPY crontab /etc/cron.d/fetch-job

# Set correct permissions and register the crontab
RUN chmod 0644 /etc/cron.d/fetch-job && crontab /etc/cron.d/fetch-job

# cron -f runs cron in the foreground, which is required for Docker
CMD ["cron", "-f"]

The cron -f flag is important here. Docker keeps a container alive as long as its main process is running. If cron ran in the background (its default), the main process would exit immediately and Docker would stop the container. The -f flag keeps cron running in the foreground so the container stays alive.

// Building and Running

Build the image and start the container in detached mode:

docker build -t data-fetcher .
docker run -d --name fetcher -v $(pwd)/output:/app/output data-fetcher

Check the logs any time:

docker exec fetcher cat /var/log/fetch.log

The output folder is mounted from your local machine, so the CSV files land on your filesystem even though the script runs inside the container.

# Wrapping Up

I hope you found this Docker article helpful. Docker doesn’t have to be complicated. Start with the first example, swap in your own script and dependencies, and get comfortable with the build-run cycle. Once you’ve done that, the other patterns follow naturally. Docker is a good fit when:

You need reproducible environments across machines or team members
You’re sharing scripts or models that have specific dependency requirements
You’re building multi-service systems that need to run together reliably
You want to deploy anywhere without setup friction

That said, you don’t always need to use Docker for all of your Python work. It’s probably overkill when:

You’re doing quick, exploratory analysis only for yourself
Your script has no external dependencies beyond the standard library
You’re early in a project and your requirements are changing rapidly

If you’re interested in going further, check out 5 Simple Steps to Mastering Docker for Data Science.

Happy coding!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

What's Hot

Bunkerhill raises $55M to scale agentic AI across health systems

How to Connect MCP Servers to Claude (Desktop & Code)

HP fined 1.4 billion rupees for “cartelization” of ink cartridges, toner, PCs

Docker for Python & Data Projects: A Beginner’s Guide

How to Clean Messy CSV Files with Python: A Beginner’s Guide

How to Speed Up Slow Python Code Even If You’re a Beginner

Meet LangSmith Assistant - Polly [An Agent for Agents]

How to Connect MCP Servers to Claude (Desktop & Code)

5 FREE Resources on Agentic AI

GPT-5.6 Sol vs. Claude Fable 5: Benchmarks, Pricing & Hands-On

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

Docker for Python & Data Projects: A Beginner’s Guide

# Introduction

# Prerequisites

# Containerizing a Python Script with Pinned Dependencies

// Structuring the Project

// Writing the Script

// Pinning Dependencies

// Defining the Dockerfile

// Building and Running

# Serving a Machine Learning Model with FastAPI

// Structuring the Project

// Writing the App

// Defining the Dockerfile

// Building and Running

# Building a Multi-Service Pipeline with Docker Compose

// Structuring the Project

// Defining the Compose File

// Writing the Loader Script

// Starting Everything

# Scheduling Jobs with a Cron Container

// Structuring the Project

// Writing the Fetch Script

// Defining the Crontab

// Defining the Dockerfile

// Building and Running

# Wrapping Up

Related posts:

How to Clean Messy CSV Files with Python: A Beginner’s Guide

How to Speed Up Slow Python Code Even If You’re a Beginner

Meet LangSmith Assistant - Polly [An Agent for Agents]

Related Posts

Subscribe to Updates