How to Crawl an Entire Documentation Site with Olostep

Image by Author

Table of Contents

# Introduction

Web crawling is the process of automatically visiting web pages, following links, and collecting content from a website in a structured way. It is commonly used to gather large amounts of information from documentation sites, articles, knowledge bases, and other web resources.

Crawling an entire website and then converting that content into a format that an AI agent can actually use is not as simple as it sounds. Documentation sites often contain nested pages, repeated navigation links, boilerplate content, and inconsistent page structures. On top of that, the extracted content needs to be cleaned, organized, and saved in a way that is useful for downstream AI workflows such as retrieval, question-answering, or agent-based systems.

In this guide, we will learn why to use Olostep instead of Scrapy or Selenium, set up everything needed for the web crawling project, write a simple crawling script to scrape a documentation website, and finally create a frontend using Gradio so that anyone can provide a link and other arguments to crawl website pages.

# Choosing Olostep Over Scrapy or Selenium

Scrapy is powerful, but it is built as a full scraping framework. That is useful when you want deep control, but it also means more setup and more engineering work.

Selenium is better known for browser automation. It is useful for interacting with JavaScript-heavy pages, but it is not really designed as a documentation crawling workflow on its own.

With Olostep, the pitch is a lot more direct: search, crawl, scrape, and structure web data through one application programming interface (API), with support for LLM-friendly outputs like Markdown, text, HTML, and structured JSON. That means you do not have to manually stitch together pieces for discovery, extraction, formatting, and downstream AI use in the same way.

For documentation sites, that can give you a much faster path from URL to usable content because you are spending less time building the crawling stack yourself and more time working with the content you actually need.

# Installing the Packages and Setting the API Key

First, install the Python packages used in this project. The official Olostep software development kit (SDK) requires Python 3.11 or later.

pip install olostep python-dotenv tqdm

These packages handle the main parts of the workflow:

olostep connects your script to the Olostep API
python-dotenv loads your API key from a .env file
tqdm adds a progress bar so you can track saved pages

Next, create a free Olostep account, open the dashboard, and generate an API key from the API keys page. Olostep’s official docs and integrations point users to the dashboard for API key setup.

Then create a .env file in your project folder:

OLOSTEP_API_KEY=your_real_api_key_here

This keeps your API key separate from your Python code, which is a cleaner and safer way to manage credentials.

# Creating the Crawler Script

In this part of the project, we will build the Python script that crawls a documentation website, extracts each page in Markdown format, cleans the content, and saves it locally as individual files. We will create the project folder, add a Python file, and then write the code step by step so it is easy to follow and test.

First, create a project folder for your crawler. Inside that folder, create a new Python file named crawl_docs_with_olostep.py.

Now we will add the code to this file one section at a time. This makes it easier to understand what each part of the script does and how the full crawler works together.

// Defining the Crawl Settings

Start by importing the required libraries. Then define the main crawl settings, such as the starting URL, crawl depth, page limit, include and exclude rules, and the output folder where the Markdown files will be saved. These values control how much of the documentation site gets crawled and where the results are stored.

import os
import re
from pathlib import Path
from urllib.parse import urlparse

from dotenv import load_dotenv
from tqdm import tqdm
from olostep import Olostep

START_URL = "https://docs.olostep.com/"
MAX_PAGES = 10
MAX_DEPTH = 1

INCLUDE_URLS = [
    "/**"
]

EXCLUDE_URLS = []

OUTPUT_DIR = Path("olostep_docs_output")

// Creating a Helper Function to Generate Safe File Names

Each crawled page needs to be saved as its own Markdown file. To do that, we need a helper function that converts a URL into a clean and filesystem-safe file name. This avoids problems with slashes, symbols, and other characters that do not work well in file names.

def slugify_url(url: str) -> str:
    parsed = urlparse(url)
    path = parsed.path.strip("https://www.kdnuggets.com/")

    if not path:
        path = "index"

    filename = re.sub(r"[^a-zA-Z0-9/_-]+", "-", path)
    filename = filename.replace("https://www.kdnuggets.com/", "__").strip("-_")

    return f"{filename or 'page'}.md"

// Creating a Helper Function to Save Markdown Files

Next, add helper functions to process the extracted content before saving it.

The first function cleans the Markdown by removing extra interface text, repeated blank lines, and unwanted page elements such as feedback prompts. This helps keep the saved files focused on the actual documentation content.

def clean_markdown(markdown: str) -> str:
    text = markdown.replace("\r\n", "\n").strip()
    text = re.sub(r"\[\s*\u200b?\s*\]\(#.*?\)", "", text, flags=re.DOTALL)

    lines = [line.rstrip() for line in text.splitlines()]

    start_index = 0
    for index in range(len(lines) - 1):
        title = lines[index].strip()
        underline = lines[index + 1].strip()
        if title and underline and set(underline) == {"="}:
            start_index = index
            break
    else:
        for index, line in enumerate(lines):
            if line.lstrip().startswith("# "):
                start_index = index
                break

    lines = lines[start_index:]

    for index, line in enumerate(lines):
        if line.strip() == "Was this page helpful?":
            lines = lines[:index]
            break

    cleaned_lines: list[str] = []
    for line in lines:
        stripped = line.strip()
        if stripped in {"Copy page", "YesNo", "⌘I"}:
            continue
        if not stripped and cleaned_lines and not cleaned_lines[-1]:
            continue
        cleaned_lines.append(line)

    return "\n".join(cleaned_lines).strip()

The second function saves the cleaned Markdown into the output folder and adds the source URL at the top of the file. There is also a small helper function to clear old Markdown files before saving a new crawl result.

def save_markdown(output_dir: Path, url: str, markdown: str) -> None:
    output_dir.mkdir(parents=True, exist_ok=True)
    filepath = output_dir / slugify_url(url)

    content = f"""---
source_url: {url}
---

{markdown}
"""
    filepath.write_text(content, encoding="utf-8")

There is also a small helper function to clear old Markdown files before saving a new crawl result.

def clear_output_dir(output_dir: Path) -> None:
    if not output_dir.exists():
        return

    for filepath in output_dir.glob("*.md"):
        filepath.unlink()

// Creating the Main Crawler Logic

This is the main part of the script. It loads the API key from the .env file, creates the Olostep client, starts the crawl, waits for it to finish, retrieves each crawled page as Markdown, cleans the content, and saves it locally.

This section ties everything together and turns the individual helper functions into a working documentation crawler.

def main() -> None:
    load_dotenv()
    api_key = os.getenv("OLOSTEP_API_KEY")

    if not api_key:
        raise RuntimeError("Missing OLOSTEP_API_KEY in your .env file.")

    client = Olostep(api_key=api_key)

    crawl = client.crawls.create(
        start_url=START_URL,
        max_pages=MAX_PAGES,
        max_depth=MAX_DEPTH,
        include_urls=INCLUDE_URLS,
        exclude_urls=EXCLUDE_URLS,
        include_external=False,
        include_subdomain=False,
        follow_robots_txt=True,
    )

    print(f"Started crawl: {crawl.id}")
    crawl.wait_till_done(check_every_n_secs=5)

    pages = list(crawl.pages())
    clear_output_dir(OUTPUT_DIR)

    for page in tqdm(pages, desc="Saving pages"):
        try:
            content = page.retrieve(["markdown"])
            markdown = getattr(content, "markdown_content", None)

            if markdown:
                save_markdown(OUTPUT_DIR, page.url, clean_markdown(markdown))
        except Exception as exc:
            print(f"Failed to retrieve {page.url}: {exc}")

    print(f"Done. Files saved in: {OUTPUT_DIR.resolve()}")


if __name__ == "__main__":
    main()

Note: The full script is available here: kingabzpro/web-crawl-olostep, a web crawler and starter web app built with Olostep.

// Testing the Web Crawling Script

Once the script is complete, run it from your terminal:

python crawl_docs_with_olostep.py

As the script runs, you will see the crawler process the pages and save them one by one as Markdown files in your output folder.

After the crawl finishes, open the saved files to check the extracted content. You should see clean, readable Markdown versions of the documentation pages.

At that point, your documentation content is ready to use in AI workflows such as search, retrieval, or agent-based systems.

# Creating the Olostep Web Crawling Web Application

In this part of the project, we will build a simple web application on top of the crawler script. Instead of editing the Python file every time, this application gives you an easier way to enter a documentation URL, choose crawl settings, run the crawl, and preview the saved Markdown files in one place.

The frontend code for this application is available in app.py in the repository: web-crawl-olostep/app.py.

This application does a few useful things:

Lets you enter a starting URL for the crawl
Lets you set the maximum number of pages to crawl
Lets you control crawl depth
Lets you add include and exclude URL patterns
Runs the backend crawler directly from the interface
Saves the crawled pages into a folder based on the URL
Shows all saved Markdown files in a dropdown
Previews each Markdown file directly inside the application
Lets you clear previous crawl results with one button

To start the application, run:

After that, Gradio will start a local web server and provide a link like this:

* Running on local URL: http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.

Once the application is running, open the local URL in your browser. In our example, we gave the application the Claude Code documentation URL and asked it to crawl 50 pages with a depth of 5.

Gradio Interface for Documentation Crawling

When you click Run Crawl, the application passes your settings to the backend crawler and starts the crawl. In the terminal, you can watch the progress as pages are crawled and saved one by one.

After the crawl finishes, the output folder will contain the saved Markdown files. In this example, you would see that 50 files were added.

The dropdown in the application is then updated automatically, so you can open any saved file and preview it directly in the web interface as properly formatted Markdown.

This makes the crawler much easier to use. Instead of changing values in code every time, you can test different documentation sites and crawl settings through a simple interface. That also makes the project easier to share with other people who may not want to work directly in Python.

# Final Takeaway

Web crawling is not only about collecting pages from a website. The real challenge is turning that content into clean, structured files that an AI system can actually use. In this project, we used a simple Python script and a Gradio application to make that process much easier.

Just as importantly, the workflow is fast enough for real use. In our example, crawling 50 pages with a depth of 5 took only around 50 seconds, which shows that you can prepare documentation data quickly without building a heavy pipeline.

This setup can also go beyond a one-time crawl. You can schedule it to run every day with cron or Task Scheduler, and even update only the pages that have changed. That keeps your documentation fresh while using only a small number of credits.

For teams that need this kind of workflow to make business sense, Olostep is built with that in mind. It is significantly more affordable than building or maintaining an internal crawling solution, and at least 50% cheaper than comparable alternatives on the market.

As your usage grows, the cost per request continues to decrease, which makes it a practical choice for larger documentation pipelines. That combination of reliability, scalability, and strong unit economics is why some of the fastest-growing AI-native startups rely on Olostep to power their data infrastructure.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

What's Hot

Trump pardons former US Congress member accused of insider trading | Donald Trump News

How a USB-connected speaker can infect a PC without ever being touched

Stellar Blade Blood Rain Trailer Reveal Is Raising AI Alarm Bells

How to Crawl an Entire Documentation Site with Olostep

Gemma 4 Tool Calling Explained: Step-by-Step Guide

How Conversational Chatbots Can Revolutionize Your Sales Process

Building Machine Learning Application with Django

3 SpaCy Tricks for Efficient Text Processing & Entity Recognition

A Deep Dive into Calibration of Language Models: Platt Scaling, Isotonic Regression, Temperature Scaling

Google’s Open-Source Multimodal AI Explained

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

How to Crawl an Entire Documentation Site with Olostep

# Introduction

# Choosing Olostep Over Scrapy or Selenium

# Installing the Packages and Setting the API Key

# Creating the Crawler Script

// Defining the Crawl Settings

// Creating a Helper Function to Generate Safe File Names

// Creating a Helper Function to Save Markdown Files

// Creating the Main Crawler Logic

// Testing the Web Crawling Script

# Creating the Olostep Web Crawling Web Application

# Final Takeaway

Related posts:

Gemma 4 Tool Calling Explained: Step-by-Step Guide

How Conversational Chatbots Can Revolutionize Your Sales Process

Building Machine Learning Application with Django

Related Posts

Subscribe to Updates