Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Cyberpunk Night City 2045 release, price, and preorder unveiled

    April 20, 2026

    Rose of Nevada review – a gorgeous Cornish…

    April 20, 2026

    2026 Nissan N7 review: Quick drive

    April 20, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»How to Crawl an Entire Documentation Site with Olostep
    How to Crawl an Entire Documentation Site with Olostep
    Business & Startups

    How to Crawl an Entire Documentation Site with Olostep

    gvfx00@gmail.comBy gvfx00@gmail.comApril 20, 2026No Comments11 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email



    Image by Author

     

    Table of Contents

    Toggle
    • # Introduction
    • # Choosing Olostep Over Scrapy or Selenium
    • # Installing the Packages and Setting the API Key
    • # Creating the Crawler Script
        • // Defining the Crawl Settings
        • // Creating a Helper Function to Generate Safe File Names
        • // Creating a Helper Function to Save Markdown Files
        • // Creating the Main Crawler Logic
        • // Testing the Web Crawling Script
    • # Creating the Olostep Web Crawling Web Application
    • # Final Takeaway
      • Related posts:
    • A Deep Dive into AI Architecture
    • 10 GitHub Repositories to Master OpenClaw
    • Harnessing Data and AI: Revolutionizing Decision-Making in Healthcare

    # Introduction

     
    Web crawling is the process of automatically visiting web pages, following links, and collecting content from a website in a structured way. It is commonly used to gather large amounts of information from documentation sites, articles, knowledge bases, and other web resources.

    Crawling an entire website and then converting that content into a format that an AI agent can actually use is not as simple as it sounds. Documentation sites often contain nested pages, repeated navigation links, boilerplate content, and inconsistent page structures. On top of that, the extracted content needs to be cleaned, organized, and saved in a way that is useful for downstream AI workflows such as retrieval, question-answering, or agent-based systems.

    In this guide, we will learn why to use Olostep instead of Scrapy or Selenium, set up everything needed for the web crawling project, write a simple crawling script to scrape a documentation website, and finally create a frontend using Gradio so that anyone can provide a link and other arguments to crawl website pages.

     

    # Choosing Olostep Over Scrapy or Selenium

     
    Scrapy is powerful, but it is built as a full scraping framework. That is useful when you want deep control, but it also means more setup and more engineering work.

    Selenium is better known for browser automation. It is useful for interacting with JavaScript-heavy pages, but it is not really designed as a documentation crawling workflow on its own.

    With Olostep, the pitch is a lot more direct: search, crawl, scrape, and structure web data through one application programming interface (API), with support for LLM-friendly outputs like Markdown, text, HTML, and structured JSON. That means you do not have to manually stitch together pieces for discovery, extraction, formatting, and downstream AI use in the same way.

    For documentation sites, that can give you a much faster path from URL to usable content because you are spending less time building the crawling stack yourself and more time working with the content you actually need.

     

    # Installing the Packages and Setting the API Key

     
    First, install the Python packages used in this project. The official Olostep software development kit (SDK) requires Python 3.11 or later.

    pip install olostep python-dotenv tqdm

     

    These packages handle the main parts of the workflow:

    • olostep connects your script to the Olostep API
    • python-dotenv loads your API key from a .env file
    • tqdm adds a progress bar so you can track saved pages

    Next, create a free Olostep account, open the dashboard, and generate an API key from the API keys page. Olostep’s official docs and integrations point users to the dashboard for API key setup.

     

    Olostep Dashboard API Key Setup

     

    Then create a .env file in your project folder:

    OLOSTEP_API_KEY=your_real_api_key_here

     

    This keeps your API key separate from your Python code, which is a cleaner and safer way to manage credentials.

     

    # Creating the Crawler Script

     
    In this part of the project, we will build the Python script that crawls a documentation website, extracts each page in Markdown format, cleans the content, and saves it locally as individual files. We will create the project folder, add a Python file, and then write the code step by step so it is easy to follow and test.

    First, create a project folder for your crawler. Inside that folder, create a new Python file named crawl_docs_with_olostep.py.

    Now we will add the code to this file one section at a time. This makes it easier to understand what each part of the script does and how the full crawler works together.

     

    // Defining the Crawl Settings

    Start by importing the required libraries. Then define the main crawl settings, such as the starting URL, crawl depth, page limit, include and exclude rules, and the output folder where the Markdown files will be saved. These values control how much of the documentation site gets crawled and where the results are stored.

    import os
    import re
    from pathlib import Path
    from urllib.parse import urlparse
    
    from dotenv import load_dotenv
    from tqdm import tqdm
    from olostep import Olostep
    
    START_URL = "https://docs.olostep.com/"
    MAX_PAGES = 10
    MAX_DEPTH = 1
    
    INCLUDE_URLS = [
        "/**"
    ]
    
    EXCLUDE_URLS = []
    
    OUTPUT_DIR = Path("olostep_docs_output")

     

    // Creating a Helper Function to Generate Safe File Names

    Each crawled page needs to be saved as its own Markdown file. To do that, we need a helper function that converts a URL into a clean and filesystem-safe file name. This avoids problems with slashes, symbols, and other characters that do not work well in file names.

    def slugify_url(url: str) -> str:
        parsed = urlparse(url)
        path = parsed.path.strip("https://www.kdnuggets.com/")
    
        if not path:
            path = "index"
    
        filename = re.sub(r"[^a-zA-Z0-9/_-]+", "-", path)
        filename = filename.replace("https://www.kdnuggets.com/", "__").strip("-_")
    
        return f"{filename or 'page'}.md"

     

    // Creating a Helper Function to Save Markdown Files

    Next, add helper functions to process the extracted content before saving it.

    The first function cleans the Markdown by removing extra interface text, repeated blank lines, and unwanted page elements such as feedback prompts. This helps keep the saved files focused on the actual documentation content.

    def clean_markdown(markdown: str) -> str:
        text = markdown.replace("\r\n", "\n").strip()
        text = re.sub(r"\[\s*\u200b?\s*\]\(#.*?\)", "", text, flags=re.DOTALL)
    
        lines = [line.rstrip() for line in text.splitlines()]
    
        start_index = 0
        for index in range(len(lines) - 1):
            title = lines[index].strip()
            underline = lines[index + 1].strip()
            if title and underline and set(underline) == {"="}:
                start_index = index
                break
        else:
            for index, line in enumerate(lines):
                if line.lstrip().startswith("# "):
                    start_index = index
                    break
    
        lines = lines[start_index:]
    
        for index, line in enumerate(lines):
            if line.strip() == "Was this page helpful?":
                lines = lines[:index]
                break
    
        cleaned_lines: list[str] = []
        for line in lines:
            stripped = line.strip()
            if stripped in {"Copy page", "YesNo", "⌘I"}:
                continue
            if not stripped and cleaned_lines and not cleaned_lines[-1]:
                continue
            cleaned_lines.append(line)
    
        return "\n".join(cleaned_lines).strip()

     

    The second function saves the cleaned Markdown into the output folder and adds the source URL at the top of the file. There is also a small helper function to clear old Markdown files before saving a new crawl result.

    def save_markdown(output_dir: Path, url: str, markdown: str) -> None:
        output_dir.mkdir(parents=True, exist_ok=True)
        filepath = output_dir / slugify_url(url)
    
        content = f"""---
    source_url: {url}
    ---
    
    {markdown}
    """
        filepath.write_text(content, encoding="utf-8")

     

    There is also a small helper function to clear old Markdown files before saving a new crawl result.

    def clear_output_dir(output_dir: Path) -> None:
        if not output_dir.exists():
            return
    
        for filepath in output_dir.glob("*.md"):
            filepath.unlink()

     

    // Creating the Main Crawler Logic

    This is the main part of the script. It loads the API key from the .env file, creates the Olostep client, starts the crawl, waits for it to finish, retrieves each crawled page as Markdown, cleans the content, and saves it locally.

    This section ties everything together and turns the individual helper functions into a working documentation crawler.

    def main() -> None:
        load_dotenv()
        api_key = os.getenv("OLOSTEP_API_KEY")
    
        if not api_key:
            raise RuntimeError("Missing OLOSTEP_API_KEY in your .env file.")
    
        client = Olostep(api_key=api_key)
    
        crawl = client.crawls.create(
            start_url=START_URL,
            max_pages=MAX_PAGES,
            max_depth=MAX_DEPTH,
            include_urls=INCLUDE_URLS,
            exclude_urls=EXCLUDE_URLS,
            include_external=False,
            include_subdomain=False,
            follow_robots_txt=True,
        )
    
        print(f"Started crawl: {crawl.id}")
        crawl.wait_till_done(check_every_n_secs=5)
    
        pages = list(crawl.pages())
        clear_output_dir(OUTPUT_DIR)
    
        for page in tqdm(pages, desc="Saving pages"):
            try:
                content = page.retrieve(["markdown"])
                markdown = getattr(content, "markdown_content", None)
    
                if markdown:
                    save_markdown(OUTPUT_DIR, page.url, clean_markdown(markdown))
            except Exception as exc:
                print(f"Failed to retrieve {page.url}: {exc}")
    
        print(f"Done. Files saved in: {OUTPUT_DIR.resolve()}")
    
    
    if __name__ == "__main__":
        main()

     

    Note: The full script is available here: kingabzpro/web-crawl-olostep, a web crawler and starter web app built with Olostep.

     

    // Testing the Web Crawling Script

    Once the script is complete, run it from your terminal:

    python crawl_docs_with_olostep.py

     

    As the script runs, you will see the crawler process the pages and save them one by one as Markdown files in your output folder.

     

    Olostep Crawler Terminal Progress

     

    After the crawl finishes, open the saved files to check the extracted content. You should see clean, readable Markdown versions of the documentation pages.

     

    Clean Markdown Output Example

     

    At that point, your documentation content is ready to use in AI workflows such as search, retrieval, or agent-based systems.

     

    # Creating the Olostep Web Crawling Web Application

     
    In this part of the project, we will build a simple web application on top of the crawler script. Instead of editing the Python file every time, this application gives you an easier way to enter a documentation URL, choose crawl settings, run the crawl, and preview the saved Markdown files in one place.

    The frontend code for this application is available in app.py in the repository: web-crawl-olostep/app.py.

    This application does a few useful things:

    • Lets you enter a starting URL for the crawl
    • Lets you set the maximum number of pages to crawl
    • Lets you control crawl depth
    • Lets you add include and exclude URL patterns
    • Runs the backend crawler directly from the interface
    • Saves the crawled pages into a folder based on the URL
    • Shows all saved Markdown files in a dropdown
    • Previews each Markdown file directly inside the application
    • Lets you clear previous crawl results with one button

    To start the application, run:

     

    After that, Gradio will start a local web server and provide a link like this:

    * Running on local URL: http://127.0.0.1:7860
    * To create a public link, set `share=True` in `launch()`.

     

    Once the application is running, open the local URL in your browser. In our example, we gave the application the Claude Code documentation URL and asked it to crawl 50 pages with a depth of 5.

     

    Gradio Interface for Documentation Crawling

     

    When you click Run Crawl, the application passes your settings to the backend crawler and starts the crawl. In the terminal, you can watch the progress as pages are crawled and saved one by one.

     

    Crawler Terminal Output

     

    After the crawl finishes, the output folder will contain the saved Markdown files. In this example, you would see that 50 files were added.

     

    Saved Markdown Files in Output Folder

     

    The dropdown in the application is then updated automatically, so you can open any saved file and preview it directly in the web interface as properly formatted Markdown.

     

    Markdown Preview in Gradio Application

     

    This makes the crawler much easier to use. Instead of changing values in code every time, you can test different documentation sites and crawl settings through a simple interface. That also makes the project easier to share with other people who may not want to work directly in Python.

     

    # Final Takeaway

     
    Web crawling is not only about collecting pages from a website. The real challenge is turning that content into clean, structured files that an AI system can actually use. In this project, we used a simple Python script and a Gradio application to make that process much easier.

    Just as importantly, the workflow is fast enough for real use. In our example, crawling 50 pages with a depth of 5 took only around 50 seconds, which shows that you can prepare documentation data quickly without building a heavy pipeline.

    This setup can also go beyond a one-time crawl. You can schedule it to run every day with cron or Task Scheduler, and even update only the pages that have changed. That keeps your documentation fresh while using only a small number of credits.

    For teams that need this kind of workflow to make business sense, Olostep is built with that in mind. It is significantly more affordable than building or maintaining an internal crawling solution, and at least 50% cheaper than comparable alternatives on the market.

    As your usage grows, the cost per request continues to decrease, which makes it a practical choice for larger documentation pipelines. That combination of reliability, scalability, and strong unit economics is why some of the fastest-growing AI-native startups rely on Olostep to power their data infrastructure.
     
     

    Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

    Related posts:

    Learn How To Laser-Target Content With AI

    Beyond Giant Models: Why AI Orchestration Is the New Architecture

    5 Useful Python Scripts for Effective Feature Engineering

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleI asked ChatGPT and Gemini to help me build a gaming PC, and I may never touch a motherboard again
    Next Article Iran expands limited internet access but restrictions remain for most | US-Israel war on Iran News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    How to Make a Claude Code Project Work Like an Engineer

    April 19, 2026
    Business & Startups

    Gemma 4 Tool Calling Explained: Step-by-Step Guide

    April 18, 2026
    Business & Startups

    Top 5 Extensions for VS Code That Aren’t Copilot

    April 18, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025138 Views

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025138 Views

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.