Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    All Star Fox games that the new Star Fox game is technically a remake of

    May 7, 2026

    Our Land review – superb doc on the right to roam

    May 7, 2026

    Mercedes-Maybach Boss: Buyers Want V12 Engines

    May 7, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Beginner’s Guide to Data Extraction with LangExtract and LLMs
    Beginner’s Guide to Data Extraction with LangExtract and LLMs
    Business & Startups

    Beginner’s Guide to Data Extraction with LangExtract and LLMs

    gvfx00@gmail.comBy gvfx00@gmail.comNovember 4, 2025No Comments6 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Beginner’s Guide to Data Extraction with LangExtract and LLMs
    Image by Author

     

    Table of Contents

    Toggle
    • # Introduction
    • # 1. Installing and Setting Up
    • # 2. Setting Up API Keys (for Cloud Models)
    • # 3. Defining an Extraction Task
    • # 4. Running the Extraction
    • # 5. Handling Output and Visualization
    • # 6. Supporting Input Formats
    • # 7. Conclusion
      • Related posts:
    • 15 Ways to Make Money with AI the Smart Way
    • 7 Ways to Build Investment Portfolio Tracker in Excel
    • Beyond Giant Models: Why AI Orchestration Is the New Architecture

    # Introduction

     
    Did you know that a large portion of valuable information still exists in unstructured text? For example, research papers, clinical notes, financial reports, etc. Extracting reliable and structured information from these texts has always been a challenge. LangExtract is an open-source Python library (released by Google) that solves this problem using large language models (LLMs). You can define what to extract via simple prompts and a few examples, and then it uses LLMs (like Google’s Gemini, OpenAI, or local models) to pull out that information from documents of any length. Another thing that makes it useful is its support for very long documents (through chunking and multi-pass processing) and interactive visualization of results. Let’s explore this library in more detail.

     

    # 1. Installing and Setting Up

     
    To install LangExtract locally, first ensure you have Python 3.10+ installed. The library is available on PyPI. In a terminal or virtual environment, run:

     

    For an isolated environment, you may first create and activate a virtual environment:

    python -m venv langextract_env
    source langextract_env/bin/activate  # On Windows: .\langextract_env\Scripts\activate
    pip install langextract
    

     

    There are other options from source and using Docker as well that you can check from here.

     

    # 2. Setting Up API Keys (for Cloud Models)

     
    LangExtract itself is free and open-source, but if you use cloud-hosted LLMs (like Google Gemini or OpenAI GPT models), you must supply an API key. You can set the LANGEXTRACT_API_KEY environment variable or store it in a .env file in your working directory. For example:

    export LANGEXTRACT_API_KEY="YOUR_API_KEY_HERE"

     
    or in a .env file:

    cat >> .env << 'EOF'
    LANGEXTRACT_API_KEY=your-api-key-here
    EOF
    echo '.env' >> .gitignore

     
    On-device LLMs via Ollama or other local backends do not require an API key. To enable OpenAI, you would run pip install langextract[openai], set your OPENAI_API_KEY, and use an OpenAI model_id. For Vertex AI (enterprise users), service account authentication is supported.

     

    # 3. Defining an Extraction Task

     
    LangExtract works by you telling it what information to extract. You do this by writing a clear prompt description and supplying one or more ExampleData annotations that show what a correct extraction looks like on sample text. For instance, to extract characters, emotions, and relationships from a line of literature, you might write:

    import langextract as lx
    
    prompt = """
      Extract characters, emotions, and relationships in order of appearance.
      Use exact text for extractions. Do not paraphrase or overlap entities.
      Provide meaningful attributes for each entity to add context."""
    examples = [
        lx.data.ExampleData(
            text="ROMEO. But soft! What light through yonder window breaks? ...",
            extractions=[
                lx.data.Extraction(
                    extraction_class="character",
                    extraction_text="ROMEO",
                    attributes={"emotional_state": "wonder"}
                ),
                lx.data.Extraction(
                    extraction_class="emotion",
                    extraction_text="But soft!",
                    attributes={"feeling": "gentle awe"}
                )
            ]
        )
    ]

     
    These examples (taken from LangExtract’s README) tell the model exactly what kind of structured output is expected. You can create similar examples for your domain.

     

    # 4. Running the Extraction

     
    Once your prompt and examples are defined, you simply call the lx.extract() function. The key arguments are:

    • text_or_documents: Your input text, or a list of texts, or even a URL string (LangExtract can fetch and process text from a Gutenberg or other URL).
    • prompt_description: The extraction instructions (a string).
    • examples: A list of ExampleData that illustrate the desired output.
    • model_id: The identifier of the LLM to use (e.g. "gemini-2.5-flash" for Google Gemini Flash, or an Ollama model like "gemma2:2b", or an OpenAI model like "gpt-4o").
    • Other optional parameters: extraction_passes (to re-run extraction for higher recall on long texts), max_workers (to do parallel processing on chunks), fence_output, use_schema_constraints, etc.

    For example:

    input_text=""'JULIET. O Romeo, Romeo! wherefore art thou Romeo?
    Deny thy father and refuse thy name;
    Or, if thou wilt not, be but sworn my love,
    And I'll no longer be a Capulet.
    ROMEO. Shall I hear more, or shall I speak at this?
    JULIET. 'Tis but thy name that is my enemy;
    Thou art thyself, though not a Montague.
    What’s in a name? That which we call a rose
    By any other name would smell as sweet.'''
    
    
    result = lx.extract(
        text_or_documents=input_text,
        prompt_description=prompt,
        examples=examples,
        model_id="gemini-2.5-flash"
    )

     
    This sends the prompt and examples along with the text to the chosen LLM and returns a Result object. LangExtract automatically handles tokenizing long texts into chunks, batching calls in parallel, and merging the outputs.

     

    # 5. Handling Output and Visualization

     
    The output of lx.extract() is a Python object (often called result) that contains the extracted entities and attributes. You can inspect it programmatically or save it for later. LangExtract also provides helper functions to save results: for example, you can write the results to a JSONL (JSON Lines) file (one document per line) and generate an interactive HTML review. For example:

    lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
    html = lx.visualize("extraction_results.jsonl")
    with open("viz.html", "w") as f:
        f.write(html if isinstance(html, str) else html.data)

     
    This writes an extraction_results.jsonl file and an interactive viz.html file. The JSONL format is convenient for large datasets and further processing, and the HTML file highlights each extracted span in context (color-coded by class) for easy human inspection like this:
     
    Output and Visualization: LangextractOutput and Visualization: Langextract
     

    # 6. Supporting Input Formats

     
    LangExtract is flexible about input. You can supply:

    • Plain text strings: Any text you load into Python (e.g. from a file or database) can be processed.
    • URLs: As shown above, you can pass a URL (e.g. a Project Gutenberg link) as text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt". LangExtract will download and extract from that document.
    • List of texts: Pass a Python list of strings to process multiple documents in one call.
    • Rich text or Markdown: Since LangExtract works at the text level, you could also feed in Markdown or HTML if you pre-process it to raw text. (LangExtract itself doesn’t parse PDFs or images, you need to extract text first.)

     

    # 7. Conclusion

     
    LangExtract makes it easy to turn unstructured text into structured data. With high accuracy, clear source mapping, and simple customization, it works well when rule-based methods fall short. It is especially useful for complex or domain-specific extractions. While there is room for improvement, LangExtract is already a strong tool for extracting grounded information in 2025.
     
     

    Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

    Related posts:

    Build Human-Like AI Voice App with Gemini 3.1 Flash TTS

    5 GitHub Repositories to Learn Quantum Machine Learning

    Getting Started with Langfuse [2026 Guide]

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleThe Golden Joystick Awards Ultimate Game of the Year award voting is almost over, with nominations including Clair Obscur: Expedition 33, Blue Prince, and Hades 2 – here’s where you can cast your vote
    Next Article 2026 KGM Rexton review | CarExpert
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Abacus AI Review: Features, AI Agents & Automation Explained (Honest Guide)

    May 7, 2026
    Business & Startups

    Is AI Taking Over Wall Street?

    May 6, 2026
    Business & Startups

    7 OpenCode Plugins That Make AI Coding More Powerful

    May 6, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202571 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202568 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202571 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202568 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.