Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Japan beat Australia to lift Women’s Asian Cup title | Football News

    March 21, 2026

    The Better Way For Document Chatbots?

    March 21, 2026

    Synology BeeCamera Review: BeeStation Plus’ Cool Surveillance Add-on Feature

    March 21, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»The Better Way For Document Chatbots?
    The Better Way For Document Chatbots?
    Business & Startups

    The Better Way For Document Chatbots?

    gvfx00@gmail.comBy gvfx00@gmail.comMarch 21, 2026No Comments13 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    What if the way we build AI document chatbots today is flawed? Most systems use RAG. They split documents into chunks, create embeddings, and retrieve answers using similarity search. It works in demos but often fails in real use. It misses obvious answers or picks the wrong context. Now there is a new approach called PageIndex. It does not use chunking, embeddings, or vector databases. Yet it reaches up to 98.7% accuracy on tough document Q&A tasks. In this article, we will break down how PageIndex works, why it performs better on structured documents, and how you can build your own chatbot using it.


    Table of Contents

    Toggle
    • The Problem with Traditional RAG
      • Problem 1: Arbitrary chunking destroys context
      • Problem 2: Similarity is not the same as relevance
      • Problem 3: It’s a black box
      • Problem 4: It doesn’t scale to long documents
    • What is PageIndex?
      • 1. Tree Search (Navigation)
    • How it Works: Deep Dive
      • The Tree Index – Building Phase
      • Why This Beats Chunking
      • The Search Phase – Reasoning, Not Math
      • The Numbers
      • What is it Best For?
    • Hands-on With Jupyter Notebook
      • Install PageIndex
      • Imports & API Setup
      • OpenAI Setup
      • Submit the Document
      • Wait for Processing & Get the Tree
      • Tree Search with the LLM
      • Fetch Text and Generate Answer
      • The ask() Function
    • Conclusion
        • Login to continue reading and enjoy expert-curated content.
      • Related posts:
    • What Actually Improved and What Still Breaks
    • 10 Ways to Slash Inference Costs with OpenAI LLMs
    • What are Recursive Language Models (RLM)?

    The Problem with Traditional RAG

    Here’s the classic RAG pipeline you’ve probably seen a hundred times.

    • You take your document – could be a PDF, a report, a contract – and you chop it into chunks. Maybe 512 tokens each, maybe with some overlap.
    • You run each chunk through an embedding model to turn it into a vector — a long list of numbers that represents the “meaning” of that chunk.
    • You store all those vectors in a vector database — Pinecone, Weaviate, Chroma, whatever your flavour is.
    • When the user asks a question, you embed the question the same way, and you do a cosine similarity search to find the chunks whose vectors are closest to the question vector.
    • You hand those chunks to the LLM as context, and it writes the answer.

    Simple. Elegant. And absolutely riddled with failure modes.

    Problem 1: Arbitrary chunking destroys context

    When you slice a document at 512 tokens, you’re not respecting the document’s actual structure. A single table might get split across three chunks. A footnote that’s critical to understanding the main text ends up in a completely different chunk. The answer you need might literally span two adjacent chunks that the retriever picks only one of.

    Problem 2: Similarity is not the same as relevance

    This is the big one. Vector similarity finds text that sounds like your question. But documents often don’t repeat the question’s phrasing when they answer it. Ask “What is the termination clause?” and the contract might just say “Section 14.3 — Dissolution of Agreement.” Low cosine similarity. Missed entirely.

    Problem 3: It’s a black box

    You get three chunks back. Why those three? You have no idea. It’s pure math. There’s no reasoning, no explanation, no audit trail. For financial documents, legal contracts, and medical records? That opacity is a serious problem.

    Problem 4: It doesn’t scale to long documents

    A 300-page technical manual with complex cross-references? The sheer number of chunks makes retrieval noisy. You end up getting chunks that are vaguely related instead of the exact section you need.

    These aren’t edge cases. These are the everyday failures that RAG engineers spend most of their time fighting. And the reason they happen is actually pretty simple — the entire architecture is borrowed from search engines, not from how humans actually read and understand documents.

    When a human expert needs to answer a question from a document, they don’t scan every sentence looking for the one that sounds most similar to the question. They open the table of contents, skim the chapter headings, navigate, and reason about where the answer should be before they even start reading.

    That’s the insight behind PageIndex.

    What is PageIndex?

    PageIndex was built by VectifyAI and open-sourced on GitHub. The core idea is deceptively simple:

    Instead of searching a document, navigate it: the way a human expert would.

    Here’s the key mental shift. Traditional RAG asks: “Which chunks look most similar to my question?”

    PageIndex asks: “Where in this document would a smart human look for the answer to this question?”

    Those are two very different questions. And the second one turns out to produce dramatically better results.

    PageIndex does this by building what it calls a Reasoning Tree. It is essentially an intelligent, AI-generated table of contents for your document.

    Here’s how to visualize it. At the top, you have a root node that represents the entire document. Below that, you have nodes for each major section or chapter. Each of those branches into subsections. Each subsection branches into specific topics or paragraphs. Every single node in this tree has two things:

    1. A title: what this section is about
    2. A summary: a concise AI-generated description of what’s in this section

    This tree is built once, when you first submit the document. It’s your index.

    Now here’s where it gets clever. When you ask a question, PageIndex does two things:

    1. Tree Search (Navigation)

    It sends the question to an LLM along with the tree, but just the titles and summaries, not the full text. The LLM reads through the tree like a human reads a table of contents, and it reasons: “Okay, given this question, which branches of the tree are most likely to contain the answer?”

    The LLM returns a list of specific node IDs, and you can see its reasoning. It literally tells you why it chose those sections. Full transparency.

    PageIndex fetches only the full text of those selected nodes, hands it to the LLM as context, and the LLM writes the final answer grounded entirely in the real document text.

    Two LLM calls. No embeddings. No vector database. Just reasoning.

    And because every answer is tied to specific nodes in the tree, you always know exactly which page, which section, which part of the document the answer came from. Complete audit trail. Complete explainability.

    How it Works: Deep Dive

    Let me go deeper into the mechanics, because this is the really interesting part.

    The Tree Index – Building Phase

    When you call submit_document(), PageIndex reads your PDF or text file and does something remarkable. It doesn’t just extract text but also understands the structure. Using a combination of layout analysis and LLM reasoning, it identifies:

    • What are the natural sections and subsections?
    • Where does one topic end and another begin?
    • How do the pieces relate to each other hierarchically?

    It then constructs the tree and generates a summary for every node. Not just a title. An actual condensed description of what’s in that section. This is what enables the smart navigation later.

    The tree uses a numeric node ID system that mirrors real document structure: 0001 might be Chapter 1, 0002 Chapter 2, 0003 the first section inside Chapter 1, and so on. The hierarchy is preserved.

    Why This Beats Chunking

    Think about what chunking does to a 50-page financial report. You get maybe 300 chunks, each with zero awareness of whether it’s from the executive summary or a footnote on page 47. The embedder treats them all equally.

    The PageIndex tree, on the other hand, knows that node 0012 is the “Revenue Breakdown” subsection under the “Q3 Financial Results” section under “Annual Report 2024.” That structural awareness is enormously valuable when you’re trying to find something specific.

    The Search Phase – Reasoning, Not Math

    Here’s the other thing that makes PageIndex special. The search step is not a mathematical operation. It’s a cognitive operation performed by an LLM.

    When you ask, “What were the main risk factors disclosed in this report?”, the LLM doesn’t measure cosine distance. It reads the tree, recognizes that the “Risk Factors” section is exactly what’s needed, and selects those nodes, just like you would.

    This means PageIndex handles semantic mismatch naturally. This is the kind of mismatch that kills vector search. The document calls it “Risk Factors.” Your question calls it “main dangers.” A vector search might miss it. An LLM reading the tree structure will not.

    The Numbers

    PageIndex powered Mafin 2.5, VectifyAI’s financial RAG system, which achieved 98.7% accuracy on FinanceBench. For those unaware, this is a benchmark specifically designed to test AI systems on financial document questions, where the documents are long, complex, and full of tables and cross-references. That’s the hardest environment for traditional RAG. It’s where PageIndex shines most.

    What is it Best For?

    PageIndex is particularly powerful for:

    • Financial reports: earnings statements, SEC filings, 10-Ks
    • Legal contracts: where every clause matters and context is everything
    • Technical manuals: complex cross-referenced documentation
    • Policy documents: HR policies, compliance documents, regulatory filings
    • Research papers: structured academic content

    Basically: anywhere your document has real structure that chunking would destroy.

    And the really exciting thing? You can use it with any LLM. OpenAI, Anthropic, Gemini — the tree search and answer generation steps are just prompts. You’re in full control.

    Hands-on With Jupyter Notebook

    Okay. You now know the theory. You know why PageIndex exists, what it does, and how it works under the hood. Now let’s actually build something with it.

    I’m going to open a Jupyter notebook and walk you through the complete PageIndex pipeline: uploading a document, getting the reasoning tree back, navigating it with an LLM, and asking questions. Every line of code is explained. No hand-waving.

    Install PageIndex

    %pip install -q --upgrade pageindex

     First things first. We install the pageindex Python library. One line, done. No vector database to set up. No embedding model to download. This is already simpler than any traditional RAG setup.

    Imports & API Setup

    import os
    from pageindex import PageIndexClient
    import pageindex.utils as utils
    from dotenv import load_dotenv
    load_dotenv()
    PAGEINDEX_API_KEY = os.getenv("PAGEINDEX_API_KEY")
    pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)

    We import the PageIndexClient. This is our connection to the PageIndex API. All the heavy lifting of building the tree happens on their end, so we don’t need a beefy machine. We also load API keys from a .env file — always keep your keys out of your code.

    OpenAI Setup

    import openai 
    async def call_llm(prompt, model="gpt-4.1-mini", temperature=0): 
        client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY) 
        response = await client.chat.completions.create(...) 
        return response.choices[0].message.content.strip()

    Here we define our LLM helper function. We’re using GPT-4.1-mini for cost efficiency — but this works with any OpenAI model, and you could swap in Claude or Gemini with a one-line change. Temperature zero keeps the answers factual and consistent.

    Submit the Document

    pdf_path = "/Users/soumil/Desktop/PageIndex/HR Policies-1.pdf" 
    doc_id = pi_client.submit_document(pdf_path)["doc_id"] 
    print('Document Submitted:', doc_id)

    This is the magic line. We point to our PDF — in this case an HR policy document — and submit it. PageIndex takes the file, reads its structure, and starts building the reasoning tree in the background. We get back a doc_id, a unique identifier for this document that we’ll use in every subsequent call. Notice there’s no chunking code, no embedding call, no vector database connection.

    Wait for Processing & Get the Tree

    while not pi_client.is_retrieval_ready(doc_id): 
        print("Still processing... retrying in 10 seconds") 
        time.sleep(10) 
    tree = pi_client.get_tree(doc_id, node_summary=True)['result'] 
    utils.print_tree(tree)

    PageIndex processes the document asynchronously — we just poll every 10 seconds until it’s ready. Then we call get_tree() with node_summary=True, which gives us the full tree structure including summaries.

    Look at this output. This is the reasoning tree. You can see the hierarchy — the top-level HR Policies node, then Electronic Communication Policy, Sexual Harassment Policy, Grievance Redressal Policy, each branching into its subsections. Every node has an ID, a title, and a summary of what’s in it.

    This is what traditional RAG throws away. The structure. The relationships. The hierarchy. PageIndex keeps all of it.

    Tree Search with the LLM

    query = "What are the key HR policies and employee guidelines?" 
    tree_without_text = utils.remove_fields(tree.copy(), fields=['text']) 
    search_prompt = f""" 
    You are given a question and a tree structure of a document... 
    Question: {query} 
    Document tree structure: {json.dumps(tree_without_text, indent=2)} 
    Reply in JSON: {{ "thinking": "...", "node_list": [...] }} 
    """ 
    tree_search_result = await call_llm(search_prompt)

    Now we search. For this, we build a prompt that includes the question and the entire tree — but crucially, without the full text content of each node. Just the titles and summaries. This keeps the prompt manageable while giving the LLM everything it needs to navigate.

    The LLM is instructed to return a JSON object with two things: its thinking process and the list of relevant node IDs.

    Look at the output. The LLM tells us exactly why it chose each section. It reasoned through the tree like a human would. And it gave us a list of 30 node IDs — every section of this HR document, because the question is broad.

    This transparency is something you simply can’t get with cosine similarity.

    Fetch Text and Generate Answer

    node_list = tree_search_result_json["node_list"] 
    relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list) 
    answer_prompt = f"""Answer the question based on the context: 
    Question: {query} 
    Context: {relevant_content}""" 
    answer = await call_llm(answer_prompt) 
    utils.print_wrapped(answer)

    Step two. Now that we know which nodes are relevant, we fetch their full text — only those nodes, nothing else. We join the text and build a clean context prompt. One more LLM call, and we get our answer.

    Look at this answer. Detailed, structured, accurate. And every single claim can be traced back to a specific node in the tree, which maps to a specific page in the PDF. Full audit trail. Full explainability.

    The ask() Function

    async def ask(query): 
        # Full pipeline: tree search → text retrieval → answer generation 
        ... 
     
    user_query = input("Enter your query: ") 
    await ask(user_query)

    Now we package the entire pipeline into a single ask() function. Submit a question, get an answer — the tree search, retrieval, and generation all happen under the hood. Let me show you a couple of live examples.

    Type a question: e.g., “What are the penalties for sexual harassment?”

    Watch what happens. It searches the tree, identifies the Sexual Harassment Policy nodes specifically, pulls their text, and gives us a precise, cited answer in seconds. This is the experience you want to deliver to your users.

    Another one. Again, it finds exactly the right section. No confusion, no noise, no hallucination. Just the answer, from the document, with a clear trail showing where it came from.

    Conclusion

    Let’s bring this together. Traditional RAG finds text that looks similar to a question. But the real goal is to find the right answer in a structured document. PageIndex solves this better. It builds a reasoning tree and lets the model navigate it intelligently. The result is accurate and explainable answers, with up to 98.7% accuracy on FinanceBench. It is not perfect for every use case. Vector search still works well for large scale semantic search. But for long, structured documents, PageIndex is a stronger approach. You can find all the code in the description. Add your API keys and get started.



    I am a Data Science Trainee at Analytics Vidhya, passionately working on the development of advanced AI solutions such as Generative AI applications, Large Language Models, and cutting-edge AI tools that push the boundaries of technology. My role also involves creating engaging educational content for Analytics Vidhya’s YouTube channels, developing comprehensive courses that cover the full spectrum of machine learning to generative AI, and authoring technical blogs that connect foundational concepts with the latest innovations in AI. Through this, I aim to contribute to building intelligent systems and share knowledge that inspires and empowers the AI community.

    Login to continue reading and enjoy expert-curated content.

    Related posts:

    15+ AI Models That are Smarter Than You

    Can AI Outsmart Humans? 5 times AI found unexpected solutions

    Top 20+ Artificial Intelligence (AI) Tools You Shouldn't Miss in 2024

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleSynology BeeCamera Review: BeeStation Plus’ Cool Surveillance Add-on Feature
    Next Article Japan beat Australia to lift Women’s Asian Cup title | Football News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    5 Powerful Python Decorators for Robust AI Agents

    March 21, 2026
    Business & Startups

    Top 7 Free Data Analytics Courses with Certificates

    March 21, 2026
    Business & Startups

    SynthID: What it is and How it Works

    March 20, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.