The Better Way For Document Chatbots?

What if the way we build AI document chatbots today is flawed? Most systems use RAG. They split documents into chunks, create embeddings, and retrieve answers using similarity search. It works in demos but often fails in real use. It misses obvious answers or picks the wrong context. Now there is a new approach called PageIndex. It does not use chunking, embeddings, or vector databases. Yet it reaches up to 98.7% accuracy on tough document Q&A tasks. In this article, we will break down how PageIndex works, why it performs better on structured documents, and how you can build your own chatbot using it.

Table of Contents

The Problem with Traditional RAG

Here’s the classic RAG pipeline you’ve probably seen a hundred times.

You take your document – could be a PDF, a report, a contract – and you chop it into chunks. Maybe 512 tokens each, maybe with some overlap.
You run each chunk through an embedding model to turn it into a vector — a long list of numbers that represents the “meaning” of that chunk.
You store all those vectors in a vector database — Pinecone, Weaviate, Chroma, whatever your flavour is.
When the user asks a question, you embed the question the same way, and you do a cosine similarity search to find the chunks whose vectors are closest to the question vector.
You hand those chunks to the LLM as context, and it writes the answer.

Simple. Elegant. And absolutely riddled with failure modes.

Problem 1: Arbitrary chunking destroys context

When you slice a document at 512 tokens, you’re not respecting the document’s actual structure. A single table might get split across three chunks. A footnote that’s critical to understanding the main text ends up in a completely different chunk. The answer you need might literally span two adjacent chunks that the retriever picks only one of.

Problem 2: Similarity is not the same as relevance

This is the big one. Vector similarity finds text that sounds like your question. But documents often don’t repeat the question’s phrasing when they answer it. Ask “What is the termination clause?” and the contract might just say “Section 14.3 — Dissolution of Agreement.” Low cosine similarity. Missed entirely.

Problem 3: It’s a black box

You get three chunks back. Why those three? You have no idea. It’s pure math. There’s no reasoning, no explanation, no audit trail. For financial documents, legal contracts, and medical records? That opacity is a serious problem.

Problem 4: It doesn’t scale to long documents

A 300-page technical manual with complex cross-references? The sheer number of chunks makes retrieval noisy. You end up getting chunks that are vaguely related instead of the exact section you need.

These aren’t edge cases. These are the everyday failures that RAG engineers spend most of their time fighting. And the reason they happen is actually pretty simple — the entire architecture is borrowed from search engines, not from how humans actually read and understand documents.

When a human expert needs to answer a question from a document, they don’t scan every sentence looking for the one that sounds most similar to the question. They open the table of contents, skim the chapter headings, navigate, and reason about where the answer should be before they even start reading.

That’s the insight behind PageIndex.

What is PageIndex?

PageIndex was built by VectifyAI and open-sourced on GitHub. The core idea is deceptively simple:

Instead of searching a document, navigate it: the way a human expert would.

Here’s the key mental shift. Traditional RAG asks: “Which chunks look most similar to my question?”

PageIndex asks: “Where in this document would a smart human look for the answer to this question?”

Those are two very different questions. And the second one turns out to produce dramatically better results.

PageIndex does this by building what it calls a Reasoning Tree. It is essentially an intelligent, AI-generated table of contents for your document.

Here’s how to visualize it. At the top, you have a root node that represents the entire document. Below that, you have nodes for each major section or chapter. Each of those branches into subsections. Each subsection branches into specific topics or paragraphs. Every single node in this tree has two things:

A title: what this section is about
A summary: a concise AI-generated description of what’s in this section

This tree is built once, when you first submit the document. It’s your index.

Now here’s where it gets clever. When you ask a question, PageIndex does two things:

1. Tree Search (Navigation)

It sends the question to an LLM along with the tree, but just the titles and summaries, not the full text. The LLM reads through the tree like a human reads a table of contents, and it reasons: “Okay, given this question, which branches of the tree are most likely to contain the answer?”

The LLM returns a list of specific node IDs, and you can see its reasoning. It literally tells you why it chose those sections. Full transparency.

PageIndex fetches only the full text of those selected nodes, hands it to the LLM as context, and the LLM writes the final answer grounded entirely in the real document text.

Two LLM calls. No embeddings. No vector database. Just reasoning.

And because every answer is tied to specific nodes in the tree, you always know exactly which page, which section, which part of the document the answer came from. Complete audit trail. Complete explainability.

How it Works: Deep Dive

Let me go deeper into the mechanics, because this is the really interesting part.

The Tree Index – Building Phase

When you call submit_document(), PageIndex reads your PDF or text file and does something remarkable. It doesn’t just extract text but also understands the structure. Using a combination of layout analysis and LLM reasoning, it identifies:

What are the natural sections and subsections?
Where does one topic end and another begin?
How do the pieces relate to each other hierarchically?

It then constructs the tree and generates a summary for every node. Not just a title. An actual condensed description of what’s in that section. This is what enables the smart navigation later.

The tree uses a numeric node ID system that mirrors real document structure: 0001 might be Chapter 1, 0002 Chapter 2, 0003 the first section inside Chapter 1, and so on. The hierarchy is preserved.

Why This Beats Chunking

Think about what chunking does to a 50-page financial report. You get maybe 300 chunks, each with zero awareness of whether it’s from the executive summary or a footnote on page 47. The embedder treats them all equally.

The PageIndex tree, on the other hand, knows that node 0012 is the “Revenue Breakdown” subsection under the “Q3 Financial Results” section under “Annual Report 2024.” That structural awareness is enormously valuable when you’re trying to find something specific.

The Search Phase – Reasoning, Not Math

Here’s the other thing that makes PageIndex special. The search step is not a mathematical operation. It’s a cognitive operation performed by an LLM.

When you ask, “What were the main risk factors disclosed in this report?”, the LLM doesn’t measure cosine distance. It reads the tree, recognizes that the “Risk Factors” section is exactly what’s needed, and selects those nodes, just like you would.

This means PageIndex handles semantic mismatch naturally. This is the kind of mismatch that kills vector search. The document calls it “Risk Factors.” Your question calls it “main dangers.” A vector search might miss it. An LLM reading the tree structure will not.

The Numbers

PageIndex powered Mafin 2.5, VectifyAI’s financial RAG system, which achieved 98.7% accuracy on FinanceBench. For those unaware, this is a benchmark specifically designed to test AI systems on financial document questions, where the documents are long, complex, and full of tables and cross-references. That’s the hardest environment for traditional RAG. It’s where PageIndex shines most.

What is it Best For?

PageIndex is particularly powerful for:

Financial reports: earnings statements, SEC filings, 10-Ks
Legal contracts: where every clause matters and context is everything
Technical manuals: complex cross-referenced documentation
Policy documents: HR policies, compliance documents, regulatory filings
Research papers: structured academic content

Basically: anywhere your document has real structure that chunking would destroy.

And the really exciting thing? You can use it with any LLM. OpenAI, Anthropic, Gemini — the tree search and answer generation steps are just prompts. You’re in full control.

Hands-on With Jupyter Notebook

Okay. You now know the theory. You know why PageIndex exists, what it does, and how it works under the hood. Now let’s actually build something with it.

I’m going to open a Jupyter notebook and walk you through the complete PageIndex pipeline: uploading a document, getting the reasoning tree back, navigating it with an LLM, and asking questions. Every line of code is explained. No hand-waving.

Install PageIndex

%pip install -q --upgrade pageindex

First things first. We install the pageindex Python library. One line, done. No vector database to set up. No embedding model to download. This is already simpler than any traditional RAG setup.

Imports & API Setup

import os
from pageindex import PageIndexClient
import pageindex.utils as utils
from dotenv import load_dotenv
load_dotenv()
PAGEINDEX_API_KEY = os.getenv("PAGEINDEX_API_KEY")
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)

We import the PageIndexClient. This is our connection to the PageIndex API. All the heavy lifting of building the tree happens on their end, so we don’t need a beefy machine. We also load API keys from a .env file — always keep your keys out of your code.

OpenAI Setup

import openai 
async def call_llm(prompt, model="gpt-4.1-mini", temperature=0): 
    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY) 
    response = await client.chat.completions.create(...) 
    return response.choices[0].message.content.strip()

Here we define our LLM helper function. We’re using GPT-4.1-mini for cost efficiency — but this works with any OpenAI model, and you could swap in Claude or Gemini with a one-line change. Temperature zero keeps the answers factual and consistent.

Submit the Document

pdf_path = "/Users/soumil/Desktop/PageIndex/HR Policies-1.pdf" 
doc_id = pi_client.submit_document(pdf_path)["doc_id"] 
print('Document Submitted:', doc_id)

This is the magic line. We point to our PDF — in this case an HR policy document — and submit it. PageIndex takes the file, reads its structure, and starts building the reasoning tree in the background. We get back a doc_id, a unique identifier for this document that we’ll use in every subsequent call. Notice there’s no chunking code, no embedding call, no vector database connection.

Wait for Processing & Get the Tree

while not pi_client.is_retrieval_ready(doc_id): 
    print("Still processing... retrying in 10 seconds") 
    time.sleep(10) 
tree = pi_client.get_tree(doc_id, node_summary=True)['result'] 
utils.print_tree(tree)

PageIndex processes the document asynchronously — we just poll every 10 seconds until it’s ready. Then we call get_tree() with node_summary=True, which gives us the full tree structure including summaries.

Look at this output. This is the reasoning tree. You can see the hierarchy — the top-level HR Policies node, then Electronic Communication Policy, Sexual Harassment Policy, Grievance Redressal Policy, each branching into its subsections. Every node has an ID, a title, and a summary of what’s in it.

This is what traditional RAG throws away. The structure. The relationships. The hierarchy. PageIndex keeps all of it.

Tree Search with the LLM

query = "What are the key HR policies and employee guidelines?" 
tree_without_text = utils.remove_fields(tree.copy(), fields=['text']) 
search_prompt = f""" 
You are given a question and a tree structure of a document... 
Question: {query} 
Document tree structure: {json.dumps(tree_without_text, indent=2)} 
Reply in JSON: {{ "thinking": "...", "node_list": [...] }} 
""" 
tree_search_result = await call_llm(search_prompt)

Now we search. For this, we build a prompt that includes the question and the entire tree — but crucially, without the full text content of each node. Just the titles and summaries. This keeps the prompt manageable while giving the LLM everything it needs to navigate.

The LLM is instructed to return a JSON object with two things: its thinking process and the list of relevant node IDs.

Look at the output. The LLM tells us exactly why it chose each section. It reasoned through the tree like a human would. And it gave us a list of 30 node IDs — every section of this HR document, because the question is broad.

This transparency is something you simply can’t get with cosine similarity.

Fetch Text and Generate Answer

node_list = tree_search_result_json["node_list"] 
relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list) 
answer_prompt = f"""Answer the question based on the context: 
Question: {query} 
Context: {relevant_content}""" 
answer = await call_llm(answer_prompt) 
utils.print_wrapped(answer)

Step two. Now that we know which nodes are relevant, we fetch their full text — only those nodes, nothing else. We join the text and build a clean context prompt. One more LLM call, and we get our answer.

Look at this answer. Detailed, structured, accurate. And every single claim can be traced back to a specific node in the tree, which maps to a specific page in the PDF. Full audit trail. Full explainability.

The ask() Function

async def ask(query): 
    # Full pipeline: tree search → text retrieval → answer generation 
    ... 
 
user_query = input("Enter your query: ") 
await ask(user_query)

Now we package the entire pipeline into a single ask() function. Submit a question, get an answer — the tree search, retrieval, and generation all happen under the hood. Let me show you a couple of live examples.

Type a question: e.g., “What are the penalties for sexual harassment?”

Watch what happens. It searches the tree, identifies the Sexual Harassment Policy nodes specifically, pulls their text, and gives us a precise, cited answer in seconds. This is the experience you want to deliver to your users.

Another one. Again, it finds exactly the right section. No confusion, no noise, no hallucination. Just the answer, from the document, with a clear trail showing where it came from.

Conclusion

Let’s bring this together. Traditional RAG finds text that looks similar to a question. But the real goal is to find the right answer in a structured document. PageIndex solves this better. It builds a reasoning tree and lets the model navigate it intelligently. The result is accurate and explainable answers, with up to 98.7% accuracy on FinanceBench. It is not perfect for every use case. Vector search still works well for large scale semantic search. But for long, structured documents, PageIndex is a stronger approach. You can find all the code in the description. Add your API keys and get started.

I am a Data Science Trainee at Analytics Vidhya, passionately working on the development of advanced AI solutions such as Generative AI applications, Large Language Models, and cutting-edge AI tools that push the boundaries of technology. My role also involves creating engaging educational content for Analytics Vidhya’s YouTube channels, developing comprehensive courses that cover the full spectrum of machine learning to generative AI, and authoring technical blogs that connect foundational concepts with the latest innovations in AI. Through this, I aim to contribute to building intelligent systems and share knowledge that inspires and empowers the AI community.

What's Hot

The weight of the Three Lions: Football, colonialism, diaspora | World Cup 2026

Multi-Agent AI Orchestration in a Single Model

TP-Link Roam 7 Travel Router (TL-WR3602BE) Review: Modest but Effective

The Better Way For Document Chatbots?

5 Time Series Foundation Models You Are Missing Out On

A Gentle Introduction to TypeScript for Python Programmers

A Guide to OpenRouter for AI Development

Multi-Agent AI Orchestration in a Single Model

3 NLTK Tricks for Advanced Text Preprocessing & Linguistic Analysis

Here’s What Everyone Gets Wrong About Agentic AI

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

The Better Way For Document Chatbots?

The Problem with Traditional RAG

Problem 1: Arbitrary chunking destroys context

Problem 2: Similarity is not the same as relevance

Problem 3: It’s a black box

Problem 4: It doesn’t scale to long documents

What is PageIndex?

1. Tree Search (Navigation)

How it Works: Deep Dive

The Tree Index – Building Phase

Why This Beats Chunking

The Search Phase – Reasoning, Not Math

The Numbers

What is it Best For?

Hands-on With Jupyter Notebook

Install PageIndex

Imports & API Setup

OpenAI Setup

Submit the Document

Wait for Processing & Get the Tree

Tree Search with the LLM

Fetch Text and Generate Answer

The ask() Function

Conclusion

Login to continue reading and enjoy expert-curated content.

Related posts:

5 Time Series Foundation Models You Are Missing Out On

A Gentle Introduction to TypeScript for Python Programmers

A Guide to OpenRouter for AI Development

Related Posts

Subscribe to Updates