Do you build GenAI systems and want to deploy them, or do you just want to learn more about FastAPI? Then this is exactly what you were looking for! Just imagine you have lots of PDF reports and want to search for specific answers in them. Either you could spend hours scrolling, or you could build a system that reads them for you and answers your questions. We are building a RAG system that will be deployed and accessed through an API using FastAPI. So without any further ado, let’s dive in.
What is FastAPI?
FastAPI is a Python framework for building API(s). FastAPI lets us use HTTP methods to communicate with the server.
One of its useful features is that it auto-generates documentation for your APIs you create. After writing your code and creating the APIs, you can visit a URL and utilize the interface (Swagger UI) to test your endpoints without even requiring you to code the frontend.
Understanding REST APIs
A REST API is an interface that creates communication between the client and server. REST API is short for Representational State Transfer API. The client can send HTTP requests to a specific API endpoint, and the server processes those requests. There are quite a few HTTP methods present. A few of which we will be implementing in our project using FastAPI.
HTTP Methods:
In our project, we will use two methods to communicate:
- GET: This is used to retrieve information. We will use /health GET request to check if the server is running.
- POST: This is used to send data to the server to create or process something. We will use /ingest and /query POST requests. We use POST here because they involve sending complex data like files or JSON objects. More about this in the implementation section.
What is RAG?
Retrieval-Augmented Generation (RAG) is one way to give an LLM access to specific knowledge it wasn’t originally trained on.
RAG components:
- Retrieval: Finding relevant sentences from the document(s) based on the query.
- Generation: Passing those sentences to an LLM so it can summarize them into an answer.
Let’s understand more about the RAG in the upcoming implementation section.
Implementation
Problem Statement: Creating a system that allows users to upload documents, specifically .txt files or PDFs. Then it indexes them into a searchable database and ensures that an LLM can answer questions about the new data. This system will be deployed and used through API endpoints that we will create through FastAPI.
Pre-Requisites
– We will require an OpenAI API Key, and we will use the gpt-4.1-mini model as the brain of the system. You can get your hands on the API key from the link: (https://platform.openai.com/settings/organization/api-keys)
– An IDE for executing the Python scripts, I’ll be using VSCode for the demo. Create a new project (folder).
– Make an .env file in your project and add your OpenAI key exactly like:
OPENAI_API_KEY=sk-proj...
– Create a Virtual Environment for This Project (To isolate the project’s dependencies).
Note:
- Ensure that the fast_env is created in your project, as path errors may occur if the working directory is not set to the project directory..
- Once activated, any packages you install will be contained within this environment.
– Download the blog below as a PDF using the ‘download icon’ to use in our RAG system:
Requirements
To solve this, we need a stack that handles heavy lifting efficiently:
- FastAPI: To handle the web requests and file uploads.
- LangChain: To extend the capabilities of the LLM.
- FAISS (Facebook AI Similarity Search): Helps search through text chunks. We will use it as a vector database.
- Uvicorn: To host the server.
You can create a requirements.txt in your project and run ‘pip install -r requirements.txt’:
fastapi==0.129.0
uvicorn[standard]==0.41.0
python-multipart==0.0.22
langchain==1.2.10
langchain-community==0.4.1
langchain-openai==1.1.10
langchain-core==1.2.13
faiss-cpu==1.13.2
openai==2.21.0
pypdf==6.7.1
python-dotenv==1.2.1
Implementation Approach
We will implement two FastAPI endpoints:
1. The Ingestion Pipeline (/ingest)
When a user uploads a file, we make use of the RecursiveCharacterTextSplitter from LangChain. This function breaks long documents into smaller chunks (we will configure the function to make each chunk size as 500 characters).
These chunks are then converted into embeddings and stored in our FAISS index (vector database). We will use the local storage for FAISS so that even if the server restarts, the uploaded documents aren’t lost.
2. The Query Pipeline (/query)
When you ask a question, the question turns into a vector. We then use FAISS to retrieve the top k (usually 4) chunks of text that are most similar to the question.
Finally, we use LCEL (LangChain Expression Language) to implement the Generation component of the RAG. We send the question and those 4 chunks to gpt-4.1-mini along with our prompt to get the answer.
Python Code
In the same project folder, create two scripts, rag_pipeline.py and main.py:
rag_pipeline.py:
Imports
import os
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document
from dotenv import load_dotenv
from typing import List
Configuration
# Loading OpenAI API key
load_dotenv()
# Config
FAISS_INDEX_PATH = "faiss_index"
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4.1-mini"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
Note: Ensure you have added the API key in the .env file
Initializations and Defining the Functions
# Shared state
_vectorstore: FAISS | None = None
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
def _load_vectorstore() -> FAISS | None:
"""Load existing FAISS index from disk if it exists."""
global _vectorstore
if _vectorstore is None and os.path.exists(FAISS_INDEX_PATH):
_vectorstore = FAISS.load_local(
FAISS_INDEX_PATH,
embeddings,
allow_dangerous_deserialization=True
)
return _vectorstore
def ingest_document(file_path: str, filename: str = "") -> int:
"""
Chunks, Embeds, Stores in FAISS and returns the number of chunks stored.
"""
global _vectorstore
# 1. Load
if file_path.endswith(".pdf"):
loader = PyPDFLoader(file_path)
else:
loader = TextLoader(file_path)
documents = loader.load()
# Overwriting source with the filename
display_name = filename or os.path.basename(file_path)
for doc in documents:
doc.metadata["source"] = display_name
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(documents)
# 3. Embed and Store
if _vectorstore is None:
_load_vectorstore()
if _vectorstore is None:
_vectorstore = FAISS.from_documents(chunks, embeddings)
else:
_vectorstore.add_documents(chunks)
# 4. Persist to disk
_vectorstore.save_local(FAISS_INDEX_PATH)
return len(chunks)
def _format_docs(docs: List[Document]) -> str:
"""Concatenate document page_content to add to the prompt."""
return "\n\n".join(doc.page_content for doc in docs)
These functions help chunking the documents, split the text into embeddings (using the embedding model: text-embedding-3-small) and store them in the FAISS index (vector store).
Defining the Retriever and Generator
def query_rag(question: str, top_k: int = 4) -> dict:
"""
Returns answer text and source references.
"""
vs = _load_vectorstore()
if vs is None:
return _format_docs,
"question": RunnablePassthrough(),
# Retriever
retriever = vs.as_retriever(
search_type="similarity",
search_kwargs=
"source_documents": retriever,
"context": retriever
)
# Prompt
prompt = PromptTemplate(
input_variables=["context", "question"],
template="""You are a helpful assistant. Use only the context below to answer the question.
If the answer is not in the context, say "I don't know based on the provided documents."
Context:
"source_documents": retriever,
"context": retriever
Question: _format_docs,
"question": RunnablePassthrough(),
Answer:"""
)
llm = ChatOpenAI(model=LLM_MODEL, temperature=0)
# LCEL chain
# Step 1:
retrieve = RunnableParallel(
{
"source_documents": retriever,
"context": retriever | _format_docs,
"question": RunnablePassthrough(),
}
)
# Step 2:
answer_chain = prompt | llm | StrOutputParser()
# Invoke
retrieved = retrieve.invoke(question)
answer = answer_chain.invoke(retrieved)
# Extracting sources
sources = list({
doc.metadata.get("source", "unknown")
for doc in retrieved["source_documents"]
})
return {
"answer": answer,
"sources": sources,
}
We have implemented our RAG, which retrieves 4 documents using similarity search and passes the question, context, and prompt to the Generator (gpt-4.1-mini).
First, the relevant documents are fetched using the query, and then the answer_chain is invoked which answers the question as a string using the StrOutputParser().
Note: The top-k and question will be passed as arguments to the function.
main.py
Imports
import os
import tempfile
from fastapi import FastAPI, UploadFile, File, HTTPException
from pydantic import BaseModel
from rag_pipeline import ingest_document, query_rag
We have imported the ingest_document and query_rag functions, which will be used by the API Endpoints we will define.
Configuration
app = FastAPI(
title="RAG API",
description="Upload documents and query them using RAG",
version="1.0.0"
)
ALLOWED_EXTENSIONS = {
"application/pdf": ".pdf",
"text/plain": ".txt",
}
class QueryRequest(BaseModel):
question: str
top_k: int = 4
class QueryResponse(BaseModel):
answer: str
sources: list[str]
Using Pydantic to strictly define the structure of inputs to the API.
Note: Validators can be added here as well to perform certain checks (example: to check if the phone number is exactly 10 digits)
/health API
@app.get("/health", tags=["Health"])
def health():
"""Check if the API is running."""
return {"status": "ok"}
This API is useful to confirm if the server is running.
Note: We wrap the API functions with a decorator; here, we use @app because we had initialized FastAPI with this variable earlier. Also, it’s followed by the HTTP method, here it is get(). We then pass the path for the endpoint, which is “/health” here.
/ingest API (To take the document from the user)
@app.post("/ingest", tags=["Ingestion"], summary="Upload and index a document")
async def ingest(file: UploadFile = File(...)):
"""
Upload a **.txt** or **.pdf** file.
"""
if file.content_type not in ALLOWED_EXTENSIONS:
raise HTTPException(
status_code=400,
detail=f"Unsupported file type '{file.content_type}'. Only .txt and .pdf are supported."
)
suffix = ALLOWED_EXTENSIONS[file.content_type]
contents = await file.read()
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
tmp.write(contents)
tmp_path = tmp.name
try:
num_chunks = ingest_document(tmp_path, filename=file.filename)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
finally:
os.unlink(tmp_path)
return {
"message": f"Successfully ingested '{file.filename}'",
"chunks_indexed": num_chunks
}
This function ensures only .txt or .pdf is accepted and then calls the ingest_document() function defined in rag_pipeline.py script.
/query API (To run the RAG pipeline)
@app.post("/query", response_model=QueryResponse, tags=["Query"], summary="Ask a question about your documents")
def query(request: QueryRequest):
"""
Ask a question related to the provided document.
The pipeline will return the answer and the source file names used to generate it.
"""
if not request.question.strip():
raise HTTPException(status_code=400, detail="Question cannot be empty.")
try:
result = query_rag(request.question, request.top_k)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
return QueryResponse(answer=result["answer"], sources=result["sources"])
Finally, we defined the API that calls the query_rag() function and returns the response according to the documents to the user. Let’s quickly test it.
Running the App
– Run the below command on your command prompt or terminal:
uvicorn main:app --reload
Note: Ensure your environment is activated and all the dependencies are installed. Or else you might see errors related to the same.
– Now the app should be up and running here: http://127.0.0.1:8000
– Open Swagger UI (Interface) using the URL below:
http://127.0.0.1:8000/docs
Great! We can test our APIs using the interface just by passing the arguments to the APIs.
Testing Both the APIs
1. /ingest API:
Click on ‘Try it out’ and pass the demo.pdf (you can replace it with any other PDF as well). And click execute.
Great! The API processed our request and created the vector store using the PDF. You can verify the same by looking at your project folder, where you can see the new faiss_index folder.
2. /query API:
Now, click on Try it Out and pass the arguments below (Feel free to use different prompts and PDFs).
{
"question": "Name 3 applications of Machine Learning",
"top_k": 4
}
As expected, the response looks very related to the content in the PDF. You can go ahead and play with the top-k parameter and also test it out with different questions.
Understanding HTTP Status Codes
HTTP status codes inform the client whether a request was successful or if something went wrong.
Status Code Categories:
Success
*The request was successfully received and processed.
In our project:
- /health returns 200 OK when the server is running.
- /ingest and /query return 200 OK when successful.
Client Errors
*The error is caused by something the client sent.
In our project:
- If you upload an unexpected file type (not a PDF or txt file), the API returns status code 400.
- If the question is empty in /query, the API returns the status code 400.
- FastAPI returns status code 422 if the request body does not match the expected Pydantic model that we defined.
Server Errors
*They indicate something went wrong on the server side.
In our project:
- If ingestion or querying code fails due to FAISS error or OpenAI error the API returns the status code 500.
Also Read:
Conclusion
We successfully implemented and learnt to build and deploy a RAG system using FastAPI. Here we created an API that ingests PDFs/.txt’s, retrieves relevant information, and generates relevant answers. The deployment part makes GenAI systems or traditional ML systems easy to access in real-world applications. We can further improve our RAG by optimizing the chunking strategy and combining different retrieval methods for our queries
Frequently Asked Questions
–reload makes the FastAPI server auto-restart whenever code changes, reflecting updates without manually restarting the server.
We use POST because queries include structured data like JSON objects. These can be large and complex. These are unlike GET requests which are used for simple retrievals.
MMR (Maximal Marginal Relevance) balances relevance and diversity when selecting document chunks, ensuring retrieved results are useful without being redundant.
Increasing top_k retrieves more chunks for the LLM, which can lead to potential noise in generated answers due to the presence of irrelevant content.
Login to continue reading and enjoy expert-curated content.
