Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Pokémon Pokopia Reviews Call It One of The Best In The Series

    March 2, 2026

    Glasgow’s Next Anthemic Rock Uprising

    March 2, 2026

    ‘Anything Can Happen In This Crazy World’

    March 2, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Building a RAG API with FastAPI
    Building a RAG API with FastAPI
    Business & Startups

    Building a RAG API with FastAPI

    gvfx00@gmail.comBy gvfx00@gmail.comMarch 2, 2026No Comments12 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Do you build GenAI systems and want to deploy them, or do you just want to learn more about FastAPI? Then this is exactly what you were looking for! Just imagine you have lots of PDF reports and want to search for specific answers in them. Either you could spend hours scrolling, or you could build a system that reads them for you and answers your questions. We are building a RAG system that will be deployed and accessed through an API using FastAPI. So without any further ado, let’s dive in.

    Table of Contents

    Toggle
    • What is FastAPI?
    • Understanding REST APIs
    • What is RAG?
    • Implementation
      • Pre-Requisites
      • Requirements
      • Implementation Approach
        • 1. The Ingestion Pipeline (/ingest)
        • 2. The Query Pipeline (/query)
    • Python Code
      • rag_pipeline.py:
        • Imports
        • Configuration
        • Initializations and Defining the Functions
        • Defining the Retriever and Generator
      • main.py
        • Imports
        • Configuration
        •  /ingest API (To take the document from the user)
        • /query API (To run the RAG pipeline)
        • Running the App
        • Testing Both the APIs
        • 2. /query API:
    • Understanding HTTP Status Codes
      • Status Code Categories:
    • Conclusion
    • Frequently Asked Questions
        • Login to continue reading and enjoy expert-curated content.
      • Related posts:
    • The KDnuggets ComfyUI Crash Course
    • A Complete Guide to Building Multi-Agent Systems
    • Google's Veo 3.1 Just Killed Sora 2!

    What is FastAPI?

    FastAPI is a Python framework for building API(s). FastAPI lets us use HTTP methods to communicate with the server.

    One of its useful features is that it auto-generates documentation for your APIs you create. After writing your code and creating the APIs, you can visit a URL and utilize the interface (Swagger UI) to test your endpoints without even requiring you to code the frontend.

    Understanding REST APIs

    A REST API is an interface that creates communication between the client and server. REST API is short for Representational State Transfer API. The client can send HTTP requests to a specific API endpoint, and the server processes those requests. There are quite a few HTTP methods present. A few of which we will be implementing in our project using FastAPI.

    HTTP Methods:

    In our project, we will use two methods to communicate:

    • GET: This is used to retrieve information. We will use /health GET request to check if the server is running.
    • POST: This is used to send data to the server to create or process something. We will use /ingest and /query POST requests. We use POST here because they involve sending complex data like files or JSON objects. More about this in the implementation section.

    What is RAG?

    Retrieval-Augmented Generation (RAG) is one way to give an LLM access to specific knowledge it wasn’t originally trained on.

    RAG components:

    • Retrieval: Finding relevant sentences from the document(s) based on the query.
    • Generation: Passing those sentences to an LLM so it can summarize them into an answer.

    Let’s understand more about the RAG in the upcoming implementation section.

    Implementation

    Problem Statement: Creating a system that allows users to upload documents, specifically .txt files or PDFs. Then it indexes them into a searchable database and ensures that an LLM can answer questions about the new data. This system will be deployed and used through API endpoints that we will create through FastAPI.

    Pre-Requisites

    – We will require an OpenAI API Key, and we will use the gpt-4.1-mini model as the brain of the system. You can get your hands on the API key from the link: (https://platform.openai.com/settings/organization/api-keys)

    – An IDE for executing the Python scripts, I’ll be using VSCode for the demo. Create a new project (folder).

    – Make an .env file in your project and add your OpenAI key exactly like:

    OPENAI_API_KEY=sk-proj... 

    – Create a Virtual Environment for This Project (To isolate the project’s dependencies).

    Note:

    • Ensure that the fast_env is created in your project, as path errors may occur if the working directory is not set to the project directory..
    • Once activated, any packages you install will be contained within this environment.

     – Download the blog below as a PDF using the ‘download icon’ to use in our RAG system:

    Requirements

    To solve this, we need a stack that handles heavy lifting efficiently:

    • FastAPI: To handle the web requests and file uploads.
    • LangChain: To extend the capabilities of the LLM.
    • FAISS (Facebook AI Similarity Search): Helps search through text chunks. We will use it as a vector database.
    • Uvicorn: To host the server.

    You can create a requirements.txt in your project and run ‘pip install -r requirements.txt’:

    fastapi==0.129.0
    uvicorn[standard]==0.41.0
    python-multipart==0.0.22
    langchain==1.2.10
    langchain-community==0.4.1
    langchain-openai==1.1.10
    langchain-core==1.2.13
    faiss-cpu==1.13.2
    openai==2.21.0
    pypdf==6.7.1
    python-dotenv==1.2.1

    Implementation Approach

    We will implement two FastAPI endpoints:

    1. The Ingestion Pipeline (/ingest)

    When a user uploads a file, we make use of the RecursiveCharacterTextSplitter from LangChain. This function breaks long documents into smaller chunks (we will configure the function to make each chunk size as 500 characters).

    These chunks are then converted into embeddings and stored in our FAISS index (vector database). We will use the local storage for FAISS so that even if the server restarts, the uploaded documents aren’t lost.

    2. The Query Pipeline (/query)

    When you ask a question, the question turns into a vector. We then use FAISS to retrieve the top k (usually 4) chunks of text that are most similar to the question.

    Finally, we use LCEL (LangChain Expression Language) to implement the Generation component of the RAG. We send the question and those 4 chunks to gpt-4.1-mini along with our prompt to get the answer.

    Python Code

    In the same project folder, create two scripts, rag_pipeline.py and main.py:

    rag_pipeline.py:

    Imports

    import os
    from langchain_community.document_loaders import TextLoader, PyPDFLoader
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    from langchain_openai import OpenAIEmbeddings, ChatOpenAI
    from langchain_community.vectorstores import FAISS
    from langchain_core.runnables import RunnablePassthrough, RunnableParallel
    from langchain_core.output_parsers import StrOutputParser
    from langchain_core.prompts import PromptTemplate
    from langchain_core.documents import Document
    from dotenv import load_dotenv
    from typing import List 

    Configuration

    # Loading OpenAI API key
    load_dotenv()
    #  Config
    FAISS_INDEX_PATH = "faiss_index"
    EMBEDDING_MODEL  = "text-embedding-3-small"
    LLM_MODEL        = "gpt-4.1-mini"
    CHUNK_SIZE       = 500
    CHUNK_OVERLAP    = 50

    Note: Ensure you have added the API key in the .env file

    Initializations and Defining the Functions

    #  Shared state
    _vectorstore: FAISS | None = None
    embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
    def _load_vectorstore() -> FAISS | None:
        """Load existing FAISS index from disk if it exists."""
        global _vectorstore
        if _vectorstore is None and os.path.exists(FAISS_INDEX_PATH):
            _vectorstore = FAISS.load_local(
                FAISS_INDEX_PATH,
                embeddings,
                allow_dangerous_deserialization=True
            )
        return _vectorstore
    def ingest_document(file_path: str, filename: str = "") -> int:
        """
        Chunks, Embeds, Stores in FAISS and returns the number of chunks stored.
        """ 
        global _vectorstore
        # 1. Load
        if file_path.endswith(".pdf"):
            loader = PyPDFLoader(file_path)
        else:
            loader = TextLoader(file_path)
        documents = loader.load()
        # Overwriting source with the filename
        display_name = filename or os.path.basename(file_path)
        for doc in documents:
            doc.metadata["source"] = display_name
        # 2. Chunk
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=CHUNK_SIZE,
            chunk_overlap=CHUNK_OVERLAP,
            separators=["\n\n", "\n", ".", " ", ""]
        )
        chunks = splitter.split_documents(documents)
        # 3. Embed and Store
        if _vectorstore is None:
            _load_vectorstore()
        if _vectorstore is None:
            _vectorstore = FAISS.from_documents(chunks, embeddings)
        else:
            _vectorstore.add_documents(chunks)
        # 4. Persist to disk
        _vectorstore.save_local(FAISS_INDEX_PATH)
        return len(chunks)
    def _format_docs(docs: List[Document]) -> str:
        """Concatenate document page_content to add to the prompt."""
        return "\n\n".join(doc.page_content for doc in docs)

    These functions help chunking the documents, split the text into embeddings (using the embedding model: text-embedding-3-small) and store them in the FAISS index (vector store).

    Defining the Retriever and Generator

    def query_rag(question: str, top_k: int = 4) -> dict:
        """
        Returns answer text and source references.
        """
        vs = _load_vectorstore()
        if vs is None:
            return  _format_docs,
                "question":         RunnablePassthrough(),
            
        #  Retriever
        retriever = vs.as_retriever(
            search_type="similarity",      
            search_kwargs=
                "source_documents": retriever,             
                "context":          retriever 
        )
        #  Prompt
        prompt = PromptTemplate(
            input_variables=["context", "question"],
            template="""You are a helpful assistant. Use only the context below to answer the question.
    If the answer is not in the context, say "I don't know based on the provided documents." 
    Context:
    
                "source_documents": retriever,             
                "context":          retriever 
    Question:  _format_docs,
                "question":         RunnablePassthrough(),
             
    Answer:"""
        ) 
        llm = ChatOpenAI(model=LLM_MODEL, temperature=0)
        #  LCEL chain
        # Step 1:
        retrieve = RunnableParallel(
            {
                "source_documents": retriever,             
                "context":          retriever | _format_docs,
                "question":         RunnablePassthrough(),
            }
        )
        # Step 2:
        answer_chain = prompt | llm | StrOutputParser()
        #  Invoke
        retrieved = retrieve.invoke(question)    
        answer    = answer_chain.invoke(retrieved)   
        #  Extracting sources
        sources = list({
            doc.metadata.get("source", "unknown")
            for doc in retrieved["source_documents"]
        })
        return {
            "answer":  answer,
            "sources": sources,
        }

    We have implemented our RAG, which retrieves 4 documents using similarity search and passes the question, context, and prompt to the Generator (gpt-4.1-mini).

    First, the relevant documents are fetched using the query, and then the answer_chain is invoked which answers the question as a string using the StrOutputParser(). 

    Note: The top-k and question will be passed as arguments to the function.

    main.py

    Imports

    import os
    import tempfile 
    from fastapi import FastAPI, UploadFile, File, HTTPException
    from pydantic import BaseModel
    from rag_pipeline import ingest_document, query_rag

    We have imported the ingest_document and query_rag functions, which will be used by the API Endpoints we will define.

    Configuration

    app = FastAPI(
        title="RAG API",
        description="Upload documents and query them using RAG",
        version="1.0.0"
    ) 
    ALLOWED_EXTENSIONS = {
        "application/pdf": ".pdf",
        "text/plain": ".txt",
    }
    class QueryRequest(BaseModel):
        question: str 
        top_k: int = 4 
    class QueryResponse(BaseModel):
        answer: str 
        sources: list[str]

    Using Pydantic to strictly define the structure of inputs to the API.

     Note: Validators can be added here as well to perform certain checks (example: to check if the phone number is exactly 10 digits)

    /health API

    @app.get("/health", tags=["Health"])
    def health():
        """Check if the API is running."""
        return {"status": "ok"}

    This API is useful to confirm if the server is running.

    Note: We wrap the API functions with a decorator; here, we use @app because we had initialized FastAPI with this variable earlier. Also, it’s followed by the HTTP method, here it is get(). We then pass the path for the endpoint, which is “/health” here.

     /ingest API (To take the document from the user)

    @app.post("/ingest", tags=["Ingestion"], summary="Upload and index a document")
    async def ingest(file: UploadFile = File(...)):
        """
        Upload a **.txt** or **.pdf** file.
        """
        if file.content_type not in ALLOWED_EXTENSIONS:
            raise HTTPException(
                status_code=400,
                detail=f"Unsupported file type '{file.content_type}'. Only .txt and .pdf are supported."
            )
        suffix = ALLOWED_EXTENSIONS[file.content_type]
        contents = await file.read()
        with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
            tmp.write(contents)
            tmp_path = tmp.name
        try:
            num_chunks = ingest_document(tmp_path, filename=file.filename)
        except Exception as e:
            raise HTTPException(status_code=500, detail=str(e))
        finally:
            os.unlink(tmp_path) 
        return {
            "message": f"Successfully ingested '{file.filename}'",
            "chunks_indexed": num_chunks
        }

    This function ensures only .txt or .pdf is accepted and then calls the ingest_document() function defined in rag_pipeline.py script.

    /query API (To run the RAG pipeline)

    @app.post("/query", response_model=QueryResponse, tags=["Query"], summary="Ask a question about your documents")
    def query(request: QueryRequest):
        """
        Ask a question related to the provided document.
        The pipeline will return the answer and the source file names used to generate it.
        """
        if not request.question.strip():
            raise HTTPException(status_code=400, detail="Question cannot be empty.")
        try:
            result = query_rag(request.question, request.top_k)
        except Exception as e:
            raise HTTPException(status_code=500, detail=str(e))
        return QueryResponse(answer=result["answer"], sources=result["sources"])

    Finally, we defined the API that calls the query_rag() function and returns the response according to the documents to the user. Let’s quickly test it.

    Running the App

    – Run the below command on your command prompt or terminal:

    uvicorn main:app --reload

    Note: Ensure your environment is activated and all the dependencies are installed. Or else you might see errors related to the same.

    – Now the app should be up and running here: http://127.0.0.1:8000

    – Open Swagger UI (Interface) using the URL below:
    http://127.0.0.1:8000/docs

    Great! We can test our APIs using the interface just by passing the arguments to the APIs.

    Testing Both the APIs

    1. /ingest API:

    Click on ‘Try it out’ and pass the demo.pdf (you can replace it with any other PDF as well). And click execute.

    Great! The API processed our request and created the vector store using the PDF. You can verify the same by looking at your project folder, where you can see the new faiss_index folder.

    2. /query API:

    Now, click on Try it Out and pass the arguments below (Feel free to use different prompts and PDFs).

    {
    "question": "Name 3 applications of Machine Learning",
    "top_k": 4
    }

    As expected, the response looks very related to the content in the PDF. You can go ahead and play with the top-k parameter and also test it out with different questions.

    Understanding HTTP Status Codes

    HTTP status codes inform the client whether a request was successful or if something went wrong.

    Status Code Categories:

    Success

    *The request was successfully received and processed.

    In our project:

    • /health returns 200 OK when the server is running.
    • /ingest and /query return 200 OK when successful.

    Client Errors

    *The error is caused by something the client sent.

    In our project:

    • If you upload an unexpected file type (not a PDF or txt file), the API returns status code 400.
    • If the question is empty in /query, the API returns the status code 400.
    • FastAPI returns status code 422 if the request body does not match the expected Pydantic model that we defined.

    Server Errors

    *They indicate something went wrong on the server side.

    In our project:

    • If ingestion or querying code fails due to FAISS error or OpenAI error the API returns the status code 500.

    Also Read:

    Conclusion

    We successfully implemented and learnt to build and deploy a RAG system using FastAPI. Here we created an API that ingests PDFs/.txt’s, retrieves relevant information, and generates relevant answers. The deployment part makes GenAI systems or traditional ML systems easy to access in real-world applications. We can further improve our RAG by optimizing the chunking strategy and combining different retrieval methods for our queries

    Frequently Asked Questions

    Why is –reload used in the command?

    –reload makes the FastAPI server auto-restart whenever code changes, reflecting updates without manually restarting the server.

    Why is POST used for the /query endpoint?

    We use POST because queries include structured data like JSON objects. These can be large and complex. These are unlike GET requests which are used for simple retrievals.

    What is MMR in retrieval?

    MMR (Maximal Marginal Relevance) balances relevance and diversity when selecting document chunks, ensuring retrieved results are useful without being redundant.

    What happens if top_k is increased to high values?

    Increasing top_k retrieves more chunks for the LLM, which can lead to potential noise in generated answers due to the presence of irrelevant content.

     


    Mounish V

    Passionate about technology and innovation, a graduate of Vellore Institute of Technology. Currently working as a Data Science Trainee, focusing on Data Science. Deeply interested in Deep Learning and Generative AI, eager to explore cutting-edge techniques to solve complex problems and create impactful solutions.

    Login to continue reading and enjoy expert-curated content.

    Related posts:

    Building a Multi-Agent Dungeons & Dragons Game with LangChain

    Data Analytics Automation Scripts with SQL Stored Procedures

    50+ Machine Learning Resources for Self Study in 2026

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleClaude is topping the App Store charts thanks to ‘cancel ChatGPT’ trend — but the US military is finding it hard to quit
    Next Article From experiment to enterprise reality
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Top 7 Free Prompt Engineering Courses with Certificates

    March 2, 2026
    Business & Startups

    Deterministic vs Stochastic Explained (ML & Risk Examples)

    March 2, 2026
    Business & Startups

    4 Ways to Grow your LinkedIn Scarily Fast with This AI Workflow

    March 1, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.