Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Celebrate Apple’s 50th birthday with these deals on Watches and AirPods

    April 2, 2026

    Super Mario Galaxy’s Charlie Day Lists Luigi Mangione As 2nd Favorite Luigi

    April 2, 2026

    PlayStation Plus Makes Iconic Trilogy Free For Subscribers

    April 2, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»7 Steps to Build a Simple RAG System from Scratch
    7 Steps to Build a Simple RAG System from Scratch
    Business & Startups

    7 Steps to Build a Simple RAG System from Scratch

    gvfx00@gmail.comBy gvfx00@gmail.comNovember 18, 2025No Comments13 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    7 Steps to Build a Simple RAG System from Scratch
    Image by Author

     

    Table of Contents

    Toggle
    • # Introduction
    • # Understanding the Retrieval-Augmented Generation Workflow
    • # Step 1: Preprocessing the Data
    • # Step 2: Converting Text into Chunks
    • # Step 3: Creating and Storing Vector Embeddings
    • # Step 4: Retrieving Relevant Information
    • # Step 5: Combining the Retrieved Context
    • # Step 6: Using a Large Language Model to Generate the Answer
    • # Step 7: Running the Full Retrieval-Augmented Generation Pipeline
    • # Wrapping Up
      • Related posts:
    • Benchmarking AI for Indian Languages & Culture
    • Build Your Own Open-Source Logo Detector
    • Most Downloaded Hugging Face Datasets and Their Use-cases

    # Introduction

     
    These days, almost everyone uses ChatGPT, Gemini, or another large language model (LLM). They make life easier but can still get things wrong. For example, I remember asking a generative model who won the most recent U.S. presidential election and getting the previous president’s name back. It sounded confident, but the model simply relied on training data before the election took place. This is where retrieval-augmented generation (RAG) helps LLMs give more accurate and up-to-date responses. Instead of depending only on the model’s internal knowledge, it pulls information from external sources — such as PDFs, documents, or APIs — and uses that to build a more contextual and reliable answer. In this guide, I’ll walk you through seven practical steps to build a simple RAG system from scratch.

     

    # Understanding the Retrieval-Augmented Generation Workflow

     
    Before we proceed to code, here’s the idea in plain terms. A RAG system has two core pieces: the retriever and the generator. The retriever searches your knowledge base and pulls out the most relevant chunks of text. The generator is the language model that takes those snippets and turns them into a natural, useful answer. The process is straightforward, as follows:

    1. A user asks a question.
    2. The retriever searches your indexed documents or database and returns the best matching passages.
    3. Those passages are handed to the LLM as context.
    4. The LLM then generates a response grounded in that retrieved context.

    Now we will break that flow down into seven simple steps and build it end-to-end.

     

    # Step 1: Preprocessing the Data

     
    Even though large language models already know a lot from textbooks and web data, they don’t have access to your private or newly generated information like research notes, company documents, or project files. RAG helps you feed the model your own data, reducing hallucinations and making responses more accurate and up-to-date. For the sake of this article, we’ll keep things simple and use a few short text files about machine learning concepts.

    data/
     ├── supervised_learning.txt
     └── unsupervised_learning.txt
    

     

    supervised_learning.txt:
    In this type of machine learning (supervised), the model is trained on labeled data. 
    In simple terms, every training example has an input and an associated output label. 
    The objective is to build a model that generalizes well on unseen data. 
    Common algorithms include:
    - Linear Regression
    - Decision Trees
    - Random Forests
    - Support Vector Machines
    
    Classification and regression tasks are performed in supervised machine learning.
    For example: spam detection (classification) and house price prediction (regression).
    They can be evaluated using accuracy, F1-score, precision, recall, or mean squared error.
    

     

    unsupervised_learning.txt:
    In this type of machine learning (unsupervised), the model is trained on unlabeled data. 
    Popular algorithms include:
    - K-Means
    - Principal Component Analysis (PCA)
    - Autoencoders
    
    There are no predefined output labels; the algorithm automatically detects 
    underlying patterns or structures within the data.
    Typical use cases include anomaly detection, customer clustering, 
    and dimensionality reduction.
    Performance can be measured qualitatively or with metrics such as silhouette score 
    and reconstruction error.

     
    The next task is to load this data. For that, we will create a Python file, load_data.py:

    import os
    
    def load_documents(folder_path):
        docs = []
        for file in os.listdir(folder_path):
            if file.endswith(".txt"):
                with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
                    docs.append(f.read())
        return docs

     
    Before we use the data, we will clean it. If the text is messy, the model may retrieve irrelevant or incorrect passages, increasing hallucinations. Now, let’s create another Python file, clean_data.py:

    import re
    
    def clean_text(text: str) -> str:
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[^\x00-\x7F]+', ' ', text)
        return text.strip()

     
    Finally, combine everything into a new file called prepare_data.py to load and clean your documents together:

    from load_data import load_documents
    from clean_data import clean_text
    
    def prepare_docs(folder_path="data/"):
        """
        Loads and cleans all text documents from the given folder.
        """
        # Load Documents
        raw_docs = load_documents(folder_path)
    
        # Clean Documents
        cleaned_docs = [clean_text(doc) for doc in raw_docs]
    
        print(f"Prepared {len(cleaned_docs)} documents.")
        return cleaned_docs

     

    # Step 2: Converting Text into Chunks

     
    LLMs possess a small context window — e.g. they are capable of processing only a limited amount of text simultaneously. We solve this by dividing long documents into short, overlapping pieces (the number of words in a chunk is normally 300 to 500 words). We’ll use LangChain’s RecursiveCharacterTextSplitter, which splits text at natural points like sentences or paragraphs. Each piece makes sense, and the model can quickly find the relevant piece while answering.

    split_text.py
    
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    def split_docs(documents, chunk_size=500, chunk_overlap=100):
     
       # define the splitter
       splitter = RecursiveCharacterTextSplitter(
           chunk_size=chunk_size,
           chunk_overlap=chunk_overlap
       )
    
       # use the splitter to split docs into chunks
       chunks = splitter.create_documents(documents)
       print(f"Total chunks created: {len(chunks)}")
    
       return chunks

     
    Chunking helps the model understand the text without losing its meaning. If we don’t add a little overlap between pieces, the model can get confused at the edges, and the answer might not make sense.

     

    # Step 3: Creating and Storing Vector Embeddings

     
    A computer does not understand textual information; it only understands numbers. So, we need to convert our text chunks into numbers. These numbers are called vector embeddings, and they help the computer understand the meaning behind the text. We can use tools like OpenAI, SentenceTransformers, or Hugging Face for this. Let’s create a new file called create_embeddings.py and use SentenceTransformers to generate embeddings.

    from sentence_transformers import SentenceTransformer
    import numpy as np
    
    def get_embeddings(text_chunks):
      
       # Load embedding model
       model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
      
       print(f"Creating embeddings for {len(text_chunks)} chunks:")
       embeddings = model.encode(text_chunks, show_progress_bar=True)
      
       print(f"Embeddings shape: {embeddings.shape}")
       return np.array(embeddings)
    

     
    Each vector embedding captures its semantic meaning. Similar text chunks will have embeddings that are close to each other in vector space. Now we will store embeddings in a vector database like FAISS (Facebook AI Similarity Search), Chroma, or Pinecone. This helps in fast similarity search. For example, let’s use FAISS (a lightweight, local option). You can install it using:

     
    Next, let’s create a file called store_faiss.py. First, we make necessary imports:

    import faiss
    import numpy as np
    import pickle

     
    Now we’ll create a FAISS index from our embeddings using the function build_faiss_index().

    def build_faiss_index(embeddings, save_path="faiss_index"):
       """
       Builds FAISS index and saves it.
       """
       dim = embeddings.shape[1]
       print(f"Building FAISS index with dimension: {dim}")
    
       # Use a simple flat L2 index
       index = faiss.IndexFlatL2(dim)
       index.add(embeddings.astype('float32'))
    
       # Save FAISS index
       faiss.write_index(index, f"{save_path}.index")
       print(f"Saved FAISS index to {save_path}.index")
    
       return index

     
    Each embedding represents a text chunk, and FAISS assists in retrieving the nearest ones in the future when a user poses a question. Finally, we need to save all text chunks (their metadata) into a pickle file so they can be easily reloaded later for retrieval.

    def save_metadata(text_chunks, path="faiss_metadata.pkl"):
       """
       Saves the mapping of vector positions to text chunks.
       """
       with open(path, "wb") as f:
           pickle.dump(text_chunks, f)
       print(f"Saved text metadata to {path}")

     

    # Step 4: Retrieving Relevant Information

     
    In this step, the user’s question is first converted into numerical form, just like what we did with all the text chunks before. The computer then compares the numerical values of the chunks with the question’s vector to find the closest ones. This process is called similarity search.
    Let’s create a new file called retrieve_faiss.py and make the imports as needed:

    import faiss
    import pickle
    import numpy as np
    from sentence_transformers import SentenceTransformer

     
    Now, create a function to load the previously saved FAISS index from disk so it can be searched.

    def load_faiss_index(index_path="faiss_index.index"):
        """
        Loads the saved FAISS index from disk.
        """
        print("Loading FAISS index.")
        return faiss.read_index(index_path)

     

    We’ll also need another function that loads the metadata, which contains the text chunks we stored earlier.

    def load_metadata(metadata_path="faiss_metadata.pkl"):
        """
        Loads text chunk metadata (the actual text pieces).
        """
        print("Loading text metadata.")
        with open(metadata_path, "rb") as f:
            return pickle.load(f)

     

    The original text chunks are stored in a metadata file (faiss_metadata.pkl) and are used to map FAISS results back to readable text. At this point, we will be creating another function that takes a user’s query, embeds it, and finds the top matching chunks from the FAISS index. The semantic search takes place here.

    def retrieve_similar_chunks(query, index, text_chunks, top_k=3):
        """
        Retrieves top_k most relevant chunks for a given query.
      
        Parameters:
            query (str): The user's input question.
            index (faiss.Index): FAISS index object.
            text_chunks (list): Original text chunks.
            top_k (int): Number of top results to return.
      
        Returns:
            list: Top matching text chunks.
        """
      
        # Embed the query
        model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
        # Ensure query vector is float32 as required by FAISS
        query_vector = model.encode([query]).astype('float32')
      
        # Search FAISS for nearest vectors
        distances, indices = index.search(query_vector, top_k)
      
        print(f"Retrieved top {top_k} similar chunks.")
        return [text_chunks[i] for i in indices[0]]

     
    This gives you the top three most relevant text chunks to use as context.

     

    # Step 5: Combining the Retrieved Context

     
    Once we have the most relevant chunks, the next step is to combine them into a single context block. This context is then appended to the user’s query before passing it to the LLM. This step ensures that the model has all the necessary information to generate accurate and grounded responses. You can combine the chunks like this:

    context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k=3)
    context = "\n\n".join(context_chunks)

     
    This merged context will later be used when building the final prompt for the LLM.

     

    # Step 6: Using a Large Language Model to Generate the Answer

     
    Now, we combine the retrieved context with the user query and feed it into an LLM to generate the final answer. Here, we’ll use a freely available open-source model from Hugging Face, but you can use any model you prefer.

    Let’s create a new file called generate_answer.py and add the imports:

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    from retrieve_faiss import load_faiss_index, load_metadata, retrieve_similar_chunks

     
    Now define a function generate_answer() that performs the complete process:

    def generate_answer(query, top_k=3):
        """
        Retrieves relevant chunks and generates a final answer.
        """
        # Load FAISS index and metadata
        index = load_faiss_index()
        text_chunks = load_metadata()
    
        # Retrieve top relevant chunks
        context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k=top_k)
        context = "\n\n".join(context_chunks)
    
        # Load open-source LLM
        print("Loading LLM...")
        model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
        # Load tokenizer and model, using a device map for efficient loading
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
    
        # Build the prompt
        prompt = f"""
        Context:
        {context}
        Question:
        {query}
        Answer:
        """
    
        # Generate output
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        # Use the correct input for model generation
        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
        
        # Decode and clean up the answer, removing the original prompt
        full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Simple way to remove the prompt part from the output
        answer = full_text.split("Answer:")[1].strip() if "Answer:" in full_text else full_text.strip()
        
        print("\nFinal Answer:")
        print(answer)

     

    # Step 7: Running the Full Retrieval-Augmented Generation Pipeline

     
    This final step brings everything together. We’ll create a main.py file that automates the entire workflow from data loading to generating the final answer.

    # Data preparation
    from prepare_data import prepare_docs
    from split_text import split_docs
    
    # Embedding and storage
    from create_embeddings import get_embeddings
    from store_faiss import build_faiss_index, save_metadata
    
    # Retrieval and answer generation
    from generate_answer import generate_answer

     

    Now define the main function:

    def run_pipeline():
        """
        Runs the full end-to-end RAG workflow.
        """
        print("\nLoad and Clean Data:")
        documents = prepare_docs("data/")
        print(f"Loaded {len(documents)} clean documents.\n")
    
        print("Split Text into Chunks:")
        # documents is a list of strings, but split_docs expects a list of documents
        # For this simple example where documents are small, we pass them as strings
        chunks_as_text = split_docs(documents, chunk_size=500, chunk_overlap=100)
        # In this case, chunks_as_text is a list of LangChain Document objects
    
        # Extract text content from LangChain Document objects
        texts = [c.page_content for c in chunks_as_text]
        print(f"Created {len(texts)} text chunks.\n")
    
        print("Generate Embeddings:")
        embeddings = get_embeddings(texts)
      
        print("Store Embeddings in FAISS:")
        index = build_faiss_index(embeddings)
        save_metadata(texts)
        print("Stored embeddings and metadata successfully.\n")
    
        print("Retrieve & Generate Answer:")
        query = "Does unsupervised ML cover regression tasks?"
        generate_answer(query)

     

    Finally, run the pipeline:

    if __name__ == "__main__":
        run_pipeline()

     

    Output:
     

    Screenshot of the OutputScreenshot of the Output
    Screenshot of the Output | Image by Author

     

    # Wrapping Up

     
    RAG closes the gap between what an LLM “already knows” and the constantly changing information out in the world. I have implemented a very basic pipeline so you could understand how RAG works. At the enterprise level, many advanced concepts, such as adding guardrails, hybrid search, streaming, and context optimization techniques come into use. If you’re interested in exploring more advanced concepts, here are a few of my personal favorites:

     
     

    Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

    Related posts:

    Top 5 Free Google Certificate Courses in 2026

    Complete Study Material and Practice Questions

    Use Custom Skills on Claude Code

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article5 plead guilty to laptop farm and ID theft scheme to land North Koreans US IT jobs
    Next Article How Levi Strauss is using AI for its DTC-first business model
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Build Better AI Agents with Google Antigravity Skills and Workflows

    April 2, 2026
    Business & Startups

    How LLMs Generate Text 3x Faster

    April 1, 2026
    Business & Startups

    7 Essential AI Website Builders: From Prompt to Production

    April 1, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025137 Views

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025137 Views

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.