Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Iran claims to have seized ‘offending’ oil tanker in Gulf of Oman | US-Israel war on Iran News

    May 8, 2026

    How to Build Vector Search From Scratch in Python

    May 8, 2026

    Mozilla says 271 vulnerabilities found by Mythos have “almost no false positives”

    May 8, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»How to Build Vector Search From Scratch in Python
    How to Build Vector Search From Scratch in Python
    Business & Startups

    How to Build Vector Search From Scratch in Python

    gvfx00@gmail.comBy gvfx00@gmail.comMay 8, 2026No Comments9 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


     

    Table of Contents

    Toggle
    • # Introduction
    • # What Is Vector Search?
    • # Setting Up the Dataset
    • # Building the Index
    • # Running Queries
    • # Visualizing the Embedding Space
    • # Visualizing the Similarity Score Distribution
    • # Wrapping Up
      • Related posts:
    • Legal Aspects of AI in Marketing
    • Data Cleaning at the Command Line for Beginner Data Scientists
    • 5 Powerful Python Decorators for Robust AI Agents

    # Introduction

     
    You’ve probably typed a question into a search bar and gotten results that matched your words but completely missed your meaning. Or watched a recommendation engine surface something eerily relevant even though you never searched for it directly. The gap between “finding exact words” and “understanding what someone actually means” is what makes a search feature useful.

    Vector search closes that gap by representing text as points in high-dimensional space, where geometric proximity encodes semantic similarity. Two sentences can share zero words and still end up neighbors because the model learned that their meanings are close.

    This article builds a vector search engine from scratch in Python using only NumPy, so you can see exactly what happens at each step: how embeddings get stored and normalized, why cosine similarity reduces to a dot product, and what the resulting search space actually looks like when you project it down to two dimensions.

    You can get the code on GitHub.

     

    # What Is Vector Search?

     
    Traditional keyword search looks for exact word matches. Vector search works differently: it converts documents and queries into numerical vectors called embeddings, then finds the vectors that are closest to each other in high-dimensional space.

    The key insight is that closeness in vector space means semantic similarity. Two sentences that mean the same thing — even if they share no words — will have embeddings that are near each other.

    The distance metric you use to measure “closeness” is what drives the whole system. The most common one is cosine similarity, which measures the angle between two vectors rather than their absolute distance. This makes it scale-invariant — useful when you care about direction or meaning rather than magnitude or word count.

     

    # Setting Up the Dataset

     
    We’ll work with a set of short product descriptions from a fictional e-commerce catalog. These are pre-embedded as 8-dimensional vectors — a much reduced dimensionality that is realistic enough to demonstrate the concepts.

    In a real system, you’d generate these embeddings from a model like sentence-transformers. For this tutorial, we simulate that step with controlled random data that has a clear cluster structure.

    import numpy as np
    
    np.random.seed(42)
    
    # Product catalog — 3 semantic clusters: electronics, clothing, furniture
    products = [
        "Wireless noise-cancelling headphones with 30-hour battery",
        "Bluetooth speaker with waterproof design",
        "USB-C hub with 7 ports and power delivery",
        "4K HDMI cable 6ft braided",
        "Mechanical keyboard with RGB backlight",
        "Men's slim-fit chino pants navy blue",
        "Women's merino wool turtleneck sweater",
        "Unisex running jacket lightweight windbreaker",
        "Leather chelsea boots for men",
        "Organic cotton crew neck t-shirt",
        "Solid oak dining table seats 6",
        "Ergonomic mesh office chair lumbar support",
        "Linen sofa 3-seater natural beige",
        "Bamboo bookshelf 5-tier adjustable",
        "Memory foam mattress queen size medium firm",
    ]
    
    # Simulate embeddings with cluster structure
    # Cluster centers in 8D space
    electronics_center = np.array([0.9, 0.1, 0.2, 0.8, 0.1, 0.3, 0.7, 0.2])
    clothing_center    = np.array([0.1, 0.8, 0.7, 0.1, 0.9, 0.2, 0.1, 0.8])
    furniture_center   = np.array([0.2, 0.3, 0.9, 0.2, 0.1, 0.9, 0.3, 0.1])
    
    n_per_cluster = 5
    noise = 0.08
    
    embeddings = np.vstack([
        electronics_center + np.random.randn(n_per_cluster, 8) * noise,
        clothing_center    + np.random.randn(n_per_cluster, 8) * noise,
        furniture_center   + np.random.randn(n_per_cluster, 8) * noise,
    ])
    
    print(f"Embeddings shape: {embeddings.shape}")

     

    Output:

    Embeddings shape: (15, 8)

     

    Each row is a product. Each column is one dimension of its embedding. The product names won’t be used by the search engine; only the embeddings matter.

     

    How to Build Vector Search from Scratch in Python
    Image by Author

     

    # Building the Index

     
    The “index” in a vector search engine is just the stored set of normalized embeddings. Normalization is important here because it makes cosine similarity equivalent to a dot product, which is cheaper to compute.

    def normalize(vectors: np.ndarray) -> np.ndarray:
        """L2-normalize each row vector."""
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        # Avoid division by zero
        norms = np.where(norms == 0, 1e-10, norms)
        return vectors / norms
    
    class VectorIndex:
        def __init__(self):
            self.vectors = None
            self.labels = None
    
        def add(self, vectors: np.ndarray, labels: list):
            self.vectors = normalize(vectors)
            self.labels = labels
            print(f"Indexed {len(labels)} items with {vectors.shape[1]}-dimensional embeddings.")
    
        def search(self, query_vector: np.ndarray, top_k: int = 3):
            query_norm = normalize(query_vector.reshape(1, -1))
            # Cosine similarity = dot product of normalized vectors
            scores = self.vectors @ query_norm.T  # shape: (n_items, 1)
            scores = scores.flatten()
            # Get top-k indices sorted by descending score
            top_indices = np.argsort(scores)[::-1][:top_k]
            return [(self.labels[i], float(scores[i])) for i in top_indices]
    
    index = VectorIndex()
    index.add(embeddings, products)

     

    Output:

    Indexed 15 items with 8-dimensional embeddings.

     

    The search method does three things: normalizes the query, computes dot products against every stored vector, then sorts by score and returns the top-k results. That matrix multiplication (self.vectors @ query_norm.T) is the entire retrieval step.

     

    # Running Queries

     
    Now let’s test what we’ve built with a few queries. We construct query vectors by starting from one of the cluster centers and adding a little noise to simulate a real query embedding.

    def make_query(center: np.ndarray, noise_scale: float = 0.05) -> np.ndarray:
        return center + np.random.randn(8) * noise_scale
    
    
    queries = {
        "audio equipment": make_query(electronics_center),
        "casual wear":     make_query(clothing_center),
        "home furniture":  make_query(furniture_center),
    }
    
    for query_name, q_vec in queries.items():
        print(f"\nQuery: '{query_name}'")
        results = index.search(q_vec, top_k=3)
        for rank, (label, score) in enumerate(results, 1):
            print(f"  {rank}. [{score:.4f}] {label}")

     

    Output:

    
    Query: 'audio equipment'
      1. [0.9856] Wireless noise-cancelling headphones with 30-hour battery
      2. [0.9840] USB-C hub with 7 ports and power delivery
      3. [0.9829] Mechanical keyboard with RGB backlight
    
    Query: 'casual wear'
      1. [0.9960] Men's slim-fit chino pants navy blue
      2. [0.9958] Leather chelsea boots for men
      3. [0.9916] Women's merino wool turtleneck sweater
    
    Query: 'home furniture'
      1. [0.9929] Bamboo bookshelf 5-tier adjustable
      2. [0.9902] Linen sofa 3-seater natural beige
      3. [0.9881] Solid oak dining table seats 6

     

    Scores close to 1.0 mean near-identical direction in embedding space, which is exactly what you expect for queries constructed from the same cluster center as their target documents.

     

    # Visualizing the Embedding Space

     
    High-dimensional data is hard to reason about visually. Principal component analysis (PCA) projects the 8-dimensional embeddings down to 2D so we can see the cluster structure. We’ll implement a minimal PCA using only NumPy.

    The following code computes the 2D PCA projection and plots all product embeddings with labels and cluster colors:

    import matplotlib.pyplot as plt
    import matplotlib.patches as mpatches
    
    projected = pca_2d(embeddings)
    
    cluster_colors = (
        ["#4A90D9"] * 5 +   # electronics — blue
        ["#E8734A"] * 5 +   # clothing — orange
        ["#5BAD72"] * 5     # furniture — green
    )
    cluster_labels = ["Electronics"] * 5 + ["Clothing"] * 5 + ["Furniture"] * 5
    
    fig, ax = plt.subplots(figsize=(6, 4))
    ax.scatter(projected[:, 0], projected[:, 1],
               c=cluster_colors, s=100, edgecolors="white", linewidths=0.7, zorder=3)

     

    This part projects query vectors into the same space, overlays them, and finalizes the plot:

    # Plot query projections
    q_projected = pca_2d(
        np.vstack(list(queries.values())) - embeddings.mean(axis=0)
    )
    for (qname, _), (qx, qy) in zip(queries.items(), q_projected):
        ax.scatter(qx, qy, marker="*", s=200, color="gold",
                   edgecolors="#333", linewidths=0.6, zorder=4)
        ax.annotate(f"⟵ query: {qname}", (qx, qy),
                    textcoords="offset points", xytext=(6, -8),
                    fontsize=7, color="#555555", style="italic")
    
    legend_patches = [
        mpatches.Patch(color="#4A90D9", label="Electronics"),
        mpatches.Patch(color="#E8734A", label="Clothing"),
        mpatches.Patch(color="#5BAD72", label="Furniture"),
        mpatches.Patch(color="gold",    label="Query vectors"),
    ]
    ax.legend(handles=legend_patches, loc="upper left", fontsize=6)
    ax.set_title("Vector Search — Embedding Space (PCA projection)", fontsize=10, pad=10)
    ax.set_xlabel("PC 1"); ax.set_ylabel("PC 2")
    ax.grid(True, linestyle="--", alpha=0.4)
    plt.tight_layout()
    plt.savefig("embedding_space_queries_only.png", dpi=150)
    plt.show()

     

    Output:

     

    Vector Search — Embedding Space (PCA projection)
    Vector Search — Embedding Space (PCA projection)

     

    The clusters separate cleanly. Each gold star (query vector) lands inside the cluster it was constructed from. This is the geometry that vector search makes use of.

     

    # Visualizing the Similarity Score Distribution

     
    For any given query, it’s useful to see how similarity scores are distributed across the whole index — and not just the top-k. This tells you whether the top result is a clear winner or just marginally better than everything else.

    q_vec_furniture = queries["home furniture"]
    q_norm_furniture = normalize(q_vec_furniture.reshape(1, -1))
    all_scores_furniture = (index.vectors @ q_norm_furniture.T).flatten()
    
    sorted_idx_furniture = np.argsort(all_scores_furniture)[::-1]
    sorted_scores_furniture = all_scores_furniture[sorted_idx_furniture]
    sorted_labels_furniture = [products[i][:30] + "…" if len(products[i]) > 30
                               else products[i] for i in sorted_idx_furniture]
    
    # Define bar colors: green for furniture items, gray for others
    bar_colors_furniture = []
    for i in sorted_idx_furniture:
        if i >= 10 and i <= 14:  # Furniture items are originally at indices 10-14
            bar_colors_furniture.append("#5BAD72") # Green for furniture
        else:
            bar_colors_furniture.append("#cccccc") # Gray for others
    
    fig, ax = plt.subplots(figsize=(10, 5))
    bars = ax.barh(sorted_labels_furniture[::-1], sorted_scores_furniture[::-1],
                   color=bar_colors_furniture[::-1], edgecolor="white", height=0.65)
    
    ax.axvline(sorted_scores_furniture[2], color="#5BAD72", linestyle="--",
               linewidth=1.2, label="Top-3 cutoff")
    ax.set_xlim(sorted_scores_furniture.min() - 0.002, 1.001)
    ax.set_xlabel("Cosine Similarity Score")
    ax.set_title("Query: 'home furniture' — Similarity Across All Products", fontsize=11, pad=12)
    ax.legend(fontsize=8)
    ax.grid(axis="x", linestyle="--", alpha=0.4)
    plt.tight_layout()
    plt.savefig("score_distribution_furniture.png", dpi=150)
    plt.show()

     

    Output:

     

    Query: 'home furniture' — Similarity Across All Products
    Query: ‘home furniture’ — Similarity Across All Products

     

    There’s a visible gap between the furniture cluster (top 5 bars) and everything else. In practice, you’d use this gap to set a similarity threshold below which results are suppressed entirely.

     

    # Wrapping Up

     
    You built a vector search engine with about 50 lines of NumPy: an index class that normalizes and stores embeddings, a search method that uses matrix multiplication to compute cosine similarity, and two visualizations that reveal the geometry behind the results.

    The next step is to replace the simulated embeddings with real ones. Try loading sentence-transformers and embedding your own text corpus. The index code here will work without any changes.

    If you’d like to read more “from scratch” articles, let us know what you’d like to see next!
     
     

    Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.



    Related posts:

    Install, Connect, and Manage Data

    Top 10 Free Data Analysis Courses With Certification

    15 Steps to Ensure Your Company's Compliance

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleMozilla says 271 vulnerabilities found by Mythos have “almost no false positives”
    Next Article Iran claims to have seized ‘offending’ oil tanker in Gulf of Oman | US-Israel war on Iran News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    Here is How to Use it

    May 8, 2026
    Business & Startups

    Building Modern EDA Pipelines with Pingouin

    May 8, 2026
    Business & Startups

    Feature Engineering with LLMs: Techniques & Python Examples

    May 7, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202573 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202572 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025140 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 202573 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202572 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.