Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Check Your CGM: Recalled FreeStyle Libre 3 Sensors Associated With 7 Deaths

    February 5, 2026

    Overwatch’s Heroes Are Getting Hotter, Here’s Why

    February 4, 2026

    Taylor Sheridan’s TV Shows, Ranked Worst to Best

    February 4, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Build a Text-to-SQL System: Replicating Pinterest’s Approach
    Build a Text-to-SQL System: Replicating Pinterest’s Approach
    Business & Startups

    Build a Text-to-SQL System: Replicating Pinterest’s Approach

    gvfx00@gmail.comBy gvfx00@gmail.comOctober 4, 2025No Comments11 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Data is essential for modern business decisions. Many employees, however, are unfamiliar with SQL. This creates a bottleneck between questions and answers. A Text-to-SQL system solves this problem directly. It translates simple questions into database queries. This article shows you how to build a SQL generator. We will follow the ideas from Pinterest’s Text-to-SQL engineering team. You will learn how to convert natural language to SQL. We will also use advanced techniques like RAG for table selection.

    Table of Contents

    Toggle
    • Understanding Pinterest’s Approach
        • The First Version
        • The Second Version
    • Our Plan: A Simplified Replication
    • Hands-on Guide: Building Your Own SQL Generator
        • Step 1: Setting Up Your Environment
        • Step 2: Simulating the Database
        • Step 3: Building the Core Text-to-SQL Chain
        • Step 4: Enhancing with RAG for Table Selection
        • Step 5: Combining Everything into a RAG-Powered Chain
    • Conclusion
    • Frequently Asked Questions
        • Login to continue reading and enjoy expert-curated content.
      • Related posts:
    • The Complete Guide to Logging for Python Developers
    • 7 Best GitHub Repositories For Mastering RAG Systems
    • 9 Books to Start Your Business Analytics Journey

    Understanding Pinterest’s Approach

    Pinterest wanted to make data accessible to everyone. Their employees needed insights from vast datasets. Most of them were not SQL experts. This challenge led to the creation of Pinterest’s Text-to-SQL platform. Their journey provides a great roadmap for building similar tools.

    The First Version

    Their first system was straightforward. A user would ask a question and also list the database tables they thought were relevant. The system would then generate a SQL query.

    Let’s take a closer look at its architecture:

    The user asks an analytical question, choosing the tables to be used.

    1. The relevant table schemas are retrieved from the table metadata store.
    2. The question, selected SQL dialect, and table schemas are compiled into a Text-to-SQL prompt.
    3. The prompt is fed into the LLM.
    4. A streaming response is generated and displayed to the user.

    This approach worked, but it had a major flaw. Users often had no idea which tables contained their answers.

    The Second Version

    To solve this, their team built a smarter system. It used a technique called Retrieval-Augmented Generation (RAG). Instead of asking the user for tables, the system found them automatically. It searched a collection of table descriptions to find the most relevant ones for the question. This use of RAG for table selection made the tool much more user-friendly.

    1. An offline job is employed to generate a vector index of tables’ summaries and historical queries against them.
    2. Suppose the user does not specify any tables. In that case, their question is transformed into embeddings, and a similarity search is conducted against the vector index to infer the top N suitable tables.
    3. The top N tables, along with the table schema and analytical question, are compiled into a prompt for LLM to select the top K most relevant tables.
    4. The top K tables are returned to the user for validation or alteration.
    5. The standard Text-to-SQL process is resumed with the user-confirmed tables.

     We will replicate this powerful two-step approach.

    Our Plan: A Simplified Replication

    This guide will help you build a SQL generator in two parts. First, we will create the core engine that converts natural language to SQL. Second, we will add the intelligent table-finding feature.

    1. The Core System: We will build a basic chain. It takes a question and a list of table names to create a SQL query.
    Schema retrieval flowchart
    • User input: Provides an analytical question, selected tables, and SQL dialect.
    • Schema Retrieval: The system fetches relevant table schemas from the metadata store.
    • Prompt Assembly: Combines question, schemas, and dialect into a prompt.
    • LLM Generation: Model outputs the SQL query.
    • Validation & Execution: Query is checked for safety, executed, and results are returned.
    1. The RAG-Enhanced System: We will add a retriever. This component automatically suggests the correct tables for any question.
    Query embeddings flowchart
    • Offline Indexing: SQL query logs are summarized by an LLM, embedded, and stored in a vector index with metadata.
    • User Query: The user provides a natural-language analytical question.
    • Retrieval: The question is embedded, matched against the vector store, and Top-N candidate tables are returned.
    • Table Selection: An LLM ranks and selects the Top-K most relevant tables.
    • Schema Retrieval & Prompting: The system fetches schemas for those tables and builds a Text-to-SQL prompt.
    • SQL Generation: An LLM generates the SQL query.
    • Validation & Execution: The query is checked, executed, and the results + SQL are returned to the user.

    We will use Python, LangChain, and OpenAI to build this Text-to-SQL system. An in-memory SQLite database will act as our data source.

    Hands-on Guide: Building Your Own SQL Generator

    Let’s begin building our system. Follow these steps to create a working prototype.

    Step 1: Setting Up Your Environment

    First, we install the necessary Python libraries. LangChain helps us connect components. Langchain-openai provides the connection to the LLM. FAISS helps create our retriever, and Pandas displays data nicely.

    !pip install -qU langchain langchain-openai faiss-cpu pandas langchain_community

    Next, you must configure your OpenAI API key. This key allows our application to use OpenAI’s models.

    import os
    
    from getpass import getpass
    
    OPENAI_API_KEY = getpass("Enter your OpenAI API key: ")
    
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

    Step 2: Simulating the Database

    A Text-to-SQL system needs a database to query. For this demo, we create a simple, in-memory SQLite database. It will contain three tables: users, pins, and boards. This setup mimics a basic version of Pinterest’s data structure.

    import sqlite3
    
    import pandas as pd
    
    # Create a connection to an in-memory SQLite database
    
    conn = sqlite3.connect(':memory:')
    
    cursor = conn.cursor()
    
    # Create tables
    
    cursor.execute('''
    
    CREATE TABLE users (
    
       user_id INTEGER PRIMARY KEY,
    
       username TEXT NOT NULL,
    
       join_date DATE NOT NULL,
    
       country TEXT
    
    )
    
    ''')
    
    cursor.execute('''
    
    CREATE TABLE pins (
    
       pin_id INTEGER PRIMARY KEY,
    
       user_id INTEGER,
    
       board_id INTEGER,
    
       image_url TEXT,
    
       description TEXT,
    
       created_at DATETIME,
    
       FOREIGN KEY(user_id) REFERENCES users(user_id),
    
       FOREIGN KEY(board_id) REFERENCES boards(board_id)
    
    )
    
    ''')
    
    cursor.execute('''
    
    CREATE TABLE boards (
    
       board_id INTEGER PRIMARY KEY,
    
       user_id INTEGER,
    
       board_name TEXT NOT NULL,
    
       category TEXT,
    
       FOREIGN KEY(user_id) REFERENCES users(user_id)
    
    )
    
    ''')
    
    # Insert sample data
    
    cursor.execute("INSERT INTO users (user_id, username, join_date, country) VALUES (1, 'alice', '2023-01-15', 'USA')")
    
    cursor.execute("INSERT INTO users (user_id, username, join_date, country) VALUES (2, 'bob', '2023-02-20', 'Canada')")
    
    cursor.execute("INSERT INTO boards (board_id, user_id, board_name, category) VALUES (101, 1, 'DIY Crafts', 'DIY')")
    
    cursor.execute("INSERT INTO boards (board_id, user_id, board_name, category) VALUES (102, 1, 'Travel Dreams', 'Travel')")
    
    cursor.execute("INSERT INTO pins (pin_id, user_id, board_id, description, created_at) VALUES (1001, 1, 101, 'Handmade birthday card', '2024-03-10 10:00:00')")
    
    cursor.execute("INSERT INTO pins (pin_id, user_id, board_id, description, created_at) VALUES (1002, 2, 102, 'Eiffel Tower at night', '2024-05-15 18:30:00')")
    
    cursor.execute("INSERT INTO pins (pin_id, user_id, board_id, description, created_at) VALUES (1003, 1, 101, 'Knitted scarf pattern', '2024-06-01 12:00:00')")
    
    conn.commit()
    
    print("Database created and populated successfully.")

    Output:

    Database created successfully

    Step 3: Building the Core Text-to-SQL Chain

    The language model cannot see our database directly. It needs to know the table structures, or schemas. We create a function to get the CREATE TABLE statements. This information tells the model about columns, data types, and keys.

    def get_table_schemas(conn, table_names):
    
       """Fetches the CREATE TABLE statement for a list of tables."""
    
       schemas = []
    
       cursor = conn.cursor() # Get cursor from the passed connection
    
       for table_name in table_names:
    
           query = f"SELECT sql FROM sqlite_master WHERE type="table" AND name="{table_name}";"
    
           cursor.execute(query)
    
           result = cursor.fetchone()
    
           if result:
    
               schemas.append(result[0])
    
       return "\n\n".join(schemas)
    
    # Example usage
    
    sample_schemas = get_table_schemas(conn, ['users', 'pins'])
    
    print(sample_schemas)

    Output:

    SQL Schema

    With the schema function ready, we build our first chain. A prompt template instructs the model on its task. It combines the schemas and the user’s question. We then connect this prompt to the model.

    from langchain_core.prompts import ChatPromptTemplate
    
    from langchain_openai import ChatOpenAI
    
    from langchain_core.output_parsers import StrOutputParser
    
    from langchain_core.runnables import RunnablePassthrough, RunnableLambda
    
    import sqlite3 # Import sqlite3
    
    template = """
    
    You are a master SQL expert. Based on the provided table schema and a user's question, write a syntactically correct SQLite SQL query.
    
    Only return the SQL query and nothing else.
    
    Here is the database schema:
    
    {schema}
    
    Here is the user's question:
    
    {question}
    
    """
    
    prompt = ChatPromptTemplate.from_template(template)
    
    llm = ChatOpenAI(model="gpt-4.1-mini", temperature=0)
    
    sql_chain = prompt | llm | StrOutputParser()
    
    Let's test our chain with a question where we explicitly provide the table names.
    
    user_question = "How many pins has alice created?"
    
    table_names_provided = ["users", "pins"]
    
    # Retrieve the schema in the main thread before invoking the chain
    
    schema = get_table_schemas(conn, table_names_provided)
    
    # Pass the schema directly to the chain
    
    generated_sql = sql_chain.invoke({"schema": schema, "table_names": table_names_provided, "question": user_question})
    
    print("User Question:", user_question)
    
    print("Generated SQL:", generated_sql)
    
    # Clean the generated SQL by removing markdown code block syntax
    
    cleaned_sql = generated_sql.strip()
    
    if cleaned_sql.startswith("```sql"):
    
       cleaned_sql = cleaned_sql[len("```sql"):].strip()
    
    if cleaned_sql.endswith("```"):
    
       cleaned_sql = cleaned_sql[:-len("```")].strip()
    
    print("Cleaned SQL:", cleaned_sql)
    
    # Let's run the generated SQL to verify it works
    
    try:
    
       result_df = pd.read_sql_query(cleaned_sql, conn)
    
       display(result_df)
    
    except Exception as e:
    
       print(f"Error executing SQL query: {e}")

    Output:

    SQL command

    The system correctly generated the SQL and found the right answer.

    Step 4: Enhancing with RAG for Table Selection

    Our core system works well, but requires users to know table names. This is the exact problem Pinterest’s Text-to-SQL team solved. We will now implement RAG for table selection. We start by writing simple, natural language summaries for each table. These summaries capture the meaning of each table’s content.

    table_summaries = {
    
       "users": "Contains information about individual users, including their username, join date, and country of origin.",
    
       "pins": "Contains data about individual pins, linking to the user who created them and the board they belong to. Includes descriptions and creation timestamps.",
    
       "boards": "Stores information about user-created boards, including the board's name, category, and the user who owns it."
    
    }

    Next, we create a vector store. This tool converts our summaries into numerical representations (embeddings). It allows us to find the most relevant table summaries for a user’s question through a similarity search.

    from langchain_openai import OpenAIEmbeddings
    
    from langchain_community.vectorstores import FAISS
    
    from langchain.schema import Document
    
    # Create LangChain Document objects for each summary
    
    summary_docs = [
    
       Document(page_content=summary, metadata={"table_name": table_name})
    
       for table_name, summary in table_summaries.items()
    
    ]
    
    embeddings = OpenAIEmbeddings()
    
    vector_store = FAISS.from_documents(summary_docs, embeddings)
    
    retriever = vector_store.as_retriever()
    
    print("Vector store created successfully.")

    Step 5: Combining Everything into a RAG-Powered Chain

    We now construct the final, intelligent chain. This chain automates the entire process. It takes a question, uses the retriever to find relevant tables, fetches their schemas, and then passes everything to our sql_chain.

    def get_table_names_from_docs(docs):
    
       """Extracts table names from the metadata of retrieved documents."""
    
       return [doc.metadata['table_name'] for doc in docs]
    
    # We need a way to get schema using table names and the connection within the chain
    
    # Use the thread-safe function that recreates the database for each call
    
    def get_schema_for_rag(x):
    
       table_names = get_table_names_from_docs(x['table_docs'])
    
       # Call the thread-safe function to get schemas
    
       schema = get_table_schemas(conn, table_names)
    
       return {"question": x['question'], "table_names": table_names, "schema": schema}
    
    full_rag_chain = (
    
       RunnablePassthrough.assign(
    
           table_docs=lambda x: retriever.invoke(x['question'])
    
       )
    
       | RunnableLambda(get_schema_for_rag) # Use RunnableLambda to call the schema fetching function
    
       | sql_chain # Pass the dictionary with question, table_names, and schema to sql_chain
    
    )
    
    Let's test the complete system. We ask a question without mentioning any tables. The system should handle everything.
    
    user_question_no_tables = "Show me all the boards created by users from the USA."
    
    # Pass the user question within a dictionary
    
    final_sql = full_rag_chain.invoke({"question": user_question_no_tables})
    
    print("User Question:", user_question_no_tables)
    
    print("Generated SQL:", final_sql)
    
    # Clean the generated SQL by removing markdown code block syntax, being more robust
    
    cleaned_sql = final_sql.strip()
    
    if cleaned_sql.startswith("```sql"):
    
       cleaned_sql = cleaned_sql[len("```sql"):].strip()
    
    if cleaned_sql.endswith("```"):
    
       cleaned_sql = cleaned_sql[:-len("```")].strip()
    
    # Also handle cases where there might be leading/trailing newlines after cleaning
    
    cleaned_sql = cleaned_sql.strip()
    
    print("Cleaned SQL:", cleaned_sql)
    
    # Verify the generated SQL
    
    try:
    
       result_df = pd.read_sql_query(cleaned_sql, conn)
    
       display(result_df)
    
    except Exception as e:
    
       print(f"Error executing SQL query: {e}")

    Output:

    Cleaned SQL command

    Success! The system automatically identified the users and board tables. It then generated the correct query to answer the question. This shows the power of using RAG for table selection.

    Conclusion

    We have successfully built a prototype that shows how to build an SQL generator. Moving this to a production environment requires more steps. You could automate the table summarization process. You could also include historical queries in the vector store to improve accuracy. This follows the path taken by Pinterest’s Text-to-SQL team. This foundation provides a clear path to creating a powerful data tool.

    Frequently Asked Questions

    Q1. What is a Text-to-SQL system?

    A. Text-to-SQL system translates questions written in plain language (like English) into SQL database queries. This allows non-technical users to get data without writing code.

    Q2. Why is RAG useful for Text-to-SQL?

    A. RAG helps the system automatically find the most relevant database tables for a user’s question. This removes the need for users to know the database structure.

    Q3. What is LangChain?

    A. LangChain is a framework for developing applications powered by language models. It helps connect different components like prompts, models, and retrievers into a single chain.


    Harsh Mishra

    Harsh Mishra is an AI/ML Engineer who spends more time talking to Large Language Models than actual humans. Passionate about GenAI, NLP, and making machines smarter (so they don’t replace him just yet). When not optimizing models, he’s probably optimizing his coffee intake. 🚀☕

    Login to continue reading and enjoy expert-curated content.

    Related posts:

    RIP, Data Scientists! The Rise of the GenAI Data Scientist

    Probability Concepts You’ll Actually Use in Data Science

    How To Use LLMs in a Software Developer’s Work [Case Study Included]

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWhy Apple Intelligence Might Fall Short of Expectations? | by PreScouter
    Next Article Tundra & Sequoia Screens Go Dark, Tacoma Driveshafts May Break
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    AI Agents Can Now Hire Real Humans via rentahuman.ai

    February 4, 2026
    Business & Startups

    5 Open Source Image Editing AI Models

    February 4, 2026
    Business & Startups

    Top 10 MCP Servers for AI Builders in 2026

    February 4, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.