Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    How does the cutoff of Starlink terminals affect Russia’s moves in Ukraine? | Russia-Ukraine war News

    February 10, 2026

    7 Python EDA Tricks to Find and Fix Data Issues

    February 10, 2026

    How to watch The Artful Dodger season 2 online from anywhere

    February 10, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»10 Ways to Slash Inference Costs with OpenAI LLMs
    10 Ways to Slash Inference Costs with OpenAI LLMs
    Business & Startups

    10 Ways to Slash Inference Costs with OpenAI LLMs

    gvfx00@gmail.comBy gvfx00@gmail.comDecember 14, 2025No Comments9 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Large Language Models (LLMs) are the heart of Agentic systems and RAG systems. And building with LLMs is exciting until the scale makes them expensive. There is always a tradeoff for cost vs quality, but in this article we will explore the 10 best ways according to me that can slash costs for the LLM usage while focusing on maintaining the quality of the system. Also note I’ll be using OpenAI API for the inference but the techniques could be applied to other model providers as well. So without any further ado let’s understand the cost equation and see ways of LLM cost optimization. 

    Table of Contents

    Toggle
    • Prerequisite: Understanding the Cost Equation
    • 1. Route Requests to the Right Model
        • Configure the API key  
        • Define the functions 
    • 2. Use Models according to the task
    • 3. Using Prompt Caching
    • 4. Use the Batch API for Tasks that can wait
    • 5. Trim the Outputs with max_tokens and Stops parameters
    • 6. Make Use of RAG
    • 7. Always Manage the Conversation History
      • Function for summarization 
      • Inference 
    • 8. Upgrade to Efficient Model Modes
    • 9. Enforce Structured Outputs (JSON)
      • Imports and Structure Definition  
      • Inference 
    • 10. Cache Queries
    • Conclusion 
    • Frequently Asked Questions 
        • Login to continue reading and enjoy expert-curated content.
      • Related posts:
    • Elon Musk’s AI Encyclopedia is Here!
    • Top 5 Small AI Coding Models That You Can Run Locally
    • The KDnuggets ComfyUI Crash Course

    Prerequisite: Understanding the Cost Equation

    Before we start, it’s better we get better versed with about costs, tokens and context window: 

    • Tokens: These are the small units of the text. For all practical purposes you can assume 1,000 tokens is roughly 750 words.  
    • Prompt Tokens: These are the input tokens that we send to the model. They are generally cheaper. 
    • Completion Tokens: These are the tokens generated by the model. They are often 3-4 times more expensive than input tokens. 
    • Context Window: This is like a short-term memory (It can include the old Inputs + Outputs). If you exceed this limit, the model leaves out the earlier parts of the conversation. If you send 10 previous messages in the context window, then those count as Input Tokens for the current request and will add to the costs. 
    • Total Cost: (Input Tokens x Per Input Token Cost) + (Output Tokens x Per Output Token Cost) 

    Note: For OpenAI you can use the billing dashboard to track costs: https://platform.openai.com/settings/organization/billing/overview 

    To learn how to get the OpenAI API read this article.

    1. Route Requests to the Right Model

    Not every task requires the best, state-of-the-art model, you can experiment with a cheaper model or try using a few-shot prompting with a cheaper model to replicate a bigger model.  

    Configure the API key  

    from google.colab import userdata 
    import os 
    
    os.environ['OPENAI_API_KEY']=userdata.get('OPENAI_API_KEY') 

    Define the functions 

    from openai import OpenAI 
    
    client = OpenAI() 
    SYSTEM_PROMPT = "You are a concise, helpful assistant. You answer in 25-30 words" 
    
    def generate_examples(questions, n=3): 
       examples = [] 
       for q in questions[:n]: 
        response = client.chat.completions.create( 
            model="gpt-5.1", 
            messages=[{"role": "system", "content": SYSTEM_PROMPT}, 
                         {"role": "user", "content": q}] 
           ) 
           examples.append({"q": q, "a": response.choices[0].message.content}) 
    
       return examples

    This function uses the larger GPT-5.1 and answers the question in 25-30 words.  

    # Example usage 
    
    questions = [ 
       "What is overfitting?", 
       "What is a confusion matrix?", 
       "What is gradient descent?" 
    ] 
    
    few_shot = generate_examples(questions, n=3)
    Few_shot learning

    Great, we got our question-answer pairs.  

    def build_prompt(examples, question): 
    
       prompt = "" 
       for ex in examples: 
           prompt += f"Q: {ex['q']}\nA: {ex['a']}\n\n" 
       return prompt + f"Q: {question}\nA:" 
    
    def ask_small_model(examples, question): 
    
       prompt = build_prompt(examples, question) 
       response = client.chat.completions.create( 
    
        model="gpt-5-nano", 
        messages=[{"role": "system", "content": SYSTEM_PROMPT}, 
                     {"role": "user", "content": prompt}] 
       ) 
    
       return response.choices[0].message.content

    Here, we have a function that uses smaller ‘gpt-5-nano’ and another function that makes the prompt using the question-answer pairs for the model.  

    answer = ask_small_model(few_shot, "Explain regularization in ML.") 
    
    print(answer)

    Let’s pass a question to the model. 

    Output:

    Regularization adds a penalty to the loss for model complexity to reduce overfitting. Common forms include L1 (lasso) promoting sparsity and L2 (ridge) shrinking weights; elastic net blends. 

    Great! We have used a much cheaper model (gpt-5-nano) to get our output, but surely we can’t use the cheaper model for every task.  

    2. Use Models according to the task

    The idea here is to use a smaller model for routine tasks, and using the larger models only for complex reasoning. So how do we do this? Here we will define a classifier that returns “simple” or “complex” and route the queries accordingly. This is help us save costs on routine costs.  

    Example: 

    from openai import OpenAI 
    
    client = OpenAI() 
    
    def get_complexity(question): 
    
       prompt = f"Rate the complexity of the question from 1 to 10 for an LLM to answer. Provide only the number.\nQuestion: {question}" 
    
       res = client.chat.completions.create( 
        model="gpt-5.1", 
        messages=[{"role": "user", "content": prompt}], 
        ) 
    
       return int(res.choices[0].message.content.strip()) 
    
    print(get_complexity("Explain convolutional neural networks"))

    Output:

    4 

    So our classifier says the complexity is 4, don’t worry about the extra LLM call as this is generating only a single number. This complexity number can be used to route the tasks, like: complexity < 7 then route to a smaller model, else a larger model.   

    3. Using Prompt Caching

    If the LLM-system uses bulky system instructions or lots of few-shot examples across many calls, then make sure to place them at the start of your message.

    Few important points here: 

    • Ensure the prefix is exactly identical across requests (including all the characters, whitespace included). 
    • According to OpenAI the supported models will automatically benefit from Caching but the prompt has to be longer than 1,024 tokens. 
    • Requests using Prompt Caching have a cached_tokens value as a part of the response. 

    4. Use the Batch API for Tasks that can wait

    Many tasks don’t require immediate responses, this is where we can use the asynchronous Batch endpoint for the inference. By submitting a file of requests and giving OpenAI upto 24 hours time to process them, will reduce 50% costs on token costs compared to the usual OpenAI API calls. 

    5. Trim the Outputs with max_tokens and Stops parameters

    What we’re trying to do here is stop the untrolled token generation, Let’s say you need a 75-word summary or a specific JSON object, don’t let the model keep generating unnecessary text. Instead we can make use of the parameters:

    Example:

    from openai import OpenAI 
    client = OpenAI() 
    
    response = client.chat.completions.create( 
       model="gpt-5.1", 
       messages=[ 
           { 
               "role": "system", 
               "content": "You are a data extractor. Output only raw JSON." 
           } 
       ], 
       max_tokens=100, 
       stop=["\n\n", "}"] 
    )

    We have set max_tokens as 100 as it’s roughly 75 words.  

    6. Make Use of RAG

    Instead of flooding the context window, we can use Retrieval-Augmented Generation. This will help convert the knowledge base into embeddings and store them in a vector database. When a user queries, then all the context won’t be in the context window but the retrieved top few relevant text chunks will be passed for context.  

    RAG System Architecture

    7. Always Manage the Conversation History

    Here our focus is on the conversation history where we pass the older inputs and outputs. Instead of iteratively adding the conversations we can implement a “sliding window” approach.  

    Here we drop the oldest messages once the context gets too long (set a threshold), or summarize previous turns into a single system message before continuing. Ensure that the active context window is not too long as it’s crucial for long-running sessions. 

    Function for summarization 

    from openai import OpenAI 
    
    client = OpenAI() 
    
    SYSTEM_PROMPT = "You are a concise assistant. Summarize the chat history in 30-40 words." 
    
    def summarize_chat(history_text): 
    
       response = client.chat.completions.create( 
        model="gpt-5.1", 
        messages=[ 
               {"role": "system", "content": SYSTEM_PROMPT}, 
               {"role": "user", "content": history_text} 
           ] 
       ) 
    
       return response.choices[0].message.content

    Inference 

    chat_history = """ 
    
    User: Hi, I'm trying to understand how embeddings work. 
    Assistant: Embeddings turn text into numeric vectors. 
    
    User: Can I use them for similarity search?
    Assistant: Yes, that’s a common use case. 
    
    User: Nice, show me simple code. 
    Assistant: Sure, here's a short example... 
    
    """ 
    
    summary = summarize_chat(chat_history)

    User asked what embeddings are; assistant explained they convert text to numeric vectors. User then asked about using embeddings for similarity search; assistant confirmed and provided a short example code snippet demonstrating basic similarity search. 

    We now have a summary which can be added to the model’s context window when the input tokens are above a defined threshold. 

    8. Upgrade to Efficient Model Modes

    OpenAI frequently releases optimized versions of their models. Always check for newer “Mini,” or “Nano” variants of the latest models. These are specifically made for efficiency, often delivering similar performance for certain tasks at a fraction of the cost.

    Upgrade options to efficient models

    9. Enforce Structured Outputs (JSON)

    When you need data extracted or formatted. Defining a strict schema forces the model to cut the unnecessary tokens and returns only the exact data fields requested. Denser responses mean fewer generated tokens on your bill. 

    Imports and Structure Definition  

    from openai import OpenAI 
    
    import json 
    
    client = OpenAI() 
    
    prompt = """ 
    You are an extraction engine. Output ONLY valid JSON. 
    No explanations. No natural language. No extra keys. 
    
    Extract these fields: 
    
    - title (string)
    - date (string, format: YYYY-MM-DD) 
    - entities (array of strings) 
    
    Text: 
    
    "On 2025-12-05, OpenAI introduced Structured Outputs, allowing developers to enforce strict JSON schemas. This improved reliability was welcomed by many engineers." 
    
    Return JSON in this exact format: 
    
    { 
     "title": "", 
     "date": "", 
     "entities": [] 
    } 
    
    """

    Inference 

    response = client.chat.completions.create( 
       model="gpt-5.1", 
       messages=[{"role": "user", "content": prompt}] 
    ) 
    
    data = response.choices[0].message.content 
    
    json_data = json.loads(data) 
    
    print(json_data)

    Output:

    {'title': 'OpenAI Introduces Structured Outputs', 'date': '2025-12-05', 'entities': ['OpenAI', 'Structured Outputs', 'JSON', 'developers', 'engineers']} 

    As we can see only the required dictionary with the required details is returned. Also the output is neatly structured as key-value pairs. 

    10. Cache Queries

    Unlike our earlier idea of caching, this is quite different. If the users frequently ask the exact same questions, cache the LLM’s response in your own database. Check this database before calling the API. This cached response is faster for the user and is practically free. Also if working with LangGraph for Agents then you can explore this for Node-level-caching: Caching in LangGraph

    Conclusion 

    Building with LLMs is powerful but the scale can quickly make them expensive, so understanding the cost equation becomes essential.By applying the right mix of model routing, caching, structured outputs, RAG, and efficient context management, we can significantly slash inference costs. These techniques help maintain the quality of the system while ensuring the overall LLM usage remains practical and cost-effective. Don’t forget to take check the billing dashboard for the costs after implementing each technique.

    Frequently Asked Questions 

    Q1. What is a token in the context of LLMs?

    A. A token is a small unit of text, where roughly 1,000 tokens correspond to about 750 words. 

    Q2. Why are completion tokens more costly than prompt tokens?

    A. Because output tokens (from the model) are often several times more expensive per token than input (prompt) tokens. 

    Q3. What is the “context window” and why does it matter for cost?

    A. The context window is the short-term memory (previous inputs and outputs) sent to the model; a longer context increases token usage and thus cost. 


    Mounish V

    Passionate about technology and innovation, a graduate of Vellore Institute of Technology. Currently working as a Data Science Trainee, focusing on Data Science. Deeply interested in Deep Learning and Generative AI, eager to explore cutting-edge techniques to solve complex problems and create impactful solutions.

    Login to continue reading and enjoy expert-curated content.

    Related posts:

    WTF is a Parameter?!? - KDnuggets

    Context Engineering Explained in 3 Levels of Difficulty

    5 Useful DIY Python Functions for Parsing Dates and Times

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleDisney invests $1 billion in OpenAI, licenses 200 characters for AI video app Sora
    Next Article The prison to school pipeline: Why freedom behind bars starts with the mind | Prison
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    7 Python EDA Tricks to Find and Fix Data Issues

    February 10, 2026
    Business & Startups

    How to Learn AI for FREE in 2026?

    February 10, 2026
    Business & Startups

    Claude Code Power Tips – KDnuggets

    February 9, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.