How to Reduce LLM Costs by 40% in 24 Hours (2026)

ScaleMind Editorial Team avatar
ScaleMind Editorial Team
Cover for How to Reduce LLM Costs by 40% in 24 Hours (2026)

An AI gateway is the fastest path to LLM cost optimization, but you can start cutting costs in minutes with these 5 tested strategies in 2026.

TL;DR

If you’re spending over $1k/month on LLM APIs without optimization, you’re overpaying by at least 40%. This guide covers five strategies you can implement today:

StrategyTime to ImplementExpected Savings
Prompt Caching10 minutes50-90% on cached tokens
Model Routing2-4 hours20-60%
Semantic Caching1-2 hours15-30%
Batch Processing30 minutes50% on async workloads
AI Gateway5 minutes40-70% combined

Why Are LLM Costs Spiraling Out of Control?

LLM costs scale linearly with usage, but most teams don’t notice the bleeding until it’s too late. The problem isn’t model pricing, it’s routing every request to the most expensive model regardless of task complexity.

Here’s the current cost landscape for 1 million input tokens (December 2025):

Model FamilyModelCost / 1M InputBest For
FrontierGPT-4o$2.50Complex reasoning, coding
FrontierClaude 3.5 Sonnet$3.00Nuanced writing, RAG
EfficientGPT-4o Mini$0.15Summarization, classification
EfficientClaude 3.5 Haiku$0.80Speed-critical tasks
Ultra-LowGemini 1.5 Flash-8B$0.0375Bulk processing, extraction

Sources: OpenAI Pricing, Anthropic Pricing, Google AI Pricing

The math is brutal: Processing 100M tokens/month with Claude 3.5 Sonnet costs ~$300 in input tokens alone. Route 50% of that to Gemini 1.5 Flash? That portion drops from $150 to $1.88. The savings compound fast.

This guide is for engineering teams ready to stop the bleeding today-not next quarter.


How Does Prompt Caching Reduce LLM Costs?

Prompt caching stores frequently-used context (system prompts, RAG documents, few-shot examples) so you don’t pay full price every time you resend them. Both OpenAI and Anthropic offer native prompt caching with substantial discounts.

When it works best:

  • Chatbots - System prompt + conversation history sent repeatedly
  • RAG applications - Same documents analyzed across multiple queries
  • Coding assistants - Full codebase context included with every request

Implementation: Anthropic Prompt Caching

# Before: Paying full price for the same 10k-token context every request
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system="You are a legal analyst assistant. " + large_context,
    messages=[{"role": "user", "content": query}]
)

# After: 90% discount on cached tokens
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal analyst assistant.",
            "cache_control": {"type": "ephemeral"}  # Cache this block
        },
        {
            "type": "text",
            "text": large_context,  # Your 10k-token document
            "cache_control": {"type": "ephemeral"}  # Cached at 90% discount
        }
    ],
    messages=[{"role": "user", "content": query}]
)

# Check your savings
print(f"Cache Creation: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache Read: {response.usage.cache_read_input_tokens} tokens (90% off)")

Savings breakdown:

ProviderCache DiscountNotes
Anthropic~90% on readsExplicit cache_control header required
OpenAI~50% on readsAutomatic for prompts > 1,024 tokens

Time to implement: 10 minutes for existing prompts.

See the Anthropic Prompt Caching docs for minimum token requirements and TTL details.


How Does Model Routing Cut Costs Without Sacrificing Quality?

Model routing directs each request to the cheapest model capable of handling it. Using GPT-4o for password reset instructions is like hiring a PhD to answer the phone-expensive and unnecessary.

The routing principle:

Task TypeUse This ModelWhy
Creative writing, complex reasoning, codingGPT-4o, Claude 3.5 SonnetRequires frontier intelligence
Summarization, classification, extractionGPT-4o Mini, Haiku10-20x cheaper, same quality
Bulk data processingGemini Flash, open-sourceSub-penny per request

Implementation: Basic Complexity Router

def classify_complexity(prompt: str) -> str:
    """
    Simple heuristic router. Production systems often use:
    - A small classifier model (BERT, DistilBERT)
    - Keyword/regex matching
    - Token count thresholds
    """
    complexity_signals = ["code", "reason", "analyze", "compare", "debug"]
    
    if len(prompt) > 2000:
        return "complex"
    if any(signal in prompt.lower() for signal in complexity_signals):
        return "complex"
    return "simple"


def route_request(prompt: str):
    complexity = classify_complexity(prompt)
    
    if complexity == "simple":
        # 94% cheaper than GPT-4o
        return client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )
    else:
        return client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )


# Real-world scenario:
# - 70% of requests are simple (FAQ, summaries, classification)
# - 30% require frontier models
# 
# Cost calculation:
# Before: 100% at $2.50 = $2.50 avg
# After: (0.7 × $0.15) + (0.3 × $2.50) = $0.855 avg
# Savings: 66%

Expected savings: 20-60% depending on your traffic mix. Most B2B apps see 60-70% of requests routable to efficient models.


What is Semantic Caching and How Does It Work?

Semantic caching uses vector embeddings to recognize that “How do I reset my password?” and “I forgot my password, help!” are the same question. Instead of hitting the LLM again, it returns the cached response-zero API cost.

Standard Redis caching only works on exact string matches. Semantic caching works on meaning.

Implementation: LangChain + Redis

from langchain.cache import RedisSemanticCache
from langchain.globals import set_llm_cache
from langchain_openai import OpenAIEmbeddings

# One-time setup
set_llm_cache(RedisSemanticCache(
    redis_url="redis://localhost:6379",
    embedding=OpenAIEmbeddings(),
    score_threshold=0.2  # Lower = stricter matching
))

# First request: hits API, costs money, stores response
response_1 = llm.invoke("What's the refund policy?")

# Second request: semantically similar, returns from cache ($0)
response_2 = llm.invoke("How can I get my money back?")

# Third request: different enough, hits API
response_3 = llm.invoke("What products do you sell?")

When semantic caching shines:

  • Customer support bots (20-40% query overlap typical)
  • FAQ-style applications
  • Search result explanations
  • Any high-repetition use case

Expected savings: If 20% of your queries are semantically similar, you save 20% immediately. The embedding lookup cost is negligible (~$0.02/1M tokens with text-embedding-3-small).

For a deeper dive, see our Semantic Caching for LLMs guide.


When Should You Use Batch Processing for LLM Requests?

Batch processing is the right choice for any workload that doesn’t need real-time responses. OpenAI’s Batch API offers a flat 50% discount for requests that can wait up to 24 hours (they usually complete within minutes).

Ideal batch candidates:

  • Nightly content generation
  • Sentiment analysis on yesterday’s support tickets
  • Bulk document summarization
  • Evaluation and testing pipelines

Implementation: OpenAI Batch API

from openai import OpenAI
client = OpenAI()

# Step 1: Create JSONL file with requests
# batch_requests.jsonl:
# {"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize: ..."}]}}
# {"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize: ..."}]}}

# Step 2: Upload the file
batch_file = client.files.create(
    file=open("batch_requests.jsonl", "rb"),
    purpose="batch"
)

# Step 3: Create the batch job (50% discount tier)
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch {batch_job.id} created. Status: {batch_job.status}")

# GPT-4o batch pricing:
# - Standard: $2.50/1M input, $10.00/1M output
# - Batch:    $1.25/1M input, $5.00/1M output
# Savings: 50% flat

Expected savings: 50% on all eligible workloads. If 30% of your LLM usage is async, that’s 15% off your total bill.

See OpenAI’s Batch API cookbook for error handling and retrieval patterns.


What is an AI Gateway and Why Does It Matter?

An AI gateway is a proxy layer between your application and LLM providers that handles routing, caching, fallbacks, and cost optimization automatically. Instead of implementing the four strategies above separately, a gateway gives you all of them out of the box.

What an AI gateway handles:

FeatureDIY EffortWith Gateway
Model routingCustom classifier + routing logicConfig file
Prompt cachingProvider-specific implementationAutomatic
Semantic cachingRedis + embeddings + maintenanceBuilt-in
Failover (OpenAI down → Anthropic)Complex error handlingAutomatic
Cost trackingCustom logging + dashboardsReal-time UI

The tradeoff: You’re adding a dependency. The benefit is shipping faster and not maintaining infrastructure that isn’t your core product.

# Before: Direct OpenAI call (no caching, no fallback, no cost tracking)
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

# After: ScaleMind gateway (same API, automatic optimization)
from scalemind import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",  # Routes to cheapest capable model
    messages=[{"role": "user", "content": prompt}]
)
# Automatic: caching, fallback to Anthropic if OpenAI fails, cost logging

Expected savings: 40-70% combined, depending on workload characteristics.

Building an AI-powered frontend? Tools like Forge can generate the UI in minutes while an AI gateway handles your backend cost optimization.

Read more: What is an AI Gateway? and AI Gateway vs API Gateway: What’s the Difference?


Case Studies: Real Companies, Real Savings

Jellypod: 88% Cost Reduction

Problem: Jellypod converts newsletters into podcasts. They were using GPT-4 for every summarization task, burning cash as usage scaled.

Solution: Implemented model routing (Strategy 2) and fine-tuned a smaller Mistral model for their specific summarization task.

Result: Inference costs dropped from ~$10/1M tokens to ~$1.20/1M tokens-an 88% reduction without quality loss for their use case.

Supernormal: 80% Cost Reduction

Problem: Supernormal’s AI meeting note-taker faced spiraling costs as user growth exploded.

Solution: Moved to specialized fine-tuned infrastructure, optimized prompt context length, and implemented intelligent routing.

Result: 80% cost reduction, enabling them to scale to thousands of daily meetings without linear cost growth.

Source: Confident AI Case Study


The 24-Hour Implementation Checklist

Here’s your action plan, prioritized by effort-to-impact ratio.

Hours 0-2: Audit and Baseline

  • Export usage logs from OpenAI/Anthropic dashboards
  • Identify your top 3 most expensive prompts (longest or most frequent)
  • Calculate current cost-per-user or cost-per-request baseline

Hours 2-4: Quick Wins (No Code)

  • Move all background jobs to Batch API (50% savings, 30 min work)
  • Switch obvious low-stakes features to gpt-4o-mini
  • Review system prompts-can any be shortened?

Hours 4-8: Code Changes

  • Enable prompt caching on all system prompts > 1,024 tokens (Anthropic) or let it auto-enable (OpenAI)
  • Set up semantic caching with Redis if you use LangChain

Hours 8-24: Routing Infrastructure

  • Build and deploy classify_complexity() router
  • Start at 30% traffic to cheaper models, monitor quality
  • Increase routing percentage as confidence grows

What Results Can You Expect?

Optimization LevelStrategies ImplementedTypical Savings
BasicPrompt caching + batch API30-40%
Intermediate+ Model routing50-60%
Advanced+ Semantic caching + AI gateway60-70%

ROI example: A startup spending $5,000/month on LLM APIs implements basic + routing optimizations. At 50% savings, that’s $30,000/year back in the budget-from one day of engineering work.


Key Takeaways

  1. Start with prompt caching. It’s 10 minutes of work for immediate savings on any repeated context.
  2. Route by complexity. Most production traffic doesn’t need GPT-4o. Build a simple classifier and start at 30% routing.
  3. Batch everything async. If it can wait 24 hours, it should use the Batch API (50% off).
  4. Semantic caching compounds. High-repetition use cases (support, FAQ, search) see 20%+ savings.
  5. Gateways handle the complexity. If you don’t want to maintain routing/caching infrastructure, tools like ScaleMind handle it automatically.

The tools to cut your bill in half exist right now. You don’t need to wait for GPT-5 to lower prices.

Try ScaleMind for automated cost optimization →


Resources