AI Summary - 20-sec read - Reviewed by experts
- This is about running an LLM feature cheaply, not what it costs to build one. The bill is paid per request, forever, so small per-call savings compound fast.
- The levers that actually move the token bill: prompt caching, model routing or a cascade, trimming context and retrieved chunks, capping output tokens, batching where latency allows, and semantic caching of repeated answers.
- Most teams overpay because they run one frontier model for every call - classification, routing, and easy questions do not need frontier prices. Right-size the model per task.
- You cannot optimise what you do not measure. Track cost-per-request and you will find the expensive 5 percent of calls that drive most of the bill.
- Short on time? Book a free call.
Short on time? Book a free call.
The feature shipped, people actually use it, and that is the problem. Your LLM bill is climbing faster than your usage chart, finance has started asking pointed questions, and the honest answer is that nobody on the team has looked at where the tokens go. A demo that cost a few dollars a day quietly became a four-figure monthly line, and doubling your users would double a number that already hurts. This is the part of the job that the build-cost articles skip: not what an AI feature costs to make, but what it costs to keep running once real traffic hits it.
This is an engineering runbook, not a strategy memo. Every lever below is something a team can ship this week, and they are ordered roughly by how much they move the bill for the least effort. The token bill is paid per request and never stops, so a saving that looks small per call - a few cents here, a halved context there - compounds across millions of requests into the difference between a feature that pays for itself and one finance wants killed.
First, measure cost-per-request or you are guessing
You cannot cut a bill you cannot see. Before any optimisation, instrument every call so you log the model used, input tokens, output tokens, and the computed dollar cost, tagged by feature and route. Do this first because it changes what you work on. Almost every LLM workload follows the same shape: a small fraction of requests - often the longest contexts, the retries, the run-on generations - drives most of the spend. Find that expensive 5 percent and you optimise the calls that matter instead of micro-tuning the cheap ones. We walk through this measurement discipline in our breakdown of what an AI agent costs to build and run, where the run cost is exactly the line teams forget to track.
Bill scaling faster than usage and not sure which lever to pull first?
Get a free audit. We instrument your LLM workload, find the expensive calls, and hand you a ranked list of cuts with the dollar saving against each one. No pitch, reply in 2 hrs, no card needed, NDA on request.
Get a free auditPrompt caching: stop paying for the same prefix twice
If every request sends the same long system prompt, tool definitions, or reference document, you are paying full input price for tokens that never change. Prompt caching lets the provider reuse a cached prefix at a steep discount - cached input tokens cost a fraction of fresh ones. The trick is structure: put everything stable at the front of the prompt - instructions, schemas, examples, the retrieved document that does not change within a session - and the variable user input last. For a chat or agent loop that replays a growing history on every turn, caching the static prefix can take a large slice off the input bill for almost no engineering effort. The only discipline it asks is prompt hygiene: keep the cacheable part byte-for-byte identical, because a single changed token at the front busts the cache.
Model routing and cascades: do not pay frontier prices for easy work
This is usually the single biggest lever. Running one frontier model for every request means you pay top rates for the simple turns - the greetings, the classifications, the questions a small model answers perfectly well - to cover the hard ones. A cascade fixes that. Send every request to a cheap, fast model first; if it answers confidently, you are done. If the task is genuinely hard, or the small model signals low confidence, escalate to the big model. In a typical support or assistant workload, a cascade that sends 80 percent of traffic to a model 15 times cheaper, escalating only the hard 20 percent, can cut the model bill by more than half while the user notices nothing - because the easy questions never needed the expensive brain.
Right-size the model per task, not per product
Routing is not only about hard versus easy. Different jobs in the same product have different needs. Classification, intent detection, routing decisions, short extractions, and tagging are tasks where a small model is both cheaper and faster, and where frontier quality buys you nothing measurable. Reserve the expensive model for open-ended generation, multi-step reasoning, and the final user-facing answer where quality is visible. The mistake is picking one model for the whole product; the fix is picking the smallest model that passes your eval for each distinct task. If you are weighing the model question more broadly, our analysis of self-hosted LLMs versus API-based LLMs covers when owning the model changes this maths.
Trim the context: RAG bloat is the silent cost
Retrieval is where input tokens quietly balloon. A RAG pipeline that stuffs the top 20 chunks into every prompt to be safe is paying for thousands of tokens the model mostly ignores, on every single query. The fix is to retrieve fewer, better chunks: rerank what the vector search returns and pass only the top few that actually answer the question, rather than everything that scored above a loose threshold. Cut the chunk size, drop near-duplicate passages, and stop sending boilerplate the model does not need. Context trimming that halved input tokens per query - going from a lazy top-20 to a reranked top-4 - is one of the most common big wins we see, and it usually improves answer quality too because the model is not distracted by irrelevant text. The cost shape of all this is laid out in what a RAG app on AWS really costs at a million queries a month.
Cap output tokens and stop run-on generations
Output tokens usually cost more per token than input, and an unbounded generation is a blank cheque. Set a sensible max-output-tokens limit for each call so a model that decides to write an essay cannot run up the bill, and use stop sequences to end generation the moment the useful answer is complete. Ask for the format you actually need - a short JSON object, a one-line classification, a two-sentence summary - rather than letting the model pad. A surprising share of run-on cost comes from prompts that never told the model to be brief, so it filled the space it was given. This is cheap to fix and the savings land on the most expensive tokens on the bill.
Takeaways
- Measure cost-per-request first. A small slice of calls drives most of the bill - find it before you optimise anything.
- Two levers move the most money: a cheap-model-first cascade for easy traffic, and prompt caching for the stable prefix you send every call.
- Trim retrieved context. A reranked top-4 beats a lazy top-20 on both cost and quality - RAG bloat is the silent line item.
- Cap output tokens, cache repeated answers, and batch where latency allows. Each is cheap to ship and compounds across every request.
Semantic caching: answer the same question once
In most real workloads users ask the same things in slightly different words. A semantic cache stores past question-and-answer pairs and, when a new query is close enough in meaning to one already answered, returns the cached answer instead of calling the model at all. A cache hit costs effectively nothing in inference, so for a FAQ-shaped or support workload where a chunk of traffic is repetitive, semantic caching can remove a meaningful share of calls from the bill entirely. The honest trade-off is staleness: a cached answer can go out of date when the underlying facts change, so put a time-to-live on entries, scope the cache to content that is stable, and invalidate it when the source data updates. Used carelessly it serves yesterday's answer; used with a sensible expiry it is close to free money.
Batch and go async where latency allows
Not every LLM call is a user waiting on a screen. Background work - generating embeddings, summarising documents overnight, enriching records, running evals - does not need a real-time response, and providers price batch and asynchronous processing well below interactive rates. Move everything that can tolerate delay onto a batch or async path and you pay the lower rate for a large slice of your volume. The discipline is simply to separate the two: keep the interactive path lean and fast, and sweep the rest into off-peak batch jobs where the cheaper pricing and higher throughput do the work.
Want your LLM bill cut without touching what users feel?
We profile your workload, rank the cuts by dollar saving, and ship the ones that hold quality. No pitch, reply in 2 hrs.
Book a free callThe honest trade-offs
None of these levers is free of risk, and pretending otherwise is how teams get burned. A cheap model in a cascade will sometimes get a hard question wrong, so you need a confidence signal and a clean escalation path, plus evals that catch quality regressions before users do. Semantic caching can serve a stale answer if the facts moved, so scope it to stable content and expire entries. Aggressive context trimming can drop the one chunk that held the answer, so measure answer quality as you cut, not just token count. The right way to apply this runbook is to ship one lever at a time behind your cost and quality dashboards, confirm the bill dropped and the evals held, then move to the next. The teams that get this right treat cost optimisation as an ongoing part of AI development services rather than a one-off cleanup, and where the model itself needs to change they reach for the right approach - sometimes that means fine-tuning a smaller model so it can do a job that used to need a frontier one. The same discipline runs through how we build and run custom AI agents and how we cost workloads on AWS.
FAQ
What is the single biggest lever to reduce LLM cost?
For most teams it is model routing - running a cheap, fast model on the easy majority of traffic and escalating only the hard cases to a frontier model. A cascade that sends 80 percent of requests to a model many times cheaper can cut the model bill by more than half while users notice no drop in quality, because the easy questions never needed the expensive model. Pair it with prompt caching for the second biggest win.
Does prompt caching really save money?
Yes, when you have a stable prefix you send on every call - a long system prompt, tool definitions, or a reference document. Cached input tokens cost a fraction of fresh ones, so an agent or chat loop that replays the same instructions every turn can take a large slice off its input bill. The requirement is keeping the cacheable part identical byte-for-byte, because any change at the front of the prompt busts the cache.
How does retrieved context drive up my token bill?
A RAG pipeline that stuffs the top 20 chunks into every prompt pays for thousands of input tokens the model mostly ignores, on every query, forever. Reranking and passing only the few chunks that actually answer the question - a reranked top-4 instead of a lazy top-20 - often halves input tokens per query and improves answer quality at the same time, because the model is not distracted by irrelevant text.
Is it safe to use a cheaper model to save cost?
It is safe when you route deliberately and measure. Use small models for classification, routing, and easy answers where frontier quality buys nothing, and keep the expensive model for hard reasoning and final user-facing answers. The risk is a cheap model getting a hard question wrong, so you need a confidence signal, a clean escalation path, and evals that catch regressions before users do.
The bottom line: running an LLM in production is a per-request bill that never stops, which is exactly why small per-call savings are worth chasing. Measure cost-per-request first, then ship the levers in order - cascade the easy traffic to a cheap model, cache the stable prefix, trim the retrieved context, cap the output, and sweep background work into batch. Do it one lever at a time behind your cost and quality dashboards, and a bill that was scaling faster than usage becomes a flat, predictable line you control.
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
