AI Summary - 20-sec read - Reviewed by experts
- Most production LLM traffic is repetitive. A large share of your calls are near-duplicates of questions the model already answered - and you pay full price for every one.
- There are two different caches. Prompt caching (from the model provider) discounts the repeated part of your prompt; semantic caching (in your own app) skips the model entirely when a new question means the same thing as a past one.
- Stack both and teams routinely cut 60-80 percent of spend on repetitive workloads. The lever is the cache hit rate, and you must measure it, not assume it.
- Caching helps most on high-volume, low-variety traffic - support FAQs, product lookups, classification. It helps least on personalised, fast-changing, or one-off answers, where a stale cache is a correctness risk.
- Short on time? We will look at your traffic, estimate your real hit rate, and tell you what caching would actually save. Book a free call.
Short on time? Book a free call.
Open your LLM bill and one uncomfortable fact is usually hiding in it: you are paying, over and over, for answers you have already generated. The same twenty support questions asked a thousand different ways. The same product spec looked up all afternoon. The model does the full, expensive work every single time, because by default it has no memory of having just answered that. Caching is how you stop paying twice for the same thought - and on repetitive workloads it is the single highest-return cost lever you have.
Two kinds of caching, and they are not the same
People say "we added caching" and mean one of two very different things. Getting them straight is the whole game, because they save money in different places.
- Prompt caching lives with the model provider. When many calls share a long, identical opening - a big system prompt, a fixed instruction block, a document you keep re-sending - the provider caches that prefix and charges a fraction of the normal input price to re-read it. Anthropic, for example, bills cache reads at a tenth of the base input rate; some providers apply it automatically once the shared prefix is long enough. You change little or no code; you just stop paying full price for the boilerplate you repeat on every call.
- Semantic caching lives in your application, and it is the bigger prize. Instead of discounting the prompt, it decides whether to call the model at all. You embed each incoming question, compare it to questions you have already answered, and if a past one is close enough in meaning, you return the stored answer and skip the model completely. "What is your refund window?" and "how long do I have to return something?" are different strings but the same question - exact-match caching misses that, semantic caching catches it.
The distinction matters because prompt caching shaves the cost of each call while semantic caching removes calls entirely. The second is where the large savings come from, and it is closer to a retrieval problem than a model one - it leans on the same vector database machinery that powers good retrieval, just pointed at your own past answers.
Not sure how much of your traffic is actually repetitive?
We will sample your real prompts, estimate the semantic hit rate you would get, and put a number on the saving before you build anything. No pitch, reply in 2 hrs, no card needed, NDA on request.
Get a free auditThe cost math that makes it worth doing
Caching only pays if enough traffic repeats, so the number that decides everything is your cache hit rate - the share of requests served from the cache instead of the model. The arithmetic is blunt and honest: a 60 percent hit rate means roughly 60 percent of those calls cost you almost nothing instead of full model price. On a workload where the model bill is the dominant line item, that is most of the bill gone.
The gains are real but they are not free money, and you should size them from your own data, not a vendor's best case. Exact-match caching on naive traffic often lands in the high teens; switching to semantic matching commonly pushes that past 60 percent, because it captures all the rephrasings exact-match throws away. Stacking application-level semantic caching on top of provider prompt caching is where teams report cutting 60 to 80 percent of spend on repetitive workloads. The same discipline of measuring before you optimise runs through all our AI development work - and if your model is running on AWS, the caching layer sits naturally alongside the rest of your AI infrastructure on AWS.
Paying full price for repeat questions adds up fast.
We will estimate your hit rate and the monthly saving from a caching layer - no build required to get the number. Reply in 2 hrs, NDA on request.
Book a free callWhere caching helps - and where it quietly hurts
A cache is a promise that a past answer is still the right answer. When that promise holds, caching is close to free savings. When it does not, a cache serves confidently wrong answers, which is worse than being slow. Know which side of the line you are on.
- Caching helps most on high-volume, low-variety, slow-changing traffic: support FAQs, policy questions, product lookups, document Q and A, classification and routing. Many users ask the same small set of things, and the right answer rarely moves - exactly the shape a cache rewards.
- Caching hurts when answers are personalised, live, or one-off. Anything keyed to a specific user's account, a real-time price, or this minute's stock is a correctness risk to cache - a hit returns another customer's context or yesterday's number. Here you either do not cache, or you cache only the non-personal scaffolding and fill the live parts fresh.
The trap is the similarity threshold on the semantic cache. Set it too loose and it treats "cancel my order" and "change my order" as the same question, serving the wrong answer with total confidence. Set it too tight and your hit rate collapses to exact-match levels and the savings vanish. That threshold is not a set-and-forget constant; it is tuned against real traffic and watched.
Takeaways
- Prompt caching discounts the repeated prompt; semantic caching removes the call entirely. The second saves more.
- The cache hit rate is the lever - measure it on your own traffic before you promise a saving.
- Semantic matching turns high-teens exact-match hit rates into 60 percent-plus by catching rephrasings.
- Stacking both layers cuts 60-80 percent of spend on genuinely repetitive workloads.
- Do not cache personalised or live answers - a stale hit is a correctness bug, not a saving.
How to actually roll it out
Start where the promise is safest. Turn on provider prompt caching first - it is often a config change and it lowers the floor on every call with a shared prefix, with no correctness risk. Then add a semantic cache in front of your single highest-volume, lowest-variety endpoint - usually support or FAQ - and instrument the hit rate per layer from day one. You are not chasing a headline percentage; you are watching whether the cached answers are actually right, and only widening once they are. Set a conservative similarity threshold, log every cache hit so you can audit it, and expire entries whenever the underlying facts change. Treat it like any other production system that can be wrong - which is how we approach every AI agent we put into production. If your calls are also large because you are re-sending big documents, pair caching with the retrieval choices we cover in fine-tuning versus RAG, since a leaner prompt caches better and costs less either way.
Frequently asked questions
Is semantic caching the same as prompt caching?
No. Prompt caching is a provider feature that discounts the repeated prefix of your prompt, so each call is cheaper. Semantic caching is something you build in your app that returns a stored answer when a new question means the same as a past one, so the call does not happen at all. They stack, and used together they save the most.
What cache hit rate should I expect?
It depends entirely on how repetitive your traffic is, so measure rather than assume. Exact-match caching on unnormalised prompts often sits in the high teens; semantic matching commonly pushes past 60 percent on support-style workloads by catching rephrasings. Personalised or one-off traffic will hit far lower, and that is a signal caching is not the right lever there.
Will a cache ever return a wrong answer?
Yes, if you let it. The two failure modes are a similarity threshold set too loose, which matches questions that are not really the same, and a stale entry that was not expired when the underlying fact changed. Both are managed with a conservative threshold, logging of every hit for audit, and expiry tied to your source data - not by hoping.
Does caching reduce latency as well as cost?
Yes, and often that is the bigger win for users. A cache hit skips the model call, so the answer returns in milliseconds instead of seconds. On high-traffic endpoints that faster response can matter as much as the money saved, and both improve together from the same hit.
The short version: a large part of most LLM bills is repetition, and caching is how you stop paying for it twice. Turn on prompt caching for the easy floor, add semantic caching where your traffic genuinely repeats, and let the measured hit rate - not a vendor slide - tell you what you saved. Cache the stable, never the personal, watch the threshold, and you take real cost out of the system without taking any correctness out with it.
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
