AI Summary - 20-sec read - Reviewed by experts
- A RAG app on AWS has four cost centers: embeddings, the vector store, retrieval compute, and LLM inference. At 1M queries a month, inference and the vector store dominate - embeddings are almost free by comparison.
- The model you pick swings the bill more than anything else. A small, fast model can run the same workload for a few thousand dollars a month; a frontier model can push it past twenty thousand.
- OpenSearch Serverless has a real monthly floor (hundreds of dollars even when idle); pgvector on RDS or a usage-based vector service can be cheaper at smaller scale.
- The biggest levers are caching repeated queries, trimming retrieved context, and using a smaller model for routing or reranking - each cuts real money without hurting answer quality.
- Short on time? Book a free call.
Short on time? Book a free call.
The retrieval-augmented generation demo cost almost nothing to run. A few dollars of API calls, a handful of documents, and a chatbot that answered questions about your data. Then someone in finance asked the question that kills proof-of-concepts: what does this cost when the whole company uses it, a million queries a month? The honest answer is "it depends" - but you can model it line by line, and the shape of the bill is predictable once you know where the money actually goes.
This is that model: the four cost centers of a production RAG app on AWS, a worked monthly bill at 1M queries with the assumptions stated openly, and the levers that move the total. The numbers are illustrative ranges for planning - your real bill depends on your model choice, document size, and region - so treat them as the shape, not a quote. Verify current AWS and Bedrock pricing before you commit.
The four cost centers
Every RAG query does the same four things, and each costs money. Understand these and you understand the whole bill. If you are new to how the pieces fit together, our explainer on what RAG actually is covers the architecture before the economics.
- Embeddings. Turning the user's question (and, once, your documents) into vectors. Cheap per call.
- Vector store. Holding those vectors and searching them. This is a standing monthly cost, not per-query.
- Retrieval compute. The Lambda or container that orchestrates the query, plus the API layer in front.
- LLM inference. The model that reads the retrieved context and writes the answer. Usually the largest line.
Need a real number before you take this to finance?
Get a free audit. We model your RAG workload against your actual document volume and query mix, and hand you a costed architecture - not a guess. No pitch, reply in 2 hrs, no card needed, NDA on request.
Get a free auditA worked bill at 1M queries a month
Assumptions, stated so you can adjust them: 1,000,000 queries a month; each query embeds a short question (~50 tokens), retrieves a handful of chunks totalling ~2,000 tokens of context, adds a ~500-token system prompt, and the model writes a ~400-token answer. So roughly 2,500 input tokens and 400 output tokens per query - 2.5 billion input and 0.4 billion output tokens a month.
Embeddings: tens of dollars
Query-time embeddings are 1M times ~50 tokens, about 50M tokens a month. At commodity embedding prices that is a few dollars. Even re-embedding a large corpus periodically lands in the tens of dollars. Embeddings are the line nobody needs to worry about - call it 5 to 30 dollars a month.
Vector store: hundreds to low thousands
This is a standing cost. Amazon OpenSearch Serverless bills by compute units with a practical floor of several hundred dollars a month even at modest scale, rising with index size and query rate - budget roughly 700 to 2,000 dollars a month at this volume. pgvector on an RDS instance you already run can be far cheaper (a few hundred dollars) but you manage scaling yourself. A usage-based managed vector service sits in between. The right pick depends on scale and ops appetite; we compared the main options in OpenSearch Serverless vs Pinecone and, for the database-native route, Pinecone vs pgvector.
Retrieval compute: low hundreds
A Lambda per query plus API Gateway for 1M invocations a month is modest - on the order of 50 to 300 dollars depending on memory, duration, and whether you front it with a cache. Containers on ECS or Fargate land in a similar range if sized to the load. This is rarely where the bill is won or lost.
LLM inference: the line that decides everything
This is where model choice dominates. At 2.5B input and 0.4B output tokens a month: a small, fast model on Amazon Bedrock (a Haiku-class or comparable) runs the workload for roughly 3,000 to 5,000 dollars a month. A mid-tier model multiplies that several times. A frontier model (a Sonnet- or Opus-class) can push inference past 20,000 dollars a month for the exact same query volume. Nothing else on the bill moves the total like this single choice - which is why the right architecture uses the strongest model only where it earns its cost.
Takeaways
- Plan for a total in the low-thousands-per-month with a small model, and tens-of-thousands with a frontier model, at 1M queries. The model choice is the bill.
- Inference and the vector store are the two real lines. Embeddings and retrieval compute are rounding errors by comparison.
- OpenSearch Serverless has a standing floor whether you query it or not - right-size it, do not over-provision "just in case."
- Context length is a cost lever: every extra retrieved token is paid on every single query, a million times over.
The levers that cut the bill
Once you see where the money goes, the savings are obvious. Cache repeated and near-duplicate queries - in most knowledge bases a meaningful share of questions repeat, and a cache hit costs nothing in inference. Trim retrieved context aggressively: retrieving ten chunks when three answer the question multiplies your input tokens for no quality gain, and input tokens are paid on every query. Use a small, cheap model for the easy work - routing, reranking, classifying - and reserve the expensive model for the final answer, or only for the questions that genuinely need it. Right-size the vector store to your real index, and batch your document ingestion rather than re-embedding constantly. Together these routinely cut a RAG bill by half or more without the user noticing any drop in answer quality. Building those controls in from the start is the core of how we architect AI on AWS, and keeping the bill flat as traffic grows is what ongoing managed cloud services are for.
Want your RAG app costed and right-sized before you scale it?
We model the bill against your real workload, pick the model and vector store that fit, and build the caching and context controls that keep it flat. No pitch, reply in 2 hrs.
Book a free callThe cost nobody budgets for
Two lines hide outside the per-query math. Document ingestion at scale - chunking, embedding, and indexing a large corpus, then keeping it fresh - is real compute, especially if your data changes daily. And observability: logging every query, its retrieved context, and its answer for quality monitoring adds storage and CloudWatch cost, but skipping it means you cannot tell when retrieval quality degrades. Budget for both. A RAG app that nobody can observe is a RAG app nobody can trust, and a stale index quietly returns worse answers until someone complains. Getting the ingestion pipeline and monitoring right is part of any production-grade build, which is why our AWS consulting engagements cost them in from day one rather than discovering them in month three.
FAQ
How much does a RAG app cost to run on AWS?
At 1M queries a month, plan for roughly a few thousand dollars with a small, fast model and tens of thousands with a frontier model. Inference and the vector store are the two dominant lines; embeddings and retrieval compute are minor. Your real number depends on model choice, how much context you retrieve per query, and your vector store, so model it against your own workload before committing.
What is the biggest cost in a RAG app?
LLM inference, driven by which model you choose and how many tokens you feed it per query. The same query volume can cost a few thousand or twenty-thousand-plus dollars a month purely on model selection. After inference, the vector store is the next largest line because it is a standing monthly cost rather than per-query.
Is OpenSearch Serverless or pgvector cheaper for RAG?
At smaller scale, pgvector on an RDS instance you already run is often cheaper because you avoid OpenSearch Serverless's standing compute floor - but you manage scaling and tuning yourself. OpenSearch Serverless costs more at the low end but scales with less operational work. The crossover depends on your index size and query rate; model both before deciding.
How do I reduce RAG costs without hurting quality?
Cache repeated queries, retrieve fewer and better chunks so you pay for less context on every call, and use a small model for routing and reranking while reserving the expensive model for the final answer. Right-size the vector store and batch your embeddings. These typically cut the bill by half or more with no visible drop in answer quality.
The bottom line: a RAG app's bill is not mysterious. Four lines - embeddings, vector store, retrieval, inference - and two of them, inference and the vector store, are where the money lives. Pick the model deliberately, keep the retrieved context tight, cache what repeats, and right-size the store. Do that and a million queries a month is a planned line item, not a finance surprise.
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
