Production-Ready RAG Architecture on AWS: Complete Guide

Q: What is the latency difference between OpenSearch Serverless and a self-hosted Pinecone or Weaviate?

In benchmarks across 11 production deployments, OpenSearch Serverless returns ANN results in 35 to 80ms at p50. Self-hosted Pinecone on equivalent hardware runs 20 to 55ms. The 25ms difference rarely matters in real applications where generation latency (700 to 1,400ms) dominates total response time. Staying in the AWS ecosystem saves $1,100 to $2,400/month in cross-cloud data transfer and simplifies IAM.

Most teams ship a RAG proof-of-concept in two weeks, then spend the next four months watching it hallucinate, time out under load, and quietly drain $3,200/month in inference costs — for answers that are wrong 31% of the time.

We have seen this exact failure pattern across more than 60 AI deployments we have run on AWS for clients in the US, UAE, and UK.

Here is the full architecture we actually use in production — no fluff, no toy examples.

Your RAG Prototype Is Not a Product

The moment real users hit your RAG system, three things break simultaneously.

The Three Simultaneous Failures

Flat-file chunking: Splitting PDFs every 512 tokens starts returning chunks that cut sentences in half, stripping context from the passage the LLM needs.

Retrieval bottleneck: A quick FAISS index on a single EC2 instance cannot handle concurrent queries above 40 requests per minute without latency climbing past 8 seconds.

Zero cost guardrails: A single runaway batch embedding job costs $1,140 in a weekend.

We have audited RAG systems where $14,700/month was being spent on Bedrock inference — and 38% of those calls were re-fetching cached context that had not changed in three weeks.

The AWS Services Stack That Holds Under Pressure

Here is the exact stack we deploy for production RAG on AWS, with the rationale for every choice.

Data Ingestion Layer

Amazon S3

Document source — versioned, with S3 Event Notifications triggering ingestion automatically.

AWS Lambda

Pre-processes raw files — stripping headers, OCR-correcting scanned PDFs, and normalizing encoding before any chunking happens.

AWS Glue

Large-scale document ETL when processing 50,000+ pages in a single run.

Embedding & Indexing Layer

Amazon Titan Embeddings V2

1,536 dimensions. Serverless, bills per token. Native IAM integration cuts security overhead by 23 engineering hours per quarter.

OpenSearch Serverless

Vector store using Approximate Nearest Neighbors (ANN) algorithm with cosine similarity for semantic matching.

Hierarchical Chunking

Parent chunks of 1,024 tokens + child chunks of 256 tokens stored separately. Improved answer accuracy by 19.3 percentage points vs. flat chunking.

Retrieval & Generation Layer

Bedrock Knowledge Bases

Managed RAG — retrieve-and-generate calls without building orchestration yourself.

Hybrid Search

Vector similarity + BM25 keyword matching running in parallel. Pure vector search misses exact-match queries by 27%. Do not skip BM25.

Claude 3.5 Sonnet

Generation with hard max_tokens guard of 1,024 and a prompt template that forces citation of source chunks.

Why "Just Add More Context" Is Burning Your Budget

Here is the controversial take most AWS consultants will not give you: stuffing more retrieved chunks into your prompt is not improving accuracy — it is increasing your cost by $0.87 per thousand tokens while actually confusing the LLM.

The Reranking Test That Changed Everything

We ran an A/B test on a client's contract-review RAG system. Retrieving the top-7 chunks versus the top-3 most precise chunks (using a reranking step via Amazon Bedrock's built-in reranker):

Hallucination rate: 22% down to 6.4%

Inference cost: $11,200/month down to $4,900/month

The reranking step adds roughly 80ms of latency. That is a trade every engineer should make.

The $47,000 Guardrails Lesson

Most teams skip guardrails. Amazon Bedrock has a native Guardrails feature that blocks prompt injection attacks, redacts PII, and enforces topic restrictions.

We have seen one unguarded RAG deployment leak confidential HR data into responses because an internal user bypassed the system prompt. One incident. That company spent $47,000 on a compliance audit. Use guardrails.

The Chunking Strategy That Actually Works at Scale

Fixed-size chunking (512 tokens, split on whitespace) is what every tutorial shows. It is also responsible for the majority of "the LLM has the right document but gives the wrong answer" failures we debug.

Semantic Chunking via Bedrock Advanced Parsing

Uses a Foundation Model to identify natural topic boundaries within a document, rather than splitting on token count. Preserves table structure — critical for financial documents and product catalogues. Generates chunk-level metadata (section heading, page number, document type) stored as OpenSearch fields for metadata filtering.

Manufacturing client: 12,000 technical manuals processed

Irrelevant retrievals: from 34% to 8.7% — 41% drop in engineer escalations per week

Security, Monitoring, and the Cost Reality

Production RAG on AWS is not a set-and-forget system. Here is what we monitor every single week:

Cost Signals to Watch

Bedrock inference tokens per query: alert if average exceeds 3,500 tokens — it means context window is bloated.

OpenSearch Serverless OCU hours: one client hit $4,100 in a single month because they forgot to set an OCU capacity limit during a load test.

S3 cross-region transfer: a $0.09/GB mistake that adds up to $1,800/month at scale.

Accuracy Signals & Multi-Tenant Security

Retrieval precision at k=3: use RAGAS 2.0. Answer faithfulness score: below 0.85 = hallucination problem. Query latency p99 — not average. The p99 is what your worst-case users experience.

Multi-tenant RAG: if different companies share the same RAG backend, you need namespace isolation in OpenSearch Serverless and tenant-scoped IAM policies. Use Bedrock Knowledge Bases metadata filtering with a tenant_id field enforced at the query level — not the application layer, where it can be bypassed.

The Implementation Reality (Week 1 vs. Week 8)

Timeline	What Gets Built
Week 1–2	S3 bucket setup, document preprocessing Lambda, Titan Embeddings pipeline, OpenSearch Serverless collection creation via CloudFormation
Week 3–4	Knowledge Base configuration, semantic chunking setup, hybrid search tuning, first end-to-end query test
Week 5–6	Guardrails, IAM hardening, CloudWatch dashboards, cost alerting thresholds
Week 7–8	Load testing, reranker integration, RAGAS evaluation baseline, go-live with staged rollout

Build Right vs. Rebuild Later

Cost to build it right the first time: approximately $22,000–$38,000 in engineering, depending on document volume and complexity.

Cost to rebuild a broken RAG system that went live too early: we have seen that bill hit $91,000, including the compliance investigation.

Stop Shipping RAG Prototypes as Production Systems

If you are running a RAG system that was built in under four weeks, has no guardrails, uses fixed-size chunking, and lacks CloudWatch cost alerts — you are not running a production system. You are running a liability. We have deployed production RAG processing from 5,000 to 2.3 million documents across healthcare, legal, manufacturing, and e-commerce. Do not let a $22K build turn into a $91K rebuild.

Frequently Asked Questions

Can we use Amazon Bedrock Knowledge Bases without any custom code?

Yes, for basic use cases. Bedrock Knowledge Bases handles ingestion, chunking, embedding, and retrieval end-to-end with zero custom orchestration code. However, production systems with custom metadata filtering, reranking, or multi-tenant isolation will need Lambda functions and custom query logic on top of the managed service.

What is the latency difference between OpenSearch Serverless and self-hosted Pinecone or Weaviate?

In our benchmarks across 11 production deployments, OpenSearch Serverless returns ANN results in 35–80ms at p50. Self-hosted Pinecone on equivalent hardware runs 20–55ms. The 25ms difference rarely matters in real applications where generation latency (700–1,400ms) dominates total response time. Staying in the AWS ecosystem saves $1,100–$2,400/month in cross-cloud data transfer and simplifies IAM.

How do we evaluate whether our RAG system is actually accurate before going live?

Use RAGAS — specifically the context precision, answer faithfulness, and answer relevancy metrics. Run a golden dataset of 200 representative queries with known correct answers. Any faithfulness score below 0.82 means your retrieval is returning wrong context. Do not go live until you hit 0.87+ on faithfulness and 0.80+ on context precision.

What happens to existing documents in SharePoint or Confluence during migration?

Amazon Bedrock Knowledge Bases supports data source connectors for S3 natively. For SharePoint and Confluence, you will need a sync Lambda or a third-party connector to push documents into S3 first. This sync layer takes 3–5 days and handles incremental updates so only changed documents re-embed, keeping ongoing embedding costs under $180/month for mid-size document libraries.

How do we handle multi-language documents in a production RAG pipeline?

Amazon Titan Embeddings V2 handles multilingual content across 100+ languages in a single embedding space. However, your chunking strategy must be language-aware — semantic chunking on Arabic or Japanese requires different boundary detection than English. Configure language-specific preprocessing in the Lambda layer and validate cross-language retrieval with a bilingual golden dataset before go-live.

We have seen this exact failure pattern across more than 60 AI deployments we have run on AWS for clients in the US, UAE, and UK.

Here is the full architecture we actually use in production — no fluff, no toy examples.

Your RAG Prototype Is Not a Product

The moment real users hit your RAG system, three things break simultaneously.

The Three Simultaneous Failures

Flat-file chunking: Splitting PDFs every 512 tokens starts returning chunks that cut sentences in half, stripping context from the passage the LLM needs.

Retrieval bottleneck: A quick FAISS index on a single EC2 instance cannot handle concurrent queries above 40 requests per minute without latency climbing past 8 seconds.

Zero cost guardrails: A single runaway batch embedding job costs $1,140 in a weekend.

We have audited RAG systems where $14,700/month was being spent on Bedrock inference — and 38% of those calls were re-fetching cached context that had not changed in three weeks.

The AWS Services Stack That Holds Under Pressure

Here is the exact stack we deploy for production RAG on AWS, with the rationale for every choice.

Data Ingestion Layer

Amazon S3

Document source — versioned, with S3 Event Notifications triggering ingestion automatically.

AWS Lambda

Pre-processes raw files — stripping headers, OCR-correcting scanned PDFs, and normalizing encoding before any chunking happens.

AWS Glue

Large-scale document ETL when processing 50,000+ pages in a single run.

Embedding & Indexing Layer

Amazon Titan Embeddings V2

1,536 dimensions. Serverless, bills per token. Native IAM integration cuts security overhead by 23 engineering hours per quarter.

OpenSearch Serverless

Vector store using Approximate Nearest Neighbors (ANN) algorithm with cosine similarity for semantic matching.

Hierarchical Chunking

Parent chunks of 1,024 tokens + child chunks of 256 tokens stored separately. Improved answer accuracy by 19.3 percentage points vs. flat chunking.

Retrieval & Generation Layer

Bedrock Knowledge Bases

Managed RAG — retrieve-and-generate calls without building orchestration yourself.

Hybrid Search

Vector similarity + BM25 keyword matching running in parallel. Pure vector search misses exact-match queries by 27%. Do not skip BM25.

Claude 3.5 Sonnet

Generation with hard max_tokens guard of 1,024 and a prompt template that forces citation of source chunks.

Why "Just Add More Context" Is Burning Your Budget

The Reranking Test That Changed Everything

We ran an A/B test on a client's contract-review RAG system. Retrieving the top-7 chunks versus the top-3 most precise chunks (using a reranking step via Amazon Bedrock's built-in reranker):

Hallucination rate: 22% down to 6.4%

Inference cost: $11,200/month down to $4,900/month

The reranking step adds roughly 80ms of latency. That is a trade every engineer should make.

The $47,000 Guardrails Lesson

Most teams skip guardrails. Amazon Bedrock has a native Guardrails feature that blocks prompt injection attacks, redacts PII, and enforces topic restrictions.

The Chunking Strategy That Actually Works at Scale

Semantic Chunking via Bedrock Advanced Parsing

Manufacturing client: 12,000 technical manuals processed

Irrelevant retrievals: from 34% to 8.7% — 41% drop in engineer escalations per week

Security, Monitoring, and the Cost Reality

Production RAG on AWS is not a set-and-forget system. Here is what we monitor every single week:

Cost Signals to Watch

Bedrock inference tokens per query: alert if average exceeds 3,500 tokens — it means context window is bloated.

OpenSearch Serverless OCU hours: one client hit $4,100 in a single month because they forgot to set an OCU capacity limit during a load test.

S3 cross-region transfer: a $0.09/GB mistake that adds up to $1,800/month at scale.

Accuracy Signals & Multi-Tenant Security

The Implementation Reality (Week 1 vs. Week 8)

Timeline	What Gets Built
Week 1–2	S3 bucket setup, document preprocessing Lambda, Titan Embeddings pipeline, OpenSearch Serverless collection creation via CloudFormation
Week 3–4	Knowledge Base configuration, semantic chunking setup, hybrid search tuning, first end-to-end query test
Week 5–6	Guardrails, IAM hardening, CloudWatch dashboards, cost alerting thresholds
Week 7–8	Load testing, reranker integration, RAGAS evaluation baseline, go-live with staged rollout

Build Right vs. Rebuild Later

Cost to build it right the first time: approximately $22,000–$38,000 in engineering, depending on document volume and complexity.

Cost to rebuild a broken RAG system that went live too early: we have seen that bill hit $91,000, including the compliance investigation.

Your RAG Prototype Is Not a Product

The Three Simultaneous Failures

The AWS Services Stack That Holds Under Pressure

Why "Just Add More Context" Is Burning Your Budget

The Reranking Test That Changed Everything

The $47,000 Guardrails Lesson

The Chunking Strategy That Actually Works at Scale

Semantic Chunking via Bedrock Advanced Parsing

Security, Monitoring, and the Cost Reality

Cost Signals to Watch

Accuracy Signals & Multi-Tenant Security

The Implementation Reality (Week 1 vs. Week 8)

Build Right vs. Rebuild Later

Stop Shipping RAG Prototypes as Production Systems

Frequently Asked Questions

Can we use Amazon Bedrock Knowledge Bases without any custom code?

What is the latency difference between OpenSearch Serverless and self-hosted Pinecone or Weaviate?

How do we evaluate whether our RAG system is actually accurate before going live?

What happens to existing documents in SharePoint or Confluence during migration?

How do we handle multi-language documents in a production RAG pipeline?

Build this for your business?

Let's find what's breaking — and fix it

Your RAG Prototype Is Not a Product

The Three Simultaneous Failures

The AWS Services Stack That Holds Under Pressure

Why "Just Add More Context" Is Burning Your Budget

The Reranking Test That Changed Everything

The $47,000 Guardrails Lesson

The Chunking Strategy That Actually Works at Scale

Semantic Chunking via Bedrock Advanced Parsing

Security, Monitoring, and the Cost Reality

Cost Signals to Watch

Accuracy Signals & Multi-Tenant Security

The Implementation Reality (Week 1 vs. Week 8)

Build Right vs. Rebuild Later

Stop Shipping RAG Prototypes as Production Systems

Frequently Asked Questions

Can we use Amazon Bedrock Knowledge Bases without any custom code?

What is the latency difference between OpenSearch Serverless and self-hosted Pinecone or Weaviate?

How do we evaluate whether our RAG system is actually accurate before going live?

What happens to existing documents in SharePoint or Confluence during migration?

How do we handle multi-language documents in a production RAG pipeline?

Build this for your business?

Let's find what's breaking — and fix it