How We Build Production RAG Applications with AWS Bedrock

What is RAG and Why It Matters

RAG (Retrieval Augmented Generation) lets you connect LLMs to your own data. Instead of relying on a model's training data, you retrieve relevant documents at query time and feed them to the model as context.

This is the foundation of most enterprise AI apps we build — chatbots over documentation, search over internal knowledge bases, AI assistants for SaaS products.

The AWS-Native RAG Stack

We build RAG pipelines entirely within AWS, which matters for clients with data sovereignty requirements:

AWS Bedrock — Claude or Titan as the LLM (no data leaves AWS)
OpenSearch Serverless — vector store for embeddings
Lambda — orchestration layer
S3 — document storage
Textract — PDF/document parsing

The Pipeline

Step 1: Document Ingestion

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def get_embedding(text: str) -> list[float]:
    response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v1",
        body=json.dumps({"inputText": text})
    )
    return json.loads(response["body"].read())["embedding"]

Step 2: Vector Search

When a user sends a query, we embed it and search OpenSearch for the most similar chunks:

def search_documents(query: str, index: str, top_k: int = 5):
    query_embedding = get_embedding(query)
    # kNN search against OpenSearch Serverless
    results = opensearch_client.search(
        index=index,
        body={"knn": {"vector": {"vector": query_embedding, "k": top_k}}}
    )
    return [hit["_source"]["text"] for hit in results["hits"]["hits"]]

Step 3: LLM Generation with Context

def answer_question(question: str, context_chunks: list[str]) -> str:
    context = "\n\n".join(context_chunks)
    prompt = f"""Use the following context to answer the question.
    
Context:
{context}

Question: {question}

Answer:"""
    
    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        body=json.dumps({"messages": [{"role": "user", "content": prompt}], "max_tokens": 1024})
    )
    return json.loads(response["body"].read())["content"][0]["text"]

Common Pitfalls

Chunk size matters. Too large and you lose precision. Too small and you lose context. We default to 512 tokens with 50-token overlap.

Metadata filtering. Always store document metadata (source, date, tenant ID) alongside embeddings so you can filter before vector search.

Cost control. Embedding calls add up. Cache embeddings for static documents and only re-embed on updates.

When to Use RAG vs Fine-tuning

Use RAG when:

Your data changes frequently
You need source attribution
Data volume is large (>10k documents)

Use fine-tuning when:

You need specific tone/style
Task is narrow and well-defined
Data is stable

Want This Built for Your Product?

We've built RAG pipelines for SaaS products, internal tools, and customer-facing AI assistants. Book a call and let's talk about your use case.