Back to Blog
AI app development agencyAWS BedrockRAGLLM

How We Build Production RAG Applications with AWS Bedrock

A practical walkthrough of building RAG (Retrieval Augmented Generation) pipelines using AWS Bedrock, OpenSearch Serverless, and Lambda — patterns from real AI app development projects.

Zyron Technologies·April 15, 2025·10 min read

What is RAG and Why It Matters

RAG (Retrieval Augmented Generation) lets you connect LLMs to your own data. Instead of relying on a model's training data, you retrieve relevant documents at query time and feed them to the model as context.

This is the foundation of most enterprise AI apps we build — chatbots over documentation, search over internal knowledge bases, AI assistants for SaaS products.

The AWS-Native RAG Stack

We build RAG pipelines entirely within AWS, which matters for clients with data sovereignty requirements:

  • AWS Bedrock — Claude or Titan as the LLM (no data leaves AWS)
  • OpenSearch Serverless — vector store for embeddings
  • Lambda — orchestration layer
  • S3 — document storage
  • Textract — PDF/document parsing

The Pipeline

Step 1: Document Ingestion

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def get_embedding(text: str) -> list[float]:
    response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v1",
        body=json.dumps({"inputText": text})
    )
    return json.loads(response["body"].read())["embedding"]

Step 2: Vector Search

When a user sends a query, we embed it and search OpenSearch for the most similar chunks:

def search_documents(query: str, index: str, top_k: int = 5):
    query_embedding = get_embedding(query)
    # kNN search against OpenSearch Serverless
    results = opensearch_client.search(
        index=index,
        body={"knn": {"vector": {"vector": query_embedding, "k": top_k}}}
    )
    return [hit["_source"]["text"] for hit in results["hits"]["hits"]]

Step 3: LLM Generation with Context

def answer_question(question: str, context_chunks: list[str]) -> str:
    context = "\n\n".join(context_chunks)
    prompt = f"""Use the following context to answer the question.
    
Context:
{context}

Question: {question}

Answer:"""
    
    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        body=json.dumps({"messages": [{"role": "user", "content": prompt}], "max_tokens": 1024})
    )
    return json.loads(response["body"].read())["content"][0]["text"]

Common Pitfalls

Chunk size matters. Too large and you lose precision. Too small and you lose context. We default to 512 tokens with 50-token overlap.

Metadata filtering. Always store document metadata (source, date, tenant ID) alongside embeddings so you can filter before vector search.

Cost control. Embedding calls add up. Cache embeddings for static documents and only re-embed on updates.

When to Use RAG vs Fine-tuning

Use RAG when:

  • Your data changes frequently
  • You need source attribution
  • Data volume is large (>10k documents)

Use fine-tuning when:

  • You need specific tone/style
  • Task is narrow and well-defined
  • Data is stable

Want This Built for Your Product?

We've built RAG pipelines for SaaS products, internal tools, and customer-facing AI assistants. Book a call and let's talk about your use case.

Need help building this?

We build AWS serverless, AI, and SaaS solutions for companies worldwide.

Book a Free Call →