Skip to content
Home » Building Production-Grade RAG Systems: An Architectural Deep Dive

Building Production-Grade RAG Systems: An Architectural Deep Dive

Introduction

Retrieval Augmented Generation has become a cornerstone of modern AI applications, but building a production-ready RAG system requires far more than connecting a vector database to an LLM. This article shares architectural patterns and implementation strategies from building an enterprise RAG system that handles real-world complexity at scale.

The system we’ll explore demonstrates how to move beyond basic RAG implementations to create a robust, performant solution that handles query expansion, multi-source retrieval, domain filtering, and intelligent routing. These patterns are applicable across industries and use cases.

The Challenge: Beyond Basic RAG

Most RAG tutorials show you how to embed documents, store them in a vector database, and retrieve relevant chunks. That’s table stakes. The real challenges emerge when you need to:

  • Handle ambiguous queries with acronyms and domain-specific terminology
  • Retrieve from multiple knowledge sources with different characteristics
  • Filter out irrelevant content that passes semantic similarity checks
  • Manage performance at scale with hundreds of concurrent requests
  • Provide accurate citations while preventing hallucinations
  • Route queries intelligently between retrieval and direct responses

Let’s examine how to solve these challenges architecturally.

Core Architecture: The Multi-Stage Pipeline

The foundation is a three-stage retrieval pipeline that progressively refines results:

Stage 1: Query Expansion

Rather than searching with the user’s raw query, we expand it into multiple variations. This dramatically improves retrieval quality for domain-specific queries.

The key insight: use LLM-based expansion instead of static dictionaries. When a user asks about “HELOAN requirements,” the system generates:

  • “What are Home Equity Loan requirements?”
  • “How do I qualify for a home equity loan?”
  • “What documentation is needed for a HELOAN?”

This agentic approach handles new terminology automatically and understands context. The implementation uses structured prompting to generate exactly N alternative queries, maintaining control while leveraging LLM reasoning.

Performance consideration: We optimized from 3 to 2 alternative queries, reducing knowledge base calls by 25% while maintaining quality. Each additional query variation multiplies your retrieval costs across all knowledge sources.

Stage 2: Dual Knowledge Base Retrieval with Fusion

Enterprise knowledge exists in multiple places. Our architecture combines internal proprietary knowledge with external authoritative sources using LangChain’s EnsembleRetriever pattern.

The implementation uses AWS Bedrock Knowledge Bases with configurable weighting:

  • Internal knowledge base (company-specific content): 50% weight
  • External knowledge base (industry standards, regulations): 50% weight

Each knowledge base uses HYBRID search mode, combining semantic vector search with keyword matching. This prevents the common failure mode where semantically similar but factually wrong content ranks higher than exact keyword matches.

Critical implementation detail: Score-based filtering happens at the retriever level, not after fusion. Documents below the threshold are filtered before the ensemble combines results. This prevents low-quality results from one source from diluting high-quality results from another.

Stage 3: Domain Scope Filtering

Here’s where most RAG systems fail in production: semantic similarity doesn’t guarantee domain relevance. A query about “mortgage insurance” might retrieve content about auto insurance or life insurance if those documents discuss similar concepts.

The solution: LLM-based domain classification after retrieval. Each retrieved document is evaluated for domain relevance before being passed to the response generator.

The filter uses a fast, cost-effective model with structured output:

{

    "is_mortgage_related": bool,
    "confidence": float,
    "reasoning": str,
    "detected_topics": List[str]

}

Documents below the confidence threshold are removed. This prevents the agent from providing advice on out-of-scope topics even when the knowledge base contains that content.

Performance optimization: Known-good sources (like company FAQ pages) bypass LLM classification entirely. This reduces latency and costs while maintaining accuracy.

Intelligent Routing: When to Retrieve vs. Respond

Not every query needs retrieval. “Help me apply for a mortgage” is action-oriented, not informational. Routing every query through retrieval adds latency and cost without value.

The solution: agentic routing using LLM reasoning to classify query intent.

The router analyzes:

  • Is this a factual question or an action request?
  • Would knowledge base content add value?
  • Does this need tools (calculators, forms) instead of information?

This pure agentic approach outperforms rule-based routing because it understands context and nuance. A query like “What happens after I submit my application?” is factual (route to knowledge base), while “Submit my application” is action-oriented (route to agent).

Architecture decision: The router uses structured output with TypedDict rather than Pydantic models. AWS Bedrock’s Converse API has more reliable TypedDict support, especially with Amazon Nova models.

Performance Architecture: Async All The Way

Production RAG systems must handle concurrent requests efficiently. Our architecture uses async/await throughout:

Parallel Query Execution

When query expansion generates multiple variations, all knowledge base retrievals happen in parallel:

async def retrieve_for_query(q: str):

    return await retriever.ainvoke(q)

# Execute all retrievals concurrently

results = await asyncio.gather(*[

    retrieve_for_query(q) for q in queries_to_search

])

This reduces total retrieval time from (N queries × retrieval latency) to (max retrieval latency). With 3 query variations and 2 knowledge bases, this changes 6 sequential calls into 6 parallel calls.

Throttling Protection

AWS Bedrock has rate limits. Naive parallel execution causes throttling errors under load. The solution: semaphore-based request limiting with exponential backoff retry.

The implementation uses a factory pattern with centralized throttling:

class LLMClientFactory:
    _bedrock_semaphore = Semaphore(30)  # Max 30 concurrent requests

    @classmethod
    async def invoke_with_retry(cls, llm, messages, max_retries=6):
        async with cls._bedrock_semaphore:
            # Add jitter to prevent thundering herd
            await asyncio.sleep(random.uniform(0, 0.1))
            
            # Exponential backoff on throttling errors
            return await llm.ainvoke(messages)

This pattern provides:

  • Request queuing to prevent cascade failures
  • Circuit breaker pattern for automatic recovery
  • Per-model invocation tracking for observability
  • Jitter to prevent synchronized retry storms

Real-world impact: This reduced throttling errors from ~15% of requests under load to <0.1%.

Caching Strategy

Guardrail checks and domain filtering are expensive LLM calls. Caching identical checks within a session eliminates redundant work:

class GuardrailCache:
    def __init__(self, maxsize=1000, ttl=300):
        self.cache = TTLCache(maxsize=maxsize, ttl=ttl)

The cache is session-scoped with 5-minute TTL. This handles users who rephrase the same question multiple times without caching stale results across sessions.

Performance gain: Cache hit rates of 30-40% in typical conversations, reducing LLM calls by the same percentage.

Citation Management: Accuracy and Transparency

RAG systems must provide citations, but naive implementations show citations even when the LLM ignores retrieved content. This erodes trust.

Our architecture implements two-phase citation validation:

Phase 1: Content Usage Verification

After generating a response, verify that the LLM actually used the retrieved content:

async def _verify_kb_content_usage(answer, documents, query):
    # Use LLM to analyze if answer contains information from documents
    # Returns True only if KB content was demonstrably used

This prevents showing citations when the LLM generates generic responses or uses its training data instead of retrieved content.

Phase 2: Strict KB Usage Mode

If verification fails but we have good documents, retry with a stricter prompt that enforces KB usage:

if not kb_content_used and documents:
    strict_answer = await self._generate_response(
        query, documents, conversation_history,
        enforce_kb_usage=True
    )

This two-phase approach balances flexibility (allowing the LLM to use its knowledge when appropriate) with accuracy (ensuring citations match content).

Critical detail: Citations are extracted from document metadata, deduplicated by URL, and limited to a maximum count (default: 5). Each citation includes:

  • Title (cleaned of markdown/HTML artifacts)
  • URL (validated and formatted)
  • Relevance score (optional, configurable)
  • Source type (internal vs. external)

Guardrails: Input vs. Output

A subtle but critical architectural decision: where to apply guardrails.

Input guardrails: Applied to user queries before processing. Blocks inappropriate content, off-topic queries, and potential prompt injections.

Output guardrails: Applied to LLM-generated responses. Masks sensitive information like phone numbers and email addresses.

The mistake: applying output guardrails to RAG responses. When your knowledge base contains legitimate contact information (like a Better Business Bureau phone number), output guardrails mask it as {PHONE}, breaking the user experience.

Solution: Disable output guardrails for RAG responses. Input guardrails already prevented inappropriate queries, and the knowledge base is curated content. The response generator uses:

llm = LLMClientFactory.create_client(
    task_type="conversation",
    enable_guardrails=False  # Disable for KB-sourced responses
)

This architectural decision requires careful consideration of your threat model and content sources.

Configuration-Driven Design

Production systems need tunable parameters without code changes. Our architecture uses hierarchical configuration:

class KnowledgeBaseConfig:
    # Retrieval parameters
    kb_min_score: float = 0.6
    kb_results_per_source: int = 2
    internal_kb_weight: float = 0.5
    kb_search_type: str = "HYBRID"  # or "SEMANTIC"

    # Citation parameters
    enable_citations: bool = True
    show_citation_scores: bool = True
    max_citations: int = 5

    # Filtering parameters
    enable_domain_scope_filtering: bool = True
    domain_filter_min_confidence: float = 0.6

All parameters are environment-variable configurable, allowing A/B testing and gradual rollout of changes.

Operational insight: Start with conservative settings (higher min_score, fewer results) and relax based on quality metrics. It’s easier to increase recall than to fix precision problems in production.

Error Handling and Fallbacks

Production systems fail gracefully. Our architecture implements multiple fallback layers:

Retrieval Failures

If knowledge base retrieval fails, return a helpful fallback message rather than an error:

if not documents:
    return self._create_fallback_response(query)

The fallback message is configurable and can suggest alternative actions (contact support, try rephrasing, etc.).

LLM Failures

If response generation fails, the system falls back to showing retrieved documents directly with citations. This provides value even when the LLM is unavailable.

Filtering Failures

Domain filtering uses a fail-open strategy. If classification fails, allow the content through rather than blocking potentially relevant information. Log the failure for investigation.

Philosophy: Degraded functionality is better than no functionality. Each component has a defined fallback behavior.

Observability and Debugging

Production RAG systems need comprehensive logging for debugging and optimization:

logger.info(f"Query expansion generated {len(queries)} queries")
logger.info(f"Retrieved {len(documents)} documents from {sources_used} sources")
logger.info(f"Domain filter removed {removed_count} non-relevant documents")
logger.info(f"KB content verification: {kb_content_used}")

The logging strategy captures:

  • Query transformations (expansion, routing decisions)
  • Retrieval metrics (document counts, sources, scores)
  • Filtering decisions (what was removed and why)
  • Performance metrics (latency, cache hit rates, throttling events)
  • Citation extraction (how many, from which sources)

Operational practice: Use structured logging with correlation IDs to trace a single query through the entire pipeline. This is invaluable for debugging quality issues.

Performance Trade-offs and Tuning

Every architectural decision involves trade-offs. Here are the key ones:

Query Expansion: Quality vs. Latency

  • More query variations = better recall, higher latency and cost
  • Our optimization: 2 variations (down from 3) = 25% fewer KB calls
  • Trade-off: Slightly lower recall for significantly better performance

Domain Filtering: Precision vs. Recall

  • Stricter filtering = fewer false positives, more false negatives
  • Our setting: 0.6 confidence threshold
  • Trade-off: Blocks some borderline-relevant content to prevent off-topic responses

Search Type: HYBRID vs. SEMANTIC

  • HYBRID = semantic + keyword matching, better for exact terms
  • SEMANTIC = vector search only, better for conceptual queries
  • Our choice: HYBRID for production (handles both cases)
  • Trade-off: Slightly higher latency for significantly better accuracy

Parallel Retrieval: Speed vs. Cost

  • Parallel execution = faster responses, more concurrent API calls
  • Sequential execution = slower responses, lower peak load
  • Our choice: Parallel with throttling protection
  • Trade-off: Higher infrastructure requirements for better user experience

Lessons Learned

Building this system taught several valuable lessons:

  1. Agentic beats rules: LLM-based query expansion, routing, and filtering outperform hand-crafted rules. The flexibility to handle edge cases is worth the additional LLM calls.
  1. Async is non-negotiable: Synchronous RAG systems don’t scale. The performance difference between sequential and parallel retrieval is dramatic.
  1. Citations require validation: Don’t show citations unless you verify the LLM used the content. Users notice when citations don’t match answers.
  1. Fail-open for filtering: When in doubt, allow content through. False negatives (missing relevant content) are worse than false positives (showing borderline content) in most applications.
  1. Configuration is architecture: Make everything tunable. You’ll need to adjust thresholds based on real usage patterns.
  1. Observability from day one: You can’t optimize what you can’t measure. Comprehensive logging is not optional.

Conclusion

Production RAG systems require careful architectural planning beyond basic retrieval. The patterns shared here – multi-stage pipelines, intelligent routing, domain filtering, async execution, and comprehensive error handling – are applicable across industries and use cases.

The key is balancing quality, performance, and cost through thoughtful design decisions and continuous optimization based on real-world usage.

Start with a solid foundation, make everything configurable, and iterate based on metrics. Your RAG system will evolve as you learn from production traffic, but these architectural patterns provide a robust starting point.


Discover more from The Data Lead

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *