The Challenge
When we hit 500K daily AI requests in Q3 2024, our naive RAG implementation started showing cracks. Latency spiked to 800ms p95, vector search was eating 40% of our compute budget, and our chunking strategy was producing hallucinations on long-form documents.
We needed to rebuild the entire pipeline — while keeping the service running for 8,000+ teams depending on it daily.
Architecture Overview
Our RAG pipeline has four stages: document ingestion, chunking & embedding, retrieval, and generation. Each stage had its own scaling bottleneck, and we ended up redesigning all four.
Stage 1: Document Ingestion
The first bottleneck was parsing. We support 15+ document formats (PDF, DOCX, HTML, Markdown, Google Docs, Notion exports), and each parser had different failure modes. PDFs with scanned images would silently drop content. Google Docs with nested tables would corrupt formatting.
Our solution: a unified document AST (Abstract Syntax Tree) that normalizes all formats into a common representation before chunking. This added 50ms to ingestion but eliminated an entire class of parsing bugs.
Stage 2: Smart Chunking
The standard approach — split on token count with overlap — produces terrible results for structured content. A 512-token chunk might split a code example in half, or separate a heading from its body.
We built a semantic chunker that respects document structure: headings, lists, code blocks, and tables stay intact. Chunks are sized between 256-1024 tokens based on content type, with semantic overlap (we embed the last sentence of the previous chunk as context).
Stage 3: Hybrid Retrieval
Pure vector search misses exact matches (product names, API endpoints, error codes). Pure keyword search misses semantic similarity. We use a hybrid approach: BM25 for keyword relevance + cosine similarity for semantic matching, with a learned re-ranker that combines both scores.
The re-ranker is a small cross-encoder model fine-tuned on our query-document pairs. It adds 30ms of latency but improved retrieval precision by 23%.
Stage 4: Context-Aware Generation
The final piece: how we feed retrieved context to the LLM. We experimented with three approaches — stuff all chunks into context, map-reduce over chunks, and iterative refinement. For most queries, a compressed context window with the top-5 chunks performs best.
Results
After the rebuild, our p95 latency dropped from 800ms to 180ms. Hallucination rate on document-grounded queries fell from 12% to 1.3%. And our compute costs dropped 40% thanks to smarter caching and smaller context windows.
The key lesson: RAG is not a solved problem. Every component — parsing, chunking, retrieval, generation — has significant room for optimization, and the right approach depends entirely on your data and use case.