Engineering11 min read

We cut our LLM inference costs by 73% — here's how

Alex Rivera

Head of Engineering · Jan 31, 2025

The $47K Monthly Wake-Up Call

In October 2024, our LLM inference bill hit $47,000 for the month. We were processing 1.8M requests daily across GPT-4o, Claude 3.5, and Gemini — and growing 15% month-over-month. At this trajectory, we'd hit $200K/month by mid-2025. Something had to change.

Three months later, we're processing 2.1M daily requests for $12,700/month — a 73% cost reduction while handling 17% more traffic. Here's exactly what we did.

Strategy 1: Intelligent Model Routing

Not every request needs GPT-4o. Our analysis showed that 62% of requests were simple tasks (summarization, reformatting, basic Q&A) that smaller, cheaper models handle equally well. We built a router that classifies incoming requests by complexity and routes them to the most cost-effective model.

Simple tasks (62%) go to GPT-4o-mini at $0.15/1M input tokens. Medium complexity (28%) goes to Claude 3.5 Sonnet. Only complex tasks requiring deep reasoning or long-context understanding (10%) hit GPT-4o at $2.50/1M input tokens.

The router itself is a small classifier fine-tuned on 50,000 labeled request-complexity pairs. It adds <5ms latency and reduced our average cost per request by 44%.

How We Built the Classifier

We sampled 10,000 historical requests and had three annotators label each as simple, medium, or complex based on the output quality across models. Inter-annotator agreement was 87%. We fine-tuned a DistilBERT model on these labels — it's tiny (66M params), fast (3ms inference), and achieves 91% accuracy on our held-out test set.

Strategy 2: Semantic Caching

Many requests are semantically identical even if the exact wording differs. "Summarize this article" and "Give me a summary of this piece" should return the same result for the same input document.

We implemented a semantic cache: each request is embedded using a small embedding model, and we check if any cached response has a cosine similarity above 0.95 with the incoming request. Cache hits skip the LLM entirely.

Our cache hit rate stabilized at 23% — meaning nearly a quarter of all requests are served from cache at effectively zero marginal cost. For popular templates and repeated workflows, the hit rate exceeds 40%.

Strategy 3: Prompt Compression

System prompts are expensive because they're sent with every request. Our average system prompt was 1,200 tokens — and for enterprise customers with custom brand guidelines, it could exceed 3,000 tokens.

We built a prompt compression layer that uses a technique inspired by LLMLingua: it identifies and removes redundant tokens from system prompts while preserving semantic meaning. Average compression ratio: 2.4×, meaning a 1,200-token prompt becomes ~500 tokens with no measurable impact on output quality.

We validated this with A/B testing: compressed vs. uncompressed prompts across 100,000 requests, measured by human evaluation scores. The difference was statistically insignificant (p=0.42).

Strategy 4: Batching and Scheduling

Not all requests are time-sensitive. Batch operations (tone analysis, content scoring, bulk reformatting) can be queued and processed during off-peak hours when API pricing is lower and rate limits are more forgiving.

We introduced a priority queue with three tiers: real-time (interactive chat, <2s SLA), near-time (background enrichment, <30s SLA), and batch (bulk operations, <5min SLA). Batch tier requests are accumulated and sent in optimized batches of 20-50, reducing per-request overhead.

The Results

Combining all four strategies, our cost per request dropped from $0.026 to $0.006 — a 77% reduction at the per-request level. After accounting for our 17% traffic growth, that translates to 73% lower monthly costs.

Latency actually improved: p50 dropped from 1.2s to 0.8s (thanks to model routing sending simple requests to faster models) and p95 dropped from 3.4s to 1.9s. Cache hits return in <50ms.

What We'd Do Differently

If we were starting over, we'd implement semantic caching on day one. It's the highest ROI optimization with the lowest implementation complexity. Model routing requires ongoing maintenance as new models launch and pricing changes. Prompt compression is powerful but requires careful validation. Start with caching, then layer in routing, then optimize prompts.

AR

Alex Rivera

Head of Engineering at Aria

Passionate about building AI systems that amplify human creativity. Previously at Google DeepMind and Stanford NLP Group.

Try Aria free for 14 days

See how AI-powered content creation can transform your workflow.

Start Free Trial