The Ambition
In Q4 2024, we set an ambitious goal: make Aria work natively in 48 languages. Not just translate prompts and responses, but genuinely understand and generate content that reads naturally to native speakers — with correct grammar, idiomatic expressions, and culturally appropriate tone.
Six months later, we support 48 languages with an average quality score of 8.2/10 (rated by native-speaker evaluators). Here's what we learned along the way.
Challenge 1: Evaluation at Scale
The first problem isn't technical — it's organizational. How do you evaluate content quality in 48 languages when your team speaks 8? You can't rely on BLEU scores or other automated metrics for creative content; they correlate poorly with human-perceived quality.
We built a network of 96 native-speaker evaluators (2 per language) who rate AI-generated content on five dimensions: grammar, naturalness, tone accuracy, cultural appropriateness, and task completion. Evaluators are compensated at above-market rates and receive calibration training to ensure consistency.
The Calibration Challenge
"Good writing" means different things in different cultures. German business writing favors formal, structured prose. Brazilian Portuguese marketing copy is warm and conversational. Japanese documentation uses specific honorific patterns that AI frequently gets wrong.
We spent four weeks building language-specific evaluation rubrics with our evaluators. Each rubric defines what "natural" means for that language across content types. This was the highest-ROI investment of the entire project.
Challenge 2: Tokenizer Bias
Modern LLM tokenizers are heavily biased toward English. The sentence "The cat sat on the mat" tokenizes into 7 tokens. The equivalent in Japanese (猫がマットの上に座った) can require 15-20 tokens. This means Japanese users pay 2-3× more per request and hit context window limits faster.
We addressed this with two approaches: (1) language-aware prompt compression that's more aggressive for high-token-density languages, and (2) cost normalization in our billing so users pay per "semantic unit" rather than per token. A 1,000-word blog post costs the same regardless of language.
Challenge 3: Voice Vectors Across Languages
Our voice vector system (128-dimensional style representation) was trained primarily on English content. It captures dimensions like formality, sentence complexity, and vocabulary sophistication — but these dimensions manifest differently across languages.
For example, formality in English is about word choice ("use" vs. "utilize"). In Japanese, it's about grammatical conjugation patterns. In French, it's about pronoun usage (tu vs. vous) and subjunctive mood frequency.
We retrained the voice vector model with multilingual data: 50,000 writing samples across 20 languages, labeled on the same 128 dimensions by native speakers. The resulting model captures cross-lingual style patterns — you can describe your voice in English, and Aria generates appropriately styled content in German.
Challenge 4: Cultural Localization
Translation is not localization. A US-focused case study that references "Series A funding" and "Y Combinator" needs complete reworking for a Japanese audience, not just word-for-word translation. Metaphors, examples, humor, and references must be culturally appropriate.
We built a "cultural context" layer that detects region-specific references in prompts and either adapts or flags them for human review. For example, if you ask Aria to "write a marketing email with a Black Friday promotion" for a Brazilian audience, it will suggest referencing local events instead.
Challenge 5: Right-to-Left and Script Complexity
Arabic, Hebrew, Persian, and Urdu required significant UI work. Text rendering, cursor behavior, and layout all needed RTL support. Mixed-direction content (Arabic text with embedded English code examples) was particularly challenging — we implemented the Unicode Bidirectional Algorithm with custom overrides for code blocks.
For CJK languages (Chinese, Japanese, Korean), we dealt with input method editor (IME) compatibility. Real-time streaming of AI responses while the user's IME is active can cause input conflicts. We solved this by buffering AI output during active IME composition.
Where We Are Now
48 languages live, with an average quality score of 8.2/10. Tier 1 languages (English, Spanish, French, German, Japanese, Portuguese, Chinese) score 8.7+. Tier 3 languages (Hungarian, Vietnamese, Thai) average 7.4 — still above our 7.0 quality bar, but with room to improve.
The biggest lesson: multilingual AI is not an engineering problem with an engineering solution. It's a cultural, linguistic, and organizational challenge that requires native speakers at every stage — from evaluation to product decisions. Our evaluator network is now one of our most valuable assets.