India’s AI Dream Is Getting Lost in Translation

netral

India's ambition to become an AI superpower is being undermined by a persistent language gap that threatens to exclude the vast majority of its population. Prime Minister Narendra Modi has championed AI as a tool for inclusion and empowerment, and Silicon Valley sees India as a critical growth market, second only to the US in usage of major AI platforms. However, the country's linguistic diversity—with nearly two dozen official languages and over a hundred dialects—poses a fundamental barrier. If AI cannot understand Bengali voice notes, Gujarati payment queries, or code-switched Hindi-English business calls, it risks becoming yet another technology that divides the English-speaking elite from everyone else. The core problem is data: Indic languages are severely underrepresented in training datasets, and even advanced models struggle with accuracy. One study found that GPT-5 achieved only about 45% accuracy on a benchmark covering 11 Indic languages, including Modi's mother tongue, Gujarati. While technical improvements have helped recent models perform better in low-resource languages, the gap persists, especially for speech—the most intuitive interface for many in developing regions. AI systems that cannot comprehend voice-based interactions will be useless for automating daily commerce and public services, and potentially dangerous in critical applications like healthcare and law. India Inc. and global tech giants are racing to solve this, but the challenges of quality, ethics, and safety remain formidable. Startups like Sarvam AI are building models that understand local voices and documents, while OpenAI and the Indian government have launched evaluation frameworks and data-collection platforms. Yet crowdsourcing alone is not enough; a Stanford study warned that quality control and ethical concerns around fair pay and data sovereignty are critical. What to watch next: Whether India's government will mandate language-inclusive AI standards and whether model builders can deliver systems that truly comprehend the country's linguistic diversity, or if the AI revolution will deepen existing inequalities.

Key Takeaways

India's linguistic diversity is the single biggest obstacle to its AI ambitions, risking exclusion of non-English speakers.
Current AI models show poor accuracy on Indic languages, with GPT-5 achieving only 45% on a benchmark of 11 languages.
Voice-based AI is critical for India, but speech data remains scarce, noisy, and poorly benchmarked.
Safety alignment degrades in low-resource languages, leaving the most vulnerable populations least protected from AI risks.

Insights & Analysis

The language gap in AI is not just a technical problem but a strategic vulnerability for India's economic and social inclusion goals.
Success in cracking India's language challenge could give any AI company a massive competitive advantage in the Global South, where similar linguistic diversity exists.

Original source

https://www.bloomberg.com/opinion/articles/2026-06-28/india-s-ai-dream-is-getting-lost-in-translation?srnd=homepage-asia

Sentiment

Summary

**India's ambition to become an AI superpower is being undermined by a persistent language gap that threatens to exclude the vast majority of its population.** Prime Minister Narendra Modi has championed AI as a tool for inclusion and empowerment, and Silicon Valley sees India as a critical growth market, second only to the US in usage of major AI platforms. However, the country's linguistic diversity—with nearly two dozen official languages and over a hundred dialects—poses a fundamental barrier. If AI cannot understand Bengali voice notes, Gujarati payment queries, or code-switched Hindi-English business calls, it risks becoming yet another technology that divides the English-speaking elite from everyone else.

**The core problem is data: Indic languages are severely underrepresented in training datasets, and even advanced models struggle with accuracy.** One study found that GPT-5 achieved only about 45% accuracy on a benchmark covering 11 Indic languages, including Modi's mother tongue, Gujarati. While technical improvements have helped recent models perform better in low-resource languages, the gap persists, especially for speech—the most intuitive interface for many in developing regions. AI systems that cannot comprehend voice-based interactions will be useless for automating daily commerce and public services, and potentially dangerous in critical applications like healthcare and law.

**India Inc. and global tech giants are racing to solve this, but the challenges of quality, ethics, and safety remain formidable.** Startups like Sarvam AI are building models that understand local voices and documents, while OpenAI and the Indian government have launched evaluation frameworks and data-collection platforms. Yet crowdsourcing alone is not enough; a Stanford study warned that quality control and ethical concerns around fair pay and data sovereignty are critical. **What to watch next:** Whether India's government will mandate language-inclusive AI standards and whether model builders can deliver systems that truly comprehend the country's linguistic diversity, or if the AI revolution will deepen existing inequalities.

Key Takeaways

Insights

Teks Asli (SEO)

Full article body

When Indian Prime Minister Narendra Modi hosted world leaders and tech chiefs earlier this year in New Delhi, he declared that AI must be “democratized” and a “medium of inclusion and empowerment, especially across the Global South.”

It’s a convenient vision for Silicon Valley, which is in the midst of an ongoing landgrab for the lucrative market. Young, tech-savvy and mobile first, India has become one of the most important growth regions for AI, ranking behind only the US in usage for both OpenAI’s ChatGPT and Anthropic’s Claude.

But the key to both the “inclusion” and business dreams of tech diffusion is overcoming the barriers to speech. India has nearly two dozen official languages and more than a hundred dialects. If AI can’t close this gap, it will just become another technology that divides the English-speaking elite and everyone else. True localization will depend on whether models can comprehend Bengali voice notes, Gujarati payment queries, and code-switched Hindi-English business calls — all the messy and real-world spoken words that drive daily commerce and public life.

More than a billion people speak Indic languages. Yet one study found that GPT 5 only achieved about 45% accuracy on a human-curated benchmark covering 11 of them, including Modi’s mother tongue, Gujarati.

The first generation of AI tools were trained on internet text, the majority of which is in English. Technical improvements and better datasets have helped recent models improve in non-English and so-called “low-resource languages,” those with less data to train on. But the language gap persists, especially for speech, forecast to become the next mass way of interacting with models.

“Voice is the most intuitive interface for humans, especially in more developing regions,” Sandeep Chinchali, the co-founder of Poseidon, an Andreessen Horowitz-backed data infrastructure startup, told me. And South Asia, he added “uses voice for everything,” with businesses running through phone calls, WhatsApp voice memos, speech-based payments and increasingly voice-enabled coding tools. AI systems that can’t comprehend these interactions will be useless in automating this work, not to mention potentially dangerous in public services.

One problem is a lack of proper benchmarks for non-English models. Leading ones, for example, can’t even agree on what proper Bengali — a language spoken by more than 280 million — should look like. The heart of the issue is still data, Chinchali says, and not just quantity (Bengali makes up less than 0.1% of web text) but also quality.

Spoken Indic languages add another layer of difficulty: regional variants, background noise, and frequent code-switching in technical and financial conversations. Speech data for AI training requires accurate transcription, longer clips, varied acoustic environments, demographic and regional variants, as well as careful human review before they can be truly improve AI models. Systems trained on narrower datasets often fail in the real world, where conversations mix local slang and borrowed English words in varied settings.

India Inc. understands the stakes. Cracking the language challenge has become central to the country’s broader sovereign AI push. Building AI that work’s “at India’s scale” presents a massive opportunity, Pratyush Kumar, the co-founder of domestic AI hopeful Sarvam AI, said in a statement this month, announcing a new funding round. That means models that “understand our voices” and “read our documents.” In April, the startup that so many are pinning their hopes on for catching up in a US-China race launched a new evaluation for Indic speech recognition, arguing that the standard metrics were not built for these languages and can distort how such systems are judged.

US tech giants are paying attention, too. OpenAI last year unveiled a framework for evaluating AI systems on Indian culture and language. And Modi’s government has launched a translation platform that also collects spoken data to improve multilingual models.

But crowdsourcing is no cure-all. High standards and human curation remains vital. A Stanford team warned in a paper last year that quality has become a key challenge when trying to scale such endeavors. It also raised ethical questions in a sector with a long history of poor pay and exploitation.

Poseidon’s Chinchali says it worked with a supplier in India that is committed to fair pay and is exploring blockchain tools that would give contributors even more say over how their data is deployed — for example, it’s used by companies within their own country versus being leaked out to train foreign AI tools. These are good steps and should become the baseline, not the exception.

The government is already forcing high school students to learn three languages, including two indigenous ones. If Modi is serious about making India an AI superpower, he should demand something similar from model builders and craft policy that ensures systems that can truly comprehend the linguistic diversity.

There’s also a safety issue. As AI moves into schools, hospitals, courts and public service, language failures have consequences. Safety alignment tends to deteriorate when people engage with AI in low-resource languages, researchers have found. It means the people most likely to be left behind by the tech revolution may also be the least protected from its risks.

Bridging the language divide is key to Silicon Valley cracking this next growth market. And Modi’s promise that it will deliver “happiness for all, welfare for all” will ring hollow if the technology cannot even understand the people it claims to empower.

Never Miss a Briefing