IN THE INTERNET age, when we face a language barrier, there are a host of internet resources to solve it: things like translation apps, dictionary websites, versions of Wikipediain other languages, and the simple "click to translate" option. But there are about 7000 languages spoken in the world today. The top 10 or so are spoken by hundred of millions of speakers; the bottom third have 1000 speakers or fewer.
But in the murky middle ground are a couple hundred languages that are spoken by speakers in millions. These midsize languages are still fairly widely spoken, but they have vastly inconsistent levels of support online. There’s Swedish, which has 9.6 million speakers, the third-largest Wikipedia with over 3 million articles, and support in Google Translate, Bing Translate, Facebook, Siri, YouTube captions, and so on. But there’s also Odia, the official language of the Odisha state in India, with 38 million speakers, which has no presence in Google Translate. And Oromo, a language spoken by some 34 million people, mostly in Ethiopia, which has just 772 articles in its Wikipedia.
Why do Greek, Czech, Hungarian, and Swedish, with their 8 to 13 million speakers, have Google Translate support and robust Wikipedia presences, while languages the same size or larger, like Bhojpuri (51 million), Fula (24 million), Sylheti (11 million), Quechua (9 million), and Kirundi (9 million) languish in technological obscurity?
Part of the reason is that Greek, Czech, Hungarian, and Swedish are among the 24 official languages of the European Union, which means that a small hoard of human translators translate many official European Parliament documents every year. Human-translated documents make a great base for what linguists call a parallel corpus — a large mass of text that's equivalent, sentence-by-sentence, in multiple languages. Machine translation engines use parallel corpora to figure out regular correspondences between languages: if "regering" or "κυβέρνηση" or "kormány" or "vláda" all frequently appear in parallel to "government," then the machine concludes these words are equivalent.
In order to be reasonably effective, machine translation requires an enormous parallel corpus for each language. Ideally, this corpus contains documents from a variety of genres: not just parliamentary proceedings but news reports, novels, film scripts, and so on. The machine can't translate informal social media posts very well if it's been trained only on formal legal documents. Translation tools are already scraping the bottom of the parallel corpus barrel: In many languages, the largest parallel translated text is the Bible, which leads to peculiar circumstances where Google translates nonsense syllables into prophecies of doom.
In addition to EU documents, Swedish, Greek, Hungarian, and Czech have a wealth of language resources, created one human at a time over centuries. They're the languages of entire nation-states, with national TV and radio recordings that can be used as the foundation for text-to-speech models. Their speakers have the kind of disposable income that makes media companies translate popular novels and subtitle foreign movies and TV shows. They're found in countries that tech companies imagine their customers might be living in or might at least visit on holiday, meaning it's worth localizing interfaces and adding them as translation options. They have regularized spelling systems and dictionaries that can be rolled into spellcheckers and predictive text models. They have highly literate speakers with internet access who can contribute to projects like Wikipedia. (Speakers who can even, in the case of Swedish, create a bot to automatically make basic Wikipedia articles for rivers, mountains, and other natural features.)
Language resources don't just appear. People have to decide to create them, and those people need to be fed and watered and educated and housed and supported, whether that's by governments or by companies or by the kind of personal wealth that lets individuals take on time-consuming intellectual hobbies. Creating parallel corpora and other language resources takes years, if it happens at all, and costtens of millions of dollars per language.
Meanwhile, we know that catastrophes periodically happen around the world: earthquakes, floods, hurricanes, cyclones, diseases, famines, fires. Some of them will happen in areas where people speak a large, well-resourced language, and organizations will rush to their aid. But the odds are goodthat some of the world's future crises will happen in areas where people speak one of these medium-size but low-resource languages. In those cases, aid organizations and governments will face an urgent language barrier.
The problem is, we don't know which language will desperately need the world's attention next. When an earthquake hit Haiti in 2010, international organizations suddenly required Haitian Creole resources. Ebola outbreaks in West Africa affected speakers of languages like Swahili, Nande, Mbuba, Krio, Mende and Themne. Asylum seekers from Central America often speak languages like Zapotec, Q’anjob’al, K'iche' and Mam. These speakers aren't the ideal customers of big tech companies. They don't have leisure time to edit Wikipedia. They may not even be literate in their mother tongue, communicating by voice memoinstead of by text message. But when a crisis hits, internet communication tools will be crucial.
Researchers at Darpa, the Defense Advanced Research Projects Agency, decided to tackle the problem by rethinking the way we translate languages. Instead of creating language-specific tools, Darpa is attempting to build language-agnostic tools that, once created, could spring into action in times of crisis and be tuned to any language with minor tweaking — even if they have just monolingual text scraped from social media rather than carefully translated parallel corpora.
They also changed their goals. It's too hard to jump right to full-blown machine translators that produce idiomatic prose, according to Dr. Boyan Onyshkevych, program manager at Darpa's Information Innovation Office. Instead, they carve out more manageable tasks, such as linking all the proper nouns in a passage with their equivalents in a more widely-spoken language. Automatically identifying entities in this way can help provide clues about the overall situation — say, which rivers are flooding, which villages are affected by an outbreak, or which people are missing.
Darpa funds researchers year-round at a couple dozen universities and companies; then, twice a year, they test them, in a "linguistic crisis simulation" event, where teams of researchers translate imaginary catastrophe reports in a surprise mystery language. For the first round, the teams have 24 hours to figure out as much useful information as possible from social media, blogs, and news reports, with the help of a few resources like a basic dictionary and an hour of time with a native speaker of the language. Then Darpa adds in more social media data and more time with a speaker, and the teams go at it again. Later, the results and data sets from such simulations are often published online so they can eventually be rolled into tools like Siri and Google Translate.
Methods like these use the resources of the internet age to solve the problems of the internet age. Smaller languages may not have extensive books or parliamentary records to train a language processor; they may not have very many professional translators. But they do have thousands or millions of speakers hanging out on social media and posting, like all of us do, about the weather and what they had for lunch. These posters are potentially sowing the seeds of their own survival, should catastrophe strike — their tweets and blog posts could get scooped up to teach the rest of the world how to help.