Services like Google Translate and Bing work great with English and Spanish, because they have plentiful and deep data sets to draw upon: lots of stuff exists in both of those languages. But the trouble with big data is that it needs big data. This leaves languages like Galician, Welsh, and Faroese in the cold, translationally wise, because there’s just not much of them online to work with.
So linguists from the University of Copenhagen found a different solution for the translation of these minority languages. And that solution is the Bible. They didn’t just pray for better algorithms, though. “The Bible has been translated into more than 1,500 languages, even the smallest and most ‘exotic’ ones, “ says Anders Søgaard a professor at the University of Copenhagen. “The translations are extremely conservative; the verses have a completely uniform structure across the many different languages which means that we can make suitable computer models of even very small languages where we only have a couple of hundred pages of biblical text.”
The Bible acts like a Rosetta Stone for all these languages, not only giving exact translations of words, but allowing the researchers’ computers to apply knowledge of one well-understood language to another “low-resource” language.
The other tool that Søgaard and his team are using is Wikipedia. Wikipedia comes in 129 languages that share around 10,000 articles. And while these articles are not exact translations of each other, their narrow focus makes them similar enough to be used. In fact, the lack of direct translation might even be an advantage, as it lets the machines see how different languages express the same concepts, which is seldom word-for-word equivalent.
“This allows us to do what we call ‘inverted indexing,’” says Søgaard, “ the English word ‘glasses’ appears in the English Wikipedia entry on Harry Potter, and the German word ‘Brille‘ is used in the equivalent German entry, it is very likely that the two words will be represented in a similar fashion in our models.”
These models can compare 100 different languages at a time, teasing out relationships between all of them, something impossible for humans to even try. They also help to untangle ambiguous words, those words that have many disparate meanings, so the correct translation can be inferred from context.
The results from 25 test languages, published in this report, “are consistently better than the unsupervised base-lines, and by a very large margin.”
One irony of machine translation is that it offers the worst service to those that need it most. If you speak Spanish or English, you can ignore other languages altogether, but if you can’t find much to read in your native language, then you’re much more likely to rely on translation. Søgaard’s research, with its easily scalable machine-based learning, is ripe for adoption by the likes of Google and Bing, and it may even help assistants like Apple’s Siri to understand us better.