While Google Translate can help users understand words in more than 100 languages, there are thousands more dialects used around the world that remain largely a mystery to those who aren’t native speakers.
A team of Johns Hopkins University scientists is working to bridge that gap with the help of a $10.7 million federal grant from the Office of the Director of National Intelligence, which oversees the nation’s intelligence community.
They will use the funding to create a translation and information retrieval system for obscure languages.
Philipp Koehn, a computer science professor at Hopkins, is leading a team of about 20 researchers in a quest to build a system that can translate foreign languages into English while also summarizing the material.
Koehn said he believes the nation’s intelligence agencies will direct his team to focus on languages that are spoken by millions but have relatively little written material. Those languages — such as Kurdish, Serbo-Croatian, Khmer, Hmong and Somali — are referred to as “low resource” languages.
The project is slated to start in a few weeks and will begin with data collection. Researchers involved will compile documents, including books, tweets, blogs and speeches, that were written in their target language and have previously been translated into English. It’s estimated they will gather enough text to fill the equivalent of 10 books of about 350 pages each, or roughly a million words.
“The biggest challenge for us is we usually have more data,” Koehn said. “This requires lots and lots of data.”
Scientists then will use machines to analyze the language patterns, including sentence structure and the way verbs, adjectives and nouns are positioned.
This analysis will enable the scientists to develop algorithms that automatically translate their target language.
Intelligence gathering has come to involve a growing number of languages. For most uncommon languages, there are few to no automated tools available for machine translation.
“We need to figure out what people are talking about, and it’s often in languages where we don’t have these translation technologies yet,” Koehn said. Intelligence agencies are “interested in what people are saying all around the world about events that are unfolding, that are being talked about in hundreds of languages.”
The U.S. intelligence community hopes this project will cut down on the amount of time needed to produce accurately translated information.
Beyond Johns Hopkins, researchers at several other research institutions will work toward a similar goal and compete against each other, according to a news release. The University of Southern California, Columbia University and Raytheon BBN Technologies also are involved in the project, dubbed MATERIAL, which stands for Machine Translation for English Retrieval of Information in Any Language.
The Office of the Director of National Intelligence is likely to give the research results to a private company to build the system to be used by the government, Koehn said.
The project will run over four years. The work will be broken into three phases.
Carl Rubino, the project’s manager within the Office of the Director of National Intelligence, said the teams are working toward a “monumental task” that will improve communications between countries.
“We’re trying to push technology in ways that it hasn’t been pushed before,” he said.