Jim and Janet Baker's home in Newton, Mass., is full of dragons - dragon statues, dragon kites and dragon curtains. A 200-pound dragon sculpted by Jim's sister sits in his living room.
It's a fitting symbol for a couple who spent 25 years conquering one of digital technology's great dragons - converting human speech to written text.
Today, at 62, Jim Baker is on a new quest. He is coming to Baltimore to direct a center devoted to language technology at the Johns Hopkins University.
It's an urgent assignment. The U.S. Department of Defense, desperate for technology to sift through the huge amounts of raw voice traffic and text it collects each day, has awarded Hopkins a $48 million grant through 2015 to open the Human Language Technology Center of Excellence near its Homewood campus.
The Bakers' program, Dragon NaturallySpeaking, was one of the first to make speech-recognition practical - on a personal computer, no less. And Jim Baker's mathematical model for speech recognition remains the industry's gold standard today. But Baker is convinced it's time for something better.
"I'm not satisfied. I'm still trying to leap-frog," he says.
There was no question that Jim Baker, currently a professor at Carnegie Mellon University, was the best fit for the new Hopkins post, said Gary Strong, who will serve as the center's executive director. "He was written into the proposal Hopkins made."
Once he relocates to Hopkins - less than an hour from the National Security Agency, presumably one of the center's major customers - Baker wants to create even more robust speech-recognition tools, as well as tools that can intelligently sort through billions of words on millions of Web sites and blogs. About half the research is expected to be classified.
"Jim has the broadest, deepest knowledge in all aspects of speech recognition," said Baker's friend and fellow Carnegie Mellon researcher Richard Stern. "And he is one of those rare individuals who is not afraid to think about really big ideas."
A good team
But speech recognition wasn't on Baker's radar back in the 1970's, when the young mathematician took notice of research conducted by Janet MacIver, a neurophysiologist and fellow graduate student at Rockefeller University in New York City, whom he later married.
"It was my wife who introduced me to speech," Baker said. "She was studying humpback [whale] wave songs and bird songs and trying to understand their messages."
Drawn to MacIver and her research, Baker started to study graphic displays of animal sounds, and later human speech, at her side.
"I started doing simple computations, trying to understand speech as a mathematician," Baker recalled.
"We hit it off right away," said Janet, "We met in May, he proposed in August and we were married by October.
"We were a very good team," she added.
Together they decided they could make a difference by exploring a difficult but solvable problem: speech recognition. And given rapid improvement in computer technology, they were convinced they could do it within their lifetimes.
Invited to work at Carnegie Mellon in Pittsburgh, a university at the forefront of computer innovation, Jim Baker decided to approach speech using the so-called Hidden Markov Model, a statistical process that makes inferences about unknown or future events - for example, the next sound or letter in a word - based on what has come before.
Eventually, he created a speech recognition system with the lowest error rate to date, which he and Janet used when they started Dragon Systems. It's still the backbone of their hit product, NaturallySpeaking.
Just as she did with speech research, Janet introduced Jim to dragons. "I started getting into dragons a little bit before I met Jim," she said, and their company logo was based on the Eastern dragon, which she called "a symbol of good fortune, wisdom, creativity."
Humble beginnings
Without a penny in venture capital, the couple bet everything on their tiny company, working for 18 months with no salaries and hauling their young son and daughter to conferences and shows, where the youngsters demonstrated how easily words spoken into a computer could be translated into text on the screen.
The company grew from 10 employees to 350, and from almost no revenue to $70 million a year by the time they sold the firm in 2000. While the Bakers pursued other ventures, the company changed hands several times, and NaturallySpeaking is now marketed by Nuance Communications of Burlington, Mass.
While Janet, now 60, lectures and pursues her interests in Celtic folk singing and dancing, Jim Baker has unfinished business.
What surprises him, he says, is that the Hidden Markov Model - his milestone in speech recognition - is still the best the industry has.
"If you asked me in the 1980s, I would have said it's a good way to get a quick and dirty approximation [of speech-to-text], but we'll have something better in 20 years," Baker said.
Unfortunately, we don't.
"Speech recognition has gotten better, but it's because computers have more memory and are faster," Baker said. "The Markov Model isn't smart enough ... and personally, I think I'm bored with it. It's been over 30 years."
He hopes the Center's multimillion-dollar funding will allow it to leap beyond Markov to the next step - converting recordings of casual conversations - such as phone or radio chatter - into text.
Current technologies only work well when users are deliberately dictating, Baker said. "Because it's a live system, the computer can train itself, and you get trained to learn how to best talk to the computer. On the other hand, recognizing casual conversational speech-that's more difficult."
Challenge of wiretaps
But that casual conversation, caught by wiretaps and eavesdroppers, is what the Defense Department is trying to decipher, said Daniel Goure, a national security expert and Vice-President of the Lexington Institute, a Washington think tank.
Whether it's conducted in English, Arabic, Chinese or any other language, casual conversation is a tough nut to crack. "Speech is highly variable," Baker said, and the same word spoken by the same person can vary greatly from time to time.
In casual conversation, Baker said, today's speech recognizers probably would not pick up on the difference between a person saying "recognize speech" and "wreck a beach."
The Defense Department also wants to scan large amounts of text to look for critical patterns. "You want to be able to survey, analyze and gather data from the common wheel of the Web," Goure said.
Given the vastness of the Internet, and the world's 6,000-plus languages, this is an impossible task to do manually.
"The idea would be to develop a computer program that can understand 'intent,'" said Strong, a veteran language technology researcher at the National Science Foundation before taking the Hopkins post. "If a blog was talking about the United States and being inflammatory or making threats, we could identify this."
The task is an enormous one that depends on bringing together many types of research. Bonnie Dorr, a University of Maryland computer science professor at the University of Maryland, will lead several projects at the new Hopkins center.
Initially, Dorr said, she will work on a problem called cross-center co-referencing, which involves trying to form connections between sentence pairs.
Consider a news article about President Bush meeting with French President Nicolas Sarkozy.
One sentence might start with the words "Bush and Sarkozy had a wide-ranging discussion of foreign affairs," while another sentence further down might say, "The two leaders got along well."
Today, Dorr said, it's hard for a computer to recognize that "Bush and Sarkozy" and "the two leaders" refer to the same people, even though it's easy for humans.
Different groups will try to develop novel solutions to these problems, including researchers from Carnegie Mellon, University of Maryland, Baltimore, Columbia University and BBN Technologies, a defense contractor with expertise in language processing.
Quest for efficiency
These tools will help speakers with fluency in a language analyze audio and text more efficiently, said Strong. "Otherwise, a government analyst spends days and days sorting through data, in multiple languages," he said.
Eventually the center will also focus on machine translators that convert speech and text from one language to another, Dorr said.
Today, the very best machine translators are designed to handle a specialized vocabulary, such as a Canadian program that translates weather forecasts from English to French, said Robert Frederking, a language technology specialist at Carnegie Mellon.
It's much harder to translate a newspaper, Web site or blog - especially the latter. "Blogs have a very informal style, and that makes the task a lot harder," Maryland's Dorr said.
Other experts cautioned against pinning too much hope on computers.
"It's not that easy," said Melissa Ngo, a director at the Electronic Privacy Information Center, a Washington-based watchdog group. "You can't tell by sifting through many terabytes of information whether someone is going to harm you, or when, or why or how. A computer can't replace a person."
The Bush administration acknowledged this problem last year when it launched of the National Security Language Initiative. The $750 million, five-year effort will teach Americans from kindergarten through graduate school critical foreign languages such as Arabic, Hindi, Chinese, Russian and Farsi.
Even language technology researchers concede that computerized translators will not solve the problem. "We will always need human translators," Dorr said.
sindya.bhanoo@baltsun.com