SUBSCRIBE

COMPUTER SCIENTISTS ARE CLOSER TO MODELS THAT PROCESS SPEECH But hurdles are many--and costly

THE BALTIMORE SUN

In a few years, when you curse at your computer, don't be surprised if it tries to obey your command.

After decades of work, computer scientists are coming closer to realizing one of the discipline's most sought-after goals: creating computers that respond to natural human speech.

There are already a few products that let computers "understand" the spoken word. Most of them understand only one or two people who have taken time to "train" the system to understand their style of speech. But more advanced systems, ones that can understand anyone speaking to them without specific training, appear on the verge of moving from laboratory to market.

AT&T; announced early this month that it planned to replace thousands of telephone operators over the next few years with computers that can understand the instructions of people trying to place a collect or person-to-person call.

For the past several weeks, Apple Computer has been publicly demonstrating a Macintosh-based speech recognition system dubbed Casper, after the friendly cartoon ghost, that lets a Macintosh personal computer accept spoken commands from any user.

The system, which the company claims has a vocabulary of 40,000 to 50,000 words that can activate pre-written command "scripts," reportedly may be offered for sale as early as the end of this year.

And researchers at International Business Machines Corp. are working on advanced speech recognition systems for personal computers and work stations to allow them, among other things, to take dictation. SRI International in Menlo Park, Calif., is developing its own system that could be used to take a wide variety of dictation.

Speech recognition, along with pen control, is seen by experts as playing a crucial role in the development of personal digital assistants, hand-held communications and information-retrieval devices. These devices could respond to spoken commands and would render keyboards largely unnecessary.

With advances like these, it's easy to see why many people might believe we're only a few years away from the computers of science fiction, like HAL of "2001," which routinely conversed with its human companions. But researchers say that, given the complexities of the problem, such a scenario is still decades away.

"Speech recognition capability is coming along, as far as discernment of sounds," said David Roe, who's been involved in speech recognition research for a decade at AT&T;'s Bell Laboratories. "What's hard is language understanding, what sentence is likely to be made out of those sounds."

Speech recognition shares much in common with two other vexing problems in computer science: handwriting recognition and computer vision systems. All three fields require computers to analyze huge amounts of data at high speed, searching for meaningful patterns in the enormous digital clutter.

The difficulty in achieving the results of science fiction stories is that even the most powerful computers, which can calculate far faster than any human, are a poor match for the human brain when it comes to the complex task of making sense out of the constant stream of patterns in speech.

The first step in speech recognition is what researchers call "feature extraction." Essentially, it involves converting the sound waves entering a microphone into digital form, compressing the raw stream of bits into a more compact manageable size, and extracting meaningful spoken sounds while rejecting background noises.

"There's so much information that's irrelevant," said David Nahamoo, manager of speech recognition modeling at IBM's Thomas J. Watson Research Center in Hawthorne, N.Y. "Pitch, for example, doesn't mean anything in our current system."

Once the computer extracts meaningful sounds, it must them assemble them into one of the "phonemes" that comprise the basic components of English speech from which all words are built. This is generally done by matching successions of sounds with a library of phonemes.

While the English language consists of 46 to 48 phonemes, depending on how they are defined, individual differences in pronunciation, from subtle ones to pronounced accents, also complicate matters. Speech recognition systems built to understand many speakers may actually have to store thousands of phoneme variations in their libraries to handle these differences.

Once the phonemes are extracted, the computer begins its most difficult process: assembling them into precise English words and sentences the computer can then act upon according to pre-determined rules. This "grammar processing," Mr. Roe said, involves one of speech recognition's most persistent problems, creating a working mathematical model of language.

"That is a tough nut to crack," he said. The language model, for example, must know what words are likely to follow others in English, to help differentiate among similar sounds -- but it's still an imprecise art.

Indeed, some companies developing speech recognition programs have decided to simplify grammar processing by reducing the number of possible sound permutations it will accept. A system from Verbex Voice Systems Inc. in Edison, N.J., for example, makes users decide in advance on what precise word combinations they will use to activate a program's function. Only those are programmed into the system's language model, so it doesn't have to deal with a huge number of possible ways to say a certain thing.

"It's not a free-form kind of thing; you could have two, three or six ways of saying a thing, but not an infinite number," said Mike Perkons, vice president of marketing. Such a system can be run on a much less powerful computer than one that allows users to speak completely naturally.

Because it involves such complex problems, researchers have tackled different pieces, resulting in four distinct categories of speech recognition systems.

First, systems can either be trained to respond to a single speaker's voice -- so-called speaker-dependent systems -- or be built to recognize a wide variety of people, a speaker-independent system.

Typically, a system trained to recognize only one voice will understand a larger vocabulary, because it doesn't need to account for the wide variation of accents and nuances different people apply to the same words. That also means that a computer that accepts one user's commands will utterly ignore another's. Speaker-dependent system makes a reasonable choice for a dictation device, or a hand-held personal digital assistant. But for a system that will route phone calls, or a portable system that might have many users, speaker-independent speech is a must.

Either of those types of speech recognition systems can be further characterized by how people must speak to be understood.

The first is the most commonly found system today, a so-called isolated system in which users speak individual words distinctly, pausing discernibly after each one.

Because it must understand the fewest number of distinct word sounds, or phonemes, the easiest kind of speech recognition system to develop is a speaker-dependent, isolated system.

IBM has a working system of that type in its labs that can recognize 20,000 words with about 98 percent accuracy, said Mr. Nahamoo.

The drawback to such a system is that it is unnatural for people to pause distinctly between each word, making such systems appealing to users only in certain cases.

The far more difficult scheme -- but the one most people would likely use when it becomes available -- is one that can understand ordinary conversation, where spoken words run together.

Copyright © 2021, The Baltimore Sun, a Baltimore Sun Media Group publication | Place an Ad

You've reached your monthly free article limit.

Get Unlimited Digital Access

4 weeks for only 99¢
Subscribe Now

Cancel Anytime

Already have digital access? Log in

Log out

Print subscriber? Activate digital access