|
Link to Computerworld
Is Speech Recognition
Finally Good Enough?
Better hardware and algorithms nudge
the technology closer to its 10-year promise of supplanting
keyboards.
By Lamont Wood
(Computerworld
May 18, 2007)
"For me it's a lifesaver,"
said Paul Langer, an attorney at the Chicago office of Mayer, Brown,
Rowe & Maw. "I never learned to type."
His alternative to the
keyboard is speech recognition (SR) software, in this case Dragon
NaturallySpeaking (DNS) from
Nuance Communications Inc. in Burlington, Mass. Now in Version
9.0, the introduction of DNS a decade ago marked the birth of
continuous speech recognition -- previous SR software required the
user to pause between words.
But problems with
accuracy, and the need for an hour-long "enrollment" process to
train the software to follow the user's voice, meant that typing
didn't become obsolete in the intervening decade. However, things
have changed.
"I don't know how
accurate it is, but if it were not accurate enough, I would go back
to typing," said Peter Laipson, a DNS 9.0 user who, unlike Langer,
is also a fast typist.
"I use it to do nearly
all my grading," continued Laipson, a history teacher from
Arlington, Mass., working at a temporary job in San Francisco. "I
will dictate comments on relevant parts of an essay and then summary
comments at the end. With Dragon I need about 60% as much time to
comment on a paper."
He does not claim it's
100% accurate, saying it was not suitable for text with a lot of
slang, and recalling a time when it rendered "I really admire your
analysis" as "I really admire urinalysis."
"It helps to have a sense
of humor, but simple proofreading is enough," Laipson said.
Actually, retaining a
sense of humor has been important for multiple reasons in the SR
field. In 1993 two executives from Kurzweill Applied Intelligence
(which pioneered SR for the medical market) went to prison for
faking sales. That firm was sold in 1997 to a Belgium SR firm,
Lernout and Hauspie (L&H), which was reporting phenomenal sales
growth at the time. Dragon Systems, which originated DNS that year,
was reporting only anaemic growth, and L&H had no trouble acquiring
Dragon Systems in early 2000 in a stock deal.
Within a year a series of
accounting frauds came to light and L&H collapsed into bankruptcy.
Its SR technology was sold in late 2001 to ScanSoft Inc., which kept
the DNS line going. (It was then at Version 6.0.) ScanSoft later
acquired Nuance and adopted its name.
Starting
To Deliver
Thereafter, "It was with
the launch of Version 8.0 (in November 2004) that the market became
reinvigorated and took off," said Chris Strammiello, director of
product management at Nuance. "We crossed an invisible line with
Version 8.0, where the software actually delivered on its promises
and offered real utility for the users. Sales have been growing at a
rate of 30% yearly since then, except that we expect it to do better
than 30% this year.
"About 60% of the buyers
are consumers or what we call proficient professionals," added
Strammiello. "The rest are from vertical markets, especially health
care and law, whose practitioners are used to dictating and in the
past have paid for transcription services. About 10% are people who
use speech recognition for accessibility reasons, and that cuts
across the other segments," he added.
Version 8.0 reduced the
error rate a factor of 30% compared with Version 7.0, while Version
9.0 reduced errors by another 20%, said Strammiello. Overall, about
25% of the accuracy improvements can be credited to today's faster
hardware, while the rest stems from improved algorithms, he added.
The personal version costs about $200, the professional version
costs about $765, and there are also specialized medical and law
office versions. Version 9 also includes tools for deploying the
software over the network.
Today, "A person can get
95% accuracy right out of the box, and enrollment is optional and
only takes five minutes," said Howard Parks, president of Microref
Systems Inc., a firm in Highland Park, Ill., that sells SR systems
and trains users.
As for input speed,
Strammiello said that DNS can keep up with someone talking 160 words
per minute, which is about as fast as most people converse. As for
typing speed, Rich Stroud, spokesman for the International
Association of Administrative Professionals (IAAP), said that ads
for clerical job usually ask for at least 40 words per minute.
But despite the obvious
speed advantages, there has been no evident rush to adopt the
technology. For instance, Stroud noted that only 5% of the IAAP's
membership reports using SR software at work. When asked what
software they wish the boss would supply, none mentioned SR.
95% Isn't
Good Enough
That resistance is at
least partly because, in his experience, 95% accuracy is not good
enough, indicated Parks. "Most users are not happy until they get to
the 98% level," he said. "It's only when you become skillful that
you can say that it becomes productive, and it takes five to 20
hours of intensive use to become skillful."
Not being accustomed to
dictation is a problem, he noted, but the main pitfalls involve the
need for clear, consistent pronunciation, plus a mastery of the
correction procedures by which the software learns from its
mistakes, steadily increasing its accuracy rate.
Without help on those
issues, about three-fourths of the people who attempt to use SR
eventually put it aside and go back to keyboarding, Parks said, and
even among those with training the rate is about 20%.
Trying another brand of
large-vocabulary desktop SR other than DNS is rarely mentioned as an
option because there are few alternatives. After the L&H debacle
there were basically three entries: DNS, ViaVoice from
IBM and software from Philips that was not actively marketed in
the U.S., explained Parks. IBM later sold the marketing rights for
ViaVoice to Nuance, which uses it as an entry-level product, he
noted.
On the other hand, the
most widely owned form of SR is probably a version developed by
Microsoft, since it is included free in
Office XP -- a fact that appears to be unknown to most Office XP
users.
"Office XP had it but
Microsoft did not promote it -- it was a beta test and they were not
comfortable about the quality of the user interface," said
Bill Meisel, head of TMA Associates, a speech industry
consulting firm in Tarzana, Calif. Unlike DNS, Office XP's SR
required that the user rely on the mouse to navigate and make
corrections. (Microsoft did not respond to requests for comment.)
Microsoft Vista also has
SR built in but uses a voice correction interface similar to the one
in DNS, Meisel noted.
Vista's
Version Not Up To Speed
"It's good, but it's not
in the same league as DNS yet," said Parks, who has used both Vista
and DNS SR. "But it's a foundation they can improve on over the
years. Dragon's research and development is measured in hundreds of
person-years, so it will take a few years for Microsoft to catch
up."
At Nuance, Strammiello
said he saw Vista as more of a promotional vehicle than as a
competitor. "It will expose people to what the technology can do,
and those who like it will then seek out a premium product," he
predicted.
But will that lead to the
day when people set aside their keyboards for SR, having discovered
that with a few minutes' training they can achieve several times the
throughput that they could reach after investing a semester in a
typing class?
"If you had asked me 10
years ago (when DNS came out) I would've said that in three or four
years the world would be converted," said Parks. "But here we are 10
years later and I still don't know when it will take off. It is
expensive, but there is no question that it offers far greater
benefits than anything else. My average user creates text 50 to 100%
faster than they did before."
"It's misleading to think
of SR as a replacement for the keyboard for the average person,"
cautioned Meisel. "Where the keyboard is really effective is with
editing. Getting around is harder with voice -- you can do it but it
requires a new learning experience."
Strammiello preferred to
talk about the future of the product itself. "We will be broadening
the bell curve and getting more and more users to 99% accuracy," he
said. "Speaker independence is on the horizon, but how soon that
will arrive is unclear. What we can expect in the meantime are more
natural commands and a more conversational interface, and more noise
robustness so we can speak in a crowded room."
Link to Computerworld

|