“Dragon doesn't understand me:” Why people give up on speech recognition

This article was originally published on LinkedIn in February 2018.

Introduction

I have relied on speech recognition technologies for 25 years. So voice control of computers is second nature to me. There are many tasks — although not all — that I perform faster and easier by voice than via keyboard, mouse, or touch screen. For those tasks, I find that speech recognition is the best tool.

My longstanding use makes me, I suppose, an early adopter. But not everyone takes to the technology as I did. Heidi Horstmann Koester, in Abandonment of Speech Recognition by New Users (2003), reported that only one of the eight participants in her study was still using a speech recognition product after six months.

During the 15 years since Koester's paper, speech recognition technology has greatly improved. I suspect that a larger percentage of new users stick with it. But I am also aware that many people who could benefit from the technology use it for a short period, and then give up in frustration.

In 2015, former MacSpeech, Inc. employee Chuck Rogers estimated that 15% to 20% of computer owners got consistent and productive results from speech recognition, but admitted the proportion may be “closer to 5%.” He suggested two reasons: the learning curve, while not as steep as before, is still very steep; and the disappointment people feel when they realize that “speech recognition just doesn't work like it does on Star Trek.”

Undoubtedly, ideas about speech recognition took root in the popular imagination through science fiction. But with the buzz around voice-enabled digital assistants like Siri, Google Assistant, Alexa, and Cortana, people are discovering that speech recognition exists today. The most versatile and accurate product for Windows and Mac computers is Dragon Professional, aka Dragon.

(Although Dragon for Windows and MacOS use the same underlying recognition engine and are equally accurate, the Windows version has more features and is more customizable. The following descriptions apply primarily to the Windows version.)

I use Dragon to write short documents and emails, format and revise text, enter data into spreadsheets, browse the web, and more. I don't rely on Dragon for everything, but I do use it almost every day. The current release is Version 15. Although fast and accurate, it is not without limitations, problems, and bugs. For example, Dragon 15 works nicely in Internet Explorer, less well in Firefox and Chrome, and hardly at all in Edge.

Over the years I have noticed that new users often have unrealistic expectations about Dragon. In this article, I will review four misconceptions that may predispose a person to abandon Dragon.

Four common misconceptions about Dragon

1. “Dragon doesn't understand me”

It's easy to anthropomorphize Dragon. When everything is going well, Dragon seems to understand every word.

But Dragon understands zilch. Dragon doesn't know the meaning of words, and it certainly doesn't understand language — at least, not as humans do.

Dragon is a transcription machine “made” of mathematical and statistical rules held together with logic. Instead of understanding language, Dragon applies complex pattern-matching algorithms to audio signals, and outputs the most likely words the signals contain.

Dragon's accuracy peaks when a person speaks in long phrases. Context matters: Dragon guesses each word in relation to its guesses for adjacent words. More words means more context and better guesses. Dragon is always just guessing — it's a probabilistic machine.

I stack the odds in my favour when I dictate by thinking about what I want to say, and then blurting out the idea in a single breath. Fast talking is not a problem provided I articulate every word clearly and distinctly. I find that speaking quickly and at length actually boosts accuracy, and avoids the surprises that result from saying one or two words at a time. For example, saying “minimize” by itself may cause the current window to minimize!

Many new users are surprised to learn that speaking too slowly degrades performance. Perhaps they liken Dragon to a non-native language speaker. Talking slowly might help non-native speakers understand, but it won't help Dragon.

2. “Isn't speech a natural way to communicate?”

When I describe speech recognition to people who are not familiar with it, I am often met with puzzled looks when I explain that Dragon is challenging to learn, and that some folks spend months or years mastering it. A common response is, “What's so hard about talking?”

Maybe watching Star Trek episodes has influenced their perceptions, but I wonder whether the idea that voice control is straightforward is a result of marketing. Nuance's television advertisements and promotional videos underplay the effort involved in becoming a competent user. Furthermore, from 1997 until 2016, “Dragon Professional” was called “Dragon NaturallySpeaking.” Perhaps the old name gives credence to the notion that the program transcribes “natural” or conversational speech.

Dragon isn't designed to handle the messiness of conversational speech. Conversations are punctuated with mumbles, hesitations, pauses, breath sounds, and, umm... you-know... what's-that-word?... fillers! Furthermore, Dragon has no awareness of non-verbal cues — body language, facial expressions, and hand gestures — that help to convey meaning.

Dragon responds best to “structured speech.” Think of structured speech as text that someone has already mulled over, written down, and revised. When you hear a newscaster reading from a teleprompter, that's structured speech. I advise people who are learning Dragon to emulate a newscaster's tone and pace.

There is nothing “natural” about dictating fully-formed, grammatical statements. It's a hard skill to develop. I have been practicing for years, am better than before, but still find it challenging. Because structured speech is not how people talk, most folks require ongoing effort to modify lifelong speaking habits.

3. “Dragon needs more training to recognize my voice”

In Dragon, training has two connotations: when creating a new profile, training means preparing Dragon to recognize an individual's vocal characteristics; and while dictating, training means pronouncing a word to strengthen the association between its spoken and written forms.

When Dragon Systems released NaturallySpeaking in 1997, users were forced to read a story out loud, one screenful at a time, for 20 or 30 minutes. Afterwards, the program crunched the acoustic data it had collected for another 20 or 30 minutes. So it really did take an hour just to begin using Dragon.

Initial training is no longer required. In fact, for the previous four or five Dragon versions, users could skip initial training. Starting with Version 15, it's not an option. Users cannot do initial training even if they want to.

Nowadays, it takes five or ten minutes to create a super-accurate profile from scratch. Most new users can expect 90% to 95% accuracy from the get-go — and some achieve even better results. People with strong accents, dysarthric or otherwise unclear speech are achieving better accuracy than was previously possible, without training.

However, if a newly-minted profile is only, say, 80% accurate, no amount of training will bring it up to 95%. Poor accuracy, especially with a new profile, cannot be fixed by training; the problem lies elsewhere. Rather than Dragon needing training, the user may need instruction on how to properly set-up and use Dragon!

If your initial accuracy is subpar, consider the following explanations. Notice that none are training-related:

The microphone or soundcard is faulty. It could be a frayed wire, a loose connection, a failing USB port, and so on. Furthermore, the sonic characteristics of some microphones make them ill-suited for speech recognition.
The microphone is incorrectly positioned. Ensure the element is 1 - 2 cm (0.5 - 1 in.) from the corner of the mouth. Never position the element directly in the breath stream.
There is excessive ambient noise. Although Dragon 15 filters out background sounds better than previous versions, it still prefers quiet environments, so nix to listening to music while dictating. Fans, air conditioners, and overhead ventilation systems within earshot of the microphone can also muddy the audio stream. A microphone with superior noise cancellation is essential when working in noisy environments.
You are dictating words that are not in Dragon's vocabulary. If you regularly dictate a word (or phrase, name, or acronym) that is not in the vocabulary, add it. Going forward, Dragon is more likely to transcribe it correctly.
You are not using structured speech and/or you are speaking haltingly. Dragon responds best to long, unbroken phrases. I have met users who had developed the habit of dictating one or two words at a time, and then stopping to check their progress. I have had modest success coaching them to modify their dictation style: I ask that they self-consciously practice three word phrases for awhile; then four; then five; and so on. Even four or five words per phrase noticeably improves accuracy and speed. There are individuals, however, who cannot physically or cognitively speak more than a word or two at a time. Dragon 15 works reasonably well for some, although their accuracy is unlikely to reach the high-90s.
Dragon may not be able to adapt to your accent, your manner of speaking, or your speech difficulty. Although Dragon 15 handles non-standard speech better than earlier versions, there are limits to how much a profile can improve. (But see the next section.)

4. "Dragon's performance improves with use"

This expectation contains a kernel of truth. Dragon's performance is improved by adding new words to its vocabulary. But overall accuracy increases only incrementally, and quickly levels off. For me, the accuracy of a new profile plateaus in about an hour.

In addition to adding new words and phrases to Dragon's vocabulary, here are three others strategies for reducing your misrecognition rate:

Correct words or phrases that Dragon misrecognizes. Use the “spell XYZ” or “correct XYZ” commands. According to Nuance, you can also improve accuracy by correcting misrecognitions using select and delete commands (e.g., “undo that” and “delete that”), or by using a keyboard or mouse.
Provide Dragon with a sample of your writing and/or emails to analyze.
After dictating for awhile and making corrections, run Dragon's Accuracy Tuning. This feature optimizes a profile by analyzing audio and text data that Dragon collects during dictation.

For people who have nonstandard accents, atypical manners of speaking, or speech difficulties, these techniques are the best ways to boost accuracy. Gains may be noticeable, but are likely to be small.

Dragon occasionally responds slowly or inaccurately. If the problem persists longer than a minute or two, run “Check Microphone...” from the “Audio” menu on the DragonBar. This 30 second procedure often fixes the problem. If it doesn't help, relaunch Dragon and the applications you were dictating into. However, when I am using Dragon and its performance unexpectedly declines, my inclination is to exit all applications (including Dragon) and shut down my computer. When I reboot, I start Dragon first.

Different assumptions for better performance

The art of fixing misrecognitions and making revisions

With the recent integration of deep learning into speech recognition, the evolution of the technology is bound to quicken. Based on the remarkable improvements I have seen over the past two years, I think future systems will handle diverse accents, function adequately in noisy environments, and accurately transcribe unstructured speech — including informal conversations between different people.

Yet errors are inevitable. Because speech recognition software will continue to be probabilistic and context-dependent, future systems will occasionally make mistakes. You will dictate, “I knead the dough,” but the system will output, “I need the dough” or “I kneed the doe.”

And despite fewer mistakes, people will still want to change what they have written. Revision is an essential aspect of writing. In The Elements of Style, William Strunk, Jr. and E. B. White explain: “Few writers are so expert that they can produce what they are after on the first try.” So you might dictate, “I love you.” Then on second thought, you will decide to change it to, “I like you, a lot.”

Even with today's Dragon, it's possible to delete extra words, substitute one word for another, and rearrange sentences and paragraphs. But these kinds of fixes are, for some users, time-consuming, error-prone, and frustrating.

For example, many Dragon users wanting to delete the third word in this paragraph will “translate” these steps into voice commands:

Move the mouse cursor to the left side of many.
Hold down the mouse button.
Drag right until the cursor passes the final letter and release the mouse button.
Right-click, and select “cut” from the context menu.

Performing these steps by voice takes 30 seconds, a minute, or longer. And I have seen Dragon users spend several minutes doing exactly this kind of thing — and not always succeeding.

Yet there is a Dragon command that will instantly erase the word: “delete many.” Bonus: substituting a single command for a multi-step process reduces the likelihood of unintentionally introducing new errors, all of which must be dealt with.

Users who have habituated to typing, pointing, clicking, and dragging — which means just about everybody! — may not realize that text can be quickly deleted without the need to even think about the mouse. Taking advantage of voice input takes a new mindset. To those who have relied on a keyboard and mouse to get things done, Dragon techniques seem counterintuitive.

Toward graceful recovery from error

I refer to these counterintuitive techniques as “voice-centric” strategies, and the approach to writing and editing that they engender as “graceful recovery from error." The techniques are not intuitive at first, but once grasped, they leverage the unique possibilities of voice control. While internalizing these techniques, users arrive at a different set of assumptions and expectations about human-computer interaction. The shift in thinking is comparable to the one many of us went through during the 1980s and 1990s when we transitioned from a typewriter mentality to a word processor sensibility.

Knowing how to gracefully recover from error allows Dragon users to spend far less time and energy on the mechanics of writing. They make fewer errors, get more done, and hopefully, get a little less frustrated!

In my next article, I will describe counterintuitive, voice-centric ways to respond when Dragon misrecognizes something you have said, or when you want to quickly revise text. As you will see, the two are closely linked.

Acknowledgements

I thank my RESNA colleagues Heidi Horstmann Koester and Ray Grott for reviewing an early draft. Their excellent suggestions helped sharpen my focus and improve the readability of this article.