Speech recognition circa 2004

If you've never tried dictation, you can get a sense of how it works by watching a ~~video~~ screencast I made shortly after I installed Version 8 of NaturallySpeaking. The out-of-the-box experience was dramatically better than before [via]{style="FONT-STYLE: italic"} [Jon's Radio]

Not long ago I had a chat with one of Microsoft’s recognition guys (J. Pittmann) and learned some interesting factoids about how voice recognition works. The most interesting is that they throw out [tons]{style="FONT-STYLE: italic"} of data and can’t take advantage of the added horsepower available in offline mode. You see, for me voice dictation would be much, [much]{style="FONT-STYLE: italic"} more useful if I could transcribe notes taken while driving or walking – when I’m at my computer in a quiet environment, I’d just as well use the keyboard or pen. The realtime aspect of recognition is not of much interest to me, but I would be thrilled if I could say “Take the next 18 hours to recognize what I babbled on my two-hour drive.”

Not with today’s algorithms. All speech recognizers began their lives several Moore’s generations ago and all use algorithms that began life in the pre-Pentium era, and were designed for very low sampling rates and bit size. (Today’s recognizers probably use higher-quality signals, but the point is that the algorithmic assumptions are what we’d today consider low-fidelity, low-memory, low-CPU.)

Basically, they quickly throw out everything to try to get to vectors representing sub-phonemes (“codes”) that they template-match to produce phonemes (). They try to match the stream of these to complete words and word-sequences, using probabilistic pattern matching and language models.

This bottom-up approach has gotten to the point where it works pretty well. But at every step up the abstraction chain, they abandon data (raw signal to code, code to phoneme, etc.) so that by the time they get to the language model, they don’t have the ability to revisit their initial data. For instance, if you use the words “C Plus Plus” in a document, what you’ll find is that you break the phrases used [afterwards]{style="FONT-STYLE: italic"}, because the phonemes of “++” are always screwing up the model. And once the language model is screwed up, you get this bizarre semi-phonetic mess that’s virtually impossible to work with. I’d probably [prefer]{style="FONT-STYLE: italic"} a pure phonetic output attached to the original note – the phonetic model would be indexable and searchable. It’s sort of like working with Ink – you learn that translating 100% of your handwriting into text is not necessary, so long as 99% of your handwriting is searchable.

By the way, one thing that really struck me about Jon’s screencast is how differently he approaches voice dictation. When composing text at my computer, words come out of me in a way that is entirely alien from my speech habits. My chief pitfall as a public speaker is that I talk fast. On the other hand, my professional writing comes out of me at the pace of about one phrase every 30 seconds, and that’s when I’m on a roll. One of the interesting things about Jon’s screencast is that he’s actually [talking]{style="FONT-STYLE: italic"}. When I’ve tried to use voice dictation, I’ve always basically voiced the word that I would otherwise type, which makes me sound like a robot with a run-down battery. “Voice dictation is…intriguing…to me…”