Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision
(CS 528)

Entropy and Pronunciation in Speech Recognition and Synthesis

Dan Jurafsky
Linguistics Department
Stanford University
Monday, February 2, 2004, 4:15PM
TCSeq 200


Automatic speech recognition (ASR) has made fantastic progress in the last decades, and current systems achieve word error rates below 5% on many tasks. Well, ok, but if we're so smart, how come we don't already have software automatically close-captioning the TV news, or transcribing business meetings?

The problem is that our successes have been mainly on human-to-computer speech. Word error rates on the more difficult task of recognizing human-to-human speech are often 20% (or even higher if you don't cheat on the test sets, but let's not get into that).

So why are ASR systems so much worse at understanding people than people are? Many studies have pointed to pronunciation variation as one likely cause of higher human-human error rates. That is, some very clever pervious experiments have shown that when people speak to humans (as opposed to machines) they pronounce words differently. But previous attempts to model this via `pronunciation models' have not had much success. I'm going to discuss a series of analytic experiments trying to understand why our clever ideas, and everybody else's even-more-clever ideas, haven't worked, and what is really causing this pronunciation variation in human lexical production. Hint: people don't like to bore other people, but they don't mind boring their computers. I'll conclude with a discussion of some neat new work we are just beginning which applies some of these ideas to TTS (Text-to-Speech synthesis). Because people also don't like their computers boring.

About the Speaker

Dan Jurafsky is a newly-arrived associate professor of Linguistics at Stanford. He has a BA (1983) in linguistics and a PhD (1992) in Computer Science from UC Berkeley, and spend 8 years in the Linguistics and Computer Science departments at the University of Colorado, Boulder before coming to Stanford just this month. His research focuses on statistical models of human and machine language processing, especially automatic speech recognition and understanding, natural language processing, and computational psycholinguistics. He received the National Science Foundation CAREER award in 1998, the MacArthur Fellowship in 2002, and also has high hopes for his recipe for Three Cups Chicken. His most recent book, with James H. Martin, is the widely-used textbook "Speech and Language Processing". He also plays the drums in mediocre pop bands.


Back to the Colloquium Page