Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision
(CS 528)
Entropy and Pronunciation in Speech Recognition and Synthesis
Dan Jurafsky
Linguistics Department
Stanford University
Monday, February 2, 2004, 4:15PM
TCSeq 200
http://graphics.stanford.edu/ba-colloquium/
Abstract
Automatic speech recognition (ASR) has made fantastic progress in the
last decades, and current systems achieve word error rates below 5% on
many tasks. Well, ok, but if we're so smart, how come we don't
already have software automatically close-captioning the TV news, or
transcribing business meetings?
The problem is that our successes have been mainly on human-to-computer
speech. Word error rates on the more difficult task of recognizing
human-to-human speech are often 20% (or even higher if you don't cheat
on the test sets, but let's not get into that).
So why are ASR systems so much worse at understanding people than
people are? Many studies have pointed to pronunciation variation as
one likely cause of higher human-human error rates. That is, some very
clever pervious experiments have shown that when people speak to humans
(as opposed to machines) they pronounce words differently. But
previous attempts to model this via `pronunciation models' have not had
much success. I'm going to discuss a series of analytic experiments
trying to understand why our clever ideas, and everybody else's
even-more-clever ideas, haven't worked, and what is really causing this
pronunciation variation in human lexical production. Hint: people don't
like to bore other people, but they don't mind boring their computers.
I'll conclude with a discussion of some neat new work we are just
beginning which applies some of these ideas to TTS (Text-to-Speech
synthesis). Because people also don't like their computers boring.
About the Speaker
Dan Jurafsky is a newly-arrived associate professor of Linguistics at
Stanford. He has a BA (1983) in linguistics and a PhD (1992) in Computer
Science from UC Berkeley, and spend 8 years in the Linguistics and Computer
Science departments at the University of Colorado, Boulder before
coming to Stanford just this month. His research focuses on
statistical models of human and machine language processing, especially
automatic speech recognition and understanding, natural language
processing, and computational psycholinguistics. He received the
National Science Foundation CAREER award in 1998, the MacArthur
Fellowship in 2002, and also has high hopes for his recipe for Three
Cups Chicken. His most recent book, with James H. Martin, is the
widely-used textbook "Speech and Language Processing". He also plays
the drums in mediocre pop bands.
Contact: bac-coordinators@cs.stanford.edu
Back to the Colloquium Page