Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision

Machine Learning of Natural Languages via Translingual Projection

David Yarowsky
Johns Hopkins University

Monday, May 12, 2003, 4:15PM
TCSeq 200


This talk will present a paradigm for the transfer of diverse linguistic knowledge and analysis capabilities between languages via statistically aligned parallel bilingual text corpora. The overwhelming majority of the world's investment in computational language processing has been dedicated to English and a small group of resource-rich languages. In contrast, few or no analysis tools or labeled training data sets are available for the large remaining bulk of languages on the planet. This talk will present an approach that helps bridge this gap by translingual information projection, using examples from syntactic, semantic and morphological analysis. Straightforward annotation transfer is ineffective, however, and the talk will present noise-robust techniques for inducing stand-alone analysis tools in linguistically diverse foreign languages, starting with no existing capability in the given language. It will also present and contrast alternative weakly supervised approaches for inducing part-of-speech and morphological analyses derived from word association statistics in only monolingual text corpora. Together these machine learning techniques offer the potential for rapidly transferring fundamental linguistic analysis and information extraction capabilities to 50-100 new languages with minimal new investment.

About the Speaker

David Yarowsky is an associate professor of computer science at Johns Hopkins University and a member of its Center for Language and Speech Processing. His research interests include machine translation, corpus-based natural language processing and minimally supervised machine learning, with a focus on lexical ambiguity resolution and broad-coverage multilingual language processing and induction.

Back to the Colloquium Page