Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision
Machine Learning of Natural Languages via Translingual Projection
David Yarowsky
Johns Hopkins University
Monday, May 12, 2003, 4:15PM
TCSeq 200
http://robotics.stanford.edu/ba-colloquium/
Abstract
This talk will present a paradigm for the transfer of diverse
linguistic knowledge and analysis capabilities between languages
via statistically aligned parallel bilingual text corpora.
The overwhelming majority of the world's investment in
computational language processing has been dedicated to English
and a small group of resource-rich languages. In contrast,
few or no analysis tools or labeled training data sets are
available for the large remaining bulk of languages on the planet.
This talk will present an approach that helps bridge this
gap by translingual information projection, using examples
from syntactic, semantic and morphological analysis.
Straightforward annotation transfer is ineffective, however,
and the talk will present noise-robust techniques for
inducing stand-alone analysis tools in linguistically diverse
foreign languages, starting with no existing capability in the
given language. It will also present and contrast alternative
weakly supervised approaches for inducing part-of-speech
and morphological analyses derived from word association
statistics in only monolingual text corpora. Together these
machine learning techniques offer the potential for rapidly
transferring fundamental linguistic analysis and information
extraction capabilities to 50-100 new languages with minimal
new investment.