Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision
Machine Learning and Extracting Information from the Web
Tom M. Mitchell
Vice President and Chief Scientist, WhizBang! Labs
Fredkin Professor of Learning and AI, Carnegie Mellon University
Monday, Jan 29, 2001, 4:15PM
Herrin Hall, BioT 175
http://robotics.stanford.edu/ba-colloquium/
Abstract
Today's search engines can retrieve and display over a billion web pages,
but their use is limited by the fact that they don't analyze the content of
these pages in any depth.
What if these search engines could extract the factual content from the
pages they retrieve? Then, instead of asking for pages that contain the
keyword "java," we could ask directly for the facts we are after, such as
"What Java programming jobs are available in Palo Alto?," or "Are there any
evening courses on Java available in the Palo Alto area during spring 2001?"
This talk will describe research that has resulted in systems that answer
these kinds of questions by extracting detailed factual information
automatically from millions of web pages. Our approach relies heavily on
machine learning algorithms to train the system to find and extract targeted
information. For example, in one case we trained our system to find and
extract job postings from the web, resulting in the world's largest database
of job openings (over 600,000 jobs, see www.flipdog.com). This talk will
describe machine learning algorithms for classifying and extracting
information from web pages, including results of recent research on using
unlabeled data and other kinds of information to improve learning accuracy.
About the Speaker
Tom M. Mitchell is Vice President and Chief Scientist at WhizBang! Labs. He
is currently on a two-year leave of absence from Carnegie Mellon University,
where he is the Fredkin Professor of Learning and AI in the School of
Computer Science, and Director of CMU's Center for Automated Learning and
Discovery. Mitchell's primary research interest is in Machine Learning
theory and practic. Mitchell is the author of the textbook "Machine
Learning" (McGraw Hill, 1997), incoming President of the American
Association for Artificial Intelligence, and a member of the National
Research Council's Computer Science and Telecommunications Board.
Contact: bac-coordinators@cs.stanford.edu
Back to the Colloquium Page