Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision
(CS 528)
Information Extraction, Social Network Analysis and Joint Inference
Andrew McCallum
October 10, 2005, 4:15PM
Hewlett (TCSeq) 200
http://graphics.stanford.edu/ba-colloquium/
Abstract
Although information extraction and data mining appear together in
many applications, their interface in most current systems would
better be described as serial juxtaposition than as tight integration.
Information extraction populates slots in a database by identifying
relevant subsequences of text, but is usually not aware of the
emerging patterns and regularities in the database. Data mining
methods begin from a populated database, and are often unaware of
where the data came from, or its inherent uncertainties. The result
is that the accuracy of both suffers, and accurate mining of complex
text sources has been beyond reach.
In this talk I will describe work in probabilistic models that perform
joint inference across multiple components of an information
processing pipeline in order to avoid the brittle accumulation of
errors. After briefly introducing conditional random fields, I will
describe recent work in information extraction leveraging factorial
state representations, object deduplication, and transfer learning, as
well as scalable methods of inference and learning.
I will then describe two methods of integrating textual data into a
particular type of data mining---social network analysis. The
Author-Recipient-Topic (ART) model performs summarization and question
routing from large quantities of email or other message data by
discovering clusters of words associated with topics, and also
role-similarity among entities based on those topics. The Group-Topic
(GT) model captures relational data along with accompanying text by
discovering how entities fall into groups---capturing the different
coalitions that arise dependent on the topic at hand. I will
demonstrate this on several decades of voting records in the U.N. and
U.S. Senate.
If there is time, I will also give a demo of the new research paper
search engine we are creating at UMass.
Joint work with colleagues at UMass: Charles Sutton, Chris Pal, Ben
Wellner, Michael Hay, Xuerui Wang, Natasha Mohanty, and Andres
Corrada.
About the Speaker
Andrew McCallum is an Associate Professor at University of
Massachusetts, Amherst. He was previously Vice President of Research
and Development at WhizBang Labs, a company that used machine learning
for information extraction from the Web. In the late 1990's he was a
Research Scientist and Coordinator at Justsystem Pittsburgh Research
Center, where he spearheaded the creation of CORA, an early research
paper search engine that used machine learning for spidering,
extraction, classification and citation analysis. He was a
post-doctoral fellow at Carnegie Mellon University after receiving his
PhD from the University of Rochester in 1995. He is an action editor
for the Journal of Machine Learning Research. For the past ten years,
McCallum has been active in research on statistical machine learning
applied to text, especially information extraction, document
classification, clustering, finite state models, semi-supervised
learning, and social network analysis.
Contact: bac-coordinators@cs.stanford.edu
Back to the Colloquium Page