Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision
Inner-loop Statistics in Automated
Scientific Discovery from Massive Datasets
Andrew Moore
Robotics Institute and Computer Science CMU
and Schenley Park Research, Inc.
Wednesday, February 2, 2000
refreshments 4:05PM, talk begins 4:15PM
TCseq201, Lecture Hall B
http://robotics.stanford.edu/ba-colloquium/
Abstract
Intensive statistical analysis of massive data sources ("data mining")
has been embraced as one of the final areas with a need for massive
computation beyond that available on a $2000 computer or $200
videogame. We begin this talk with two examples of software, instead
of hardware, giving 1000-fold speedups over traditional
implementations of statistical algorithms for prediction, density
estimation, and clustering.
We then pause to examine directions in which these software solutions
seemed blocked when faced with Physics, Biology and commercial
scientific data discovery problems. The primary blocks were a curse of
dimensionality and limitations on machine main memories. This is
followed by four examples of new pieces of research that circumvent
these barriers: lazy cached sufficient statistics, exact accelerated
k-means, multiresolution ball-trees for very high dimensional
real-valued data, and filament identifiers.
We then reveal the reason for our new-found respect for
super-computation: when an algorithm you previously ran overnight
executes in seconds, you find yourself wanting to run it ten thousand
times. We show the impact of being able to run intensive statistics as
an inner loop has had on our analysis of cosmology data (preliminary
data from the Sloan Digital Sky Survey) and biotoxin identification,
where desirable but hopelessly extravagant operations such as model
selection, bootstrapping, backfitting, randomization and graphical
model design now become somewhat non-hopeless.
Joint work with Andy Connolly (U Pitt Physics), Artur Dubrawski
(Schenley Park Research), Geoff Gordon (Auton Lab), Paul Komarek (Auton
Lab), Bob Nichol (CMU Physics), Dan Pelleg (Auton Lab) and Larry
Wasserman (CMU Statistics).
About the Speaker
Andrew Moore (www.cs.cmu.edu/~awm) is the A. Nico Haberman Associate
Professor of Computer Science and Robotics at CMU. He received a Phd
in Computer Science from the University of Cambridge in 1991 (thesis
topic: Robot Learning). He has worked with robots that learn,
factories than learn and supply chains that learn. His research
interests include: statistical foundations, autonomous learning
systems for manufacturing, efficient algorithms for machine learning
from massive data and reinforcement learning, finite production
scheduling, and machine learning applied to optimization. He is the
co-owner and CTO of Schenley Park Research Inc---a 12 person
Pittsburgh-based AI startup supplying data mining and decision theory
products and solutions to manufacturing, business-to-business and
biotechnology clients.
bac-coordinators@cs.stanford.edu
Back to the Colloquium Page
Last modified: Mon Jan 31 14:52:02 PST 2000