Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision

Inner-loop Statistics in Automated Scientific Discovery from Massive Datasets

Andrew Moore
Robotics Institute and Computer Science CMU
and Schenley Park Research, Inc.

Wednesday, February 2, 2000
refreshments 4:05PM, talk begins 4:15PM
TCseq201, Lecture Hall B


Intensive statistical analysis of massive data sources ("data mining") has been embraced as one of the final areas with a need for massive computation beyond that available on a $2000 computer or $200 videogame. We begin this talk with two examples of software, instead of hardware, giving 1000-fold speedups over traditional implementations of statistical algorithms for prediction, density estimation, and clustering. We then pause to examine directions in which these software solutions seemed blocked when faced with Physics, Biology and commercial scientific data discovery problems. The primary blocks were a curse of dimensionality and limitations on machine main memories. This is followed by four examples of new pieces of research that circumvent these barriers: lazy cached sufficient statistics, exact accelerated k-means, multiresolution ball-trees for very high dimensional real-valued data, and filament identifiers. We then reveal the reason for our new-found respect for super-computation: when an algorithm you previously ran overnight executes in seconds, you find yourself wanting to run it ten thousand times. We show the impact of being able to run intensive statistics as an inner loop has had on our analysis of cosmology data (preliminary data from the Sloan Digital Sky Survey) and biotoxin identification, where desirable but hopelessly extravagant operations such as model selection, bootstrapping, backfitting, randomization and graphical model design now become somewhat non-hopeless. Joint work with Andy Connolly (U Pitt Physics), Artur Dubrawski (Schenley Park Research), Geoff Gordon (Auton Lab), Paul Komarek (Auton Lab), Bob Nichol (CMU Physics), Dan Pelleg (Auton Lab) and Larry Wasserman (CMU Statistics).

About the Speaker

Andrew Moore ( is the A. Nico Haberman Associate Professor of Computer Science and Robotics at CMU. He received a Phd in Computer Science from the University of Cambridge in 1991 (thesis topic: Robot Learning). He has worked with robots that learn, factories than learn and supply chains that learn. His research interests include: statistical foundations, autonomous learning systems for manufacturing, efficient algorithms for machine learning from massive data and reinforcement learning, finite production scheduling, and machine learning applied to optimization. He is the co-owner and CTO of Schenley Park Research Inc---a 12 person Pittsburgh-based AI startup supplying data mining and decision theory products and solutions to manufacturing, business-to-business and biotechnology clients.
Back to the Colloquium Page
Last modified: Mon Jan 31 14:52:02 PST 2000