Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision

Training Products of Experts By Minimizing Contrastive Divergence

Geoffrey Hinton
Gatsby Computational Neuroscience Unit
University College London

Monday, Apr 9, 2001, 4:15PM


It is possible to combine multiple non-linear latent variable models of the same data by multiplying the probability distributions together and then renormalizing. This is a very efficient way to model data which simultaneously satisfies many different constraints. For example, one expert model of a word string can ensure that the tenses will agreed and another can ensure that the number agree. It is hard to generate samples from a product of experts, but it is very easy to infer the distributions of the latent variables give an observation because the latent variables of different experts are conditionally independent. Maximum likelihood fitting of a product of experts is difficult because, in addition to maximizing the log probabilities that each expert assigns to the observed data, it is necessary to minimize the normalization term which involves a weighted sum over all possible observations. This appears to require tedious Monte Carlo methods or dubious approximations. Fortunately, there is an efficient alternative to maximum likelhood fitting which works remarkably well. Instead of just maximizing the log probabilty of the data, we also minimize the log probability of the reconstructions of the data that are produced by a single full step of Gibbs sampling. In effect, we initialize a markov chain at the distribution that we would LIKE to be its equilibrium distribution and watch how it starts to wander away. We then lower the free energy of the place it started from and raise the free energy of wherever it wants to go to. This eliminates the model's desire to corrupt the data. Some examples of product of expert models trained in this way will be described. Products of experts work very well for handwritten digit recognition and the same algorithm can be used to fit products of Hidden Markov Models, which can have exponentially more representational power than single Hidden Markov Models.

About the Speaker

Geoffrey Hinton received his BA in experimental psychology from Cambridge in 1970 and his PhD in Artificial Intelligence from Edinburgh in 1978. He was a member of the PDP group at the University of California, San Diego, an assistant professor at Carnegie-Mellon University, and a professor at the University of Toronto. He is currently the director of the Gatsby Computational Neuroscience Unit at University College London. He does research on ways of using neural networks for learning, memory, perception and symbol processing. He was one of the researchers who introduced the back-propagation algorithm. His other contributions to neural network research include Boltzmann machines, distributed representations, time-delay neural nets, mixtures of experts and Helmholtz machines. His current main interest is learning procedures for products of latent variable models.


Back to the Colloquium Page