##
Broad Area Colloquium For AI-Geometry-Graphics-Robotics-Vision

### Training Products of Experts By Minimizing Contrastive Divergence

Geoffrey Hinton

Gatsby Computational Neuroscience Unit

University College London

Monday, Apr 9, 2001, 4:15PM

TCSEQ 201

`http://robotics.stanford.edu/ba-colloquium/`

#### Abstract

It is possible to combine multiple non-linear latent variable models
of the same data by multiplying the probability distributions together
and then renormalizing. This is a very efficient way to model data
which simultaneously satisfies many different constraints. For
example, one expert model of a word string can ensure that the tenses
will agreed and another can ensure that the number agree.
It is hard to generate samples from a product of experts, but it is
very easy to infer the distributions of the latent variables give an
observation because the latent variables of different experts are
conditionally independent. Maximum likelihood fitting of a product of
experts is difficult because, in addition to maximizing the log
probabilities that each expert assigns to the observed data, it is
necessary to minimize the normalization term which involves a weighted
sum over all possible observations. This appears to require tedious
Monte Carlo methods or dubious approximations. Fortunately, there is
an efficient alternative to maximum likelhood fitting which works
remarkably well. Instead of just maximizing the log probabilty of the
data, we also minimize the log probability of the reconstructions of
the data that are produced by a single full step of Gibbs sampling.
In effect, we initialize a markov chain at the distribution that we
would LIKE to be its equilibrium distribution and watch how it starts
to wander away. We then lower the free energy of the place it started
from and raise the free energy of wherever it wants to go to. This
eliminates the model's desire to corrupt the data.
Some examples of product of expert models trained in this way will be
described. Products of experts work very well for handwritten digit
recognition and the same algorithm can be used to fit products of
Hidden Markov Models, which can have exponentially more
representational power than single Hidden Markov Models.
#### About the Speaker

Geoffrey Hinton received his BA in experimental psychology from Cambridge in
1970 and his PhD in Artificial Intelligence from Edinburgh in 1978. He was a
member of the PDP group at the University of California, San Diego, an
assistant
professor at Carnegie-Mellon University, and a professor at the University of
Toronto. He is currently the director of the Gatsby Computational Neuroscience
Unit at University College London. He does research on ways of using neural
networks for learning, memory, perception and symbol processing. He was one of
the researchers who introduced the back-propagation algorithm. His other
contributions to neural network research include Boltzmann machines,
distributed
representations, time-delay neural nets, mixtures of experts and Helmholtz
machines. His current main interest is learning procedures for products of
latent variable models.

Contact: `bac-coordinators@cs.stanford.edu`
Back to the Colloquium Page