Gaussian Process Regression

J Dorsey

Roadmap

  • The problem: Non-parametric curve-fitting
  • A Bayesian solution
  • Gaussian Processes
  • Why Gaussians?
  • Sampling from Gaussian processes
  • Inference
  • Hyperparameter learning
  • Applications
  • Further Resources

The Problem

plot of chunk unnamed-chunk-1

  • Infer / learn a scalar-valued function of vectors, without assuming a parametric form
  • For this talk we will just use a scalar-valued function of scalars
  • We also want a measure of uncertainty

Bayesian Approach

  • If we have a parametric family of functions and a sensible prior over the parameters, then Bayes theorem (the rules of probability) can take over and give us reasonable estimates with uncertainty accounted for
  • How can we apply Bayesian techniques in a non-parametric setting?
  • When there are no parameters, what do we put a prior on?
  • The whole function!

Gaussian Process prior

  • How do we get a prior over functions? (infinite dimensional space)
  • Consider each individual y value a Gaussian random variable.
  • Think of the whole function as just like one really long vector.
  • An infinite collection of Gaussian variables like this is called a Gaussian process.
  • Just as a finite-dimensional multivariate Gaussian is specified by a mean vector and a covariance matrix, an infinite dimensional Gaussian process is specified by a mean function and a covariance function.

Why Gaussians?

  • We're really good at integrating Gaussians
  • That means we can easily perform marginalization and conditioning
  • And Gaussians are so nice that when we do that we get back another Gaussian

Example Covariance Functions

plot of chunk unnamed-chunk-2

  • Most common covariance function is called “Squared exponential” (really should be “exponentiated square”) \[ K(x_1, x_2) = \sigma_f \text{exp}\left( \frac{(x_1 - x_2) ^ 2}{2 l ^ 2} \right) \]
  • Brownian motion / Wiener process \[ K(x_1, x_2) = \text{min}(x_1, x_2) \]

  • “Deep” kernels constructed via limiting arguments to mimic deep nets

How Did I Make Those Graphs?

  • I.e. how can we sample an infinite-dimensional object?
  • We can't! But we don't have to.
  • We only ever need to know the function at finitely many inputs
  • On any finite set of inputs, the function values are a finite-dimensional multivariate Gaussian
  • (In fact, this is basically the technical definition of a Gaussian process)
  • Plug in the inputs into the mean and covariance function to get a mean vector and covariance matrix
  • To make those graphs, I just specified a grid of x values and sampled from a high-dimensional Gaussian

Inference

  • For fixed hyperparameters (length scale, function amplitude), inference can be done exactly
  • Group all the training data (function value known) and testing data (function value unknown, prediction wanted) into one input set
  • Then the values of the function on that set is a finite-dimensional Gaussian
  • Condition on the training data using good old-fashioned matrix algebra
  • The resulting conditional distribution over the test set will again be multivariate Gaussian, with some mean vector and covariance matrix

Inference

  • The Bayes estimator for any symmetric loss is the mean vector. The covariance matrix gives error estimates
  • To handle noisy observations, add a constant \( \sigma_n \) to the diagonal of covariance matrix for training data
  • You can add different constants if you know that some observations are more precise than others
  • The main cost of inference is inverting the \( n \times n \) training data covariance matrix which is \( \mathcal{O}(n^3) \)
  • Once that matrix has been inverted, further predictions are less costly

Inference Pictures

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

Inference Pictures

plot of chunk unnamed-chunk-5

plot of chunk unnamed-chunk-6

Inference Pictures

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

Inference Pictures

plot of chunk unnamed-chunk-9

plot of chunk unnamed-chunk-10

Inference Pictures

plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-12

Inference Pictures

plot of chunk unnamed-chunk-13

plot of chunk unnamed-chunk-14

How to Handle Hyperparameters

  • If you can afford it, full Bayes is best (MCMC)
  • I want to talk about Type II maximum likelihood
  • If we want to compare models in the presence of data, Bayes says: \[ P(M|D) = \frac{P(D|M)P(M)}{P(D)} \] where \[ P(D|M) = \int P(D|\theta, M)P(\theta | M) d\theta \]
  • So if we have no reason to prefer model 1 over model 2 then: \[ \frac{P(M_1|D)}{P(M_2|D)} = \frac{P(D|M_1) (1/2)}{P(D)} \frac{P(D)}{P(D|M_2)(1/2)} = \frac{P(D|M_1)}{P(D|M_2)} \]

How to Handle Hyperparameters

  • So the “best” model is the one that maximizes \( P(D|M) \), the marginal likelihood
  • Not strictly Bayesian; a form of maximum likelihood
  • Not a problem unless the marginal likelihood has more than one good mode, but this can happen
  • For Gaussian processes, the marginal likelihood has a relatively simple, differentiable closed form!
  • This means we can use gradient descent to tune hyperparameters
  • Way more efficient than grid search or random search, like you have to do for most approaches

Applications

  • Machine learning technique in its own right
    • can be used for classification, though not discussed here
    • on-going research into more expressive kernels
    • research in fast approximations to inference
  • Used to optimize hyperparameters for other techniques (deep nets)
    • Google Vizier uses Gaussian processes for certain cases
  • Other function optimization applications where evaluation of the function is costly
    • Oil drilling, gold mining

Further Resources

A github repo with links to lots of books, papers, blog posts, and videos