Gaussian Process Regression

J Dorsey

plot of chunk unnamed-chunk-1

Infer / learn a scalar-valued function of vectors, without assuming a parametric form
For this talk we will just use a scalar-valued function of scalars
We also want a measure of uncertainty

If we have a parametric family of functions and a sensible prior over the parameters, then Bayes theorem (the rules of probability) can take over and give us reasonable estimates with uncertainty accounted for
How can we apply Bayesian techniques in a non-parametric setting?
When there are no parameters, what do we put a prior on?
The whole function!

How do we get a prior over functions? (infinite dimensional space)
Consider each individual y value a Gaussian random variable.
Think of the whole function as just like one really long vector.
An infinite collection of Gaussian variables like this is called a Gaussian process.
Just as a finite-dimensional multivariate Gaussian is specified by a mean vector and a covariance matrix, an infinite dimensional Gaussian process is specified by a mean function and a covariance function.

plot of chunk unnamed-chunk-2

Most common covariance function is called “Squared exponential” (really should be “exponentiated square”) \[ K(x_1, x_2) = \sigma_f \text{exp}\left( \frac{(x_1 - x_2) ^ 2}{2 l ^ 2} \right) \]
Brownian motion / Wiener process \[ K(x_1, x_2) = \text{min}(x_1, x_2) \]
“Deep” kernels constructed via limiting arguments to mimic deep nets

I.e. how can we sample an infinite-dimensional object?
We can't! But we don't have to.
We only ever need to know the function at finitely many inputs
On any finite set of inputs, the function values are a finite-dimensional multivariate Gaussian
(In fact, this is basically the technical definition of a Gaussian process)
Plug in the inputs into the mean and covariance function to get a mean vector and covariance matrix
To make those graphs, I just specified a grid of x values and sampled from a high-dimensional Gaussian

For fixed hyperparameters (length scale, function amplitude), inference can be done exactly
Group all the training data (function value known) and testing data (function value unknown, prediction wanted) into one input set
Then the values of the function on that set is a finite-dimensional Gaussian
Condition on the training data using good old-fashioned matrix algebra
The resulting conditional distribution over the test set will again be multivariate Gaussian, with some mean vector and covariance matrix

The Bayes estimator for any symmetric loss is the mean vector. The covariance matrix gives error estimates
To handle noisy observations, add a constant \( \sigma_n \) to the diagonal of covariance matrix for training data
You can add different constants if you know that some observations are more precise than others
The main cost of inference is inverting the \( n \times n \) training data covariance matrix which is \( \mathcal{O}(n^3) \)
Once that matrix has been inverted, further predictions are less costly

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-5

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-9

plot of chunk unnamed-chunk-10

plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-12

plot of chunk unnamed-chunk-13

plot of chunk unnamed-chunk-14

If you can afford it, full Bayes is best (MCMC)
I want to talk about Type II maximum likelihood
If we want to compare models in the presence of data, Bayes says: \[ P(M|D) = \frac{P(D|M)P(M)}{P(D)} \] where \[ P(D|M) = \int P(D|\theta, M)P(\theta | M) d\theta \]
So if we have no reason to prefer model 1 over model 2 then: \[ \frac{P(M_1|D)}{P(M_2|D)} = \frac{P(D|M_1) (1/2)}{P(D)} \frac{P(D)}{P(D|M_2)(1/2)} = \frac{P(D|M_1)}{P(D|M_2)} \]

So the “best” model is the one that maximizes \( P(D|M) \), the marginal likelihood
Not strictly Bayesian; a form of maximum likelihood
Not a problem unless the marginal likelihood has more than one good mode, but this can happen
For Gaussian processes, the marginal likelihood has a relatively simple, differentiable closed form!
This means we can use gradient descent to tune hyperparameters
Way more efficient than grid search or random search, like you have to do for most approaches

Machine learning technique in its own right
- can be used for classification, though not discussed here
- on-going research into more expressive kernels
- research in fast approximations to inference
Used to optimize hyperparameters for other techniques (deep nets)
- Google Vizier uses Gaussian processes for certain cases
Other function optimization applications where evaluation of the function is costly
- Oil drilling, gold mining

A github repo with links to lots of books, papers, blog posts, and videos