Infer / learn a scalar-valued function of vectors, without assuming a parametric form
For this talk we will just use a scalar-valued function of scalars
We also want a measure of uncertainty
Bayesian Approach
If we have a parametric family of functions and a sensible prior over the parameters, then Bayes theorem (the rules of probability) can take over and give us reasonable estimates with uncertainty accounted for
How can we apply Bayesian techniques in a non-parametric setting?
When there are no parameters, what do we put a prior on?
The whole function!
Gaussian Process prior
How do we get a prior over functions? (infinite dimensional space)
Consider each individual y value a Gaussian random variable.
Think of the whole function as just like one really long vector.
An infinite collection of Gaussian variables like this is called a Gaussian process.
Just as a finite-dimensional multivariate Gaussian is specified by a mean vector and a covariance matrix, an infinite dimensional Gaussian process is specified by a mean function and a covariance function.
Why Gaussians?
We're really good at integrating Gaussians
That means we can easily perform marginalization and conditioning
And Gaussians are so nice that when we do that we get back another Gaussian
Example Covariance Functions
Most common covariance function is called “Squared exponential” (really should be “exponentiated square”)
\[
K(x_1, x_2) = \sigma_f \text{exp}\left( \frac{(x_1 - x_2) ^ 2}{2 l ^ 2} \right)
\]
Brownian motion / Wiener process
\[
K(x_1, x_2) = \text{min}(x_1, x_2)
\]
“Deep” kernels constructed via limiting arguments to mimic deep nets
How Did I Make Those Graphs?
I.e. how can we sample an infinite-dimensional object?
We can't! But we don't have to.
We only ever need to know the function at finitely many inputs
On any finite set of inputs, the function values are a finite-dimensional multivariate Gaussian
(In fact, this is basically the technical definition of a Gaussian process)
Plug in the inputs into the mean and covariance function to get a mean vector and covariance matrix
To make those graphs, I just specified a grid of x values and sampled from a high-dimensional Gaussian
Inference
For fixed hyperparameters (length scale, function amplitude), inference can be done exactly
Group all the training data (function value known) and testing data (function value unknown, prediction wanted) into one input set
Then the values of the function on that set is a finite-dimensional Gaussian
Condition on the training data using good old-fashioned matrix algebra
The resulting conditional distribution over the test set will again be multivariate Gaussian, with some mean vector and covariance matrix
Inference
The Bayes estimator for any symmetric loss is the mean vector. The covariance matrix gives error estimates
To handle noisy observations, add a constant \( \sigma_n \) to the diagonal of covariance matrix for training data
You can add different constants if you know that some observations are more precise than others
The main cost of inference is inverting the \( n \times n \) training data covariance matrix which is \( \mathcal{O}(n^3) \)
Once that matrix has been inverted, further predictions are less costly
Inference Pictures
Inference Pictures
Inference Pictures
Inference Pictures
Inference Pictures
Inference Pictures
How to Handle Hyperparameters
If you can afford it, full Bayes is best (MCMC)
I want to talk about Type II maximum likelihood
If we want to compare models in the presence of data, Bayes says:
\[
P(M|D) = \frac{P(D|M)P(M)}{P(D)}
\]
where
\[
P(D|M) = \int P(D|\theta, M)P(\theta | M) d\theta
\]
So if we have no reason to prefer model 1 over model 2 then:
\[
\frac{P(M_1|D)}{P(M_2|D)} = \frac{P(D|M_1) (1/2)}{P(D)} \frac{P(D)}{P(D|M_2)(1/2)} = \frac{P(D|M_1)}{P(D|M_2)}
\]
How to Handle Hyperparameters
So the “best” model is the one that maximizes \( P(D|M) \), the marginal likelihood
Not strictly Bayesian; a form of maximum likelihood
Not a problem unless the marginal likelihood has more than one good mode, but this can happen
For Gaussian processes, the marginal likelihood has a relatively simple, differentiable closed form!
This means we can use gradient descent to tune hyperparameters
Way more efficient than grid search or random search, like you have to do for most approaches
Applications
Machine learning technique in its own right
can be used for classification, though not discussed here
on-going research into more expressive kernels
research in fast approximations to inference
Used to optimize hyperparameters for other techniques (deep nets)
Google Vizier uses Gaussian processes for certain cases
Other function optimization applications where evaluation of the function is costly
Oil drilling, gold mining
Further Resources
Gaussian Processes for Machine Learning by Rasmussen and Williams