Dr Niamh Cahill (she/her)
Early Career Workshop on Bayesian Statistics
\[P(A|B) = \frac{P(B|A) P(A)}{P(B)}\]
All quantities are divided up into data (i.e., things which have been observed) and parameters (i.e., things we want to estimate).
We use Bayes’ interpretation of the theorem to get the posterior probability distribution. The posterior tells us everything we need to know about what we are trying to estimate.
Bayes’ theorem can be written in words as:
\[\mbox{posterior is proportional to likelihood times prior}\] … or …
\[\mbox{posterior} \propto \mbox{likelihood} \times \mbox{prior}\]
Each of the three terms posterior, likelihood, and prior are probability distributions (pdfs).
In a Bayesian model, every item of interest is either data (which we will write as \(y\) or \(x\)) or parameters (which we will write as \(\theta\) or \(\sigma\)). Often the parameters are divided up into those of interest, and other nuisance parameters.
Given a set of observed data points \(y\) and a set of parameters \(\theta\), we write Bayes’ rule as
\[\underset{\text{posterior}}{P(\theta|y)} = \frac{\underset{\text{likelihood}}{P(y|\theta)}\underset{\text{prior}}{P(\theta)}}{\underset{\text{marginal likelihood}}{P(y)}}\] \(P(y)\) is a constant that is often difficult to calculate and Baye’s rule is often written more simply as a proportional statement
\[\underset{\text{posterior}}{P(\theta|y)} \propto \underset{\text{likelihood}}{P(y|\theta)}\underset{\text{prior}}{P(\theta)}\]
\[\underset{\text{posterior}}{P(\theta|y)} \propto \underset{\text{likelihood}}{P(y|\theta)}\underset{\text{prior}}{P(\theta)}\]
\(P(y|\theta)\) is the probability distribution of the data given the parameters and is known as the likelihood. This summarises what we know about the data and is also known as the data model.
\(P(\theta)\) is the probability distribution of the parameters and is known as the prior. The prior represents what we know about the parameters before the data are observed.
\(P(\theta|y)\) is the probability distribution of the parameters given the data and is known as the posterior. The posterior represents our updated knowledge about parameters after the data are observed.
Let’s suppose we want to estimate an unknown parameter \(\theta\) which represents the 21st century rate of global sea-level rise.
Assume that a set of previous studies have estimated the rate of sea-level rise to be \(3 \pm .2 \text{ mm/yr}\) where \(.2\) represents one standard deviation.
We will assume a normal distribution for our prior to reflect this external information, so
\[\theta \sim N(3, .2^2)\]
Assume we have a new study which gives us a new rate data point: \(y=3.7 mm/yr\). Let’s suppose that we know that the standard deviation for this new rate is \(0.5 mm/yr\).
We will assume a data model (likelihood) that assumes \(\theta\) is the expected value of \(y\), and that given \(\theta\), the value \(y\) would be normally distributed with mean \(\theta\) and standard deviation .5, such that
\[y|\theta \sim N(\theta, .5^2)\]
\[\theta \sim N(3, .2^2)\] \[y|\theta \sim N(\theta, .5^2)\]
Note: posterior mean for \(\theta|y\) is 3.1 mm/yr and standard deviation is 0.19 mm/yr.
\[\theta \sim N(3, .4^2)\] \[y|\theta \sim N(\theta, .5^2)\]
Note: posterior mean for \(\theta|y\) is 3.27 mm/yr and standard deviation is 0.31 mm/yr.
\[\theta \sim N(3, .6^2)\] \[y|\theta \sim N(\theta, .5^2)\]
Note: posterior mean for \(\theta|y\) is 3.35 mm/yr and standard deviation is 0.35 mm/yr.
\[\theta \sim U(0, 10)\] \[y|\theta \sim N(\theta, .5^2)\]
Note: posterior mean for \(\theta|y\) is 3.7 mm/yr and standard deviation is 0.5 mm/yr.
There are two ways to get at a posterior:
Number 1 is impractical once you move beyond a few parameters, so number 2 is used by almost everybody.
This means that we create samples from the posterior distribution.
This typically relies on a set of algorithms knowns as Marckov Chain Monte Carlo (MCMC)
We often create thousands of posterior samples to represent the posterior distribution.
Having the posterior distribution enables you to get at exactly the quantities you are interested in.
Armed with some background information about Bayesian methods and some data we want to analyse, we will now explore some regression models.
We will use the following general notation:
When it comes to the specifying Bayesian models we typically define
We now know:
Combining the priors and the likelihood gives us the posterior.
We can use simulation based methods to get samples from the posterior.
The posterior tells us everything we need to know about what we are trying to estimate.
\(\tilde{y} = f(\theta,x)\), where \(f\) = a function, e.g., linear
\(y \sim \pi (y|\theta,\sigma,x)\), where \(\pi\) = probability distribution, e.g., Normal
Priors = Constraints
Now let’s assume \(y\) is measurements of relative sea level and that we are using age as an explanatory variable.
Process model
\(\tilde{y} = f(\theta,x = age)\), where f = linear
\[\tilde{y} = \alpha + \beta age\]
Data model
\(y \sim \pi (y|\theta,\sigma,x = age)\), where = Normal
\[y|\theta, \sigma \sim N(\tilde{y},\sigma^2)\]
Priors = Constraints
We might assume we don’t know much about \(\alpha\) and \(\beta\)
\(\sigma\) must be positive
| variable | mean | q5 | q95 |
|---|---|---|---|
| alpha | -4.0769872 | -4.3666123 | -3.7898817 |
| beta | 1.9112899 | 1.7514559 | 2.0764976 |
| sigma | 0.0169143 | 0.0011484 | 0.0420047 |
These data actually have measurement uncertainty assoicated with age as well as RSL.
The uncertainty is often represented by boxes.
Instead of \[\tilde{y} = f(\theta,age) = \alpha + \beta age\] What we want is \[\tilde{y} = f(\theta,\tilde{age}) = \alpha + \beta \tilde{age}\] Where \(\tilde{age}\) is the “true” age.
But we don’t know \(\tilde{age}\) 😿
We can assume that
\(x = \tilde{x} + error\), where error \(\sim\) Normal
\[age = \tilde{age} + error\] \[error \sim Normal(0, \sigma^2_{age})\] which is the same as
\[age \sim N(\tilde{age}, \sigma^2_{age})\] Priors = Constraints
This is called an Errors-in-Variables model.
| variable | mean | q5 | q95 |
|---|---|---|---|
| alpha | -4.0435842 | -4.3040220 | -3.7435280 |
| beta | 1.8929372 | 1.7252241 | 2.0391864 |
| sigma | 0.0164157 | 0.0012616 | 0.0426489 |
We’ll look at this web application to see the impact of accounting for vs not accounting for age errors in a simple linear regression model.
We can extend the Errors-in-Variables idea to be used with more complex process models.
Process model
\(\tilde{y} = f(\theta,\tilde{x})\), where f = change point model
\[\tilde{y} = \alpha + \beta_1 (\tilde{age} - cp) \hspace{2em} \text{for $\tilde{age}$ $<$ $cp$}\]
\[\tilde{y} = \alpha + \beta_2 (\tilde{age} - cp) \hspace{2em} \text{for $\tilde{age}$ $\geq$ $cp$}\]
| variable | mean | q5 | q95 |
|---|---|---|---|
| alpha | -0.6006002 | -0.8107149 | -0.3775718 |
| beta[1] | 1.4692624 | 1.0578567 | 1.8249054 |
| beta[2] | 4.0727349 | 2.5725674 | 7.6454496 |
| cp | 1.8692600 | 1.7819783 | 1.9559629 |
| sigma | 0.0142591 | 0.0012358 | 0.0341196 |
Process model
\(\tilde{y} = f(\theta,\tilde{x})\), where f = Gaussian process model
Consider the multivariate normal distribution for some k-dimension random vector \(\tilde{y}\) \[\tilde{y} \sim MVN(0, \Sigma)\] The matrix Sigma currently looks like this: \[\Sigma = \left[ \begin{array}{cccc} \sigma_g^2 & 0 & \ldots & 0 \\ 0 & \sigma_g^2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \sigma_g^2 \end{array} \right]\]
A zero-mean Gaussian process is defined as follows:
\[\tilde{y} \sim GP(0,K)\]
\[K_{i,j} = \sigma_g^2exp\bigg(-\rho^2(\tilde{x}_i-\tilde{x}_j)^2\bigg)\]
\(\sigma_g^2\) should be high for functions that cover a broad range on the y-axis
if \(\tilde{x}_i\) is distant from \(\tilde{x}_j\) then \(K_{i,j} \approx 0\)
the effect of \(\tilde{x}_i-\tilde{x}_j\) on \(K_{i,j}\) will depend on \(\rho\)
Some intuition for the correlation function.
In this case the correlation decay is “slow” and there is still some correlation even between points that are 2000 years apart.
Let’s assume that the data suggests that the decay should be much faster than this. This is what \(\rho\) controls. For example, larger values of \(\rho\) might result in a correlation function that looks like this:
So what this means in practice is that the GP parameters will control how “wiggly” the curves are:
| variable | mean | q5 | q95 |
|---|---|---|---|
| alpha | 0.4662679 | -3.2557742 | 5.5385965 |
| phi | 1.2327465 | 0.3897824 | 2.8894180 |
| sigma | 0.0147301 | 0.0009978 | 0.0372906 |
| sigma_g | 3.1253644 | 0.6055551 | 8.3126435 |
Process model
\(\tilde{y} = f(\theta,\tilde{x})\), where f = integrated Gaussian process model
We assume a GP model for the rate of sea-level change
\[\omega \sim GP(0, K)\]
Then the sea-level process is obtained by integrating the rate process
\[\tilde{y} = \int_0^x \omega(u) du\]
The key idea in the intergrated Gaussian Processes is that you get higher correlations between rates over shorter time lags.
The advantage of this model is you get an estimate of the rate process directly from the model.
| variable | mean | q5 | q95 |
|---|---|---|---|
| phi | 0.1626012 | 0.0344187 | 0.3518412 |
| sigma | 0.0146700 | 0.0013467 | 0.0340291 |
| sigma_g | 2.8305896 | 0.9060651 | 6.3122754 |