What is Bayesian statistics?

Bayesian statistics is based on an interpretation of Bayes’ theorem

\[P(A|B) = \frac{P(B|A) P(A)}{P(B)}\]

All quantities are divided up into data (i.e., things which have been observed) and parameters (i.e., things we want to estimate).
We use Bayes’ interpretation of the theorem to get the posterior probability distribution. The posterior tells us everything we need to know about what we are trying to estimate.

Bayes theorem in english

Bayes’ theorem can be written in words as:

\[\mbox{posterior is proportional to likelihood times prior}\] … or …

\[\mbox{posterior} \propto \mbox{likelihood} \times \mbox{prior}\]

Each of the three terms posterior, likelihood, and prior are probability distributions (pdfs).

In a Bayesian model, every item of interest is either data (which we will write as $y$ or $x$) or parameters (which we will write as $\theta$ or $\sigma$). Often the parameters are divided up into those of interest, and other nuisance parameters.

Bayes’ rule applied to parameters and data

Given a set of observed data points $y$ and a set of parameters $\theta$, we write Bayes’ rule as

\[\underset{\text{posterior}}{P(\theta|y)} = \frac{\underset{\text{likelihood}}{P(y|\theta)}\underset{\text{prior}}{P(\theta)}}{\underset{\text{marginal likelihood}}{P(y)}}\] $P(y)$ is a constant that is often difficult to calculate and Baye’s rule is often written more simply as a proportional statement

\[\underset{\text{posterior}}{P(\theta|y)} \propto \underset{\text{likelihood}}{P(y|\theta)}\underset{\text{prior}}{P(\theta)}\]

Bayes’ rule applied to parameters and data

\[\underset{\text{posterior}}{P(\theta|y)} \propto \underset{\text{likelihood}}{P(y|\theta)}\underset{\text{prior}}{P(\theta)}\]

$P(y|\theta)$ is the probability distribution of the data given the parameters and is known as the likelihood. This summarises what we know about the data and is also known as the data model.
$P(\theta)$ is the probability distribution of the parameters and is known as the prior. The prior represents what we know about the parameters before the data are observed.
$P(\theta|y)$ is the probability distribution of the parameters given the data and is known as the posterior. The posterior represents our updated knowledge about parameters after the data are observed.

Example: one parameter to estimate

Let’s suppose we want to estimate an unknown parameter $\theta$ which represents the 21st century rate of global sea-level rise.
Assume that a set of previous studies have estimated the rate of sea-level rise to be $3 \pm .2 \text{ mm/yr}$ where $.2$ represents one standard deviation.
We will assume a normal distribution for our prior to reflect this external information, so

\[\theta \sim N(3, .2^2)\]

Assume we have a new study which gives us a new rate data point: $y=3.7 mm/yr$. Let’s suppose that we know that the standard deviation for this new rate is $0.5 mm/yr$.
We will assume a data model (likelihood) that assumes $\theta$ is the expected value of $y$, and that given $\theta$, the value $y$ would be normally distributed with mean $\theta$ and standard deviation .5, such that

\[y|\theta \sim N(\theta, .5^2)\]

Once we have a prior and likelihood, we can use Bayes theorem to estimate $p(\theta|y$), the posterior distribution for the 21st century rate of sea level rise given our new data.

Example continued

\[\theta \sim N(3, .2^2)\] \[y|\theta \sim N(\theta, .5^2)\]

Note: posterior mean for $\theta|y$ is 3.1 mm/yr and standard deviation is 0.19 mm/yr.

Example continued

\[\theta \sim N(3, .4^2)\] \[y|\theta \sim N(\theta, .5^2)\]

Note: posterior mean for $\theta|y$ is 3.27 mm/yr and standard deviation is 0.31 mm/yr.

Example continued

\[\theta \sim N(3, .6^2)\] \[y|\theta \sim N(\theta, .5^2)\]

Note: posterior mean for $\theta|y$ is 3.35 mm/yr and standard deviation is 0.35 mm/yr.

Example continued

\[\theta \sim U(0, 10)\] \[y|\theta \sim N(\theta, .5^2)\]

Note: posterior mean for $\theta|y$ is 3.7 mm/yr and standard deviation is 0.5 mm/yr.

Calculating the posterior vs sampling from it

There are two ways to get at a posterior:
1. Calculate it directly (using hard maths)
2. Use a simulation method
Number 1 is impractical once you move beyond a few parameters, so number 2 is used by almost everybody.
This means that we create samples from the posterior distribution.
This typically relies on a set of algorithms knowns as Marckov Chain Monte Carlo (MCMC)
We often create thousands of posterior samples to represent the posterior distribution.

Things you can do with posterior samples

Create histograms or density plots.
Get individual summaries such as means, standard deviations, and quantiles (e.g. 95% uncertainty intervals).
Get joint summaries such as scatter plots or correlations.
Perform transformations such as logs/exponents, squares/square roots, etc

Having the posterior distribution enables you to get at exactly the quantities you are interested in.

Introducing some data

Bayesian regression models, some general notation

Armed with some background information about Bayesian methods and some data we want to analyse, we will now explore some regression models.

We will use the following general notation:

Roman letters for data, Greek for parameters.
$y$ for the values we’re interested in modelling.
$x$ for explanatory variables, e.g., Age.
$\theta, \phi, ...$ for unknown parameters.
$\sigma$ for parameters representing standard deviations.

Bayesian regression models, general model specification

When it comes to the specifying Bayesian models we typically define

The process model - describes latent trends in the outcome of interest
The data model - describes how the data are generated and how they link to the process model
The priors - describes what we know about paramters in the process and data models.

We now know:

Combining the priors and the likelihood gives us the posterior.
We can use simulation based methods to get samples from the posterior.
The posterior tells us everything we need to know about what we are trying to estimate.

Model specification cont.

The process model often relates to the expectation of what is underlying the observed data (i.e., “the truth”). The process model will be governed by a set of explanatory variables and parameters that need to be estimated.

$\tilde{y} = f(\theta,x)$, where $f$ = a function, e.g., linear

The data model contains assumptions about how the data are generated and incorporates data uncertainty. The data model links the observed data to the underlying process.

$y \sim \pi (y|\theta,\sigma,x)$, where $\pi$ = probability distribution, e.g., Normal

The priors contain assumptions about the parameters we are estimating. Priors can be used to impose constraints on model parameters based on apriori information.

Priors = Constraints

A modelling option: simple linear regression

Now let’s assume $y$ is measurements of relative sea level and that we are using age as an explanatory variable.

Process model

$\tilde{y} = f(\theta,x = age)$, where f = linear

\[\tilde{y} = \alpha + \beta age\]

Data model

$y \sim \pi (y|\theta,\sigma,x = age)$, where = Normal

\[y|\theta, \sigma \sim N(\tilde{y},\sigma^2)\]

$\sigma^2$ can incorporate measurement uncertainty associated with y

Priors = Constraints

We might assume we don’t know much about $\alpha$ and $\beta$
$\sigma$ must be positive

Results: simple linear regression

variable	mean	q5	q95
alpha	-4.0769872	-4.3666123	-3.7898817
beta	1.9112899	1.7514559	2.0764976
sigma	0.0169143	0.0011484	0.0420047

Let’s take another look at the New Jersey data

These data actually have measurement uncertainty assoicated with age as well as RSL.
The uncertainty is often represented by boxes.

So what happens when x has associated measurement errors?

Instead of \[\tilde{y} = f(\theta,age) = \alpha + \beta age\] What we want is \[\tilde{y} = f(\theta,\tilde{age}) = \alpha + \beta \tilde{age}\] Where $\tilde{age}$ is the “true” age.

But we don’t know $\tilde{age}$ 😿

We can use an Errors-in-Variables (EIV) approach 😃

We can assume that

$x = \tilde{x} + error$, where error $\sim$ Normal

\[age = \tilde{age} + error\] \[error \sim Normal(0, \sigma^2_{age})\] which is the same as

\[age \sim N(\tilde{age}, \sigma^2_{age})\] Priors = Constraints

We need a prior for $\tilde{age}$ with a plausible range.

This is called an Errors-in-Variables model.

Results: EIV simple linear regression

variable	mean	q5	q95
alpha	-4.0435842	-4.3040220	-3.7435280
beta	1.8929372	1.7252241	2.0391864
sigma	0.0164157	0.0012616	0.0426489

Let’s take a little detour to explore this a little more

We’ll look at this web application to see the impact of accounting for vs not accounting for age errors in a simple linear regression model.

Extending these ideas: change-point regression

We can extend the Errors-in-Variables idea to be used with more complex process models.

Process model

$\tilde{y} = f(\theta,\tilde{x})$, where f = change point model

\[\tilde{y} = \alpha + \beta_1 (\tilde{age} - cp) \hspace{2em} \text{for $\tilde{age}$ $<$ $cp$}\]

\[\tilde{y} = \alpha + \beta_2 (\tilde{age} - cp) \hspace{2em} \text{for $\tilde{age}$ $\geq$ $cp$}\]

The key idea in change-point regression is that you have a different rate of change (slope) before and after the change point.

Results: EIV change-point linear regression

variable	mean	q5	q95
alpha	-0.6006002	-0.8107149	-0.3775718
beta[1]	1.4692624	1.0578567	1.8249054
beta[2]	4.0727349	2.5725674	7.6454496
cp	1.8692600	1.7819783	1.9559629
sigma	0.0142591	0.0012358	0.0341196

Extending these ideas: Gaussian process regression

Process model

$\tilde{y} = f(\theta,\tilde{x})$, where f = Gaussian process model

Consider the multivariate normal distribution for some k-dimension random vector $\tilde{y}$ \[\tilde{y} \sim MVN(0, \Sigma)\] The matrix Sigma currently looks like this: \[\Sigma = \left[ \begin{array}{cccc} \sigma_g^2 & 0 & \ldots & 0 \\ 0 & \sigma_g^2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \sigma_g^2 \end{array} \right]\]

The key idea in Gaussian Processes is to change the off-diagonals of $\Sigma$ so that the you get higher correlations between $\tilde{y}_i$ and $\tilde{y}_j$ when $\tilde{x}_i$ and $\tilde{x}_j$ are close.

Gaussian processes cont.

A zero-mean Gaussian process is defined as follows:

\[\tilde{y} \sim GP(0,K)\]

one commonly used option is the squared exponential correlation (covariance) function:

\[K_{i,j} = \sigma_g^2exp\bigg(-\rho^2(\tilde{x}_i-\tilde{x}_j)^2\bigg)\]

$\sigma_g^2$ should be high for functions that cover a broad range on the y-axis
if $\tilde{x}_i$ is distant from $\tilde{x}_j$ then $K_{i,j} \approx 0$
the effect of $\tilde{x}_i-\tilde{x}_j$ on $K_{i,j}$ will depend on $\rho$

Gaussian processes cont.

Some intuition for the correlation function.

In this case the correlation decay is “slow” and there is still some correlation even between points that are 2000 years apart.

Gaussian processes cont.

Let’s assume that the data suggests that the decay should be much faster than this. This is what $\rho$ controls. For example, larger values of $\rho$ might result in a correlation function that looks like this:

Gaussian processes cont.

So what this means in practice is that the GP parameters will control how “wiggly” the curves are:

Results: EIV Gaussian Process regression

variable	mean	q5	q95
alpha	0.4662679	-3.2557742	5.5385965
phi	1.2327465	0.3897824	2.8894180
sigma	0.0147301	0.0009978	0.0372906
sigma_g	3.1253644	0.6055551	8.3126435

Extending these ideas: integrated Gaussian processes

Process model

$\tilde{y} = f(\theta,\tilde{x})$, where f = integrated Gaussian process model

We assume a GP model for the rate of sea-level change

\[\omega \sim GP(0, K)\]

Then the sea-level process is obtained by integrating the rate process

\[\tilde{y} = \int_0^x \omega(u) du\]

The key idea in the intergrated Gaussian Processes is that you get higher correlations between rates over shorter time lags.
The advantage of this model is you get an estimate of the rate process directly from the model.

variable	mean	q5	q95
phi	0.1626012	0.0344187	0.3518412
sigma	0.0146700	0.0013467	0.0340291
sigma_g	2.8305896	0.9060651	6.3122754

Bayesian Statistical Models for Analysing Time-Dependent Data with Measurement Uncertainties.

What is Bayesian statistics?

Bayes theorem in english

Bayes’ rule applied to parameters and data

Bayes’ rule applied to parameters and data

Example: one parameter to estimate

Example continued

Example continued

Example continued

Example continued

Calculating the posterior vs sampling from it

Things you can do with posterior samples

Introducing some data

Bayesian regression models, some general notation

Bayesian regression models, general model specification

Model specification cont.

A modelling option: simple linear regression

Results: simple linear regression

Let’s take another look at the New Jersey data

So what happens when x has associated measurement errors?

We can use an Errors-in-Variables (EIV) approach 😃

Results: EIV simple linear regression

Let’s take a little detour to explore this a little more

Extending these ideas: change-point regression

Results: EIV change-point linear regression

Extending these ideas: Gaussian process regression

Gaussian processes cont.

Gaussian processes cont.

Gaussian processes cont.

Gaussian processes cont.

Results: EIV Gaussian Process regression

Extending these ideas: integrated Gaussian processes

Results: EIV Integrated Gaussian Process (EIV-IGP)

Results: EIV Integrated Gaussian Process (EIV-IGP)

Let’s explore these ideas further