Introduction to Bayesian analysis

Jim Savage

Introductions

Breaks
Structure
Expectations

Very quick outline of course

Why do Bayesian analysis?
Introduction to generative models
Some useful distributions
The linear model and (maximum) likelihood
Priors, likelihood, and posterior

Very quick outline of course

Introduction to Stan
A Bayesian linear model
Varying-intercept varying-slope models
Modelling latent data 1: Time-varying parameter model
Modelling latent data 2: Nowcastin

Why do Bayesian data analysis?

v Dialectic

Why do Bayesian data analysis?

A natural way of formulating scientific learning in our data analysis
We always have information (priors) before we perform a piece of analysis
We should use this information

Why do Bayesian data analysis?

p-value

Why do Bayesian data analysis?

What even is a p-value?

Under the assumption that the true parameter of a model is 0, (and the model is correctly specified, and the observations are without bias), the p-value tells us the probability of estimating a parameter at least as large as the observed effect.

Why do Bayesian data analysis?

What even is a p-value?

Says nothing about the distribution of plausible values of your parameter
Says nothing about the practical significance of your parameter
Is basically a measure of sample size

Why do Bayesian data analysis?

What we'd like to be able to say is “I'm 95% confidence that the parameter is greater than/less than 0”
Bayesian analysis allows this interpretation

Why do Bayesian data analysis?

Model uncertainty

spaghetti

Why do Bayesian data analysis?

When we make predictions, we want to include two sources of uncertainty:
- How wrong is our model on average if we believe it is the correct model?
- How uncertain are we that our model is correct?
Bayesian analysis includes both these sources of uncertainty. Frequentists typically only include the first.

What we need to learn

Bayesian statistics is difficult in the sense that thinking is difficult. - Donald Berry, Duke

Thinking in terms of a generative model. This means assigning prior distributions to all unknowns (parameters) (conceptually difficult at first, assigning priors always takes work)
- If you're used to turn-key model estimation, this will be the hardest step, but most valuable.
Writing out the model to be estimated (learning Stan) (easiest bit, still more difficult than what you're doing now)
Checking to see whether the estimated model is consistent with the data (posterior predictive checking) (easy)
Using the output of models (difficult)

Generative models

Definition

In mathematical terms: a model of the joint probability of the parameters and data of a model: \( p(y, \theta | x) \)
In layman's terms: a model which allows us to generate new (fake) “observations”

Exercise

Generate a data frame with some random series–one uniformly distributed, one normally distributed, and one binary.
Create some parameters
Create some “outcome” data

Generative models

Now let's estimate the parameters of this model using R's lm function

Generative models

Notice that we can write a generative model in probability form.

\[ y_{i} = \beta_{0} + \beta_{1}x_{1,i} + \beta_{2}x_{2,i} + \epsilon \] with \( \epsilon \sim\mathcal{N}(0,\sigma) \)

is the same as writing

\[ y_{i} \sim \mathcal{N}(\beta_{0} + \beta_{1}x_{1,i} + \beta_{2}x_{2,i}, \sigma) \]

\[ y_{i} \sim \mathcal{N}(\mbox{Conditional mean}, \mbox{Potentially conditional volatility}) \]

Generative models

Note that the above model is the joint distribution of the data and parameters only if we think the parameters are fixed and known. In reality, we are uncertain about the value of the \( \beta \) s

So our generative model typically takes the form \( p(y_{i}, \beta) \)

Some distributions - the Normal

Normal distribution \( \mathcal{N}(\mu, \sigma) \) with \( \mu \) being the mean and \( \sigma \) being the standard deviation
Often used as a default distribution by frequentists.
Criticised for “thin tails”

In R:

# Random number generation
rnorm(number, mean, standard devaition)

# Height of the density at point x
dnorm(x, mean, standard deviation)

# Cumulative probability function up to point q
pnorm(q, mean, standard deviation)

Some distributions - Student t

Student t distribution \( \mathcal{t}(df, \mu) \) with \( df \) being the degrees of freedom parameter and \( \mu \) being the mean. By default the standard deviation is 1.
Similar to the normal distribution, but can have fat tails (give probability to extreme events)

In R:

# Random number generation
rt(number, degrees of freedom, mean)

# Height of the density at point x
dnorm(x, degrees of freedom, mean)

# Cumulative probability function up to point q
pnorm(q, degrees of freedom, mean)

Some distributions - Inverse Gamma

Very flexible distribution often used for standard deviations

Strictly positive, right skewed

In R:

install.packages("MCMCpack")
library(MCMCpack)

# Random number generation
rinvgamma(number, shape, scale)

# Height of density at point x
rinvgamma(x, shape, scale)

Some exercises

Generate some random numbers from some distributions and plot them using plot(density(x)) where x are your random numbers
Do this several times for each distribution with different values of the parameters

Multivariate distributions

So far we have played with distributions where random variables are independent of one another.
We want to learn how to incorporate inter-dependency

Multivariate distributions: Correlation matrix

Correlation a measure between 0 and 1 where 0 implies no linear relationship between two variables, and 1 implies perfect linear relationship.

A Correlation matrix contains the two-way correlations between all variables in the system.

           V1        V2        V3
V1  1.0000000 -0.482711 0.1231727
V2 -0.4827110  1.000000 0.4823770
V3  0.1231727  0.482377 1.0000000

Multivariate distributions: Correlation matrix

plot of chunk unnamed-chunk-2

Multivariate distributions: Scale matrices

The correlation matrix summarises the degree to which random variables are linearly related.
It does not tell us anything about the standard deviation of each of those variables.
We need to introduce a scale variable.

A univariate distribution's scale is determined by the \( \sigma \) value. Each variable in a multivariate distribution also has it's own scale variable. We collect them together in a vector called \( \tau \).

\[ \tau = [\sigma_{1}, \sigma_{2}, \dots, \sigma_{P}]' \]

Multivariate distributions: Covariance matrix

Covariance matrices combine the information from a scale vector with the correlation matrix. We typically use \( \Sigma \) to denote a covariance matrix and \( \Omega \) for a correlation matrix.

\[ \Sigma = \mbox{diag}(\tau)\Omega\mbox{diag}(\tau) \]

Where the diag operator places the values in \( \tau \) along the diagonal elements of a matrix (with zeroes everywher else)

Multivariate distributions: The multivariate normal distribution

Each individual series is normally distributed with standard deviation \( \sigma_{i} \)
The correlation between all series is given by \( \Omega \)
Each series has its own average value, \( \mu_{i} \). These are collected together in a vector \( \mu = [\mu_{1}, \dots, \mu_{P}] \)

Multivariate distributions: The multivariate normal distribution

In R:

library(MASS)
# Define your correlation matrix
cormat <- matrix(c(1, -0.5, 0.1, -0.5, 1, 0.5, 0.1, 0.5, 1), 3, 3)
# Define your scale vector
scalemat <- c(4, 2, 1)
# Create your covariance matrix
Sigma <- diag(scalemat) %*% cormat %*% diag(scalemat)
# Create the mean vector
means <- c(3, 7, 2)
xs <- mvrnorm(n = number, means, Sigma) %>% as.data.frame
cor(xs); plot(xs)

The LKJ distribution

A one-parameter \( [1, \infty ) \) distribution over correlation matrices
When its parameter is close to 1, the matrices are close to a unit-matrix (very strong correlations between all variables)
When its parameter is large, it collapses on an identity matrix (no correlation between variables)
Not yet available in R, but available in Stan

Back to generative models

Let's simulate a real generative model
- \( \theta\sim \mathcal{MNV}([-2, 3, 1, 5]', \Sigma) \) (you can decide \( \Sigma \))
- \( \sigma \sim \mathcal{IG}(1, 1) \)
- Use your data from before
- Simulate new values of \( y2_{i} \)
What does the distribution of the new \( y2 \) look relative to \( y \)?

The linear model

Not really a huge point in doing this using a Bayesian model, but a good learning tool.

\[ y_{i} = \beta_{0} + \beta_{1}x_{1,i} + \dots + \beta_{P}x_{p,i} + \epsilon_{i} \]

\[ y = X\beta + \epsilon \]

with \( \epsilon \sim \mathcal{N}(0, \sigma) \)

What is likelihood?

In summary,
- Treating \( X \) as fixed
- Given a (guess of a) set of \( \beta \) s
- A single number that is greater when the observations of \( y_{i} \) are more probable under the model.
- \( \prod_{i}f(y_{i} | \beta, X_{i}) \) where \( f() \) is the density of the model at point \( y_{i} \)
- In R, the density is given by dnorm, dt, dinvgamma etc.
Example in R: Two parameter likelihood.

Maximum likelihood

Work with logs! Very small numbers multiplied by very small numbers and we run into numerical issues.
We want to pick the parameters that maximimise the log likelihood.
Most optimisation algorithms minimise something, so we typcially minimise the negative log likelihood.

How do these work?

Maximum likelihood

Marble in bowl

Likelihood of the linear model

Let's take our linear model from before:

\[ y = X\beta + \epsilon \]

With \( \epsilon \sim \mathcal{N}(0, \sigma) \)

As before, we can express this in probabilistic notation:

\[ y \sim\mathcal{N}(X\beta, \sigma) \]

What's wrong with maximum likelihood?

It's a wonderful method
Gives us the parameters that make the most sense of the data
We want it the other way around: we want to develop a probabilistic understanding of the parameters given the data.
Does not allow for probabilistic undertainty about parameters.

Bayes' Rule

\[ p(\theta | y) = \frac{p(y | \theta)p(\theta)}{p(y)} \]

Because the denominator \( p(y) \) does not depend on the parameters \( \theta \), we typically write this out in proportional notation:

\[ p(\theta | y) \propto p(y | \theta)p(\theta) \]

That is, if we want to make inference about the posterior \( p(\theta | y) \), we need to have the likelihood \( p(y | \theta) \), and a prior distribution for the parameters of the model \( p(\theta) \).

Bayes' Rule

Bayes

Bayes' Rule and Monte Carlo theory

Again: We want to make inference (want to know mean, sd, quantiles etc.) about the posterior. Unfortunately for all but a few simple models, the right hand side \( p(y | \theta)p(\theta) \) is not analytically solveable.

That's OK!

Monte-Carlo

To make inference about \( p(\theta|y) \), I don't need the analytical form. As long as I can generate many random draws from the distribution, I can then calculate the statistics for these draws.
For instance, let's say I want the average of \( \theta_{1} \). If I have 500 draws of \( \theta_{1} \) from the posterior, I can simply calculate mean(theta[,1])

Markov chains

We want to generate draws from the posterior
We don't know the functional form of the posterior
BUT we do know that
- If we generate a draw of \( \theta^{(i+1)} \) from \( p(\theta^{(i)}|y) \) and
- The distribution of \( \theta \) does not change as we progress forward (no change in the mean or variance of the draws)
- We must have discovered an invariant distribution
- This is possibly our posterior (though we should do some more checks—often by running multiple chains and making sure they “mix”)

An introduction to Stan

Amazing language
We write the model, not the sampler
It translates our code into a C++ implementation of Hamiltonina Monte Carlo (an algorithm to generate draws from the posterior), compiles it, and simulates.
Also allows us to write functions to export into R
- In my tests, they can be around 1000 times as fast as native R functions (especially when you have nested loops)

An introduction to Stan

Code blocks:

functions — any user-defined functions that you want to use inside or outside Stan
data — the data inputs, including any hyper-priors
parameters — defining the parameters that you want to estimate
transformed parameters — any transformations of the parameters, such as generating covariance matrices from scale and correlation matrices.
model — where you define the priors and likelihood
generated quantities — where you define any output from the model, including predictions or values of a loss function.

An introduction to Stan

Exercise: implementing a linear model