Introduction to Bayesian analysis

Jim Savage

Introductions

  • Breaks
  • Structure
  • Expectations

Very quick outline of course

  • Why do Bayesian analysis?
  • Introduction to generative models
  • Some useful distributions
  • The linear model and (maximum) likelihood
  • Priors, likelihood, and posterior

Very quick outline of course

  • Introduction to Stan
  • A Bayesian linear model
  • Varying-intercept varying-slope models
  • Modelling latent data 1: Time-varying parameter model
  • Modelling latent data 2: Nowcastin

Why do Bayesian data analysis?

vDialectic

Why do Bayesian data analysis?

  • A natural way of formulating scientific learning in our data analysis
  • We always have information (priors) before we perform a piece of analysis
  • We should use this information

Why do Bayesian data analysis?

p-value

Why do Bayesian data analysis?

What even is a p-value?

Under the assumption that the true parameter of a model is 0, (and the model is correctly specified, and the observations are without bias), the p-value tells us the probability of estimating a parameter at least as large as the observed effect.

Why do Bayesian data analysis?

What even is a p-value?

  • Says nothing about the distribution of plausible values of your parameter
  • Says nothing about the practical significance of your parameter
  • Is basically a measure of sample size

Why do Bayesian data analysis?

  • What we'd like to be able to say is “I'm 95% confidence that the parameter is greater than/less than 0”
  • Bayesian analysis allows this interpretation

Why do Bayesian data analysis?

Model uncertainty

spaghetti

Why do Bayesian data analysis?

  • When we make predictions, we want to include two sources of uncertainty:
    • How wrong is our model on average if we believe it is the correct model?
    • How uncertain are we that our model is correct?
  • Bayesian analysis includes both these sources of uncertainty. Frequentists typically only include the first.

What we need to learn

Bayesian statistics is difficult in the sense that thinking is difficult. - Donald Berry, Duke

  • Thinking in terms of a generative model. This means assigning prior distributions to all unknowns (parameters) (conceptually difficult at first, assigning priors always takes work)
    • If you're used to turn-key model estimation, this will be the hardest step, but most valuable.
  • Writing out the model to be estimated (learning Stan) (easiest bit, still more difficult than what you're doing now)
  • Checking to see whether the estimated model is consistent with the data (posterior predictive checking) (easy)
  • Using the output of models (difficult)

Generative models

Definition

  • In mathematical terms: a model of the joint probability of the parameters and data of a model: \( p(y, \theta | x) \)
  • In layman's terms: a model which allows us to generate new (fake) “observations”

Exercise

  • Generate a data frame with some random series–one uniformly distributed, one normally distributed, and one binary.
  • Create some parameters
  • Create some “outcome” data

Generative models

  • Now let's estimate the parameters of this model using R's lm function

Generative models

  • Notice that we can write a generative model in probability form.

\[ y_{i} = \beta_{0} + \beta_{1}x_{1,i} + \beta_{2}x_{2,i} + \epsilon \] with \( \epsilon \sim\mathcal{N}(0,\sigma) \)

is the same as writing

\[ y_{i} \sim \mathcal{N}(\beta_{0} + \beta_{1}x_{1,i} + \beta_{2}x_{2,i}, \sigma) \]

\[ y_{i} \sim \mathcal{N}(\mbox{Conditional mean}, \mbox{Potentially conditional volatility}) \]

Generative models

Note that the above model is the joint distribution of the data and parameters only if we think the parameters are fixed and known. In reality, we are uncertain about the value of the \( \beta \) s

So our generative model typically takes the form \( p(y_{i}, \beta) \)

Some distributions - the Normal

  • Normal distribution \( \mathcal{N}(\mu, \sigma) \) with \( \mu \) being the mean and \( \sigma \) being the standard deviation
  • Often used as a default distribution by frequentists.
  • Criticised for “thin tails”

In R:

# Random number generation
rnorm(number, mean, standard devaition)

# Height of the density at point x
dnorm(x, mean, standard deviation)

# Cumulative probability function up to point q
pnorm(q, mean, standard deviation)

Some distributions - Student t

  • Student t distribution \( \mathcal{t}(df, \mu) \) with \( df \) being the degrees of freedom parameter and \( \mu \) being the mean. By default the standard deviation is 1.
  • Similar to the normal distribution, but can have fat tails (give probability to extreme events)

In R:

# Random number generation
rt(number, degrees of freedom, mean)

# Height of the density at point x
dnorm(x, degrees of freedom, mean)

# Cumulative probability function up to point q
pnorm(q, degrees of freedom, mean)

Some distributions - Inverse Gamma

  • Very flexible distribution often used for standard deviations
  • Strictly positive, right skewed

    In R:

    install.packages("MCMCpack")
    library(MCMCpack)
    
    # Random number generation
    rinvgamma(number, shape, scale)
    
    # Height of density at point x
    rinvgamma(x, shape, scale)
    

Some exercises

  • Generate some random numbers from some distributions and plot them using plot(density(x)) where x are your random numbers
  • Do this several times for each distribution with different values of the parameters

Multivariate distributions

  • So far we have played with distributions where random variables are independent of one another.
  • We want to learn how to incorporate inter-dependency

Multivariate distributions: Correlation matrix

Correlation a measure between 0 and 1 where 0 implies no linear relationship between two variables, and 1 implies perfect linear relationship.

A Correlation matrix contains the two-way correlations between all variables in the system.

           V1        V2        V3
V1  1.0000000 -0.482711 0.1231727
V2 -0.4827110  1.000000 0.4823770
V3  0.1231727  0.482377 1.0000000

Multivariate distributions: Correlation matrix

plot of chunk unnamed-chunk-2

Multivariate distributions: Scale matrices

  • The correlation matrix summarises the degree to which random variables are linearly related.
  • It does not tell us anything about the standard deviation of each of those variables.
  • We need to introduce a scale variable.

A univariate distribution's scale is determined by the \( \sigma \) value. Each variable in a multivariate distribution also has it's own scale variable. We collect them together in a vector called \( \tau \).

\[ \tau = [\sigma_{1}, \sigma_{2}, \dots, \sigma_{P}]' \]

Multivariate distributions: Covariance matrix

Covariance matrices combine the information from a scale vector with the correlation matrix. We typically use \( \Sigma \) to denote a covariance matrix and \( \Omega \) for a correlation matrix.

\[ \Sigma = \mbox{diag}(\tau)\Omega\mbox{diag}(\tau) \]

Where the diag operator places the values in \( \tau \) along the diagonal elements of a matrix (with zeroes everywher else)

Multivariate distributions: The multivariate normal distribution

  • Each individual series is normally distributed with standard deviation \( \sigma_{i} \)
  • The correlation between all series is given by \( \Omega \)
  • Each series has its own average value, \( \mu_{i} \). These are collected together in a vector \( \mu = [\mu_{1}, \dots, \mu_{P}] \)

Multivariate distributions: The multivariate normal distribution

In R:

library(MASS)
# Define your correlation matrix
cormat <- matrix(c(1, -0.5, 0.1, -0.5, 1, 0.5, 0.1, 0.5, 1), 3, 3)
# Define your scale vector
scalemat <- c(4, 2, 1)
# Create your covariance matrix
Sigma <- diag(scalemat) %*% cormat %*% diag(scalemat)
# Create the mean vector
means <- c(3, 7, 2)
xs <- mvrnorm(n = number, means, Sigma) %>% as.data.frame
cor(xs); plot(xs)

The LKJ distribution

  • A one-parameter \( [1, \infty ) \) distribution over correlation matrices
  • When its parameter is close to 1, the matrices are close to a unit-matrix (very strong correlations between all variables)
  • When its parameter is large, it collapses on an identity matrix (no correlation between variables)

  • Not yet available in R, but available in Stan

Back to generative models

  • Let's simulate a real generative model
    • \( \theta\sim \mathcal{MNV}([-2, 3, 1, 5]', \Sigma) \) (you can decide \( \Sigma \))
    • \( \sigma \sim \mathcal{IG}(1, 1) \)
    • Use your data from before
    • Simulate new values of \( y2_{i} \)
  • What does the distribution of the new \( y2 \) look relative to \( y \)?

The linear model

  • Not really a huge point in doing this using a Bayesian model, but a good learning tool.

\[ y_{i} = \beta_{0} + \beta_{1}x_{1,i} + \dots + \beta_{P}x_{p,i} + \epsilon_{i} \]

\[ y = X\beta + \epsilon \]

with \( \epsilon \sim \mathcal{N}(0, \sigma) \)

What is likelihood?

  • In summary,

    • Treating \( X \) as fixed
    • Given a (guess of a) set of \( \beta \) s
    • A single number that is greater when the observations of \( y_{i} \) are more probable under the model.
    • \( \prod_{i}f(y_{i} | \beta, X_{i}) \) where \( f() \) is the density of the model at point \( y_{i} \)
    • In R, the density is given by dnorm, dt, dinvgamma etc.
  • Example in R: Two parameter likelihood.

Maximum likelihood

  • Work with logs! Very small numbers multiplied by very small numbers and we run into numerical issues.
  • We want to pick the parameters that maximimise the log likelihood.
  • Most optimisation algorithms minimise something, so we typcially minimise the negative log likelihood.

How do these work?

Maximum likelihood

Marble in bowl

Likelihood of the linear model

Let's take our linear model from before:

\[ y = X\beta + \epsilon \]

With \( \epsilon \sim \mathcal{N}(0, \sigma) \)

As before, we can express this in probabilistic notation:

\[ y \sim\mathcal{N}(X\beta, \sigma) \]

What's wrong with maximum likelihood?

  • It's a wonderful method
  • Gives us the parameters that make the most sense of the data
  • We want it the other way around: we want to develop a probabilistic understanding of the parameters given the data.
  • Does not allow for probabilistic undertainty about parameters.

Bayes' Rule

\[ p(\theta | y) = \frac{p(y | \theta)p(\theta)}{p(y)} \]

Because the denominator \( p(y) \) does not depend on the parameters \( \theta \), we typically write this out in proportional notation:

\[ p(\theta | y) \propto p(y | \theta)p(\theta) \]

That is, if we want to make inference about the posterior \( p(\theta | y) \), we need to have the likelihood \( p(y | \theta) \), and a prior distribution for the parameters of the model \( p(\theta) \).

Bayes' Rule

Bayes

Bayes' Rule and Monte Carlo theory

Again: We want to make inference (want to know mean, sd, quantiles etc.) about the posterior. Unfortunately for all but a few simple models, the right hand side \( p(y | \theta)p(\theta) \) is not analytically solveable.

That's OK!

Monte-Carlo

  • To make inference about \( p(\theta|y) \), I don't need the analytical form. As long as I can generate many random draws from the distribution, I can then calculate the statistics for these draws.
  • For instance, let's say I want the average of \( \theta_{1} \). If I have 500 draws of \( \theta_{1} \) from the posterior, I can simply calculate mean(theta[,1])

Markov chains

  • We want to generate draws from the posterior
  • We don't know the functional form of the posterior
  • BUT we do know that
    • If we generate a draw of \( \theta^{(i+1)} \) from \( p(\theta^{(i)}|y) \) and
    • The distribution of \( \theta \) does not change as we progress forward (no change in the mean or variance of the draws)
    • We must have discovered an invariant distribution
    • This is possibly our posterior (though we should do some more checks—often by running multiple chains and making sure they “mix”)

An introduction to Stan

  • Amazing language
  • We write the model, not the sampler
  • It translates our code into a C++ implementation of Hamiltonina Monte Carlo (an algorithm to generate draws from the posterior), compiles it, and simulates.
  • Also allows us to write functions to export into R
    • In my tests, they can be around 1000 times as fast as native R functions (especially when you have nested loops)

An introduction to Stan

Code blocks:

  • functions — any user-defined functions that you want to use inside or outside Stan
  • data — the data inputs, including any hyper-priors
  • parameters — defining the parameters that you want to estimate
  • transformed parameters — any transformations of the parameters, such as generating covariance matrices from scale and correlation matrices.
  • model — where you define the priors and likelihood
  • generated quantities — where you define any output from the model, including predictions or values of a loss function.

An introduction to Stan

  • Exercise: implementing a linear model