The likelihood-prior connection is the most fundamental relationship in Bayesian analysis. The prior distribution must support some of the parameter values in the likelihood, whereas the likelihood must support all possible observations. Here I apply fundamentals of distribution theory to revisit this relationship, including prior weight and conjugacy.

source('clarkFunctions2024.R')
library(MCMCpack)
library(repmis)
library(rjags)


Today

Objectives

Can I trust my estimates when the predictors are correlated?

I want to determine how temperature and precipitation at different times of the year explain variation in species abundances, but these climate variables are all correlated with each other. They are further correlated with soils, because extensive coastal plain sands are mostly in the (warm, moist) Southeast, and fertile, glaciated soils are mostly in the (cold) Northeast. I’ve heard that correlated predictors could be bad, but I can never escape these correlations in observational data. How can I determine their effect on estimates?

In this unit I develop Bayesian regression extending distribution theory from unit 7 to continuous responses. To flesh out the concept of conjugacy, I begin by revisiting a previous example of binary trials. This and the previous examples for the Gaussian likelihood show commonalities in the conjugate pairs. I then turn to Bayesian regression with a MVN prior and examine the question of correlated inputs.

Conjugate prior-likelihood combinations

A conjugate prior distribution has the same form as the posterior distribution. Conjugacy depends on the both the prior and the likelihood.

Binomial likelihood/beta prior

In unit 7, I pointed out that a binary (Bernoulli) outcome could be combined with a beta prior distribution to yield a posterior beta distribution. Recall further that the binomial is a generalization of the Bernoulli distribution to allow for multiple trials. The binomial distribution looks like this:

\[ binom(x | n, \theta) \propto \theta^x (1 - \theta)^{n - x} \] The right hand side has the probability of success in a given trial (\(\theta\)) taken \(x\) times (the number of ‘successes’) times the probability of failure (\(1 - \theta\)) \(n - x\) times (the number of failures). The Bayesian model now needs a prior distribution for \(\theta\).

Here is the model with a beta prior distribution, with the product of likelihood and prior,

\[ \begin{aligned} binom(x | n, \theta) beta(\theta | a, b) &\propto \theta^x (1 - \theta)^{n - x} \times \theta^{a-1}(1 - \theta)^{b-1} \\ &= \theta^{x + a - 1} (1 - \theta)^{n - x + b - 1} \\ &= beta(\theta | x + a, n - x + b) \end{aligned} \]

As I noted before, there are two parameters, each with two parts The first past comes from the data, having exponents \(x\) and \(n - x\). The second part comes from the prior, with exponents \(a\) and \(b\). To see how they affect the posterior distribution, I could draw a picture,

n     <- 10                               # no. trials
theta <- .6
x     <- rbinom(1, n, theta)              # observations

a <- b <- .1                              # prior parameter values
tseq   <- seq(0, 1, length = 100)         # sequence of theta values
xn     <- 0:n                             # sequence of x values
bn     <- dbinom( xn, n, theta)           # likelihood
prior  <- dbeta(tseq, a, b)               # prior distribution
post   <- dbeta(tseq, x + a, n - x + b)   # posterior distribution

par(mfrow=c(2,1), bty='l', mar=c(4,5,2,4), las = 1 )

plot( xn, bn, xlab = 'x', ylab = 'Probability' )                  # draw the binomial PMF
segments( xn, xn*0, xn, bn)
points( x, dbinom(x, n, theta), pch = 16, col = 'brown', cex = 2)
title( 'binomial likelihood')

plot( tseq, post, type = 'l', lwd = 2, xlab = 'theta', ylab = 'Density',
      col = 'brown' )                                             # prior/posterior
lines( tseq, prior, lwd = 2, col = 'grey')
title( 'beta prior and posterior' )
\label{fig:figs} *The prior and posterior beta distributions used with a binomial likelihood. The posterior distribution (black line) balances the prior distribution (grey line) and the observation (dot).*

The prior and posterior beta distributions used with a binomial likelihood. The posterior distribution (black line) balances the prior distribution (grey line) and the observation (dot).


I have placed a dot at the observed \(x\). The shape of the beta distribution is determined by the magnitudes of the two parameters in the beta distribution. The weight of the prior depends on the two terms, one from the data and one from the prior. The prior is non-informative, because posterior parameters used to evaluate post are hardly affected by a and b from the prior.

Exercise 1. Modify the R code to determine how the beta distribution responds to \(a\) and \(b\).

Exercise 2. Modify the R code to determine how the true value of theta used to simulate the data and \(n\) affect the weight of the prior.

Extension to multinomial/Dirichlet

Recall that the multinomial distribution is an extension of the binomial,

\[ \left[ \mathbf{y} | n, \theta \right ] \propto \prod_{j = 1}^J \theta_j^{y_j} \] with the constraints (support) \(\sum_j \theta_j = 1\) and \(\sum_j y_j = n\). These constraints mean that the last (\(J^{th}\)) class has parameter \(\theta_J = 1 - \sum_{j = 1}^{J-1} \theta_j\) and \(y_J = n - \sum_{j = 1}^{J-1} y_j\). So there are only \(J - 1\) dimensions, the \(J^{th}\) being redundant with the others. If we have \(J = 2\), then this is the binomial distribution.

Just as the binomial likelihood/beta prior is conjugate, so too is the multinomial/Dirichlet. The Dirichlet has the same form as the beta distribution, \(Dir(\theta_{j = 1, \dots, J} | a_{j = 1, \dots, J}) \propto \prod_{j = 1}^J \theta_j^{a_j - 1}\). This has the same form as the multinomial likelihood. We can multiply them to get another Dirichlet distribution:

\[ \begin{aligned} \left[ \theta | \mathbf{y}, n \right] &\propto \left[ \mathbf{y} | n, \theta_{j = 1, \dots, J} \right ][\theta_{j = 1, \dots, J} | a_{j = 1, \dots, J}] \\ &\propto \prod_{j = 1}^J \theta_j^{y_j} \times \prod_{j = 1}^J \theta_j^{a_j - 1} \\ &= Dir( y_1 + a_1 + \dots + y_J + a_J ) \end{aligned} \] This is the posterior distribution.

Normal-normal likelihood-prior combinations

My regression question builds from the basic Gaussian model, previously discussed for a simple mean and varianve. I previously noted that a normal prior distribution on the mean of a normal likelihood resulted in a normal posterior distribution. Recall that

\[ \begin{aligned} \left[m | \mathbf{y}, \sigma^2, M \right] &\propto \prod_{i=1}^n N(y_i |\mu,\sigma^2) N(\mu|m, M) \\ &= N \left(Vv, V \right) \end{aligned} \] where, again, we have two parameters, each with two terms, one data (\(n, \bar{y}\)) and one prior (\(m, M\)),

\[ \begin{aligned} V &= \left( \frac{n}{\sigma^2} + \frac{1}{M} \right)^{-1} \\ v &= \frac{n\bar{y}}{\sigma^2} + \frac{m}{M} \end{aligned} \]

I pointed out that this is a weighted average of data and prior, with the inverse of variances being the weights,

\[ Vv = \frac{\frac{n\bar{y}}{\sigma^2} + \frac{m}{M}}{\frac{n}{\sigma^2} + \frac{1}{M}} \]

Note how large \(n\) will swamp the prior,

\[ \underset{\scriptscriptstyle \lim n \rightarrow \infty}{Vv} \rightarrow \bar{y} \]

The prior can fight back with a tiny prior variance \(M\), \[ \underset{\scriptscriptstyle \lim M \rightarrow 0}{Vv} \rightarrow m \]

As with the beta-binomial combination for \(\theta\), this normal-normal combination for \(\mu\) is conjugate. Also as with the beta-binomial, there is a transparent contribution of data and prior parameter values to the posterior distribution. This desirable relationship extends beyond the mean to regression parameters.

Regression with a fixed variance

I want to do execute a Bayesian regression with a multivariate normal prior distribution on the coefficients in \(\boldsymbol{\beta}\).

Regression extends the previous example to include predictors in a design vector \(\mathbf{x}_{i} = (x_{i1}, \dots, x_{ip})\) multiplied by a coefficient \(\boldsymbol{\beta} = (\beta_{1}, \dots, \beta_{p})\) that will be estimated from data, with contribution from a prior distribution. The posterior distribution I want is this:

\[ [\boldsymbol{\beta}|\mathbf{y}, \mathbf{X}, \sigma^2, \mathbf{b}, \mathbf{B}] = \prod_{i=1}^n N(y_i | \mathbf{x}_i' \boldsymbol{\beta}, \sigma^2) \times MVN( \boldsymbol{\beta}|\mathbf{b}, \mathbf{B}) \]

The multivariate normal distribution for a vector can be written as \(MVN( \boldsymbol{\beta}|\mathbf{b}, \mathbf{B})\), where \(\boldsymbol{\beta}\) and \(\mathbf{b}\) are length-\(p\) vectors, and \(\mathbf{B}\) is a \(p \times p\) covariance matrix.

It may not be obvious from the way I have written the posterior that the normal likelihood-multivariate normal prior combination is conjugate. To make this more transparent, I can write the responses and predictors as length-\(n\) vectors \(\mathbf{y}\) and \(\mathbf{X} \boldsymbol{\beta}\) vectors. An example can help visualize how these elements come together. First, I’ll simulate a few regression data sets, so I’ll put that in a function:

simReg <- function(n, p, s){
  b       <- matrix( rnorm(p), p)
  x       <- matrix( rnorm(n*p), n, p )
  x[,1]   <- 1
  epsilon <- rnorm(n, 0, s)
  y       <- x%*%b + epsilon  
  list(x = x, y = y, beta = b, sigma = s, epsilon = epsilon)
}

Here I use the function to simulate regression data:

sigma <- .5
n     <- 10
p     <- 3
d     <- simReg(n = n, p = p, s = sigma)
x     <- d$x
y     <- d$y
beta  <- d$beta
mu    <- x%*%beta
plot( mu, y, xlab = 'mu', ylab = 'observed', bty = 'n')
abline( 0, 1, lty=2, col = 'grey')
*The regression mean with simulated observations.*

The regression mean with simulated observations.


If I were rusty on matrix multiplication I could simply look at the objects x, beta, and y. The operations I have performed here could be written as a multivariate normal distribution for the full vector of \(\mathbf{y}\),

\[ MVN_n(\mathbf{y} | \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n) \propto (\sigma^2)^{-n/2} exp \left[ -\frac{1}{2 \sigma^2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] where \(\mathbf{I}_n\) is the identity matrix. Because the covariance matrix is \(\sigma^2 \mathbf{I}_n\), the observations are taken as independent, and this is equivalent to taking the production of \(n\) univariate likelihoods for each observed \(y_i\). Recall that off-diagonal elements of a covariance matrix indicate variables that do not covary.

When I combine this \(MVN\) likelihood with the \(MVN\) prior distribution I end up with the same structure I obtained for the univariate normal distribution,

\[ \left[\beta_1, \dots, \beta_p | \mathbf{b}, \mathbf{B} \right] = MVN(\boldsymbol{\beta} | \mathbf{Vv}, \mathbf{V}) \] The matrices on the right-hand side are multivariate versions of \((V, v)\) from the (univariate) normal distribution,

\[ \begin{aligned} \mathbf{V} &= \left( \sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1} \right)^{-1} \\ \mathbf{v} &= \sigma^{-2} \mathbf{X'y}+ \mathbf{B}^{-1}\mathbf{b} \end{aligned} \] Recall, \(V\) was a posterior variance, with terms from the likelihood and prior. Here, \(\mathbf{V}\) is a \(p \times p\) covariance matrix, with a term contributed by the crossproduct of the design matrix \(\mathbf{X}' \mathbf{X}\) and a term from the prior, \(\mathbf{B}\). Similarly, \(\mathbf{v}\) is now a length-\(p\) vector with a term contributed by the crossproduct of design and observations \(\mathbf{X}' \mathbf{y}\) and prior mean \(\mathbf{b}\). Also as with the univariate case, I obtain the posterior mean with the product \(\hat{\boldsymbol{\beta}} = \mathbf{V} \mathbf{v}\).

Here again, it can help to visualize with R. I begin with the terms from the likelihood,

XX     <- crossprod(x)                  # crossproduct X'X
V      <- sigma^2*solve(XX)             # beta covariance that ignores prior
v      <- crossprod(x, y)/sigma/sigma   
bhat   <- V%*%v
bse    <- sqrt( diag(V))
estimates <- data.frame( beta = beta, bhat = bhat, SE = bse )

These are the true values and the estimates, with standard errors in the estimates data.frame:

true values bhat SE
2.0265 2.0448 0.1626
-0.7038 -0.5665 0.2288
1.6889 1.4809 0.1749


Here I include the prior distribution and pack it away in a function so I can reuse it later:

b    <- matrix(0, p)
B    <- diag(1, p)

mvnPost <- function(x, y, s, b = NULL, B = NULL ){
  
  XX   <- crossprod(x)
  
  if( is.null( b ) )b <- matrix( 0, ncol(x) )
  if( is.null( B ) )B <- nrow(x)*s*solve(XX)   # Zellner's g prior
  
  V    <- solve( XX/s/s + solve(B) )
  v    <- crossprod(x, y)/s/s + solve(B)%*%b
  list( bhat = V%*%v, SE = sqrt(diag(V)), bcov = V )
}

z <- mvnPost(x, y, sigma, b, B)
withPrior <- data.frame( beta = beta, priorMu = b, bhat = z$bhat, SE = z$SE)
true values prior mean bhat SE
2.0265 0 1.9873 0.1603
-0.7038 0 -0.5173 0.2229
1.6889 0 1.4346 0.1723


The estimates that include the prior (bhat) could be pulled toward the prior, which is centered on zero. There can be big differences in both sets of estimates from those used to simulate the data (beta) because n is small.

To examine the effects on a larger data set, I increase the sample size, and I generate a correlated version of the design matrix x by constructing a random covariance matrix with the function .rMVN:

# simulate data with and without correlated columns in x
n    <- 100
d    <- simReg(n = n, p = 3, s = sigma)
x    <- d$x
y    <- d$y
beta <- d$beta
error <- d$epsilon
C      <- cov( .rMVN(p+1, beta*0, diag(p) ) )   # covariances can be large
xc     <- .rMVN(n, b*0, C)                      # design with correlated columns
x[,1]  <- xc[,1] <- 1                           # include intercept
lims <- range( rbind(x, xc) )
plot( x[,2], x[,3], xlab = 'x_2', ylab = 'x_3', pch = 16, 
      xlim = lims, ylim = lims, bty = 'n' )
points( xc[,2], xc[,3], col = 3, pch = 16)      # larger covariance than x
*The columns of x can be correlated (green)*.

The columns of x can be correlated (green).


mu     <- x%*%beta                              # matrix multiplication
mc     <- xc%*%beta                             # with correlation
y      <- matrix( mu + error, n)                # n x 1 response
yc     <- matrix( mc + error, n)                # response with correlated x
plot( mu, y, xlab = 'mu', ylab = 'observed', pch = 16, bty = 'n' )
points( mc, y, col = 3, pch = 16)
abline( 0, 1, lty=2, col = 'grey')
*The design with correlated predictors can generate predictions with broad scatter (green) in this plot of mean(y) and y*.

The design with correlated predictors can generate predictions with broad scatter (green) in this plot of mean(y) and y.

# summarize posterior distributions
zu      <- mvnPost(x, y, sigma, b, B)
zc      <- mvnPost(xc, y, sigma, b, B)
buncor  <- zu$bhat
SEuncor <- zu$SE
bcor    <- zc$bhat
SEcor   <- zc$SE
betaEstimates <- data.frame( beta = beta, buncor = buncor, SEuncor = SEuncor, 
                             bcor = bcor, SEcor = SEcor)
true values uncorrelated SE correlated SE
-0.5282 -0.4976 0.0503 -0.2603 0.0515
1.7965 1.8885 0.0537 0.0071 0.0352
-1.5845 -1.5628 0.0479 0.1020 0.0284


Clearly, the scatters in the predicted y are quite different. When the x columns are uncorrelated, the parameter estimates are closer to the values used to simulate the data. Here is a plot of the posterior distribution, marginalized for each coefficient, with the values used to simulate data (dashed lines):

bseq <- seq(-3, 3, length=200)
plot(NA, xlim = range(bseq), ylim = c(0,15), xlab = 'Beta', ylab = 'Density', bty = 'n')
for(j in 1:p){
  lines(bseq, dnorm( bseq, buncor[j], SEuncor[j]), col = 1, lwd=j)
  lines(bseq, dnorm( bseq, bcor[j], SEcor[j]), col = 3, lwd=j)
}
abline(v = beta, lty=2, lwd=1:3)
legend( 'topleft', c('Uncorrelated x', 'Correlated x'), 
        text.col=c(1,3), bty='n' )
*Posterior marginalized for three coefficients from the correlated and uncorrelated designs. Values used to simulate data are dashed lines*.

Posterior marginalized for three coefficients from the correlated and uncorrelated designs. Values used to simulate data are dashed lines.


It is also clear from the formula for \(\mathbf{V}\) that the correlation structure of the design matrix will appear in the covariance of the estimates, which includes the inverse of this crossproduct of $. Here is a sample of parameter estimates from the posterior distribution, with a comparison of the inverted correlation in estimates with the correlation in x:

bcov    <- zc$bcov
bsample <- .rMVN(1000, b*0, bcov )
plot(bsample[,2], bsample[,3], cex=.3, col = 3, 
     bty = 'n', xlab = 'beta_2', ylab = 'beta_3')
*Scatter plot of samples from the posterior for two slope coefficients show the inverse correlation that occurs within columns of `x`.*

Scatter plot of samples from the posterior for two slope coefficients show the inverse correlation that occurs within columns of x.


Here is the covariance in the design matrix:

cor( xc[,2:3] )         
x_2 x_3
x_2 1.000 -0.224
x_3 -0.224 1.000

Here is the inverse covariance in coefficients:

cov2cor( solve(zc$bcov) )[2:3,2:3]    
x_2 x_3
x_2 1.000 -0.212
x_3 -0.212 1.000


The posterior covariance in \(\boldsymbol{\beta}\) means that the model can have limited capacity to discriminate between the effects of these two predictors.

Exercise 3. Using the FIA basal area data, determine the relationship between the covariance in predictors and in the estimated coefficients. Should some predictors be omitted? Recall that basal area data are continuous with discrete zeros, so I suggest the Tobit.

Here are some suggestions:
  1. load data from files xdataCluster.rdata and baCluster.rdata (see below).
  2. Select a species to analyze (a column in baCluster.rdata).
  3. Select predictors from xdataCluster.rdata and write a formula.
  4. Create a data.frame holding predictors and response.
  5. Fit a Tobit regression using bayesReg in the clarkFunctions2022.R.

Here I load the data:

source_data("https://github.com/jimclarkatduke/gjam/blob/master/xdataCluster.rdata?raw=True")
source_data("https://github.com/jimclarkatduke/gjam/blob/master/baCluster.rdata?raw=True")
y    <- baCluster[,'pinusTaeda']
data <- data.frame( cbind(xdataCluster, y) )


Here is the covariance structure in data:

*Pairs plot shows correlated predictors in `data`.*

Pairs plot shows correlated predictors in data.


Note pairs showing high absolute correlation (positive or negative) will have effects that are hard to identify/separate.


It is important to recognize that the simple explorations I’ve done here are possible because the likelihood-prior combination in this regression is conjugate. This relationship allowed me to solve for the posterior distribution and things that can be extracted from it, such as the standard errors. I have ignored the uncertainty in the variance; that comes next.


Residual variance (known \(\boldsymbol{\mu}\))

In the previous example, I solved for regression coefficients assuming that the residual variance parameter \(\sigma^2\) was known. Typically, it will not be known, so it must also be estimated. In other words, it must be on the left side of the bar in the prior and posterior distributions, \([\boldsymbol{\beta}, \sigma^2 | \cdot ]\).

Now I assume that I know the coefficients \(\boldsymbol{\beta}\) and want to estimate the residual variance \(\sigma^2\). Recall the likelihood for the normal distribution, this time only retaining the factors that include \(\sigma^2\),

\[ N(\mathbf{y}|\mu,\sigma^2) \propto \sigma^{-2(n/2)} exp\left[-\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2 \right] \]

(Note: previously I wrote this likelihood in matrix form–they are equivalent). A prior distribution that is commonly used is inverse gamma,

\[ IG \left(\sigma^2 | s_1, s_2 \right) \propto \sigma^{-2(s_1 + 1)} exp\left(-s_2 \sigma^{-2} \right) \]

When I multiply likelihood and prior I get another inverse gamma distribution,

\[ IG \left(\sigma^2 | u_1, u_2 \right) \propto \sigma^{-2(u_1 + 1)} exp\left(- u_2 \sigma^{-2} \right) \]

where \(u_1 = s_1 + n/2\), and \(u_2 = s_2 + \frac{1}{2} \sum_{i=1}^n (y_i - \mu)^2\). Again, this is a conjugate likelihood-prior pair. Here is a prior and posterior distribution for a sample data set.

par( bty='n', las = 1 )
n  <- 10
y  <- rnorm(n)
s1 <- s2 <- 1
ss <- seq(0, 4, length=100)
u1 <- s1 + n/2
u2 <- s2 + 1/2*sum( ( y - mean(y) )^2) 
plot( ss, dinvgamma(ss, u1, u2), type='l', lwd=2, xlab = 'sigma^2', ylab='Density' )
lines( ss, dinvgamma(ss, s1, s2), col='blue', lwd=2 )
legend('topright', c('prior', 'posterior'), text.col=c('blue','black'), bty = 'n' )
*Prior and posterior IG distribution for a simulated data set*

Prior and posterior IG distribution for a simulated data set

For a regression model, this conjugate relationship still holds.

Small step to Gibbs sampling

The conditional posterior distributions for coefficients and variance will be combined for posterior simulation using Gibbs sampling. To see how this will come together, consider that we can now sample \([\boldsymbol{\beta} | \sigma^2]\) and, conversely, \([\sigma^2 | \boldsymbol{\beta}]\). If we alternate these two steps repeatedly we have a simulation for their joint distribution, \([\boldsymbol{\beta}, \sigma^2]\).

Here is a simulated data set with some intermediate coefficients that will be used to sample the posterior:

p     <- 4
n     <- 100
sigma <- .5
d <- simReg( n = n, p = p, s = sigma ) # simulated data
beta <- d$beta
x    <- d$x
y    <- d$y
s1  <- s2 <- .1                        # prior parameter values
b   <- beta*0

XX  <- crossprod(x)                    # intermediate products
XY  <- crossprod(x,y)
B   <- n*sigma*solve( XX )             # Zellner g prior
IB  <- solve(B)
Bb  <- IB%*%b

Here are functions I can use to alternately sample from \([\boldsymbol{\beta}|\sigma^2]\) and \([\sigma^2 | \boldsymbol{\beta}]\):

sampleBeta <- function(s){          # draw a sample of coefficients
  V  <- solve( XX/s/s + IB )
  v  <- XY/s/s + Bb
  t( .rMVN(1, V%*%v, V) )
}
sampleSigma <- function(b){         # sample residual variance
  1/rgamma(1, s1 + n/2, s1 + .5*crossprod( y - x%*%b ) )
}

Here I alternately sample from the posterior distribution of \(\boldsymbol{\beta}|\sigma^2\) (the current value is bg) followed by \(\sigma^2|\boldsymbol{\beta}\) (current value sg):

bg <- b                         # initial values
sg <- 1

ng <- 5000                      # matrix and vector to hold samples
bgibbs <- matrix( NA, ng, p)
sgibbs <- rep(NA, ng)

for(g in 1:ng){                 # generate MCMC chains
  bg <- sampleBeta(sg)
  sg <- sampleSigma(bg)
  bgibbs[g,] <- bg
  sgibbs[g]  <- sg
}

With each iteration, a new sample is taken from both conditional distributions and stored in bgibbs and sgibbs.

Here are plots of the samples and summary histograms:

par(mfrow = c(p, 2), bty='n', mar=c(2,2,.5,1), oma=c(2,4,2,1) )
for(j in 1:p){
  bp  <- bgibbs[,j]
  q   <- quantile( bp, c(.025, .975) )                    # bound 95%
  lim <- range( c(beta[j], bp) )
  plot(bp, type = 'l', xlab = '', ylab = '', ylim = lim)
  abline(h = q, lty=2)
  abline(h = beta[j], col = 'white', lwd=3)
  abline(h = beta[j], lwd = 2, lty=2)
  
  h <- hist( bp, nclass=40, plot=F )
  plot(h$density, h$mids, type = 's', xlab = '', ylab='', ylim = lim)
  abline(h = q, lty=2)
  abline(h = beta[j], lwd = 2, lty=2)
}
mtext('Beta values', 2, outer=T, las = 0, line = 1)
*Chains at left and summary distributions at right, with true values (dashed)*

Chains at left and summary distributions at right, with true values (dashed)

jags example

Here is this example in jags. The model file can be written like this:

file <- "lmSimulated.txt" 
cat("model{

  # Likelihood
  for(i in 1:n){
    y[i]   ~ dnorm(mu[i],precision)
    mu[i] <- inprod(beta[],x[i,])
  }

   for (i in 1:p) {
     beta[i] ~ dnorm(0, 1.0E-5)              
   }

  # Prior for the inverse variance
  precision ~ dgamma(0.01, 0.01)
  sigma     <- 1/precision
}", file = file)

Here is a function that sets up the posterior sampling:

model <- jags.model(file=file, 
                    data = list(x = x, y = as.vector(y), n=nrow(x), p=ncol(x)))
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 100
##    Unobserved stochastic nodes: 5
##    Total graph size: 713
## 
## Initializing model

I start with 100 burnin iterations, then sample for 2000:

update(model, 100)
jagsLm <- coda.samples(model, variable.names=c("beta","sigma"), n.iter=2000)
tmp    <- summary(jagsLm)
print(tmp$statistics)
##               Mean         SD    Naive SE Time-series SE
## beta[1]  0.7464039 0.05231721 0.001169848   0.0011698483
## beta[2] -0.2086795 0.05540322 0.001238854   0.0012587956
## beta[3]  0.8921874 0.06248059 0.001397109   0.0013971086
## beta[4]  1.9228684 0.04979419 0.001113432   0.0011533962
## sigma    0.2843073 0.04177950 0.000934218   0.0009663679

Here are plots:

plot( jagsLm, bty = 'n')
*Posterior samples at left with density plots at right*.

Posterior samples at left with density plots at right.

*Posterior samples at left with density plots at right*.

Posterior samples at left with density plots at right.

This example from jags should have sampled from the posterior in the same way we did (I don’t know how to find out for sure). For me, there are advantages in knowing exactly what I have done. There are not many ways to go wrong in this example, but larger models often require creative algorithms that are not accessible in software.

Recap

Bayesian analysis requires some basic distribution theory to properly combine data (likelihood) and prior information to generate a posterior distribution. Fundamental ways to parameterize observations include densities (continuous), probability mass functions (discrete), and probability density (both) functions. The sample space defines allowable (non-zero probablity) for a random variable. Parameter distributions typically continuous. Integrating (continous) or summing (continuous) over the sample space gives a probability of 1.

Distributions have moments, which are expectations for integer powers of a random variable. The first moment is the mean, and the second central moment is the variance. Higher moments include skewness (asymmetry) and kurtosis (shoulders versus peak and tails).

Joint distribution can be factored into conditional and marginal distributions. A conditional distribution assumes a specific value for the variable that is being conditioned on. Marginalizing over a variable is done with the law of total probability. Bayes theorem relies on a specific factorization giving a posterior distribution in terms of likelihood and prior.

The multivariate normal distribution is commonly used as a prior distribution in regression. When combined with a normal likelihood, the posterior mean can be found with the ‘Vv rule’.

An informative design does not have large positive or negative correlations between columns in \(\mathbf{X}\). These correlations can be viewed as redundancies that reduce information and influence parameter estimates.


Appendix

Conjugate normal-normal

The conjugacy for the mean parameter of the normal distribution can be checked. Notice that the normal distribution has this form:

\[ N(\mu | Vv, V) \propto exp \left[ -\frac{1}{2V} (\mu - Vv)^2 \right] \] In this form, \(V\) is the variance, and \(Vv\) is the mean parameter of the distribution. Expanding the exponential and dropping the last term (because it does not contain \(\mu\)) gives me this,

\[ \frac{1}{2V} (\mu - Vv)^2 = \frac{\mu^2}{V} - 2\mu v \] I now see that whatever multiplies \(\mu^2\) must be \(V^{-1}\) and whatever multiplies \(-2\mu\) must be \(v\).

So I have

\[ N(\mu | Vv, V) \propto exp \left[ -\frac{1}{2\sigma^2} \sum_i^n (\mu - y_i)^2 \right] \] with the exponent expanding to

\[ \begin{aligned} &= -\frac{1}{2\sigma^2} \left[ n \mu^2 + 2 \mu \sum_{i=1}^n y_i + \sum_{i=1}^n y_i^2 \right] \\ &= \frac{1}{2} \left[ \frac{n\mu^2}{\sigma^2} + \frac{2 \mu n \bar{y}}{\sigma^2} + \frac{n \bar{y^2}}{\sigma^2} \right] \end{aligned} \] Again, I omit the last term, because it does not include \(\mu\). I indentify these two parameters

\[ \begin{aligned} V &= \frac{\sigma^2}{n} \\ v &= \frac{n \bar{y}}{\sigma^2} \end{aligned} \] The mean of this distribution is transparently \(Vv = \bar{y}\) with standard error equal to \(\sqrt{V} = \sigma/\sqrt{n}\).

Bayesian regression

I used the normal distribution for regression examples without saying much about it. In Bayesian analysis it is not only used for the likelihood, but also as a prior distribution for parameters. Here I extend the foregoing distribution theory to Bayesian methods that involve the normal distribution.

For the regression model, I start with matrix notation,

\[ \mathbf{y} \sim MVN(\mathbf{X} \boldsymbol{\beta}, \Sigma) \] where \(\mathbf{y}\) is the length-\(n\) vector of responses, \(\mathbf{X}\) is the \(n \times p\) design matrix, \(\boldsymbol{\beta}\) is the length-\(p\) vector of coefficients, and \(\Sigma\) is an \(n \times n\) covariance matrix. I can write this as

\[ (2 \pi)^{-n/2} |\Sigma|^{-1/2} exp \left[ -\frac{1}{2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' \Sigma^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] Because we assume i.i.d (independent, identically distributed) \(y_i\), the covariance matrix is \(\Sigma = \sigma^2 \mathbf{I}\), and \(|\Sigma|^{-1/2} = (\sigma^2)^{-n/2}\), giving us

\[ (2 \pi)^{-n/2} (\sigma^2)^{-n/2} exp \left[ -\frac{1}{2 \sigma^2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] This is the form of the likelihood I use to obtain the conditional posterior for regression coefficients.

The multivariate prior distribution is also multivariate normal,

\[ \begin{aligned} \left[\beta_1, \dots, \beta_p | \mathbf{b}, \mathbf{B} \right] &= MVN(\boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) \\ &= \frac{1}{(2\pi)^{p/2} det(\mathbf{B})^{1/2}} exp\left[-\frac{1}{2}(\boldsymbol{\beta} - \mathbf{b})'\mathbf{B}^{-1}(\boldsymbol{\beta} - \mathbf{b}) \right] \end{aligned} \]

If there are \(p\) predictors, then \(\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)'\). The prior mean is a length-\(p\) vector \(\mathbf{b}\). The prior covariance matrix could be a non-informative diagonal matrix,

\[ \mathbf{B} = \begin{pmatrix} B & 0 & \cdots & 0 \\ 0 & B & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & B \end{pmatrix} \]

for some large value \(B\).

As for the example for the mean of the normal distribution, I apply the “big-V, small-v” method. For matrices the exponent of \(N(\boldsymbol{\beta} | \mathbf{Vv})\) is

\[ -\frac{1}{2} (\boldsymbol{\beta} - \mathbf{Vv})'\mathbf{V}^{-1}(\boldsymbol{\beta} - \mathbf{Vv}) = -\frac{1}{2} \left(\boldsymbol{\beta}'\mathbf{V}^{-1}\boldsymbol{\beta} - 2 \boldsymbol{\beta}' \mathbf{v} + \mathbf{v'Vv} \right) \]

As before I find \(\mathbf{V}\) and \(\mathbf{v}\) in the first two terms.

Now I combine the regression likelihood with this prior distribution, I have an exponent on the multivariate normal distribution (omitting the 1/2 coefficient) that looks like this,

\[ \frac{1}{\sigma^2} (y - \mathbf{X}\boldsymbol{\beta})'(y - \mathbf{X}\boldsymbol{\beta}) + (\boldsymbol{\beta} - \mathbf{b})'\mathbf{B}^{-1}(\boldsymbol{\beta} - \mathbf{b}) \]

Retaining only terms containing \(\boldsymbol{\beta}\) coefficients,

\[ -2 \boldsymbol{\beta}' \left( \sigma^{-2}\mathbf{X'y} + \mathbf{B}^{-1}\mathbf{b} \right) + \boldsymbol{\beta}'(\sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1})\boldsymbol{\beta} \]

I can now identify parameter vectors for \(\mathbf{V}\) and \(\mathbf{c}\),

\[ \begin{aligned} \mathbf{V} &= \left( \sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1} \right)^{-1} \\ \mathbf{v} &= \sigma^{-2} \mathbf{X'y}+ \mathbf{B}^{-1}\mathbf{b} \end{aligned} \]

These are determine the posterior distribution.

Taking limits as I did for the previous example, I obtain the MLE for the mean parameter vector,

\[ \underset{\scriptscriptstyle \lim n \rightarrow \infty}{\mathbf{Vv}} \rightarrow (\mathbf{X'X})^{-1}\mathbf{X'y} \]

Connection to maximum likelihood

Consider again the likelihood, now ignoring the prior distribution, having exponent

\[ \log L \propto - \frac{1}{2 \sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \]

To maximumize the log likelihood I consider only these terms, because others do not contain the coefficients \(\boldsymbol{\beta}\). I differentiate once,

\[ \frac{\partial log L}{\partial \boldsymbol{\beta}} = \sigma^{-2}\mathbf{X'y} - \sigma^{-2}\mathbf{X'X}\boldsymbol{\beta} \]

and again,

\[\frac{\partial^2 log L}{\partial \boldsymbol{\beta}^2} = -\sigma^{-2}\mathbf{X'X}\]

To obtain MLEs I set the first derivative equal to zero and solve,

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X'X})^{-1}\mathbf{X'y} \] Again, I can recognize this to be posterior mean when the prior distribution is non-informative.

How do I estimate the parameter covariance? The matrix of curvatures, or second derivatives, is related to Fisher Information and the covariance of parameter estimates,

\[ \mathbf{I} = -\frac{\partial^2 log L}{\partial \boldsymbol{\beta}^2} = \sigma^{-2}\mathbf{X}'\mathbf{X} \]

The covariance of parameter estimates is \(cov( \hat{\boldsymbol{\beta}} ) = \mathbf{I}^{-1} = \sigma^{2}(\mathbf{X}'\mathbf{X})^{-1}\).

We can go a step further. For transparency, suppose that the design matrix is standardized (subtract the mean, then divide by the standard deviation), so that columns of \(\mathbf{X}\) have mean zero and unit variance. Then \(\mathbf{X}'\mathbf{X} = n \times cov( \mathbf{X} )\) the diagonal of this covariance parameter estimates has elements \(diag \left( cov( \hat{\boldsymbol{\beta}} ) \right)_{j} = \sigma^2/var( \mathbf{x}_j)\).

How does this MLE relate to the Bayesian posterior?