The likelihood-prior connection is the most fundamental relationship in Bayesian analysis. The prior distribution must support some of the parameter values in the likelihood, whereas the likelihood must support all possible observations. Here I apply fundamentals of distribution theory to revisit this relationship, including prior weight and conjugacy.

source('../clarkFunctions2021.r')

Today

Objectives

Can I trust my estimates when the predictors are correlated?

I want to determine how temperature and precipitation at different times of the year explain variation in species abundances, but these climate variables are all correlated with each other. They are further correlated with soils, because extensive coastal plain sands are mostly in the (warm, moist) Southeast, and fertile, glaciated soils are mostly in the (cold) Northeast. I’ve heard that correlated predictors could be bad, but I can never escape these correlations in observational data. How can I determine their effect on estimates?

In this unit I develop Bayesian regression extending distribution theory from unit 7 to continuous responses. To flesh out the concept of conjugacy, I begin by revisiting a previous example of binary trials. This and the previous examples for the Gaussian likelihood show commonalities in the conjugate pairs. I then turn to Bayesian regression with a MVN prior and examine the question of correlated inputs.

Conjugate prior-likelihood combinations

A conjugate prior distribution has the same form as the posterior distribution. Conjugacy depends on the both the prior and the likelihood.

In unit 7, I pointed out that a binary (Bernoulli) outcome could be combined with a beta prior distribution to yield a posterior beta distribution. Recall further that the binomial is a generalization of the Bernoulli distribution to allow for multiple trials. The binomial distribution looks like this:

\[ binom(x | n, \theta) \propto \theta^x + (1 - \theta)^{n - x} \] The right hand side has the probability of success in a given trial (\(\theta\)) taken \(x\) times (the number of ‘successes’) added to the probability of failure (\(1 - \theta\)) taken \(n - x\) times (the number of failures). The Bayesian model now needs a prior distribution for \(\theta\).

Here is the model with a beta prior distribution, with the product of likelihood and prior,

\[ \begin{aligned} binom(x | n, \theta) beta(\theta | a, b) &\propto \theta^x + (1 - \theta)^{n - x} \times \theta^{a-1}(1 - \theta)^{b-1} \\ &= \theta^{x + a - 1} + (1 - \theta)^{n - x + b - 1} \\ &= beta(\theta | x + a, n - x + b) \end{aligned} \]

As I noted before, there are two parameters, each with two terms. The first term comes from the data, \(x\) and \(n - x\). The second term comes from the prior, \(a\) and \(b\). To see how they affect the posterior distribution, I could draw a picture,

n     <- 10                               # no. trials
theta <- .6
x     <- rbinom(1, n, theta)              # observations

a <- b <- .1                              # prior parameter values
tseq   <- seq(0, 1, length = 100)         # sequence of theta values
prior  <- dbeta(tseq, a, b)               # prior distribution
post   <- dbeta(tseq, x + a, n - x + b)   # posterior distribution

par(mfrow=c(2,1), bty='l', mar=c(4,5,2,4))
xn <- 0:n
bn <- dbinom( xn, n, theta) 
plot( xn, bn, xlab = 'x', ylab = 'Probability' )
segments( xn, xn*0, xn, bn)
points( x, 0, pch=16, col = 'brown', cex=2)
title( 'binomial likelihood, observed x')

plot( tseq, post, type = 'l', lwd = 2, xlab = 'theta', ylab = 'Density')
lines( tseq, prior, lwd = 2, col = 'grey')
title( 'beta prior and posterior' )

The prior and posterior beta distributions used with a binomial likelihood. The posterior distribution (black line) balances the prior distribution (grey line) and the observation (dot).

I have placed a dot at the observed \(x\). The shape of the beta distribution is determined by the magnitudes of the parameters. The weight of the prior depends on the two terms, one from the data and one from the prior.

Exercise 1. Modify the R code to determine how the beta distribution responds to \(a\) and \(b\).

Exercise 2. Modify the R code to determine how the parameters and \(n\) affect the weight of the prior.

Normal-normal likelihood-prior combinations

My regression question builds from the basic Gaussian model, previously discusses for a simple mean and varianve. I previously noted that a normal prior distribution on the mean of a normal likelihood resulted in a normal posterior distribution. Recall that

\[ \begin{aligned} \left[m | \mathbf{y}, \sigma^2, M \right] &\propto \prod_{i=1}^n N(y_i |\mu,\sigma^2) N(\mu|m, M) \\ &= N \left(Vv, V \right) \end{aligned} \] where, again, we have two parameters, each with two terms, one data (\(n, \bar{y}\)) and one prior (\(m, M\)),

\[ \begin{aligned} V &= \left( \frac{n}{\sigma^2} + \frac{1}{M} \right)^{-1} \\ v &= \frac{n\bar{y}}{\sigma^2} + \frac{m}{M} \end{aligned} \]

I pointed out that this is a weighted average of data and prior, with the inverse of variances being the weights,

\[ Vv = \frac{\frac{n\bar{y}}{\sigma^2} + \frac{m}{M}}{\frac{n}{\sigma^2} + \frac{1}{M}} \]

Note how large \(n\) will swamp the prior,

\[ \underset{\scriptscriptstyle \lim n \rightarrow \infty}{Vv} \rightarrow \bar{y} \]

The prior can fight back with a tiny prior variance \(M\), \[ \underset{\scriptscriptstyle \lim M \rightarrow 0}{Vv} \rightarrow m \]

As with the beta-binomial combination for \(\theta\), this normal-normal combination for \(\mu\) is conjugate. Also as with the beta-binomial, there is a transparent contribution of data and prior parameter values to the posterior distribution. This desirable relationship extends beyond the mean to regression parameters.

Regression with a fixed variance

I want to do execute a Bayesian regression with a multivariate normal prior distribution on the coefficients in \(\boldsymbol{\beta}\).

Regression extends the previous example to include predictors in a design vector \(\mathbf{x}_{ip}\) multiplied by a coefficient \(\beta_p\) that will be estimated from data, with contribution from a prior distribution. The posterior distribution I want is this:

\[ [\boldsymbol{\beta}|\mathbf{y}, \mathbf{X}, \sigma^2, \mathbf{b}, \mathbf{B}] = \prod_{i=1}^n N(y_i | \mathbf{x}_i' \boldsymbol{\beta}, \sigma^2) \times MVN( \boldsymbol{\beta}|\mathbf{b}, \mathbf{B}) \]

The multivariate normal distribution for a vector can be written as \(MVN( \boldsymbol{\beta}|\mathbf{b}, \mathbf{B})\), where \(\boldsymbol{\beta}\) and \(\mathbf{b}\) are length-\(p\) vectors, and \(\mathbf{B}\) is a \(p \times p\) covariance matrix.

It may not be obvious from the way I have written the posterior that the normal likelihood-multivariate normal prior combination is conjugate. To make this more transparent, I can write the responses and predictors as length-\(n\) vectors \(\mathbf{y}\) and \(\mathbf{X} \boldsymbol{\beta}\) vectors. An example can help visualize how these elements come together. First, I’ll simulate a few regression data sets, so I’ll put that in a function:

simReg <- function(n, p, s){
  b       <- matrix( rnorm(p), p)
  x       <- matrix( rnorm(n*p), n, p )
  x[,1]   <- 1
  epsilon <- rnorm(n, 0, s)
  y       <- x%*%b + epsilon  
  list(x = x, y = y, beta = b, sigma = s, epsilon = epsilon)
}
sigma <- .5
n     <- 10
p     <- 3
d     <- simReg(n = n, p = p, s = sigma)
x     <- d$x
y     <- d$y
beta  <- d$beta
mu    <- x%*%beta
plot( mu, y, xlab = 'mu', ylab = 'observed')
abline( 0, 1, lty=2, col = 'grey')
*The regression mean with simulated observations.*

The regression mean with simulated observations.

If I were rusty on matrix multiplication I could simply look at the objects x, beta, and y. The operations I have performed here could be written as a multivariate normal distribution for the full vector of \(\mathbf{y}\),

\[ MVN_n(\mathbf{y} | \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n) \propto (\sigma^2)^{-n/2} exp \left[ -\frac{1}{2 \sigma^2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] where \(\mathbf{I}_n\) is the identity matrix. Because the covariance matrix is \(\sigma^2 \mathbf{I}_n\), the observations are taken as independent, and this is equivalent to taking the production of \(n\) univariate likelihoods for each observed \(y_i\).

When I combine this \(MVN\) likelihood with the \(MVN\) prior distribution I end up with the same structure I obtained for the univariate normal distribution,

\[ \left[\beta_1, \dots, \beta_p | \mathbf{b}, \mathbf{B} \right] = MVN(\boldsymbol{\beta} | \mathbf{Vv}, \mathbf{V}) \] The matrices on the right-hand side are multivariate versions of \((V, v)\) from the (univariate) normal distribution,

\[ \begin{aligned} \mathbf{V} &= \left( \sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1} \right)^{-1} \\ \mathbf{v} &= \sigma^{-2} \mathbf{X'y}+ \mathbf{B}^{-1}\mathbf{b} \end{aligned} \] Recall, \(V\) was a posterior variance, with terms from the likelihood and prior. Here, \(\mathbf{V}\) is a \(p \times p\) covariance matrix, with a term contributed by the crossproduct of the design matrix \(\mathbf{X}' \mathbf{X}\) and \(\mathbf{B}\). Similarly, \(\mathbf{v}\) is now a length-\(p\) vector with a term contributed by the crossproduct of design and observations \(\mathbf{X}' \mathbf{y}\) and prior mean \(\mathbf{b}\). Also as with the univariate case, I obtain the posterior mean with the product \(\hat{\boldsymbol{\beta}} = \mathbf{V} \mathbf{v}\).

Here again, it can help to visualize with R. I begin with the terms from the likelihood,

XX     <- crossprod(x)                  # crossproduct X'X
V      <- sigma^2*solve(XX)             # beta covariance that ignores prior
v      <- crossprod(x, y)/sigma/sigma   
bdata  <- V%*%v
SEdata <- sqrt( diag(V))
data.frame( beta = beta, bhat = bdata, SE = SEdata )

These are the true values and the estimates, with standard errors, I would obtain if I ignored the prior.

Here I include the prior distribution and pack it away in a function so I can reuse it later:

b    <- matrix(0, p)
B    <- diag(1, p)

mvnPost <- function(x, y, b, B, s){
  XX   <- crossprod(x)
  V    <- solve( XX/s/s + solve(B) )
  v    <- crossprod(x, y)/s/s + solve(B)%*%b
  list( bhat = V%*%v, SE = sqrt(diag(V)), bcov = V )
}

z <- mvnPost(x, y, b, B, sigma)
data.frame( beta = beta, bdata = bdata, SEdata = SEdata, bhat = z$bhat, SE = z$SE)

Note the estimates that include the prior (bhat) have been pulled toward the prior, which is centered on zero. There can be big differences in both sets of estimates from those used to simulate the data (beta) because n is small.

To examine the effects on a larger data set, I increase the sample size, and I generate a correlated version of the design matrix x by constructing a random covariance matrix with the function .rMVN:

# simulate data with and without correlated columns in x
n    <- 100
d    <- simReg(n = n, p = 3, s = sigma)
x    <- d$x
y    <- d$y
beta <- d$beta
epsilon <- d$epsilon
C      <- cov( .rMVN(p+1, beta*0, diag(p) ) )   # covariances can be large
xc     <- .rMVN(n, b*0, C)                      # design with correlated columns
x[,1]  <- xc[,1] <- 1                           # include intercept
lims <- range( rbind(x, xc) )
plot( x[,2], x[,3], xlab = 'x_2', ylab = 'x_3', pch = 16, xlim = lims, ylim = lims)
points( xc[,2], xc[,3], col = 3, pch = 16)

The columns of x can be correlated (green).

mu     <- x%*%beta                              # matrix multiplication
mc     <- xc%*%beta                             # with correlation
y      <- matrix( mu + epsilon, n)              # n x 1 response
yc     <- matrix( mc + epsilon, n)              # response with correlated x
plot( mu, y, xlab = 'mu', ylab = 'observed', pch = 16)
points( mc, y, col = 3, pch = 16)
abline( 0, 1, lty=2, col = 'grey')

The design with correlated predictors generates predictions with broad scatter (green) in this plot of mean(y) and y.

# summarize posterior distributions
zu      <- mvnPost(x, y, b, B, sigma)
zc      <- mvnPost(xc, y, b, B, sigma)
buncor  <- zu$bhat
SEuncor <- zu$SE
bcor    <- zc$bhat
SEcor   <- zc$SE
data.frame( beta = beta, buncor = buncor, SEuncor = SEuncor, 
            bcor = bcor, SEcor = SEcor)

Clearly, the scatters in the predicted y are quite different. When the x columns are uncorrelated, the parameter estimates are much closer to the values used to simulate the data. Here is a plot of the posterior distribution, marginalized for each coefficient, with the values used to simulate data (dashed lines):

bseq <- seq(-2, 2, length=200)
plot(NA, xlim = range(bseq), ylim = c(0,15), xlab = 'Beta', ylab = 'Density')
for(j in 1:p){
  lines(bseq, dnorm( bseq, buncor[j], SEuncor[j]), col = 1, lwd=j)
  lines(bseq, dnorm( bseq, bcor[j], SEcor[j]), col = 3, lwd=j)
}
abline(v = beta, lty=2, lwd=1:3)
legend( 'topleft', c('Uncorrelated x (solid)', 'Correlated x (dashed)'), 
        text.col=c(1,3), bty='n' )

Posterior marginalized for three coefficients from the correlated and uncorrelated designs. Values used to simulate data are dashed lines.

It is also clear from the formula for \(\mathbf{V}\) that the correlation structure of the design matrix will appear in the covariance of the estimates, which includes the inverse of this crossproduct of $. Here is a sample of parameter estimates from the posterior distribution, with a comparison of the inverted correlation in estimates with the correlation in x:

bcov    <- zc$bcov
bsample <- .rMVN(1000, b*0, bcov )
plot(bsample[,2], bsample[,3], cex=.3, col = 3, xlab = 'beta_2', ylab = 'beta_3')

Scatter plot of samples from the posterior for two slope coefficients show the inverse correlation that occurs within columns of x.

cor(xc[,2:3])
##           [,1]      [,2]
## [1,] 1.0000000 0.7596706
## [2,] 0.7596706 1.0000000
cov2cor(solve(zc$bcov))[2:3,2:3]
##           [,1]      [,2]
## [1,] 1.0000000 0.7610457
## [2,] 0.7610457 1.0000000

Exercise 3. Using the FIA basal area data, determine the relationship between the covariance in predictors and in the estimated coefficients. Should some predictors be omitted? Recall that basal area data are continuous with discrete zeros, so I suggest the Tobit.

Here are some suggestions:
  1. load data from files xdataCluster.rdata and baCluster.rdata (see below).
  2. Select a species to analyze (a column in baCluster.rdata).
  3. Select predictors from xdataCluster.rdata and write a formula.
  4. Create a data.frame holding predictors and response.
  5. Fit a Tobit regression using bayesReg in the clarkFunctions2021.R.

Here I load the data:

load( 'xdataCluster.rdata', verbose = T  )
load( 'baCluster.rdata', verbose = T )
y    <- baCluster[,'pinusTaeda']
data <- data.frame( cbind(xdataCluster, y) )

Here is the covariance structure in data:

Pairs plot shows correlated predictors in data.

Note pairs showing high absolute correlation (positive or negative) will have effects that are hard to identify/separate.


It is important to recognize that the simple explorations I’ve done here are possible because the likelihood-prior combination in this regression is conjugate. This relationship allowed me to solve for the posterior distribution and things that can be extracted from it, such as the standard errors. I have ignored the uncertainty in the variance; that comes next.


Residual variance (known \(\boldsymbol{\mu}\))

In the previous example, I solved for regression coefficients assuming that the residual variance parameter \(\sigma^2\) was known. Typically, it will not be known, so it must also be estimated. In other words, it must be on the left side of the bar in the prior and posterior distributions, \([\boldsymbol{\beta}, \sigma^2 | \cdot ]\).

Now I assume that I know the coefficients \(\boldsymbol{\beta}\) and want to estimate the residual variance \(\sigma^2\). Recall the likelihood for the normal distribution, this time only retaining the factors that include \(\sigma^2\),

\[ N(\mathbf{y}|\mu,\sigma^2) \propto \sigma^{-2(n/2)} exp\left[-\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2 \right] \]

A prior distribution for that is commonly used is inverse gamma,

\[ IG \left(\sigma^2 | s_1, s_2 \right) \propto \sigma^{-2(s_1 + 1)} exp\left(-s_2 \sigma^{-2} \right) \]

When I multiply likelihood and prior I get another inverse gamma distribution,

\[ IG \left(\sigma^2 | u_1, u_2 \right) \propto \sigma^{-2(u_1 + 1)} exp\left(- u_2 \sigma^{-2} \right) \]

where \(u_1 = s_1 + n/2\), and \(u_2 = s_2 + \frac{1}{2} \sum_{i=1}^n (y_i - \mu)^2\). Again, this is a conjugate likelihood-prior pair. Here is a prior and posterior distribution for a sample data set.

library(MCMCpack)
par(bty='n')
n  <- 10
y  <- rnorm(n)
s1 <- s2 <- 1
ss <- seq(0, 4, length=100)
u1 <- s1 + n/2
u2 <- s2 + 1/2*sum( ( y - mean(y) )^2) 
plot( ss, dinvgamma(ss, u1, u2), type='l', lwd=2, xlab = 'sigma^2', ylab='Density' )
lines( ss, dinvgamma(ss, s1, s2), col='blue', lwd=2 )
legend('topright', c('prior', 'posterior'), text.col=c('blue','black'))

Prior and posterior IG distribution for a simulated data set

For a regression model, this conjugate relationship still holds.

Small step to Gibbs sampling

The conditional posterior distributions for coefficients and variance will be combined for posterior simulation using Gibbs sampling. To see how this will come together, consider that we can now sample \([\boldsymbol{\beta} | \sigma^2]\) and, conversely, \([\sigma^2 | \boldsymbol{\beta}]\). If we alternate these two steps repeatedly we have a simulation for their joint distribution, \([\boldsymbol{\beta}, \sigma^2]\).

Here is a simulated data set with some intermediate coefficients that will be used to sample the posterior:

p     <- 4
n     <- 100
sigma <- .5
d <- simReg( n = n, p = p, s = sigma ) # simulated data
beta <- d$beta
x    <- d$x
y    <- d$y
  
s1  <- s2 <- .1                        # prior parameter values
b   <- beta*0
B   <- 100*diag(p)
XX  <- crossprod(x)                    # intermediate products
XY  <- crossprod(x,y)
IB  <- solve(B)
Bb  <- IB%*%b

Here are functions I can use to alternately sample from \([\boldsymbol{\beta}|\sigma^2]\) and \([\sigma^2 | \boldsymbol{\beta}]\):

sampleBeta <- function(s){          # draw a sample of coefficients
  V  <- solve( XX/s/s + IB )
  v  <- XY/s/s + Bb
  t( .rMVN(1, V%*%v, V) )
}
sampleSigma <- function(b){         # sample residual variance
  1/rgamma(1, s1 + n/2, s1 + .5*crossprod( y - x%*%b ) )
}

Here I alternately sample from the posterior distribution of \(\boldsymbol{\beta}|\sigma^2\) (the current value is bg) followed by \(\sigma^2|\boldsymbol{\beta}\) (current value sg):

bg <- b                         # initial values
sg <- 1

ng <- 5000                      # matrix and vector to hold samples
bgibbs <- matrix( NA, ng, p)
sgibbs <- rep(NA, ng)

for(g in 1:ng){                 # generate MCMC chains
  bg <- sampleBeta(sg)
  sg <- sampleSigma(bg)
  bgibbs[g,] <- bg
  sgibbs[g]  <- sg
}

With each iteration, a new sample is taken from both conditional distributions and stored in bgibbs and sgibbs.

Here are plots of the samples and summary histograms:

par(mfrow = c(p, 2), bty='n', mar=c(2,2,.5,1), oma=c(2,4,2,1) )
g <- 10:ng
for(j in 1:p){
  bp  <- bgibbs[g,j]
  q   <- quantile( bp, c(.025, .975) )                    # bound 95%
  lim <- range( c(beta[j], bp) )
  plot(bp, type = 'l', xlab = '', ylab = '', ylim = lim)
  abline(h = q, lty=2)
  abline(h = beta[j], col = 'white', lwd=3); abline(h = beta[j], lwd = 2, lty=2)
  
  h <- hist( bp, nclass=40, plot=F )
  plot(h$density, h$mids, type = 's', xlab = '', ylab='', ylim = lim)
  abline(h = q, lty=2); abline(h = beta[j], lwd = 2, lty=2)
}
mtext('Beta values', 2, outer=T)

Chains at left and summary distributions at right, with true values (dashed)

jags example

Here is this example in jags. The model file can be written like this:

library(rjags)
file <- "lmSimulated.txt" 

cat("model{

  # Likelihood
  for(i in 1:n){
    y[i]   ~ dnorm(mu[i],precision)
    mu[i] <- inprod(beta[],x[i,])
  }

   for (i in 1:p) {
     beta[i] ~ dnorm(0, 1.0E-5)              
   }

  # Prior for the inverse variance
  precision ~ dgamma(0.01, 0.01)
  sigma     <- 1/precision
}", file = file)

Here is a function that sets up the posterior sampling:

model <- jags.model(file=file, 
                    data = list(x = x, y = as.vector(y), n=nrow(x), p=ncol(x)))
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 100
##    Unobserved stochastic nodes: 5
##    Total graph size: 713
## 
## Initializing model

I start with 100 burnin iterations, then sample for 2000:

update(model, 100)
jagsLm <- coda.samples(model, variable.names=c("beta","sigma"), n.iter=2000)
tmp    <- summary(jagsLm)
print(tmp$statistics)
##              Mean         SD     Naive SE Time-series SE
## beta[1] 0.9454592 0.05378751 0.0012027254   0.0012635487
## beta[2] 0.7870829 0.05415297 0.0012108973   0.0012108973
## beta[3] 0.3177121 0.04801452 0.0010736374   0.0010736374
## beta[4] 1.3081716 0.05514672 0.0012331183   0.0013553401
## sigma   0.2818505 0.04125719 0.0009225389   0.0008858519

Here are plots:

plot(jagsLm)

Posterior samples at left with density plots at right.Posterior samples at left with density plots at right.

This example from jags should have sampled from the posterior in the same way we did (I don’t know how to find out for sure). For me, there are advantages in knowing exactly what I have done. There are not many ways to go wrong in this example, but larger models often require creative algorithms that are not accessible in software.

Recap

Bayesian analysis requires some basic distribution theory to properly combine data (likelihood) and prior information to generate a posterior distribution. Fundamental ways to parameterize observations include densities (continuous), probability mass functions (discrete), and probability density (both) functions. The sample space defines allowable (non-zero probablity) for a random variable. Parameter distributions typically continuous. Integrating (continous) or summing (continuous) over the sample space gives a probability of 1.

Distributions have moments, which are expectations for integer powers of a random variable. The first moment is the mean, and the second central moment is the variance. Higher moments include skewness (asymmetry) and kurtosis (shoulders versus peak and tails).

Joint distribution can be factored into conditional and marginal distributions. A conditional distribution assumes a specific value for the variable that is being conditioned on. Marginalizing over a variable is done with the law of total probability. Bayes theorem relies on a specific factorization giving a posterior distribution in terms of likelihood and prior.

The multivariate normal distribution is commonly used as a prior distribution in regression. When combined with a normal likelihood, the posterior mean can be found with the ‘Vv rule’.

An informative design does not have large positive or negative correlations between columns in \(\mathbf{X}\). These correlations can be viewed as redundancies that reduce information and influence parameter estimates.


Appendix

I used the normal distribution for regression examples without saying much about it. In Bayesian analysis it is not only used for the likelihood, but also as a prior distribution for parameters. Here I extend the foregoing distribution theory to Bayesian methods that involve the normal distribution.

Conjugate prior-posterior distribution

For the regression model, I start with matrix notation,

\[ \mathbf{y} \sim MVN(\mathbf{X} \boldsymbol{\beta}, \Sigma) \] where \(\mathbf{y}\) is the length-\(n\) vector of responses, \(\mathbf{X}\) is the \(n \times p\) design matrix, \(\boldsymbol{\beta}\) is the length-\(p\) vector of coefficients, and \(\Sigma\) is an \(n \times n\) covariance matrix. I can write this as

\[ (2 \pi)^{-n/2} |\Sigma|^{-1/2} exp \left[ -\frac{1}{2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' \Sigma^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] Because we assume i.i.d (independent, identically distributed) \(y_i\), the covariance matrix is \(\Sigma = \sigma^2 \mathbf{I}\), and \(|\Sigma|^{-1/2} = (\sigma^2)^{-n/2}\), giving us

\[ (2 \pi)^{-n/2} (\sigma^2)^{-n/2} exp \left[ -\frac{1}{2 \sigma^2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] This is the form of the likelihood I use to obtain the conditional posterior for regression coefficients.

The multivariate prior distribution is also multivariate normal,

\[ \begin{aligned} \left[\beta_1, \dots, \beta_p | \mathbf{b}, \mathbf{B} \right] &= MVN(\boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) \\ &= \frac{1}{(2\pi)^{p/2} det(\mathbf{B})^{1/2}} exp\left[-\frac{1}{2}(\boldsymbol{\beta} - \mathbf{b})'\mathbf{B}^{-1}(\boldsymbol{\beta} - \mathbf{b}) \right] \end{aligned} \]

If there are \(p\) predictors, then \(\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)'\). The prior mean is a length-\(p\) vector \(\mathbf{b}\). The prior covariance matrix could be a non-informative diagonal matrix,

\[ \mathbf{B} = \begin{pmatrix} B & 0 & \cdots & 0 \\ 0 & B & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & B \end{pmatrix} \]

for some large value \(B\).

As for the example for the mean of the normal distribution, I apply the “big-V, small-v” method. For matrices the exponent of \(N(\boldsymbol{\beta} | \mathbf{Vv})\) is

\[ -\frac{1}{2} (\boldsymbol{\beta} - \mathbf{Vv})'\mathbf{V}^{-1}(\boldsymbol{\beta} - \mathbf{Vv}) = -\frac{1}{2} \left(\boldsymbol{\beta}'\mathbf{V}^{-1}\boldsymbol{\beta} - 2 \boldsymbol{\beta}' \mathbf{v} + \mathbf{v'Vv} \right) \]

As before I find \(\mathbf{V}\) and \(\mathbf{v}\) in the first two terms.

Now I combine the regression likelihood with this prior distribution, I have an exponent on the multivariate normal distribution that looks like this,

\[ \frac{1}{\sigma^2} (y - \mathbf{X}\boldsymbol{\beta})'(y - \mathbf{X}\boldsymbol{\beta}) + (\boldsymbol{\beta} - \mathbf{b})'\mathbf{B}^{-1}(\boldsymbol{\beta} - \mathbf{b}) \]

Retaining only terms containing coefficients,

\[ -2 \boldsymbol{\beta}' \left( \sigma^{-2}\mathbf{X'y} + \mathbf{B}^{-1}\mathbf{b} \right) + \boldsymbol{\beta}'(\sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1})\boldsymbol{\beta} \]

I can now identify parameter vectors for \(\mathbf{V}\) and \(\mathbf{c}\),

\[ \begin{aligned} \mathbf{V} &= \left( \sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1} \right)^{-1} \\ \mathbf{v} &= \sigma^{-2} \mathbf{X'y}+ \mathbf{B}^{-1}\mathbf{b} \end{aligned} \]

These are determine the posterior distribution.

Taking limits as I did for the previous example, I obtain the MLE for the mean parameter vector,

\[ \underset{\scriptscriptstyle \lim n \rightarrow \infty}{\mathbf{Vv}} \rightarrow (\mathbf{X'X})^{-1}\mathbf{X'y} \]

Connection to maximum likelihood

Consider again the likelihood, now ignoring the prior distribution, having exponent

\[ \log L \propto - \frac{1}{2 \sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \]

To maximumize the log likelihood I consider only these terms, because others do not contain the coefficients \(\boldsymbol{\beta"\). I differentiate once,

\[ \frac{\partial log L}{\partial \boldsymbol{\beta}} = \sigma^{-2}\mathbf{X'y} - \sigma^{-2}\mathbf{X'X}\boldsymbol{\beta} \]

and again,

\[\frac{\partial^2 log L}{\partial \boldsymbol{\beta}^2} = -\sigma^{-2}\mathbf{X'X}\]

To obtain MLEs I set the first derivative equal to zero and solve,

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X'X})^{-1}\mathbf{X'y} \] Again, I can recognize this to be posterior mean when the prior distribution is non-informative.

How do I estimate the parameter covariance? The matrix of curvatures, or second derivatives, is related to Fisher Information and the covariance of parameter estimates,

\[ \mathbf{I} = -\frac{\partial^2 log L}{\partial \boldsymbol{\beta}^2} = \sigma^{-2}\mathbf{X}'\mathbf{X} \]

The covariance of parameter estimates is \(\mathbf{I}^{-1} = \sigma^{2}(\mathbf{X}'\mathbf{X})^{-1}\).