The likelihood-prior connection is the most fundamental relationship in Bayesian analysis. The prior distribution must support some of the parameter values in the likelihood, whereas the likelihood must support all possible observations. Here I apply fundamentals of distribution theory to revisit this relationship, including prior weight and conjugacy.
source('clarkFunctions2024.R')
library(MCMCpack)
library(repmis)
library(rjags)
A conjugate prior distribution has the same form as the posterior distribution. Conjugacy depends on the both the prior and the likelihood.
In unit 7, I pointed out that a binary (Bernoulli) outcome could be combined with a beta prior distribution to yield a posterior beta distribution. Recall further that the binomial is a generalization of the Bernoulli distribution to allow for multiple trials. The binomial distribution looks like this:
\[ binom(x | n, \theta) \propto \theta^x (1 - \theta)^{n - x} \] The right hand side has the probability of success in a given trial (\(\theta\)) taken \(x\) times (the number of ‘successes’) times the probability of failure (\(1 - \theta\)) \(n - x\) times (the number of failures). The Bayesian model now needs a prior distribution for \(\theta\).
Here is the model with a beta prior distribution, with the product of likelihood and prior,
\[ \begin{aligned} binom(x | n, \theta) beta(\theta | a, b) &\propto \theta^x (1 - \theta)^{n - x} \times \theta^{a-1}(1 - \theta)^{b-1} \\ &= \theta^{x + a - 1} (1 - \theta)^{n - x + b - 1} \\ &= beta(\theta | x + a, n - x + b) \end{aligned} \]
As I noted before, there are two parameters, each with two parts The first past comes from the data, having exponents \(x\) and \(n - x\). The second part comes from the prior, with exponents \(a\) and \(b\). To see how they affect the posterior distribution, I could draw a picture,
n <- 10 # no. trials
theta <- .6
x <- rbinom(1, n, theta) # observations
a <- b <- .1 # prior parameter values
tseq <- seq(0, 1, length = 100) # sequence of theta values
xn <- 0:n # sequence of x values
bn <- dbinom( xn, n, theta) # likelihood
prior <- dbeta(tseq, a, b) # prior distribution
post <- dbeta(tseq, x + a, n - x + b) # posterior distribution
par(mfrow=c(2,1), bty='l', mar=c(4,5,2,4), las = 1 )
plot( xn, bn, xlab = 'x', ylab = 'Probability' ) # draw the binomial PMF
segments( xn, xn*0, xn, bn)
points( x, dbinom(x, n, theta), pch = 16, col = 'brown', cex = 2)
title( 'binomial likelihood')
plot( tseq, post, type = 'l', lwd = 2, xlab = 'theta', ylab = 'Density',
col = 'brown' ) # prior/posterior
lines( tseq, prior, lwd = 2, col = 'grey')
title( 'beta prior and posterior' )
The prior and posterior beta distributions used with a binomial likelihood. The posterior distribution (black line) balances the prior distribution (grey line) and the observation (dot).
I have placed a dot at the observed \(x\). The shape of the beta distribution is
determined by the magnitudes of the two parameters in the beta
distribution. The weight of the prior depends on the two terms, one from
the data and one from the prior. The prior is non-informative, because
posterior parameters used to evaluate post are hardly
affected by a and b from the prior.
Exercise 1. Modify the R code to determine how the beta distribution responds to \(a\) and \(b\).
Exercise 2. Modify the R code to determine how the true
value of theta used to simulate the data and \(n\) affect the weight of the prior.
Recall that the multinomial distribution is an extension of the binomial,
\[ \left[ \mathbf{y} | n, \theta \right ] \propto \prod_{j = 1}^J \theta_j^{y_j} \] with the constraints (support) \(\sum_j \theta_j = 1\) and \(\sum_j y_j = n\). These constraints mean that the last (\(J^{th}\)) class has parameter \(\theta_J = 1 - \sum_{j = 1}^{J-1} \theta_j\) and \(y_J = n - \sum_{j = 1}^{J-1} y_j\). So there are only \(J - 1\) dimensions, the \(J^{th}\) being redundant with the others. If we have \(J = 2\), then this is the binomial distribution.
Just as the binomial likelihood/beta prior is conjugate, so too is the multinomial/Dirichlet. The Dirichlet has the same form as the beta distribution, \(Dir(\theta_{j = 1, \dots, J} | a_{j = 1, \dots, J}) \propto \prod_{j = 1}^J \theta_j^{a_j - 1}\). This has the same form as the multinomial likelihood. We can multiply them to get another Dirichlet distribution:
\[ \begin{aligned} \left[ \theta | \mathbf{y}, n \right] &\propto \left[ \mathbf{y} | n, \theta_{j = 1, \dots, J} \right ][\theta_{j = 1, \dots, J} | a_{j = 1, \dots, J}] \\ &\propto \prod_{j = 1}^J \theta_j^{y_j} \times \prod_{j = 1}^J \theta_j^{a_j - 1} \\ &= Dir( y_1 + a_1 + \dots + y_J + a_J ) \end{aligned} \] This is the posterior distribution.
My regression question builds from the basic Gaussian model, previously discussed for a simple mean and varianve. I previously noted that a normal prior distribution on the mean of a normal likelihood resulted in a normal posterior distribution. Recall that
\[ \begin{aligned} \left[m | \mathbf{y}, \sigma^2, M \right] &\propto \prod_{i=1}^n N(y_i |\mu,\sigma^2) N(\mu|m, M) \\ &= N \left(Vv, V \right) \end{aligned} \] where, again, we have two parameters, each with two terms, one data (\(n, \bar{y}\)) and one prior (\(m, M\)),
\[ \begin{aligned} V &= \left( \frac{n}{\sigma^2} + \frac{1}{M} \right)^{-1} \\ v &= \frac{n\bar{y}}{\sigma^2} + \frac{m}{M} \end{aligned} \]
I pointed out that this is a weighted average of data and prior, with the inverse of variances being the weights,
\[ Vv = \frac{\frac{n\bar{y}}{\sigma^2} + \frac{m}{M}}{\frac{n}{\sigma^2} + \frac{1}{M}} \]
Note how large \(n\) will swamp the prior,
\[ \underset{\scriptscriptstyle \lim n \rightarrow \infty}{Vv} \rightarrow \bar{y} \]
The prior can fight back with a tiny prior variance \(M\), \[ \underset{\scriptscriptstyle \lim M \rightarrow 0}{Vv} \rightarrow m \]
As with the beta-binomial combination for \(\theta\), this normal-normal combination for \(\mu\) is conjugate. Also as with the beta-binomial, there is a transparent contribution of data and prior parameter values to the posterior distribution. This desirable relationship extends beyond the mean to regression parameters.
I want to do execute a Bayesian regression with a multivariate normal prior distribution on the coefficients in \(\boldsymbol{\beta}\).
Regression extends the previous example to include predictors in a design vector \(\mathbf{x}_{i} = (x_{i1}, \dots, x_{ip})\) multiplied by a coefficient \(\boldsymbol{\beta} = (\beta_{1}, \dots, \beta_{p})\) that will be estimated from data, with contribution from a prior distribution. The posterior distribution I want is this:
\[ [\boldsymbol{\beta}|\mathbf{y}, \mathbf{X}, \sigma^2, \mathbf{b}, \mathbf{B}] = \prod_{i=1}^n N(y_i | \mathbf{x}_i' \boldsymbol{\beta}, \sigma^2) \times MVN( \boldsymbol{\beta}|\mathbf{b}, \mathbf{B}) \]
The multivariate normal distribution for a vector can be written as \(MVN( \boldsymbol{\beta}|\mathbf{b}, \mathbf{B})\), where \(\boldsymbol{\beta}\) and \(\mathbf{b}\) are length-\(p\) vectors, and \(\mathbf{B}\) is a \(p \times p\) covariance matrix.
It may not be obvious from the way I have written the posterior that the normal likelihood-multivariate normal prior combination is conjugate. To make this more transparent, I can write the responses and predictors as length-\(n\) vectors \(\mathbf{y}\) and \(\mathbf{X} \boldsymbol{\beta}\) vectors. An example can help visualize how these elements come together. First, I’ll simulate a few regression data sets, so I’ll put that in a function:
simReg <- function(n, p, s){
b <- matrix( rnorm(p), p)
x <- matrix( rnorm(n*p), n, p )
x[,1] <- 1
epsilon <- rnorm(n, 0, s)
y <- x%*%b + epsilon
list(x = x, y = y, beta = b, sigma = s, epsilon = epsilon)
}
Here I use the function to simulate regression data:
sigma <- .5
n <- 10
p <- 3
d <- simReg(n = n, p = p, s = sigma)
x <- d$x
y <- d$y
beta <- d$beta
mu <- x%*%beta
plot( mu, y, xlab = 'mu', ylab = 'observed', bty = 'n')
abline( 0, 1, lty=2, col = 'grey')
The regression mean with simulated observations.
If I were rusty on matrix multiplication I could simply look at the
objects x, beta, and y. The
operations I have performed here could be written as a multivariate
normal distribution for the full vector of \(\mathbf{y}\),
\[ MVN_n(\mathbf{y} | \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n) \propto (\sigma^2)^{-n/2} exp \left[ -\frac{1}{2 \sigma^2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] where \(\mathbf{I}_n\) is the identity matrix. Because the covariance matrix is \(\sigma^2 \mathbf{I}_n\), the observations are taken as independent, and this is equivalent to taking the production of \(n\) univariate likelihoods for each observed \(y_i\). Recall that off-diagonal elements of a covariance matrix indicate variables that do not covary.
When I combine this \(MVN\) likelihood with the \(MVN\) prior distribution I end up with the same structure I obtained for the univariate normal distribution,
\[ \left[\beta_1, \dots, \beta_p | \mathbf{b}, \mathbf{B} \right] = MVN(\boldsymbol{\beta} | \mathbf{Vv}, \mathbf{V}) \] The matrices on the right-hand side are multivariate versions of \((V, v)\) from the (univariate) normal distribution,
\[ \begin{aligned} \mathbf{V} &= \left( \sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1} \right)^{-1} \\ \mathbf{v} &= \sigma^{-2} \mathbf{X'y}+ \mathbf{B}^{-1}\mathbf{b} \end{aligned} \] Recall, \(V\) was a posterior variance, with terms from the likelihood and prior. Here, \(\mathbf{V}\) is a \(p \times p\) covariance matrix, with a term contributed by the crossproduct of the design matrix \(\mathbf{X}' \mathbf{X}\) and a term from the prior, \(\mathbf{B}\). Similarly, \(\mathbf{v}\) is now a length-\(p\) vector with a term contributed by the crossproduct of design and observations \(\mathbf{X}' \mathbf{y}\) and prior mean \(\mathbf{b}\). Also as with the univariate case, I obtain the posterior mean with the product \(\hat{\boldsymbol{\beta}} = \mathbf{V} \mathbf{v}\).
Here again, it can help to visualize with R. I begin with the terms from the likelihood,
XX <- crossprod(x) # crossproduct X'X
V <- sigma^2*solve(XX) # beta covariance that ignores prior
v <- crossprod(x, y)/sigma/sigma
bhat <- V%*%v
bse <- sqrt( diag(V))
estimates <- data.frame( beta = beta, bhat = bhat, SE = bse )
These are the true values and the estimates, with standard errors in
the estimates data.frame:
| true values | bhat | SE |
|---|---|---|
| 2.0265 | 2.0448 | 0.1626 |
| -0.7038 | -0.5665 | 0.2288 |
| 1.6889 | 1.4809 | 0.1749 |
Here I include the prior distribution and pack it away in a function so I can reuse it later:
b <- matrix(0, p)
B <- diag(1, p)
mvnPost <- function(x, y, s, b = NULL, B = NULL ){
XX <- crossprod(x)
if( is.null( b ) )b <- matrix( 0, ncol(x) )
if( is.null( B ) )B <- nrow(x)*s*solve(XX) # Zellner's g prior
V <- solve( XX/s/s + solve(B) )
v <- crossprod(x, y)/s/s + solve(B)%*%b
list( bhat = V%*%v, SE = sqrt(diag(V)), bcov = V )
}
z <- mvnPost(x, y, sigma, b, B)
withPrior <- data.frame( beta = beta, priorMu = b, bhat = z$bhat, SE = z$SE)
| true values | prior mean | bhat | SE |
|---|---|---|---|
| 2.0265 | 0 | 1.9873 | 0.1603 |
| -0.7038 | 0 | -0.5173 | 0.2229 |
| 1.6889 | 0 | 1.4346 | 0.1723 |
The estimates that include the prior (bhat) could be
pulled toward the prior, which is centered on zero. There can be big
differences in both sets of estimates from those used to simulate the
data (beta) because n is small.
To examine the effects on a larger data set, I increase the sample
size, and I generate a correlated version of the design matrix
x by constructing a random covariance matrix with the
function .rMVN:
# simulate data with and without correlated columns in x
n <- 100
d <- simReg(n = n, p = 3, s = sigma)
x <- d$x
y <- d$y
beta <- d$beta
error <- d$epsilon
C <- cov( .rMVN(p+1, beta*0, diag(p) ) ) # covariances can be large
xc <- .rMVN(n, b*0, C) # design with correlated columns
x[,1] <- xc[,1] <- 1 # include intercept
lims <- range( rbind(x, xc) )
plot( x[,2], x[,3], xlab = 'x_2', ylab = 'x_3', pch = 16,
xlim = lims, ylim = lims, bty = 'n' )
points( xc[,2], xc[,3], col = 3, pch = 16) # larger covariance than x
The columns of x can be correlated (green).
mu <- x%*%beta # matrix multiplication
mc <- xc%*%beta # with correlation
y <- matrix( mu + error, n) # n x 1 response
yc <- matrix( mc + error, n) # response with correlated x
plot( mu, y, xlab = 'mu', ylab = 'observed', pch = 16, bty = 'n' )
points( mc, y, col = 3, pch = 16)
abline( 0, 1, lty=2, col = 'grey')
The design with correlated predictors can generate predictions with broad scatter (green) in this plot of mean(y) and y.
# summarize posterior distributions
zu <- mvnPost(x, y, sigma, b, B)
zc <- mvnPost(xc, y, sigma, b, B)
buncor <- zu$bhat
SEuncor <- zu$SE
bcor <- zc$bhat
SEcor <- zc$SE
betaEstimates <- data.frame( beta = beta, buncor = buncor, SEuncor = SEuncor,
bcor = bcor, SEcor = SEcor)
| true values | uncorrelated | SE | correlated | SE |
|---|---|---|---|---|
| -0.5282 | -0.4976 | 0.0503 | -0.2603 | 0.0515 |
| 1.7965 | 1.8885 | 0.0537 | 0.0071 | 0.0352 |
| -1.5845 | -1.5628 | 0.0479 | 0.1020 | 0.0284 |
Clearly, the scatters in the predicted y are quite
different. When the x columns are uncorrelated, the
parameter estimates are closer to the values used to simulate the data.
Here is a plot of the posterior distribution, marginalized for each
coefficient, with the values used to simulate data (dashed lines):
bseq <- seq(-3, 3, length=200)
plot(NA, xlim = range(bseq), ylim = c(0,15), xlab = 'Beta', ylab = 'Density', bty = 'n')
for(j in 1:p){
lines(bseq, dnorm( bseq, buncor[j], SEuncor[j]), col = 1, lwd=j)
lines(bseq, dnorm( bseq, bcor[j], SEcor[j]), col = 3, lwd=j)
}
abline(v = beta, lty=2, lwd=1:3)
legend( 'topleft', c('Uncorrelated x', 'Correlated x'),
text.col=c(1,3), bty='n' )
Posterior marginalized for three coefficients from the correlated and uncorrelated designs. Values used to simulate data are dashed lines.
It is also clear from the formula for \(\mathbf{V}\) that the correlation structure
of the design matrix will appear in the covariance of the estimates,
which includes the inverse of this crossproduct of $. Here is a sample
of parameter estimates from the posterior distribution, with a
comparison of the inverted correlation in estimates with the correlation
in x:
bcov <- zc$bcov
bsample <- .rMVN(1000, b*0, bcov )
plot(bsample[,2], bsample[,3], cex=.3, col = 3,
bty = 'n', xlab = 'beta_2', ylab = 'beta_3')
Scatter plot of samples from the posterior for two slope
coefficients show the inverse correlation that occurs within columns of
x.
Here is the covariance in the design matrix:
cor( xc[,2:3] )
| x_2 | x_3 | |
|---|---|---|
| x_2 | 1.000 | -0.224 |
| x_3 | -0.224 | 1.000 |
Here is the inverse covariance in coefficients:
cov2cor( solve(zc$bcov) )[2:3,2:3]
| x_2 | x_3 | |
|---|---|---|
| x_2 | 1.000 | -0.212 |
| x_3 | -0.212 | 1.000 |
The posterior covariance in \(\boldsymbol{\beta}\) means that the model can have limited capacity to discriminate between the effects of these two predictors.
Exercise 3. Using the FIA basal area data, determine the relationship between the covariance in predictors and in the estimated coefficients. Should some predictors be omitted? Recall that basal area data are continuous with discrete zeros, so I suggest the Tobit.
Here are some suggestions:load data from files xdataCluster.rdata and
baCluster.rdata (see below).
baCluster.rdata).
xdataCluster.rdata and write a
formula.
bayesReg in the
clarkFunctions2022.R.
Here I load the data:
source_data("https://github.com/jimclarkatduke/gjam/blob/master/xdataCluster.rdata?raw=True")
source_data("https://github.com/jimclarkatduke/gjam/blob/master/baCluster.rdata?raw=True")
y <- baCluster[,'pinusTaeda']
data <- data.frame( cbind(xdataCluster, y) )
Here is the covariance structure in data:
Pairs plot shows correlated predictors in data.
Note pairs showing high absolute correlation (positive or negative) will have effects that are hard to identify/separate.
It is important to recognize that the simple explorations I’ve done here are possible because the likelihood-prior combination in this regression is conjugate. This relationship allowed me to solve for the posterior distribution and things that can be extracted from it, such as the standard errors. I have ignored the uncertainty in the variance; that comes next.
In the previous example, I solved for regression coefficients assuming that the residual variance parameter \(\sigma^2\) was known. Typically, it will not be known, so it must also be estimated. In other words, it must be on the left side of the bar in the prior and posterior distributions, \([\boldsymbol{\beta}, \sigma^2 | \cdot ]\).
Now I assume that I know the coefficients \(\boldsymbol{\beta}\) and want to estimate the residual variance \(\sigma^2\). Recall the likelihood for the normal distribution, this time only retaining the factors that include \(\sigma^2\),
\[ N(\mathbf{y}|\mu,\sigma^2) \propto \sigma^{-2(n/2)} exp\left[-\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2 \right] \]
(Note: previously I wrote this likelihood in matrix form–they are equivalent). A prior distribution that is commonly used is inverse gamma,
\[ IG \left(\sigma^2 | s_1, s_2 \right) \propto \sigma^{-2(s_1 + 1)} exp\left(-s_2 \sigma^{-2} \right) \]
When I multiply likelihood and prior I get another inverse gamma distribution,
\[ IG \left(\sigma^2 | u_1, u_2 \right) \propto \sigma^{-2(u_1 + 1)} exp\left(- u_2 \sigma^{-2} \right) \]
where \(u_1 = s_1 + n/2\), and \(u_2 = s_2 + \frac{1}{2} \sum_{i=1}^n (y_i - \mu)^2\). Again, this is a conjugate likelihood-prior pair. Here is a prior and posterior distribution for a sample data set.
par( bty='n', las = 1 )
n <- 10
y <- rnorm(n)
s1 <- s2 <- 1
ss <- seq(0, 4, length=100)
u1 <- s1 + n/2
u2 <- s2 + 1/2*sum( ( y - mean(y) )^2)
plot( ss, dinvgamma(ss, u1, u2), type='l', lwd=2, xlab = 'sigma^2', ylab='Density' )
lines( ss, dinvgamma(ss, s1, s2), col='blue', lwd=2 )
legend('topright', c('prior', 'posterior'), text.col=c('blue','black'), bty = 'n' )
Prior and posterior IG distribution for a simulated data set
For a regression model, this conjugate relationship still holds.
The conditional posterior distributions for coefficients and variance will be combined for posterior simulation using Gibbs sampling. To see how this will come together, consider that we can now sample \([\boldsymbol{\beta} | \sigma^2]\) and, conversely, \([\sigma^2 | \boldsymbol{\beta}]\). If we alternate these two steps repeatedly we have a simulation for their joint distribution, \([\boldsymbol{\beta}, \sigma^2]\).
Here is a simulated data set with some intermediate coefficients that will be used to sample the posterior:
p <- 4
n <- 100
sigma <- .5
d <- simReg( n = n, p = p, s = sigma ) # simulated data
beta <- d$beta
x <- d$x
y <- d$y
s1 <- s2 <- .1 # prior parameter values
b <- beta*0
XX <- crossprod(x) # intermediate products
XY <- crossprod(x,y)
B <- n*sigma*solve( XX ) # Zellner g prior
IB <- solve(B)
Bb <- IB%*%b
Here are functions I can use to alternately sample from \([\boldsymbol{\beta}|\sigma^2]\) and \([\sigma^2 | \boldsymbol{\beta}]\):
sampleBeta <- function(s){ # draw a sample of coefficients
V <- solve( XX/s/s + IB )
v <- XY/s/s + Bb
t( .rMVN(1, V%*%v, V) )
}
sampleSigma <- function(b){ # sample residual variance
1/rgamma(1, s1 + n/2, s1 + .5*crossprod( y - x%*%b ) )
}
Here I alternately sample from the posterior distribution of \(\boldsymbol{\beta}|\sigma^2\) (the current
value is bg) followed by \(\sigma^2|\boldsymbol{\beta}\) (current
value sg):
bg <- b # initial values
sg <- 1
ng <- 5000 # matrix and vector to hold samples
bgibbs <- matrix( NA, ng, p)
sgibbs <- rep(NA, ng)
for(g in 1:ng){ # generate MCMC chains
bg <- sampleBeta(sg)
sg <- sampleSigma(bg)
bgibbs[g,] <- bg
sgibbs[g] <- sg
}
With each iteration, a new sample is taken from both conditional
distributions and stored in bgibbs and
sgibbs.
Here are plots of the samples and summary histograms:
par(mfrow = c(p, 2), bty='n', mar=c(2,2,.5,1), oma=c(2,4,2,1) )
for(j in 1:p){
bp <- bgibbs[,j]
q <- quantile( bp, c(.025, .975) ) # bound 95%
lim <- range( c(beta[j], bp) )
plot(bp, type = 'l', xlab = '', ylab = '', ylim = lim)
abline(h = q, lty=2)
abline(h = beta[j], col = 'white', lwd=3)
abline(h = beta[j], lwd = 2, lty=2)
h <- hist( bp, nclass=40, plot=F )
plot(h$density, h$mids, type = 's', xlab = '', ylab='', ylim = lim)
abline(h = q, lty=2)
abline(h = beta[j], lwd = 2, lty=2)
}
mtext('Beta values', 2, outer=T, las = 0, line = 1)
Chains at left and summary distributions at right, with true values (dashed)
Here is this example in jags. The model file can be written like this:
file <- "lmSimulated.txt"
cat("model{
# Likelihood
for(i in 1:n){
y[i] ~ dnorm(mu[i],precision)
mu[i] <- inprod(beta[],x[i,])
}
for (i in 1:p) {
beta[i] ~ dnorm(0, 1.0E-5)
}
# Prior for the inverse variance
precision ~ dgamma(0.01, 0.01)
sigma <- 1/precision
}", file = file)
Here is a function that sets up the posterior sampling:
model <- jags.model(file=file,
data = list(x = x, y = as.vector(y), n=nrow(x), p=ncol(x)))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 100
## Unobserved stochastic nodes: 5
## Total graph size: 713
##
## Initializing model
I start with 100 burnin iterations, then sample for 2000:
update(model, 100)
jagsLm <- coda.samples(model, variable.names=c("beta","sigma"), n.iter=2000)
tmp <- summary(jagsLm)
print(tmp$statistics)
## Mean SD Naive SE Time-series SE
## beta[1] 0.7464039 0.05231721 0.001169848 0.0011698483
## beta[2] -0.2086795 0.05540322 0.001238854 0.0012587956
## beta[3] 0.8921874 0.06248059 0.001397109 0.0013971086
## beta[4] 1.9228684 0.04979419 0.001113432 0.0011533962
## sigma 0.2843073 0.04177950 0.000934218 0.0009663679
Here are plots:
plot( jagsLm, bty = 'n')
Posterior samples at left with density plots at right.
Posterior samples at left with density plots at right.
This example from jags should have sampled from the posterior in the same way we did (I don’t know how to find out for sure). For me, there are advantages in knowing exactly what I have done. There are not many ways to go wrong in this example, but larger models often require creative algorithms that are not accessible in software.
Bayesian analysis requires some basic distribution theory to properly combine data (likelihood) and prior information to generate a posterior distribution. Fundamental ways to parameterize observations include densities (continuous), probability mass functions (discrete), and probability density (both) functions. The sample space defines allowable (non-zero probablity) for a random variable. Parameter distributions typically continuous. Integrating (continous) or summing (continuous) over the sample space gives a probability of 1.
Distributions have moments, which are expectations for integer powers of a random variable. The first moment is the mean, and the second central moment is the variance. Higher moments include skewness (asymmetry) and kurtosis (shoulders versus peak and tails).
Joint distribution can be factored into conditional and marginal distributions. A conditional distribution assumes a specific value for the variable that is being conditioned on. Marginalizing over a variable is done with the law of total probability. Bayes theorem relies on a specific factorization giving a posterior distribution in terms of likelihood and prior.
The multivariate normal distribution is commonly used as a prior distribution in regression. When combined with a normal likelihood, the posterior mean can be found with the ‘Vv rule’.
An informative design does not have large positive or negative correlations between columns in \(\mathbf{X}\). These correlations can be viewed as redundancies that reduce information and influence parameter estimates.
The conjugacy for the mean parameter of the normal distribution can be checked. Notice that the normal distribution has this form:
\[ N(\mu | Vv, V) \propto exp \left[ -\frac{1}{2V} (\mu - Vv)^2 \right] \] In this form, \(V\) is the variance, and \(Vv\) is the mean parameter of the distribution. Expanding the exponential and dropping the last term (because it does not contain \(\mu\)) gives me this,
\[ \frac{1}{2V} (\mu - Vv)^2 = \frac{\mu^2}{V} - 2\mu v \] I now see that whatever multiplies \(\mu^2\) must be \(V^{-1}\) and whatever multiplies \(-2\mu\) must be \(v\).
So I have
\[ N(\mu | Vv, V) \propto exp \left[ -\frac{1}{2\sigma^2} \sum_i^n (\mu - y_i)^2 \right] \] with the exponent expanding to
\[ \begin{aligned} &= -\frac{1}{2\sigma^2} \left[ n \mu^2 + 2 \mu \sum_{i=1}^n y_i + \sum_{i=1}^n y_i^2 \right] \\ &= \frac{1}{2} \left[ \frac{n\mu^2}{\sigma^2} + \frac{2 \mu n \bar{y}}{\sigma^2} + \frac{n \bar{y^2}}{\sigma^2} \right] \end{aligned} \] Again, I omit the last term, because it does not include \(\mu\). I indentify these two parameters
\[ \begin{aligned} V &= \frac{\sigma^2}{n} \\ v &= \frac{n \bar{y}}{\sigma^2} \end{aligned} \] The mean of this distribution is transparently \(Vv = \bar{y}\) with standard error equal to \(\sqrt{V} = \sigma/\sqrt{n}\).
I used the normal distribution for regression examples without saying much about it. In Bayesian analysis it is not only used for the likelihood, but also as a prior distribution for parameters. Here I extend the foregoing distribution theory to Bayesian methods that involve the normal distribution.
For the regression model, I start with matrix notation,
\[ \mathbf{y} \sim MVN(\mathbf{X} \boldsymbol{\beta}, \Sigma) \] where \(\mathbf{y}\) is the length-\(n\) vector of responses, \(\mathbf{X}\) is the \(n \times p\) design matrix, \(\boldsymbol{\beta}\) is the length-\(p\) vector of coefficients, and \(\Sigma\) is an \(n \times n\) covariance matrix. I can write this as
\[ (2 \pi)^{-n/2} |\Sigma|^{-1/2} exp \left[ -\frac{1}{2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' \Sigma^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] Because we assume i.i.d (independent, identically distributed) \(y_i\), the covariance matrix is \(\Sigma = \sigma^2 \mathbf{I}\), and \(|\Sigma|^{-1/2} = (\sigma^2)^{-n/2}\), giving us
\[ (2 \pi)^{-n/2} (\sigma^2)^{-n/2} exp \left[ -\frac{1}{2 \sigma^2}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})' (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] This is the form of the likelihood I use to obtain the conditional posterior for regression coefficients.
The multivariate prior distribution is also multivariate normal,
\[ \begin{aligned} \left[\beta_1, \dots, \beta_p | \mathbf{b}, \mathbf{B} \right] &= MVN(\boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) \\ &= \frac{1}{(2\pi)^{p/2} det(\mathbf{B})^{1/2}} exp\left[-\frac{1}{2}(\boldsymbol{\beta} - \mathbf{b})'\mathbf{B}^{-1}(\boldsymbol{\beta} - \mathbf{b}) \right] \end{aligned} \]
If there are \(p\) predictors, then \(\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)'\). The prior mean is a length-\(p\) vector \(\mathbf{b}\). The prior covariance matrix could be a non-informative diagonal matrix,
\[ \mathbf{B} = \begin{pmatrix} B & 0 & \cdots & 0 \\ 0 & B & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & B \end{pmatrix} \]
for some large value \(B\).
As for the example for the mean of the normal distribution, I apply the “big-V, small-v” method. For matrices the exponent of \(N(\boldsymbol{\beta} | \mathbf{Vv})\) is
\[ -\frac{1}{2} (\boldsymbol{\beta} - \mathbf{Vv})'\mathbf{V}^{-1}(\boldsymbol{\beta} - \mathbf{Vv}) = -\frac{1}{2} \left(\boldsymbol{\beta}'\mathbf{V}^{-1}\boldsymbol{\beta} - 2 \boldsymbol{\beta}' \mathbf{v} + \mathbf{v'Vv} \right) \]
As before I find \(\mathbf{V}\) and \(\mathbf{v}\) in the first two terms.
Now I combine the regression likelihood with this prior distribution, I have an exponent on the multivariate normal distribution (omitting the 1/2 coefficient) that looks like this,
\[ \frac{1}{\sigma^2} (y - \mathbf{X}\boldsymbol{\beta})'(y - \mathbf{X}\boldsymbol{\beta}) + (\boldsymbol{\beta} - \mathbf{b})'\mathbf{B}^{-1}(\boldsymbol{\beta} - \mathbf{b}) \]
Retaining only terms containing \(\boldsymbol{\beta}\) coefficients,
\[ -2 \boldsymbol{\beta}' \left( \sigma^{-2}\mathbf{X'y} + \mathbf{B}^{-1}\mathbf{b} \right) + \boldsymbol{\beta}'(\sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1})\boldsymbol{\beta} \]
I can now identify parameter vectors for \(\mathbf{V}\) and \(\mathbf{c}\),
\[ \begin{aligned} \mathbf{V} &= \left( \sigma^{-2}\mathbf{X'X} + \mathbf{B}^{-1} \right)^{-1} \\ \mathbf{v} &= \sigma^{-2} \mathbf{X'y}+ \mathbf{B}^{-1}\mathbf{b} \end{aligned} \]
These are determine the posterior distribution.
Taking limits as I did for the previous example, I obtain the MLE for the mean parameter vector,
\[ \underset{\scriptscriptstyle \lim n \rightarrow \infty}{\mathbf{Vv}} \rightarrow (\mathbf{X'X})^{-1}\mathbf{X'y} \]
Consider again the likelihood, now ignoring the prior distribution, having exponent
\[ \log L \propto - \frac{1}{2 \sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \]
To maximumize the log likelihood I consider only these terms, because others do not contain the coefficients \(\boldsymbol{\beta}\). I differentiate once,
\[ \frac{\partial log L}{\partial \boldsymbol{\beta}} = \sigma^{-2}\mathbf{X'y} - \sigma^{-2}\mathbf{X'X}\boldsymbol{\beta} \]
and again,
\[\frac{\partial^2 log L}{\partial \boldsymbol{\beta}^2} = -\sigma^{-2}\mathbf{X'X}\]
To obtain MLEs I set the first derivative equal to zero and solve,
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X'X})^{-1}\mathbf{X'y} \] Again, I can recognize this to be posterior mean when the prior distribution is non-informative.
How do I estimate the parameter covariance? The matrix of curvatures, or second derivatives, is related to Fisher Information and the covariance of parameter estimates,
\[ \mathbf{I} = -\frac{\partial^2 log L}{\partial \boldsymbol{\beta}^2} = \sigma^{-2}\mathbf{X}'\mathbf{X} \]
The covariance of parameter estimates is \(cov( \hat{\boldsymbol{\beta}} ) = \mathbf{I}^{-1} = \sigma^{2}(\mathbf{X}'\mathbf{X})^{-1}\).
We can go a step further. For transparency, suppose that the design matrix is standardized (subtract the mean, then divide by the standard deviation), so that columns of \(\mathbf{X}\) have mean zero and unit variance. Then \(\mathbf{X}'\mathbf{X} = n \times cov( \mathbf{X} )\) the diagonal of this covariance parameter estimates has elements \(diag \left( cov( \hat{\boldsymbol{\beta}} ) \right)_{j} = \sigma^2/var( \mathbf{x}_j)\).
How does this MLE relate to the Bayesian posterior?