Chapter 10: Model Comparison and Hierarchical Modeling

10.1 General formula and the bayes factor

\(m\) = 1 for model 1
\(m\) = 2 for model 2 and so on.
\(P_m\)(y | \(\theta\)\(_m, m)\) ← likelihood function
\(P_m\)(\(\theta\)\(_m, | m)\) ← prior denotation
- Probability density includes a subscript which expresses that different models might involve different distributional forms.
- Bayes’ rule says that the posterior probability of model m is
  - - This is the probability of the data within the model across all possible parameter values.
  - One consequence of Bayesian model comparison is it can be very sensitive to prior choice.
  - - The underbraced region marked with “BF” denotes the Bayes factor for models 1 and 2.
      - The Bayes factor (BF) is the ratio of the probabilities of the data in models 1 and 2.

10.2 Example: two factories of coins

Tail biased factory generates with biases distributed around a mode of \(\omega_1\) = 0.25, with a consistency (or concentration) of \(_K\) = 12, so that the biases are distributed as \(\theta\) ~ beta(\(\theta\) | \(\omega_1\) (\(_K\) - 2) + 1, (1 - \(\omega_1\)(\(_K\) - 2) +1) = beta(\(\theta\) | 3.5, 8.5)
The head biased factor are distributed with a mode of \(\omega_2\) = 0.75, again with a consistency of \(_K\) = 12, such that the biases of coins from the factory are distributed as \(\theta\) ~ beta(\(\theta\) | \(\omega_2\) (\(_K\) - 2) + 1, (1 - \(\omega_2\)(\(_K\) - 2) +1) = beta(\(\theta\) | 8.5, 3.5)
Figure 10.2: Hiarchical diagram for two models of a coin. Model 1 is a tail-biased mint; model 2 is a head-biased mint. This diagram is a specific case of the right panel of figure 10.1, because the likelihood function is the same for both models and only the priors are different

10.3: Two factory coin flip bias. Solution by MCMC

For large and complex models, grid approximation or formal analysis is not possible. Two approaches.

Computes \(P(D|m)\) using MCMC for individual models.
Puts the models together into a hierarchy as diagrammed in Figure 10.1, and the MCMC process visits different values of the model index proportionally to their posterior probabilities.

10.3.1: Nonhierarchical MCMC computation of each model’s marginal likelihood.

source("Jags-Ydich-Xnom1subj-MbernBeta.R")

## 
## *********************************************************************
## Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
## A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
## *********************************************************************

## Warning: package 'rjags' was built under R version 4.0.3

## Loading required package: coda

## Linked to JAGS 4.3.0

## Loaded modules: basemod,bugs

## Warning: package 'runjags' was built under R version 4.0.3

myData = c(rep(0, 9-6), rep(1,6))
mcmcCoda = genMCMC(data = myData, numSavedSteps = 10000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 9
##    Unobserved stochastic nodes: 1
##    Total graph size: 21
## 
## Initializing model
## 
## Burning in the MCMC chain...
## Sampling final MCMC chain...

theta = as.matrix(mcmcCoda[, "theta"])

meanTheta = mean(theta)
sdTheta = sd(theta)
aPost = meanTheta * (meanTheta*(1-meanTheta)/sdTheta^2-1)
bPost = (1-meanTheta) * (meanTheta*(1-meanTheta)/sdTheta^2-1)
oneOverPD = mean( dbeta( theta , aPost , bPost ) /
( theta^6*(1-theta)^(9-6)
* dbeta( theta , 0.75*(12-2)+1 , (1-0.75)*(12-2)+1 ) ) )
PD = 1/oneOverPD
show(PD)

## [1] 0.002337505

10.3.2: Hierarchical MCMC computation of relative model probability

Following figure 10.2, top level parameter is the index across models.

Figure 10.4: The prior and posterior distributions for scripts jags-ydich-xnom1subj-MbernBetaModelComp.R. The upper frame, which shows the prior distribution, has labels that indicate \[p(\theta|D)\] but the data set, \(D\), is empty. The lower frame shows the posterior distribution

Bernoulli likelihood function and beta prior distributions are only specified once, even though they participate in both models.
The posterior probability of m = 1 is about 18%. Meaning that m = 1 was visited on only about 18% of the steps.
- The histogram of \(\theta_1\), is based on only 18% of the chain, while the histogram of \(\theta_2\) is based on the complementary 82% of the chain.
- The losing model would have very few representative values in the MCMC sample.

10.3.2.1: Using Pseudo-priors to reduce autocorrelation

using two distinct beta distributions that generate distinct \(\theta_m\) values. Resulting in a method that is more faithful to the diagram in figure 10.2.

Three parameters in this method:

m
\(\theta_1\)
\(\theta_2\)

Only the \(\theta_m\) of the sampled model index m is actually used to describe the data.

The values for the pseudo-priors are determined as follows:

Do an initial run of the analysis with the pseudo-prior set to the true prior.
Set the pseudo-prior constants to values that mimic the currently estimated posterior.

10.5: Model complexity naturally accounted for

A complex model can find some combination of its parameter values that match the data better than the simpler model.
Bayesian model comparison compensates for model complexity by the fact that each model must have a prior distribution over its parameters, and more complex models must dilute their prior distributions over larger parameter spaces than simpler ones.
“Must-be-fair” model:
- One model for a factory creates coins that are fair.
  - Modal bias is \(\omega_s\) = 0.5, and its concentration is \(k_s\) = 1,000.
  - This model is simple insofar as it has limited options for describing data: all the \(\theta\) values for individual coins must be very nearly 0.5.
“Anything’s-possible” model:
- Creates coins of any possible bias, all with equal probability.
  - Central coin bias is \(\omega_c\) = 0.5, its consistency is slight, with \(k_c\) = 2.
  - This model is complex because it has many options for describing data: The individual coins can be anything between 0 and 1.
Coin flip N = 20, z = 15 heads.
- Indication that there is a bias towards heads.
  - Anything’s-model.

Figure 10.5b: Lower part shows the posterior distribution on the parameters.

Penultimate row, showing \(\theta_1\) when m = 1. This represents the actual posterior distribution on \(\theta_1\), when m = 2.

Figure 10.6: Shows the results when running the model with the new specification of the pseudopriors * Note: The pseudo-priors nicely mimic the actual posteriors. + Because the ESS of the model index is far better than in the initial run, we trust its estimate in this run more than in the initial run. + The posterior probability of m = 2 is very nearly 92%.

mcmcMat = as.matrix( codaSamples )

m = mcmcMat[,"m"]

theta1M1 = mcmcMat[,"theta1"][ m ==1 ]
theta1M2 = mcmcMat[,"theta1"][ m ==2 ] 
theta2M1 = mcmcMat[,"theta2"][ m ==1 ] 
theta2M2 = mcmcMat[,"theta2"][ m ==2]

The vector contains a sequence of 1’s and 2’s indicating which model is used at each step.
Increasing the length of the chain we can then increase the nmumber of values. However, doing so causes the program to run slower.
- Solution would be to make JAGS sample more equally between models by changing the prior on the models to compensate for their relative credibilities.

10.3.3: Models with different “noise” distributions in JAGS

\(p(D|\theta)\), is the same for all models.
- Sometimes called the “noise” distribution because it describes the random variability of the data values around the underlying trend.

10.7: Exercises

10.1: [Purpose: To illustrate the fact that models with more distinctive predictions can be more easily discriminated.]

A.)

pD = function(z,N,a,b) { exp( lbeta(z+a,N-z+b) - lbeta(a,b) ) }

omega_1 <- 0.25
omega_2 <- 0.75
kappa <- 6

z <- 7
N <- 10

a_1 <- omega_1 * (kappa - 2) + 1 
b_1 <- (1 - omega_1) * (kappa - 2) +1 
a_2 <- omega_2 * (kappa - 2) + 1
b_2 <- (1 - omega_2) * (kappa - 2) +1 

pd1 <- pD(z = z, N = N, a = a_1, b = b_1)
pd2 <- pD(z = z, N = N, a = a_2, b = b_2)

bf1 <- pd1/pd2

p1 <- .5
p2 <- 1 - p1

bfPrior = (pd1/pd2)*(p1/p2)
p1gd = bfPrior/(1.0+bfPrior)
p2gd = 1.0-p1gd

show(bf1)

## [1] 0.3333333

show(p1gd)

## [1] 0.25

show(p2gd)

## [1] 0.75

B.)

pD = function(z,N,a,b) { exp( lbeta(z+a,N-z+b) - lbeta(a,b) ) }

omega_1 <- 0.25
omega_2 <- 0.75
kappa <- 202

z <- 7
N <- 10

a_1 <- omega_1 * (kappa - 2) + 1 
b_1 <- (1 - omega_1) * (kappa - 2) +1 
a_2 <- omega_2 * (kappa - 2) + 1
b_2 <- (1 - omega_2) * (kappa - 2) +1 

pd1 <- pD(z = z, N = N, a = a_1, b = b_1)
pd2 <- pD(z = z, N = N, a = a_2, b = b_2)

bf1 <- pd1/pd2

p1 <- .5
p2 <- 1 - p1

bfPrior = (pd1/pd2)*(p1/p2)
p1gd = bfPrior/(1.0+bfPrior)
p2gd = 1.0-p1gd

show(bf1)

## [1] 0.01621596

show(p1gd)

## [1] 0.0159572

show(p2gd)

## [1] 0.9840428

C.)

That is because with the larger kappa there is a larger number of consistencies.

10.2: [Purpose: To be sure you really understand the JAGS program for Figure 10.4.]

A.)

Figure computed from file

B.)

Figure 10.2

C.)

Figure for a Figure for b

10.3: [Purpose: To get some hands-on experience with pseudopriors.]

A.)

Figure for a_1 Figure for a_2

B.)

Figure for b_1 Figure for b_2

Chapter 10: Model Comparison and Hierarchical Modeling

Andrew Ithurburn

16/11/2020

Chapter 10: Model Comparison and Hierarchical Modeling

10.1 General formula and the bayes factor

10.2 Example: two factories of coins

10.3: Two factory coin flip bias. Solution by MCMC

10.3.1: Nonhierarchical MCMC computation of each model’s marginal likelihood.

10.3.2: Hierarchical MCMC computation of relative model probability

10.3.2.1: Using Pseudo-priors to reduce autocorrelation

10.5: Model complexity naturally accounted for

10.3.3: Models with different “noise” distributions in JAGS

10.7: Exercises

10.1: [Purpose: To illustrate the fact that models with more distinctive predictions can be more easily discriminated.]

A.)

B.)

C.)

10.2: [Purpose: To be sure you really understand the JAGS program for Figure 10.4.]

A.)

B.)

C.)

10.3: [Purpose: To get some hands-on experience with pseudopriors.]

A.)

B.)