Chapter 10: Model Comparison and Hierarchical Modeling

10.1 General formula and the bayes factor

  • \(m\) = 1 for model 1
  • \(m\) = 2 for model 2 and so on.
  • \(P_m\)(y | \(\theta\)\(_m, m)\) ← likelihood function
  • \(P_m\)(\(\theta\)\(_m, | m)\) ← prior denotation
    • Probability density includes a subscript which expresses that different models might involve different distributional forms.
    • Bayes’ rule says that the posterior probability of model m is
      • Figure 10.3
        • This is the probability of the data within the model across all possible parameter values.
      • One consequence of Bayesian model comparison is it can be very sensitive to prior choice.
      • Figure 10.5: The so-called “posterior odds” of two models are the “prior odds” times the Bayes factor.
        • The underbraced region marked with “BF” denotes the Bayes factor for models 1 and 2.
          • The Bayes factor (BF) is the ratio of the probabilities of the data in models 1 and 2.

10.2 Example: two factories of coins

  1. Tail biased factory generates with biases distributed around a mode of \(\omega_1\) = 0.25, with a consistency (or concentration) of \(_K\) = 12, so that the biases are distributed as \(\theta\) ~ beta(\(\theta\) | \(\omega_1\) (\(_K\) - 2) + 1, (1 - \(\omega_1\)(\(_K\) - 2) +1) = beta(\(\theta\) | 3.5, 8.5)

  2. The head biased factor are distributed with a mode of \(\omega_2\) = 0.75, again with a consistency of \(_K\) = 12, such that the biases of coins from the factory are distributed as \(\theta\) ~ beta(\(\theta\) | \(\omega_2\) (\(_K\) - 2) + 1, (1 - \(\omega_2\)(\(_K\) - 2) +1) = beta(\(\theta\) | 8.5, 3.5)

  3. Figure 10.2: Hiarchical diagram for two models of a coin. Model 1 is a tail-biased mint; model 2 is a head-biased mint. This diagram is a specific case of the right panel of figure 10.1, because the likelihood function is the same for both models and only the priors are different

10.3: Two factory coin flip bias. Solution by MCMC

For large and complex models, grid approximation or formal analysis is not possible. Two approaches.

  1. Computes \(P(D|m)\) using MCMC for individual models.
  2. Puts the models together into a hierarchy as diagrammed in Figure 10.1, and the MCMC process visits different values of the model index proportionally to their posterior probabilities.

10.3.1: Nonhierarchical MCMC computation of each model’s marginal likelihood.

source("Jags-Ydich-Xnom1subj-MbernBeta.R")
## 
## *********************************************************************
## Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
## A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
## *********************************************************************
## Warning: package 'rjags' was built under R version 4.0.3
## Loading required package: coda
## Linked to JAGS 4.3.0
## Loaded modules: basemod,bugs
## Warning: package 'runjags' was built under R version 4.0.3
myData = c(rep(0, 9-6), rep(1,6))
mcmcCoda = genMCMC(data = myData, numSavedSteps = 10000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 9
##    Unobserved stochastic nodes: 1
##    Total graph size: 21
## 
## Initializing model
## 
## Burning in the MCMC chain...
## Sampling final MCMC chain...
theta = as.matrix(mcmcCoda[, "theta"])

meanTheta = mean(theta)
sdTheta = sd(theta)
aPost = meanTheta * (meanTheta*(1-meanTheta)/sdTheta^2-1)
bPost = (1-meanTheta) * (meanTheta*(1-meanTheta)/sdTheta^2-1)
oneOverPD = mean( dbeta( theta , aPost , bPost ) /
( theta^6*(1-theta)^(9-6)
* dbeta( theta , 0.75*(12-2)+1 , (1-0.75)*(12-2)+1 ) ) )
PD = 1/oneOverPD
show(PD)
## [1] 0.002337505

10.3.2: Hierarchical MCMC computation of relative model probability

Following figure 10.2, top level parameter is the index across models.

Figure 10.4: The prior and posterior distributions for scripts jags-ydich-xnom1subj-MbernBetaModelComp.R. The upper frame, which shows the prior distribution, has labels that indicate \[p(\theta|D)\] but the data set, \(D\), is empty. The lower frame shows the posterior distribution

  • Bernoulli likelihood function and beta prior distributions are only specified once, even though they participate in both models.
  • The posterior probability of m = 1 is about 18%. Meaning that m = 1 was visited on only about 18% of the steps.
    • The histogram of \(\theta_1\), is based on only 18% of the chain, while the histogram of \(\theta_2\) is based on the complementary 82% of the chain.
    • The losing model would have very few representative values in the MCMC sample.

10.3.2.1: Using Pseudo-priors to reduce autocorrelation

using two distinct beta distributions that generate distinct \(\theta_m\) values. Resulting in a method that is more faithful to the diagram in figure 10.2.

  • Three parameters in this method:
  1. m
  2. \(\theta_1\)
  3. \(\theta_2\)
  • Only the \(\theta_m\) of the sampled model index m is actually used to describe the data.

The values for the pseudo-priors are determined as follows:

  1. Do an initial run of the analysis with the pseudo-prior set to the true prior.
  2. Set the pseudo-prior constants to values that mimic the currently estimated posterior.

10.5: Model complexity naturally accounted for

  • A complex model can find some combination of its parameter values that match the data better than the simpler model.

  • Bayesian model comparison compensates for model complexity by the fact that each model must have a prior distribution over its parameters, and more complex models must dilute their prior distributions over larger parameter spaces than simpler ones.

  • “Must-be-fair” model:

    • One model for a factory creates coins that are fair.
      • Modal bias is \(\omega_s\) = 0.5, and its concentration is \(k_s\) = 1,000.
      • This model is simple insofar as it has limited options for describing data: all the \(\theta\) values for individual coins must be very nearly 0.5.
  • “Anything’s-possible” model:

    • Creates coins of any possible bias, all with equal probability.
      • Central coin bias is \(\omega_c\) = 0.5, its consistency is slight, with \(k_c\) = 2.
      • This model is complex because it has many options for describing data: The individual coins can be anything between 0 and 1.
  • Coin flip N = 20, z = 15 heads.

    • Indication that there is a bias towards heads.
      • Anything’s-model.

Figure 10.5b: Lower part shows the posterior distribution on the parameters.

  • Penultimate row, showing \(\theta_1\) when m = 1. This represents the actual posterior distribution on \(\theta_1\), when m = 2.

Figure 10.6: Shows the results when running the model with the new specification of the pseudopriors * Note: The pseudo-priors nicely mimic the actual posteriors. + Because the ESS of the model index is far better than in the initial run, we trust its estimate in this run more than in the initial run. + The posterior probability of m = 2 is very nearly 92%.

mcmcMat = as.matrix( codaSamples )

m = mcmcMat[,"m"]

theta1M1 = mcmcMat[,"theta1"][ m ==1 ]
theta1M2 = mcmcMat[,"theta1"][ m ==2 ] 
theta2M1 = mcmcMat[,"theta2"][ m ==1 ] 
theta2M2 = mcmcMat[,"theta2"][ m ==2]  
  • The vector contains a sequence of 1’s and 2’s indicating which model is used at each step.
  • Increasing the length of the chain we can then increase the nmumber of values. However, doing so causes the program to run slower.
    • Solution would be to make JAGS sample more equally between models by changing the prior on the models to compensate for their relative credibilities.

10.3.3: Models with different “noise” distributions in JAGS

  • \(p(D|\theta)\), is the same for all models.
    • Sometimes called the “noise” distribution because it describes the random variability of the data values around the underlying trend.

10.7: Exercises

10.1: [Purpose: To illustrate the fact that models with more distinctive predictions can be more easily discriminated.]

A.)

pD = function(z,N,a,b) { exp( lbeta(z+a,N-z+b) - lbeta(a,b) ) }

omega_1 <- 0.25
omega_2 <- 0.75
kappa <- 6

z <- 7
N <- 10

a_1 <- omega_1 * (kappa - 2) + 1 
b_1 <- (1 - omega_1) * (kappa - 2) +1 
a_2 <- omega_2 * (kappa - 2) + 1
b_2 <- (1 - omega_2) * (kappa - 2) +1 

pd1 <- pD(z = z, N = N, a = a_1, b = b_1)
pd2 <- pD(z = z, N = N, a = a_2, b = b_2)

bf1 <- pd1/pd2

p1 <- .5
p2 <- 1 - p1

bfPrior = (pd1/pd2)*(p1/p2)
p1gd = bfPrior/(1.0+bfPrior)
p2gd = 1.0-p1gd

show(bf1)
## [1] 0.3333333
show(p1gd)
## [1] 0.25
show(p2gd)
## [1] 0.75

B.)

pD = function(z,N,a,b) { exp( lbeta(z+a,N-z+b) - lbeta(a,b) ) }

omega_1 <- 0.25
omega_2 <- 0.75
kappa <- 202

z <- 7
N <- 10

a_1 <- omega_1 * (kappa - 2) + 1 
b_1 <- (1 - omega_1) * (kappa - 2) +1 
a_2 <- omega_2 * (kappa - 2) + 1
b_2 <- (1 - omega_2) * (kappa - 2) +1 

pd1 <- pD(z = z, N = N, a = a_1, b = b_1)
pd2 <- pD(z = z, N = N, a = a_2, b = b_2)

bf1 <- pd1/pd2

p1 <- .5
p2 <- 1 - p1

bfPrior = (pd1/pd2)*(p1/p2)
p1gd = bfPrior/(1.0+bfPrior)
p2gd = 1.0-p1gd

show(bf1)
## [1] 0.01621596
show(p1gd)
## [1] 0.0159572
show(p2gd)
## [1] 0.9840428

C.)

That is because with the larger kappa there is a larger number of consistencies.

10.2: [Purpose: To be sure you really understand the JAGS program for Figure 10.4.]

A.)

Figure computed from file

B.)

Figure 10.2

C.)

Figure for a Figure for b

10.3: [Purpose: To get some hands-on experience with pseudopriors.]

A.)

Figure for a_1 Figure for a_2

B.)

Figure for b_1 Figure for b_2