Chapter 9: Hierarchical Models

Hierarchical Models

The batting example: 1. Number of hits each player attains 2. The number of opportunities at bat 3. The primary fielding position of that player

Child purchasing cafeteria lunch 1. Number of days each child bought lunch 2. Number of days of school attended 3. School and district of the child

Surgeons saving the life of a patient i.e., coin flip 1. Number of recoveries 2. Number of surgeries 3. Hospital and city of the surgery

Each parameter is informed by the other individuals

9.1: A single coin from a single mint

Bernouli likelihood distribution: - \(y_1\) ~ dbern(\(\theta\))

prior distribition is a beta density: \(\theta\) ~ dbeta(\(a\), \(b\))

Can be reorganized as:

\(\theta\) ~ dbeta(\(\omega\), (\(k\) - 2) + 1, (1 - \(\omega\))(\(K\) - 2) + 1)

The value \(K\) controls how close \(\theta\) is to \(\theta\)

  • The larger the \(K\) generating values of \(\theta\) concentrated around/near \(\omega\). \(K\) expresses out prior certainty of \(\theta\) depending on \(\omega\).

Minted coins tend to have a bias near \(\omega\) inherent to the minting process. Some have a value of \(\theta\) greater than \(\omega\), while others have \(\theta\) values less than \(\omega\). The larger the \(K\) is, the more internal consistency that the manufacturing process mints coins with \(\theta\) values closer to \(\omega\). The estimated credible \(\theta\) will be near the proportion of observed heads. Larger \(K\) increases certainty.

Figure 9.1 A model of hierarchical dependencies for data from a single coin. The chain of arrows illustrates the chain of dependencies in Equations 9.2, 9.4, and 9.5. (At the top of the diagram, the second instantiation of the arrow to \(\omega\) is shown in grey instead of black merely to suggest that it is the same dependence already indicated by the first arrow in black.)

Posterior distribution over \(\omega\)

  • Supply a prior distribution
  • \(p\)(\(\omega\)) = beta(\(\omega\)|\(A _\omega\))
  • An arrow is drawn from \(\theta\)-parameterized Bernoulli distribution to the datum \(y_i\) (e.g., Figure 9.1).

\[p(\theta, \omega|y)=\frac{p(y|\theta, \omega) p(\theta, \omega)}{p(y)} =\frac{p(y|\theta)p(\theta|\omega)p(\omega)}{p(y)}\]

  • Bayesian inference on a joint parameter
  • Semantics of the model matter because it is easier for people to understand
  • Easier for computer algorithms to understand (i.e., JAGS)

9.1.1: Posterior via grid approximation

We can use a grid approximation to get a picture of what is occurring. For a binomial distribution (0, 1), grid approximations are simpler and readily graphed.

Figure 9.2:

\(A_\omega\) = 2, \(B_\omega\) = 2, and \(K\) = 100. Top right demonstrates a marginal distribution of the prior evident by the tipped nature of the distribution. - Summing/collapsing the joint prior. - \(p\)(\(\omega\)) axis scale is density. - Prior puts more credibility on \(\theta\) values near \(\omega\). – Beta densities - Middle row is the likelihood function. (9 heads and 3 tails)

9.2: Multiple coins from a single mint

Example:

  • A researcher gives a drug to several people, giveing each person a memory test, in which the person tries to remember a list. – Again, it’s dichotomous, did they remember or not?
  • We will be interested in the \(\omega\) because it helps denote the overall effectiveness of the experimental drug.

9.2.1: Posterior via grid approximation

  • Another example:

    • There are two subjects in the same condition
    • There needs to be an estimate of biases for \(\theta_1\) and \(\theta_2\) for the two subjects. – Also, estimate the estimated \(\omega\) of the drug effectiveness.
    • The biases are only weekly dependent on the \(\omega\).

Representation of the 3D space:

  • The middle row in shows the likelihood function from the data in figure 9.5
  • Parallel contours indicate likelihood does not depend on \(\omega\)
  • The tightness of data representation reflects the available data collected
  • Lower two rows indicate posterior distributions – Width of distributions reflects the uncertainty, \(\theta_1\) has less uncertainty than \(\theta_2\)
  • The contours reveal that the distribution of \(\theta\)s is tightly linked to the value of \(\omega\), along a ridge or spindle in the joint parameter space.
  • Because the biases \(\theta\)s are strongly dependent on the parameter \(\omega\), the posterior estimates are fairly tightly constrained to be near the estimated value of \(\omega\).
  • the posterior is restricted to a zone within that spindle.
  • One of the desirable aspects of using grid approximation to determine the posterior is that we do not require any analytical derivation of formulas.
  • Grid approximation can use mathematical expressions for the prior as a convenience for determining the prior values at all those thousands of grid points.
  • was motivated merely by the pedagogical goal of using functions that you are familiar with, not by any mathematical restriction.
  • have 504 = 6,250,000 points, which already strains small computers

9.2.2 A realistic model with MCMC

  • The degree of dependency was specified as the fixed value K of the concentration or consistency parameter, κ.
  • When κ was fixed at a large value, then the individual \(\theta\)s values stayed close to \(\omega\),
  • but when κ was fixed at a small value, then the individual \(\theta\)s values could roam quite far from \(\omega\).
  • when the data from different coins show very similar proportions of heads, we have evidence that κ is high.
  • But when the data from different coins show diverse proportions of heads, then we have evidence that κ is small.
  • Instead of specif\(y_i\)ng a single value of K, we allow a distribution of values κ.
  • One distribution that is convenient for this purpose, because it is familiar in traditional mathematical statistics and available in JAGS (and in Stan and BUGS), is the gamma distribution.
  • The top- left panel shows a frequently used version in which both the shape and rate are set to equivalent small values.
  • differ. In the lower left, the standard deviation equals the mean, in which case the gamma distribution has shape = 1 (and is, in fact, an exponential distribution).
  • distribution becomes less skewed as the standard deviation gets smaller, with the mode getting closer to the mean.
  • gammaShRaFromMeanSD (mean = 10, sd = 100) $ shape[1] 0.01 $ rate[1] 0.001 > gammaShRaFromModeSD (mode = 10, sd = 100) $ shape[1] 1.105125 $ rate[1] 0.01051249
  • gammaParam = gammaShRaFromModeSD (mode = 10, sd = 100) > gammaParam $ shape[1] 1.105125 > gammaParam $ rate[1] 0.01051249
  • distributions. If you want to graph a gamma distribution in R, the gamma density is provided by dgamma (x, shape = s, rate = r)

9.2.3 Doing it with JAGS

model {for (i in 1 : Ntotal) {y[i] ~ dbern (theta[s[i]])} for (s in 1 : Nsubj) {theta[s] ~ dbeta (omega * (kappa - 2) + 1, (1 - omega) * (kappa - 2) + 1)} omega ~ dbeta (1,1) kappa < − kappaMinusTwo + 2 kappaMinusTwo ~ dgamma (0.01, 0.01) # mean = 1, sd = 10 (generic vague)}
  • kappa < − kappaMinusTwo + 2 kappaMinusTwo ~ dgamma (0.01, 0.01) # mean = 1, sd = 10 (generic vague)
  • This two - line implementation is needed because the gamma distribution covers zero to infinity, but κ must be no smaller than 2.

9.2.4 Example : Therapeutic touch

  • The practitioner sat with her hands extended through cutouts in a cardboard screen, which prevented the practitioner from seeing the experimenter.
  • Each trial was scored as correct or wrong.
 Read the data file : myData = read.csv ("TherapeuticTouchData.csv") 
# Load the relevant model functions into R's working memory : 
source ("Jags - Ydich - XnomSsubj - MbernBetaOmegaKappa.R")
 Generate the MCMC chain : mcmcCoda = genMCMC (data = myData, sName ="s", yName ="y", numSavedSteps = 20000, thinSteps = 10) 
 # Display diagnostics of chain, for specified parameters : 
 diagMCMC (codaObject = mcmcCoda, parName ="omega") 
 diagMCMC (codaObject = mcmcCoda, parName ="kappa") 
 diagMCMC (codaObject = mcmcCoda, parName ="theta[1]") 
 # Get summary statistics of chain : 
 smryMCMC (mcmcCoda, compVal = 0.5, diffIdVec = c (1,14,28), compValDiff = 0.0) 
 # Display posterior information : 
 plotMCMC (mcmcCoda, data = myData, sName ="s", yName ="y", compVal = 0.5, diffIdVec = c (1,14,28), compValDiff = 0.0)
  • To keep the saved MCMC sample down to a modest file size (under 5 MB), I chose to set thinSteps = 10 and to save 20,000 of the thinned steps.
  • Only the kappa parameter showed high autocorrelation, which is typical of higher-level parameters that control the variance of lower-level parameters.
  • A numerical summary of the posterior distribution is produced by the smryMCMC function.
  • An argument of the function that is unique to this model is diffIdVec.
  • The posterior distribution reveals several interesting implications, described next.
  • It’s most credible value is less than 0.5, and its 95 % HDI includes the chance value of 0.5, so we would certainly not want to conclude that the group of practitioners, as a class, detects the energy field of a nearby hand better than chance.
  • You can see that the estimate for theta[1] has its peak around 0.4, even though this subject got only 1 out of 10 trials correct, as indicated by the“+”marked on the horizontal axis at 0.10.
  • The lower three rows of Figure 9.10 also show pairs of individual estimates and their differences.
  • The answer is provided by the marginal posterior on the difference \(\theta_1\)\(\theta_{28}\), plotted in the right column.
  • the 95 % HDI includes a difference of zero.
  • individuals include chance performance. Either
  • data. We used a hierarchical model because it is a reasonable way to capture individual differences and group-level tendencies

9.3 Shrinkage in hierarchical models

  • In typical hierarchical models, the estimates of low-level parameters are pulled closer together than they would be if there were not a higher-level distribution.
  • the most credible values of individual-level biases, \(\theta\)s, were closer to the group-level mode, \(\omega\), than the individual proportions correct, \(\frac {zs} {Ns}\).
  • only for unimodal distributions of parameters.
  • Shrinkage is a rational implication of hierarchical model structure, and is (usually) desired by the analyst because the shrunken parameter estimates are less affected by random sampling noise than estimates derived without hierarchical structure.
  • shrinkage occurs because the estimate of each low -level parameter is influenced from two sources :
    1. the subset of data that are directly dependent on the low -level parameter,
    1. the higher-level parameters on which the low -level parameter depends.
  • It is important to understand that shrinkage is a consequence of hierarchical model structure, not Bayesian estimation.
  • The MLE is the set of parameter values that maximizes the probability of the data.
  • \(y_i\) | \(s\) values are constants, namely the 0’s and 1’s in the data.
  • As a concrete numerical example, suppose we have five individuals, each of whom went through 100 trials, and had success counts of 30, 40, 50, 60, and 70.
  • with a higher-level beta distribution that is nearly flat.
  • are there other parameter values that increase the likelihood ?
  • The \(\theta\)s values in the MLE are shrunken closer to the mode of the higher-level distribution, as emphasized by the arrows in the plot
  • Intuitively, shrinkage occurs because the data from all individuals influence the higher-level distribution, which in turn influences the estimates of each individual.
  • shrinkage is caused by hierarchical structure.
  • The MLE involves no prior on the top-level parameters, of course.
  • use Bayesian estimation, we supply a prior distribution on the top-level parameters, and infer an entire posterior distribution across the joint parameter space

9.4 Speeding up jags

  • Although that statement (above) looks like a for-loop in  that would be run in a sequential procedure, it is really just telling to JAGS to set up many copies of the Bernoulli relation.
  • First, we compute the zs and Ns values from the data. The code below assumes that y is a vector of 0’s and 1’s, and that s is vector of integer subject identifiers.
z = aggregate (y, by = list (s), FUN = sum)$x 
 N = aggregate (rep(1, length(y)), by = list(s), FUN = sum)$x 
 dataList = list (z = z, N = N, Nsubj = length(unique(s)))
  • The only information sent to JAGS is zs, Ns, and the overall number of subjects, Nsubj.
model {for(s in 1 : Nsubj) {z[s] ~ dbin(theta[s], N[s])theta[s] ~ dbeta(omega*(kappa - 2) + 1, (1 - omega) * (kappa - 2) + 1)} omega ~ dbeta(1,1)
  kappa < − kappaMinusTwo + 2 
  kappaMinusTwo ~ dgamma (0.01, 0.01) # mean = 1, sd = 10 (generic vague)}
source ("Jags-Ydich-XnomSsubj-MbinomBetaOmegaKappa.R")
  • An important way to speed the processing of the MCMC is to run them in parallel with the runjags package, as was explained in Section 8.7, p.215

9.5 Extending the hierarchy : Subjects within categories

  • consider professional baseball players who, over the course of a year of games, have many opportunities at bat, and who sometimes get a hit.
  • we estimate batting abilities for individual players, and for positions, and for the overarching group of professional players. You may be able to think of a number of analogous structures from other domains.
  • an extra layer added for the category level, along with the subscripts needed to indicate the categories.
  • each category has its own modal bias \(\omega\)c, from which all subject biases in the category are assumed to be drawn.
  • When κ is large, the category biases \(\omega\)c are tightly concentrated.
  • the prior on κc applies independently to each κc in a manner fixed by the prior constants Sκ and Rκ, and the κc’s do not mutually inform each other via that part of the hierarchy.
  • Each row could also contain a unique subject identifier for ease of reference.
model {for(s in 1 : Nsubj) {z[s] ~ dbin (theta[s], N[s]) theta [s] ~ dbeta (omega[c[s]] * (kappa [c[s]] - 2) + 1, (1 - omega [c[s]]) * (kappa [c[s]] - 2) + 1)} for (c in 1 : Ncat) {omega[c] ~ dbeta (omega0 * (kappa0 - 2) + 1, (1 - omega$\omega$) * (kappa0 - 2) + 1)
  kappa [c] <− kappaMinusTwo [c] + 2 kappaMinusTwo [c] ~ dgamma (0.01, 0.01) #mean = 1, sd = 10 (generic vague)} omega0 ~ dbeta (1.0, 1.0)
  kappa0 <− kappaMinusTwo0 + 2 kappaMinusTwoO ~ dgamma (0.01, 0.01) #mean = 1, sd = 10 (generic vague)}
  • 9.4. Notice the nested indexing for categories in omega [c[s]], which uses the mode of the appropriate category for describing the individual theta[s].
  • This example is intended to demonstrate how easy it is to specify a complex hierarchical model in JAGS.
  • program : I start by making a diagram to be sure I fully understand the descriptive model, then I create the JAGS code arrow by arrow, starting at the bottom of the diagram

9.5.1 Example : Baseball batting abilities by position

  • During a year of games, different players have different numbers of opportunities at bat, and on some of these opportunities a player might actually hit the ball.
  • That ratio, of hits divided by opportunities at bat, is called the batting average of each player.
  • The data consist of records from 948 players in the 2012 regular season of Major League Baseball.
  • To give some sense of the data, there were 324 pitchers with a median of 4.0 at - bats, 103 catchers with a median of 170.0 at - bats, and 60 right fielders with a median of 340.5 at - bats, along with 461 players in six other positions.
 Player, PriPos, Hits, AtBats, PlayerNumber, PriPosNumber Fernando Abad, Pitcher, 1,7,1,1 Bobby Abreu.Left Field, 53,219,2,7 Tony Abreu, 2nd Base, 18,70,3,4 Dustin Ackley, 2nd Base, 137,607,4,4 Matt Adams, 1st Base, 21,86,5,3 …[943 more rows] …
# Read the data 
myData = read.csv ("BattingAverage.csv") 
# Load the relevant model into R's working memory : 
source ("Jags - Ybinom - XnomSsubjCcat - MbinomBetaOmegaKappa.R") 
# Generate the MCMC chain : 
mcmcCoda = genMCMC (data = myData, zName ="Hits", NName ="AtBats", sName ="Player", cName ="PriPos", numSavedSteps = 11000, thinSteps = 20) 
# Display diagnostics of chain, for specified parameters : 
for (parName in c ("omega[1]","omega0","kappa[1]","kappa0","theta[1]")) {diagMCMC (codaObject = mcmcCoda, parName = parName, saveName = fileNameRoot, saveType = graphFileType)} 
# Get summary statistics of chain : 
summarylnfo = smryMCMC (mcmcCoda, compVal = NULL) 
# Display posterior information : 
plotMCMC (mcmcCoda, data = myData, zName ="Hits", NName ="AtBats", sName ="Player", cName ="PriPos", compVal = NULL, diffCList = list (c ("Pitcher","Catcher"), c ("Catcher","1st Base")), diffSList = list (c ("Kyle Blanks","Bruce Chen"), c ("Mike Leake","Wandy Rodriguez"), c ("Andrew McCutchen","Brett Jackson"), c ("ShinSoo Choo","Ichiro Suzuki")), compValDiff = 0.0)
  • The genMCMC command indicates that JAGS should save 11,000 steps thinned by 20.
  • The goal was to have an effective sample size (ESS) of at least 10,000 for the key parameters (and differences of parameters) such as \(\theta\)s and \(\omega\)c, while also keeping the saved file size as small as possible.
  • With 968 parameters, 11,000 steps takes more than 77 MB of computer memory.
  • The last command in the script, plotMCMC, uses new arguments diffCList and diffSList.
  • JAGS produces combinations of parameter values in the 968 - dimensional parameter space that are jointly credible, given the data.
  • For every pair of players, we could ask how much their estimated batting abilities differ. For every pair of positions, we can ask how much their batting abilities differ.
  • Do outfielders have different batting averages than basemen ?
  • Clearly the modal batting ability of pitchers is lower than that of catchers.
  • These two players happened to have very few opportunities at bat, and therefore their estimated abilities are dominated by the position information.
  • model, and instead put all 948 players under a single over-arching distribution, then the estimates for two players with identical batting records would be identical regardless of the position they play.
  • Notice that the widths of their 95 % HDIs are smaller than those in the left panel because these players have so much more data.
  • The right panel shows two center fielders, one of whom has a large number of at - bats and an exceptional batting record.
  • large amount of data from these individuals, the posterior distribution of the difference excludes zero despite within - position shrinkage.
  • This example has illustrated many important concepts in hierarchical modeling that are recapitulated and amplified in the following paragraphs.
  • we saw shrinkage of individual - ability estimates based on category (fielding position).
  • Players with many at - bats (large Ns) had somewhat less shrinkage of their individual estimates than players with few at-bats (small Ns), who had estimates dominated by the position information.
  • the hierarchical structure is an expression of how you think the data should be meaningfully modeled, and the model description captures aspects of the data that you care about.

9.6: Exercises

Exercise 9.1

A.)

gammaShRaFromMeanSD(1.0, 10)
## $shape
## [1] 0.01
## 
## $rate
## [1] 0.01
gammaShRaFromModeSD(1, 10)
## $shape
## [1] 1.105125
## 
## $rate
## [1] 0.1051249

B.)

plot.new()
k=seq(0,200,length=10001)
openGraph(height=7,width=7)
layout(matrix(1:3,ncol=1))
plot(k, dgamma(k,1.105125,0.105125), ylab="dgamma(k)",
      type="l", main="Gamma Distributions (SD=10)")
lines(k, dgamma(k,0.01,0.01), col="lightgreen")
legend("topright", c("Mode 1","Mean 1"),
       lty=c(1,1), col=c("black","lightgreen"), text.col=c("black","lightgreen"))
plot(k, dgamma(k,1.105125,0.105125), ylab="dgamma(k)",
      type="l", ylim=c(.07,.08), main="Gamma Distributions’s (SD=10), zoomed in")
lines(k, dgamma(k,0.01,0.01), col="lightgreen")
legend("topright", c("Mode 1","Mean 1"),
       lty=c(1,1), col=c("black","lightgreen"), text.col=c("black","lightgreen"))
plot(k, dgamma(k,1.105125,0.105125), ylab="dgamma(k)",
      type="l", ylim=c(0,8.0e-5), main="Gamma Distributions’s (SD=10), zoomed in")
lines(k, dgamma(k,0.01,0.01), col="lightgreen")
legend("topright", c("Mode 1","Mean 1"),
       lty=c(1,1), col=c("black","lightgreen"), text.col=c("black","lightgreen"))

#### C.)

myData$s = factor(myData$s)



source("Jags-Ydich-XnomSsubj-MbinomBetaOmegaKappa1.R")

mcmcCoda = genMCMC(data=myData, sName="s", yName="y",
                    numSavedSteps=20000, thinSteps=1)

diagMCMC(codaObject=mcmcCoda, parName="omega")
diagMCMC(codaObject=mcmcCoda, parName="kappa")
diagMCMC(codaObject=mcmcCoda, parName="theta[1]")

smryMCMC(mcmcCoda,
          compVal=0.5, diffIdVec=c(1,14,28), compValDiff=0.0)

plotMCMC(mcmcCoda, data=myData, sName="s", yName="y",
          compVal=0.5, diffIdVec=c(1,14,28), compValDiff=0.0)

myData$s = factor(myData$s)



source("Jags-Ydich-XnomSsubj-MbinomBetaOmegaKappa2.R")

mcmcCoda = genMCMC(data=myData, sName="s", yName="y",
                    numSavedSteps=20000, thinSteps=10)

diagMCMC(codaObject=mcmcCoda, parName="omega")
diagMCMC(codaObject=mcmcCoda, parName="kappa")
diagMCMC(codaObject=mcmcCoda, parName="theta[1]")

smryMCMC(mcmcCoda,
          compVal=0.5, diffIdVec=c(1,14,28), compValDiff=0.0)

plotMCMC(mcmcCoda, data=myData, sName="s", yName="y",
          compVal=0.5, diffIdVec=c(1,14,28), compValDiff=0.0)

Gamma with a mean of 1.0 Gamma with a mode of 1.0

D.)

Yes, when K is increased the amount of shrinkage increases and the skews to the right.

E.)

9.2:

A.)

myData$s = factor(myData$s)



source("Jags-Ydich-XnomSsubj-MbinomBetaOmegaKappa_z1.R")

mcmcCoda = genMCMC(data=myData, sName="s", yName="y",
                    numSavedSteps=20000, thinSteps=1)


smryMCMC(mcmcCoda,
          compVal=0.5, diffIdVec=c(1,14,28), compValDiff=0.0)

plotMCMC(mcmcCoda, data=myData, sName="s", yName="y",
          compVal=0.5, diffIdVec=c(1,14,28), compValDiff=0.0)

myData$s = factor(myData$s)



source("Jags-Ydich-XnomSsubj-MbinomBetaOmegaKappa_z2.R")

mcmcCoda = genMCMC(data=myData, sName="s", yName="y",
                    numSavedSteps=20000, thinSteps=1)

smryMCMC(mcmcCoda,
          compVal=0.5, diffIdVec=c(1,14,28), compValDiff=0.0)

plotMCMC(mcmcCoda, data=myData, sName="s", yName="y",
          compVal=0.5, diffIdVec=c(1,14,28), compValDiff=0.0)

Gamma with a mean of 1.0

Gamma with a mean of 1.0

B.)

The first prior is more appropriate to use because in the second prior data visualization is not possible. There mean = 1.

9.3: Compare Bayesian Shrinkage with MLE shrinkage

A.)

ht = c(rep(1,30),rep(0,100-30),
                rep(1,40),rep(0,100-40),
                rep(1,50),rep(0,100-50),
                rep(1,60),rep(0,100-60),
                rep(1,70),rep(0,100-70))
id = factor(c(rep("group1",100),
                    rep("group2",100),
                    rep("group3",100),
                    rep("group4",100),
                    rep("group5",100)))
newdata = data.frame(y=ht, s=id)

source("jags-Ydich-XnomSsubj-MbernBetaOmegaKappa.R")

mcmcCoda = genMCMC(data = newdata, sName ="s", yName ="y", numSavedSteps = 20000, saveName = fileNameRoot, thinSteps = 10)

summar$y_i$nfo = smryMCMC(mcmcCoda, compVal=0.5,
                        diffIdVec=c(1,2,3,4,5), compValDiff=0.0,
                        saveName=fileNameRoot)

plotMCMC(mcmcCoda, data=myData, sName="s", yName="y",
          compVal=0.5, 
          diffIdVec=c(1,2,3,4,5), compValDiff=0.0, 
          saveName="9.3-Attempt", saveType="png")

Gamma with a mean of 1.0

9.4: Explore duration of processing by JAGS —————————–

A.)

Gamma with a mean of 1.0 Gamma with a mean of 1.0 Gamma with a mean of 1.0

B.)

Gamma with a mean of 1.0 Gamma with a mean of 1.0 Gamma with a mean of 1.0

C.)

Reducing the number of chains to a single chain had two effects. The first and most obvious is that running the simulations takes considerably longer to complete, (17.97s for 3 chains and 43.89s for 1 chain.) The second effect was that in the single chain simulation there was an error that was created where it wouldn’t run the last iteration in the chain which makes sense because there is only one chain being run.