Chapter 13 - Models with Memory

This chapter has been an introduction to the motivation, implementation, and interpretation of basic multilevel models. It focused on varying intercepts, which achieve better estimates of baseline differences among clusters in the data. They achieve better estimates, because they simultaneously model the population of clusters and use inferences about the population to pool information among parameters. From another perspective, varying intercepts are adaptively regularized parameters, relying upon a prior that is itself learned from the data.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

13-1. Revisit the Reed frog survival data, data(reedfrogs), and add the predation and size treatment variables to the varying intercepts model. Consider models with either main effect alone, both main effects, as well as a model including both and their interaction. Instead of focusing on inferences about these two predictor variables, focus on the inferred variation across tanks. Explain why it changes as it does across models. Plot the sigma estimates.

library(rstan)
data(reedfrogs)
d <- reedfrogs

# Add tank 
d$tank <- 1:nrow(d)

# Add pred and size 
d$pred <- ifelse( d$pred=="no" , 0L , 1L )
d$big <- ifelse( d$size=="big" , 2L , 1L )


# prepare list
dat <- list(
  S = d$surv,
  D = d$density,
  T = d$tank,
  P = d$pred,
  B = d$big
)

# Tank-only
msT <- ulam(
  alist(
    S ~ dbinom(D, p),
    logit(p) <- a[T],
    a[T] ~ dnorm(a_bar, sigma),
    a_bar ~ dnorm(0, 1.5),
    sigma ~ dexp(1)
  
  ), data=dat, chains=4, cores = 4, log_lik=TRUE, iter = 2e3)

# Predation model
msP <- ulam(
  alist(
    S ~ dbinom(D, p),
    logit(p) <- a[T] + b_p * P,
    a[T] ~ dnorm(a_bar, sigma),
    b_p ~ dnorm(-0.5, 1), 
    a_bar ~ dnorm(0, 1.5),
    sigma ~ dexp(1)
  
  ), data=dat, chains=4, cores = 4, log_lik=TRUE, iter = 2e3)

# Size model
msS <- ulam(
  alist(
    S ~ dbinom(D, p),
    logit(p) <- a[T] + s[B],
    a[T] ~ dnorm(a_bar, sigma),
    s[B] ~ dnorm(0, 0.5), 
    a_bar ~ dnorm(0, 1.5),
    sigma ~ dexp(1)
  
  ), data=dat, chains=4, cores = 4, log_lik=TRUE, iter = 2e3)

# Predation + Size model
msPS <- ulam(
  alist(
    S ~ dbinom(D, p),
    logit(p) <- a[T] + b_p * P + s[B],
    a[T] ~ dnorm(a_bar, sigma),
    b_p ~ dnorm(-0.5, 1), 
    s[B] ~ dnorm(0, 0.5), 
    a_bar ~ dnorm(0, 1.5),
    sigma ~ dexp(1)
    ), data=dat, chains=4, cores = 4, log_lik=TRUE, iter = 2e3)

# Predation + Size + Interaction model
msPSI <- ulam(
  alist(
    S ~ dbinom(D, p),
    logit(p) <- a_bar + a[T] * sigma + b_p[B] * P + s[B],
    a[T] ~ dnorm(0, 1),
    b_p[B] ~ dnorm(-0.5, 1), 
    s[B] ~ dnorm(0, 0.5), 
    a_bar ~ dnorm(0, 1.5),
    sigma ~ dexp(1)
  
  ), data=dat, chains=4, cores = 4, log_lik=TRUE, iter = 2e3)

# plot sigma

plot(coeftab(msT, msP, msS, msPS, msPSI), 
  pars = "sigma",
  labels = c("Tank", "Predation", "Size", "Predation + Size", "Predation + Size + Interaction"),
  xlab = "sigma"
)

# There is significant variation in tank between model that contains predation and model that does not. From this we can conclude that the predation variable explains a lot of the variation across tanks. Without the inclusion of predation, the model assign the unexplainable variation towards tank, hence the higher value of tank for the model without predation.

13-2. Compare the models you fit just above, using WAIC. Can you reconcile the differences in WAIC with the posterior distributions of the models? Show the WAIC table.

compare(msT, msP, msS, msPS, msPSI)
##           WAIC       SE    dWAIC      dSE    pWAIC    weight
## msP   198.6382 8.919843 0.000000       NA 19.07906 0.3410460
## msT   199.7460 7.275490 1.107790 5.697462 20.73528 0.1960015
## msPS  199.7542 8.696084 1.115988 1.906009 19.05302 0.1951998
## msPSI 200.0986 9.037135 1.460393 3.094644 19.28399 0.1643209
## msS   201.0244 7.183472 2.386208 5.673979 21.27712 0.1034319
coeftab(msT, msP, msS, msPS, msPSI)
##        msT     msP     msS     msPS    msPSI  
## a[1]      2.14    2.50    2.18    2.40   -0.03
## a[2]      3.06    2.99    3.09    2.82    0.49
## a[3]      1.00    1.69    1.08    1.68   -0.99
## a[4]      3.07    2.98    3.07    2.82    0.51
## a[5]      2.14    2.50    1.97    2.27   -0.02
## a[6]      2.13    2.49    1.96    2.26   -0.01
## a[7]      3.07    2.99    2.92    2.71    0.50
## a[8]      2.12    2.48    1.96    2.28   -0.03
## a[9]     -0.16    2.24   -0.09    2.21   -0.10
## a[10]     2.13    3.60    2.17    3.48    1.54
## a[11]     0.99    3.03    1.08    2.94    0.85
## a[12]     0.58    2.75    0.64    2.69    0.55
## a[13]     1.01    3.02    0.83    2.72    0.22
## a[14]     0.21    2.49    0.02    2.22   -0.43
## a[15]     2.14    3.60    1.98    3.27    0.98
## a[16]     2.12    3.59    1.96    3.26    0.93
## a[17]     2.90    2.91    2.95    2.82    0.45
## a[18]     2.41    2.57    2.46    2.51    0.06
## a[19]     2.01    2.28    2.08    2.24   -0.28
## a[20]     3.67    3.30    3.70    3.17    0.88
## a[21]     2.39    2.56    2.22    2.31    0.08
## a[22]     2.41    2.57    2.20    2.29    0.11
## a[23]     2.39    2.57    2.21    2.29    0.09
## a[24]     1.71    2.01    1.52    1.77   -0.57
## a[25]    -1.00    1.57   -0.92    1.61   -0.85
## a[26]     0.16    2.53    0.24    2.54    0.41
## a[27]    -1.43    1.26   -1.35    1.30   -1.27
## a[28]    -0.47    2.00   -0.38    2.03   -0.30
## a[29]     0.16    2.53   -0.03    2.22   -0.50
## a[30]     1.44    3.54    1.25    3.19    0.82
## a[31]    -0.64    1.87   -0.83    1.56   -1.38
## a[32]    -0.31    2.15   -0.50    1.83   -1.02
## a[33]     3.21    3.07    3.26    2.99    0.66
## a[34]     2.70    2.76    2.76    2.70    0.32
## a[35]     2.71    2.76    2.76    2.71    0.32
## a[36]     2.06    2.26    2.14    2.25   -0.29
## a[37]     2.05    2.25    1.87    1.99   -0.26
## a[38]     3.86    3.44    3.73    3.11    1.11
## a[39]     2.71    2.76    2.53    2.48    0.36
## a[40]     2.34    2.50    2.17    2.22    0.02
## a[41]    -1.81    0.92   -1.73    0.99   -1.68
## a[42]    -0.57    1.91   -0.49    1.95   -0.39
## a[43]    -0.45    2.01   -0.36    2.05   -0.25
## a[44]    -0.33    2.12   -0.26    2.15   -0.12
## a[45]     0.58    2.92    0.38    2.58   -0.04
## a[46]    -0.57    1.91   -0.77    1.60   -1.35
## a[47]     2.05    4.02    1.88    3.66    1.44
## a[48]     0.01    2.42   -0.19    2.10   -0.70
## a_bar     1.34    2.55    1.29    2.38    2.32
## sigma     1.61    0.83    1.61    0.78    0.75
## b_p         NA   -2.45      NA   -2.45      NA
## s[1]        NA      NA    0.20    0.36    0.10
## s[2]        NA      NA   -0.09   -0.07    0.15
## b_p[1]      NA      NA      NA      NA   -1.84
## b_p[2]      NA      NA      NA      NA   -2.77
## nobs        48      48      48      48      48
# The WAIC for models are similar. However, for the posterior, parameter estimate for s is closer to 0 than those of bp. So, model that contains bp significantly reduces sigma, even if the WAIC value is similar for the different models.

13-3. Re-estimate the basic Reed frog varying intercept model, but now using a Cauchy distribution in place of the Gaussian distribution for the varying intercepts. That is, fit this model: \[\begin{align} s_i ∼ Binomial(n_i, p_i) \\ logit(p_i) = α_{tank[i]} \\ α_{tank} ∼ Cauchy(α, σ) \\ α ∼ Normal(0, 1) \\ σ ∼ Exponential(1) \\ \end{align}\]

(You are likely to see many divergent transitions for this model. Can you figure out why? Can you fix them?) Plot and compare the posterior means of the intercepts, αtank, to the posterior means produced in the chapter, using the customary Gaussian prior. Can you explain the pattern of differences? Take note of any change in the mean α as well.

# Tank-only, Cauchy distribution
msT_c <- ulam(
  alist(
    S ~ dbinom(D, p),
    logit(p) <- a[T],
    a[T] ~ dcauchy(a_bar, sigma),
    a_bar ~ dnorm(0, 1.5),
    sigma ~ dexp(1)
  
  ), data=dat, chains=4, cores = 4, log_lik=TRUE, iter = 2e3)

# extract posterior estimands
aT <- apply(extract.samples(msT)$a, 2, mean)
aTC <- apply(extract.samples(msT_c)$a, 2, mean)

# compare intercept of Tank (Gaussian VS Cauchy)
plot(aT, aTC,
  pch = 16, col = rangi2,
  xlab = "Intercept (Gaussian prior)", ylab = "Intercept (Cauchy prior)"
)
abline(a = 0, b = 1, lty = 2)

# At higher level of αtank value, we see a higher divergent values between the Cauchy Prior and Gaussian Prior, with αtank values for Cauchy Prior being much higher than the Gaussian prior. This because Cauchy distribution has a much longer tail than the Gaussian prior. At normal αtank value, we see that the point falls on the straight line, which means that they are not divergent.

13-4. Now use a Student-t distribution with ν = 2 for the intercepts: \[\begin{align} α_{tank} ∼ Student(2, α, σ) \end{align}\]

Refer back to the Student-t example in Chapter 7 (page 234), if necessary. Plot and compare the resulting posterior to both the original model and the Cauchy model in 13-3. Can you explain the differences and similarities in shrinkage in terms of the properties of these distributions?

# student prior
msT_S <- map2stan(
  alist(
    S ~ dbinom(D, p),
    logit(p) <- a[T],
    a[T] ~ dstudent(2, a_bar, sigma),
    a_bar ~ dnorm(0,1),
    sigma ~ dcauchy(0,1)
    ), data=dat, chains=4, cores=4, warmup=1000, 
  iter=3000, control=list(adapt_delta=0.99)
)

# extract posterior estimands
aTS <- apply(extract.samples(msT_S)$a, 2, mean)

# compare intercept of Tank (Gaussian VS Student-t)
plot(aT, aTS,
  pch = 16, col = rangi2,
  xlab = "Intercept (Gaussian prior)", ylab = "Intercept (Student-t prior)"
)
abline(a = 0, b = 1, lty = 2)

# compare intercept of Tank (Cauchy VS Student-t)
plot(aTS, aTC,
  pch = 16, col = rangi2,
  xlab = "Intercept (Student-t prior)", ylab = "Intercept (Cauchy prior)"
)
abline(a = 0, b = 1, lty = 2)

# Comparing αtank value between Gaussian and Student T prior, we see some similarities to the αtank value between Gaussian and Cauchy prior. We see some drift at higher value of αtank. This means that the tail end of Student T distribution is longer than a Gaussian distribution. However, comparing the αtank value between Cauchy and Student T we see that at higher αtank value the value of Student T is lower than the value of Cauchy. This indicates that while the Student T distribution has a longer tail than the Gaussian distribution, the tail is shorter than that of the Cauchy distribution. Overall, WAIC is quite similar between the three models, with Gaussian distribution intercept prior having the lowest WAIC, hence the preferred model.

13-5. Sometimes the prior and the data (through the likelihood) are in conflict, because they concentrate around different regions of parameter space. What happens in these cases depends a lot upon the shape of the tails of the distributions. Likewise, the tails of distributions strongly influence can outliers are shrunk or not towards the mean. I want you to consider four different models to fit to one observation at y = 0, this is your data, do not use any other data set. The models differ only in the distributions assigned to the likelihood and prior. Here are the four models:

\[\begin{align} Model \;NN: y &∼ Normal(μ, 1) & Model \;TN: y &∼ Student(2, μ, 1) \\ μ &∼ Normal(10, 1) & μ &∼ Normal(10, 1) \\ Model \;NT: y &∼ Normal(μ, 1) & Model \;TT: y &∼ Student(2, μ, 1) \\ μ &∼ Student(2, 10, 1) & μ &∼ Student(2, 10, 1) \\ \end{align}\]

Estimate and plot the posterior distributions against the likelihoods for these models and compare them. Can you explain the results, using the properties of the distributions?

data(chimpanzees)
dat_c <- chimpanzees
dat_c$recipient <- NULL
dat_c$block_id <- d$block

# Model NN
m_NN <- map2stan( 
  alist(
    pulled_left ~ dnorm(mu, 1),
    mu ~ dnorm(10, 1)
  ), 
  data = dat_c, iter = 2000, chains = 1, cores = 1 
)
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 1.3e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.03 seconds (Warm-up)
## Chain 1:                0.038 seconds (Sampling)
## Chain 1:                0.068 seconds (Total)
## Chain 1: 
## Error in .local(fit, data, n, ...) : 
##   There appear to be no linear models here
precis(m_NN)
##         mean         sd      5.5%   94.5%    n_eff     Rhat4
## mu 0.5954723 0.04404625 0.5288458 0.66796 335.5032 0.9990137
m_TN <- map2stan( 
  alist(
    pulled_left ~ dstudent(2, mu, 1),
    mu ~ dnorm(10, 1)
  ), 
  data = dat_c, iter = 2000, chains = 1, cores = 1 
)
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 3.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.39 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.146 seconds (Warm-up)
## Chain 1:                0.174 seconds (Sampling)
## Chain 1:                0.32 seconds (Total)
## Chain 1: 
## Error in .local(fit, data, n, ...) : 
##   There appear to be no linear models here
precis(m_TN)
##         mean         sd      5.5%     94.5%    n_eff    Rhat4
## mu 0.6134084 0.04056785 0.5484005 0.6767299 427.7117 1.001046
m_NT <- map2stan( 
  alist(
    pulled_left ~ dnorm(mu, 1),
    mu ~ dstudent(2, 10, 1)
  ), 
  data = dat_c, iter = 2000, chains = 1, cores = 1 
)
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 1e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.013 seconds (Warm-up)
## Chain 1:                0.011 seconds (Sampling)
## Chain 1:                0.024 seconds (Total)
## Chain 1: 
## Error in .local(fit, data, n, ...) : 
##   There appear to be no linear models here
precis(m_NT)
##         mean         sd     5.5%     94.5%    n_eff     Rhat4
## mu 0.5762338 0.04179719 0.507003 0.6414666 510.0547 0.9990828
m_TT <- map2stan( 
  alist(
    pulled_left ~ dstudent(2, mu, 1),
    mu ~ dstudent(2, 10, 1)
  ), 
  data = dat_c, iter = 2000, chains = 1, cores = 1 
)
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 3.1e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.31 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.086 seconds (Warm-up)
## Chain 1:                0.083 seconds (Sampling)
## Chain 1:                0.169 seconds (Total)
## Chain 1: 
## Error in .local(fit, data, n, ...) : 
##   There appear to be no linear models here
precis(m_TT)
##         mean         sd     5.5%     94.5%    n_eff    Rhat4
## mu 0.6014454 0.04793786 0.526307 0.6780317 269.1785 1.012143
mu_NN <- extract.samples(m_NN)
mu_TN <- extract.samples(m_TN)
mu_NT <- extract.samples(m_NT)
mu_TT <- extract.samples(m_TT)

dens(mu_NN$mu, col="blue")
dens(mu_TN$mu, col="red", add = TRUE)
dens(mu_NT$mu, col="green", add=TRUE)
dens(mu_TT$mu, col="purple", add=TRUE)
abline(v=0.6, col = "grey")
legend("topright",
       col = c("blue","red", "green", "purple"),
       pch = 19,
       legend = c("Model NN","Model TN", "Model NT", "Model TT"))

# Based on this posterior plot, we see that most have Gaussian shape. However model with Student T's distribution as priors (model NT and model TT) has thicker tail, which will impact shrinkage of the posterior. While model with normal prior distirbution (model NN and model TN), are less likely to experience deviations at higher mu value.