Chapter 14 - Adventures in Covariance

This chapter extended the basic multilevel strategy of partial pooling to slopes as well as intercepts. Accomplishing this meant modeling covariation in the statistical population of parameters. The LKJcorr prior was introduced as a convenient family of priors for correlation matrices. You saw how covariance models can be applied to causal inference, using instrumental variables and the front-door criterion. Gaussian processes represent a practical method of extending the varying effects strategy to continuous dimensions of similarity, such as spatial, network, phylogenetic, or any other abstract distance between entities in the data.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

14E1. Add to the following model varying slopes on the predictor x. \[\begin{align} \ y_i ∼ Normal(μ_i, σ) \\ \ μ_i = α_{group[i]} + βx_i \\ \ α_{group} ∼ Normal(α, σ_α) \\ \ α ∼ Normal(0, 10) \\ \ β ∼ Normal(0, 1) \\ \ σ ∼ Exponential(1) \\ \ σ_α ∼ Exponential(1) \\ \end{align}\]

# We should start by change the β into β_group:
# μ_i = α_{group[i]} + β_{group[i]}x_i
# Next, we change the prior. Since β change into average slope, we should express α_{group} and β_{group} in a multivariate normal distribution, with covariance matrix.
# [α_{group} β_{group}] ~ MVNormal ([α β], S)
# S = {matrix}(σ_α 0 0 σ_β) R {matrix}(σ_α 0 0 σ_β)
# And we define the priors of the variances as:
# σ_α ∼ Exponential(1) 
# σ_β ∼ Exponential(1)
# The prior of R would be R ~ LKJ_corr(2)
# We would keep the other priors as it:
# α ∼ Normal(0, 10)
# β ∼ Normal(0, 1)
# σ ∼ Exponential(1)

14E2. Think up a context in which varying intercepts will be positively correlated with varying slopes. Provide a mechanistic explanation for the correlation.

# We could use the example of a famous restaurant. Suppose the waiting time is long during lunch time, and even longer during dinner time, which will create a positive correlation by measuring the direction and strength of a linear connection.

14E3. When is it possible for a varying slopes model to have fewer effective parameters (as estimated by WAIC or PSIS) than the corresponding model with fixed (unpooled) slopes? Explain.

# It will be possible when the clusters have very few or almost no variation. This would create a strong shrinkage, which would result in posterior distribution having more actual parameters, and varying slopes become less flexible.

14M1. Repeat the café robot simulation from the beginning of the chapter. This time, set rho to zero, so that there is no correlation between intercepts and slopes. Plot the posterior distribution. How does the posterior distribution of the correlation reflect this change in the underlying simulation?

# Let's first setup data from the book but change the rho to 0
a <- 3.5 # average morning wait time
b <- (-1) # average difference afternoon wait time
sigma_a <- 1 # std dev in intercepts
sigma_b <- 0.5 # std dev in slopes
rho <- 0 # correlation between intercepts and slopes
Mu <- c(a, b)
cov_ab <- sigma_a * sigma_b * rho
Sigma <- matrix(c(sigma_a^2, cov_ab, cov_ab, sigma_b^2), ncol = 2)

N_cafes <- 20
set.seed(6)
vary_effects <- mvrnorm(N_cafes, Mu, Sigma)
a_cafe <- vary_effects[, 1]
b_cafe <- vary_effects[, 2]
N_visits <- 10
afternoon <- rep(0:1, N_visits * N_cafes / 2)
cafe_id <- rep(1:N_cafes, each = N_visits)
mu <- a_cafe[cafe_id] + b_cafe[cafe_id] * afternoon
sigma <- 0.5 # std dev within cafes
wait <- rnorm(N_visits * N_cafes, mu, sigma)
d <- data.frame(cafe = cafe_id, afternoon = afternoon, wait = wait)

# Now we run the model
m_14M1 <- ulam(
  alist(
    wait ~ normal(mu, sigma),
    mu <- a_cafe[cafe] + b_cafe[cafe] * afternoon,
    c(a_cafe, b_cafe)[cafe] ~ multi_normal(c(a, b), Rho, sigma_cafe),
    a ~ normal(5, 2),
    b ~ normal(-1, 0.5),
    sigma_cafe ~ exponential(1),
    sigma ~ exponential(1),
    Rho ~ lkj_corr(2)
  ),
  data = d, chains = 6, cores = 6
)

# Let's plot the rho
post <- extract.samples(m_14M1)
precis(m_14M1, pars = "Rho[1,2]")
##              result
## mean     0.01110476
## sd       0.23468860
## 5.5%    -0.36400650
## 94.5%    0.38982346
## n_eff 2833.41771477
## Rhat     1.00023118
dens(post$Rho[,1,2])

# We can see from the plot, the posterior shifts to the right, and the prior learn from a slightly positive one.

14M2. Fit this multilevel model to the simulated cafĂ© data: \[\begin{align} \ W_i ∼ Normal(μ_i, σ) \\ \ μ_i = α_{CAFE[i]} + β_{CAFE[i]}A_i \\ \ α_{CAFE} ∼ Normal(α, σ_α) \\ \ β_{CAFE} ∼ Normal(β, σ_β) \\ \ α ∼ Normal(0, 10) \\ \ β ∼ Normal(0, 10) \\ \ σ, σ_α, σ_β ∼ Exponential(1) \\ \end{align}\] Use WAIC to compare this model to the model from the chapter, the one that uses a multi-variate Gaussian prior. Visualize the differences between the posterior means. Explain the result.

# Let's first setup data from the book
a <- 3.5 # average morning wait time
b <- -1 # average difference afternoon wait time
sigma_a <- 1 # std dev in intercepts
sigma_b <- 0.5 # std dev in slopes
rho <- -0.7 # correlation between intercepts and slopes
Mu <- c(a, b)
cov_ab <- sigma_a * sigma_b * rho
Sigma <- matrix(c(sigma_a^2, cov_ab, cov_ab, sigma_b^2), ncol = 2)
# simulate observations
N_cafes <- 20
set.seed(42)
vary_effects <- mvrnorm(N_cafes, Mu, Sigma)
a_cafe <- vary_effects[, 1]
b_cafe <- vary_effects[, 2]
N_visits <- 10
afternoon <- rep(0:1, N_visits * N_cafes / 2)
cafe_id <- rep(1:N_cafes, each = N_visits)
mu <- a_cafe[cafe_id] + b_cafe[cafe_id] * afternoon
sigma <- 0.5 # std dev within cafes
wait <- rnorm(N_visits * N_cafes, mu, sigma)
d <- data.frame(cafe = cafe_id, afternoon = afternoon, wait = wait)

# We now run the model from the book
m_14M2_orig <- ulam(
  alist(
    wait ~ normal(mu, sigma),
    mu <- a_cafe[cafe] + b_cafe[cafe] * afternoon,
    c(a_cafe, b_cafe)[cafe] ~ multi_normal(c(a, b), Rho, sigma_cafe),
    a ~ normal(5, 2),
    b ~ normal(-1, 0.5),
    sigma_cafe ~ exponential(1),
    sigma ~ exponential(1),
    Rho ~ lkj_corr(2)
  ),
  data = d, chains = 4, cores = 4
)

# And then with our multilevel model
m_14M2_multi <- ulam(
  alist(
    wait ~ dnorm(mu, sigma),
    mu <- a_cafe[cafe] + b_cafe[cafe] * afternoon,
    a_cafe[cafe] ~ dnorm(a, sigma_alpha),
    b_cafe[cafe] ~ dnorm(b, sigma_beta),
    a ~ dnorm(0, 10),
    b ~ dnorm(0, 10),
    sigma ~ dexp(1),
    sigma_alpha ~ dexp(1),
    sigma_beta ~ dexp(1)
  ),
  data = d, chains = 4, cores = 4
)
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
# We can now compare the two models
post_orig <- extract.samples(m_14M2_orig)
a_orig <- apply(post_orig$a_cafe, 2, mean)
b_org <- apply(post_orig$b_cafe, 2, mean)

post_multi <- extract.samples(m_14M2_multi)
a_multi <- apply(post_multi$a_cafe, 2, mean)
b_multi <- apply(post_multi$b_cafe, 2, mean)
plot(a_multi, b_multi,
  xlab = "intercept", ylab = "slope",
  pch = 16, col = "PURPLE", ylim = c(min(b_multi) - 0.05, max(b_multi) + 0.05),
  xlim = c(min(a_multi) - 0.1, max(a_multi) + 0.1), cex = 2
)
points(a_orig, b_org, pch = 1, cex = 2)

# As we can see, the purple circle for multilevel model and the open circle for original model are almost similar, especially in the center area where the divergences of the posterior increase at the fringes of the slope ranges.
# Based from the plot, the original model assumes the slopes and intercepts has a negative correlation, and negative slopes are associated with large intercepts. The multilevel model seems to make a better fit as it has more information from the entire population.

14M3. Re-estimate the varying slopes model for the UCBadmit data, now using a non-centered parameterization. Compare the efficiency of the forms of the model, using n_eff. Which is better? Which chain sampled faster?

# Let's first load the data
data(UCBadmit)
d <- UCBadmit
d$male <- ifelse(d$applicant.gender=="male", 1, 0)
d$dept_id <- coerce_index(d$dept)

# Now we run the original model
Begin_orig <- Sys.time()
m_14M3_orig <- map2stan(
  alist(
    admit ~ dbinom(applications, p),
    logit(p) <- a_dept[dept_id] + bm_dept[dept_id]*male,
    c(a_dept, bm_dept)[dept_id] ~ dmvnorm2(c(a, bm), sigma_dept, Rho),
    a ~ dnorm(0, 10),
    bm ~ dnorm(0, 1),
    sigma_dept ~ dcauchy(0, 2),
    Rho ~ dlkjcorr(2)
  ),
  data=d, warmup=1000, iter=5000, chains=4, cores=3
)
## Warning in if (class(start[[1]][[i]]) == "matrix") {: the condition has length >
## 1 and only the first element will be used
## Warning: There were 5 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning in map2stan(alist(admit ~ dbinom(applications, p), logit(p) <- a_dept[dept_id] + : There were 5 divergent iterations during sampling.
## Check the chains (trace plots, n_eff, Rhat) carefully to ensure they are valid.
End_orig <- Sys.time()

# Then run the non-centered model
Begin_non_centered <- Sys.time()
m_14M3_non_centered <- map2stan(
  alist(
    admit ~ dbinom(applications, p),
    logit(p) <- a_dept[dept_id] + bm_dept[dept_id]*male,
    c(a_dept, bm_dept)[dept_id] ~ dmvnormNC( sigma_dept, Rho ),
    a ~ dnorm(0, 10),
    bm ~ dnorm(0, 1),
    sigma_dept ~ dcauchy(0, 2),
    Rho ~ dlkjcorr(2)
  ),
  data=d, warmup=1000, iter=5000, chains=4, cores=3
)
## Warning in if (class(start[[1]][[i]]) == "matrix") {: the condition has length >
## 1 and only the first element will be used
## Warning in if (class(start[[1]][[i]]) == "matrix") {: the condition has length >
## 1 and only the first element will be used
End_non_centered <- Sys.time()

# We can now compare the two models
neff_c <- precis(m_14M3_orig, 2)$n_eff
neff_nc <- precis(m_14M3_non_centered, 2)$n_eff
boxplot(
  list(
    'm_14M3_orig'=neff_c ,
    'm_14M3_non_centered'=neff_nc
  ), 
  ylab="effective samples",
  xlab="model"
)

# As we can see from the plot, the non-centered model is more efficient.

# Now we check the running time
End_orig - Begin_orig
## Time difference of 1.333206 mins
End_non_centered - Begin_non_centered
## Time difference of 1.255059 mins
# We can see non-centered model is faster.

14M4. Use WAIC to compare the Gaussian process model of Oceanic tools to the models fit to the same data in Chapter 11. Pay special attention to the effective numbers of parameters, as estimated by WAIC.

# Let's first load the data
# data in Chapter 11
data(Kline)
d <- Kline
d$pop <- scale(log(d$population))
d$contact_id <- ifelse(d$contact == "high", 2, 1)
dat_11 <- list(total_tools = d$total_tools, pop = d$population, cid = d$contact_id)

# data in Chapter 14
d$society <- 1:10
data(islandsDistMatrix)
dat_14 <- list(total_tools = d$total_tools, pop = d$population, society = d$society, Dat_matrix = islandsDistMatrix)

# Now we build the model
m_14M4_11 <- ulam(
  alist(
    total_tools ~ dpois(lambda),
    lambda <- exp(a[cid]) * pop^b[cid] / g,
    a[cid] ~ dnorm(1, 1),
    b[cid] ~ dexp(1),
    g ~ dexp(1)
  ),
  data = dat_11, chains = 4, cores = 4, log_lik = TRUE
)

m_14M4_14 <- ulam(
  alist(
    total_tools ~ dpois(lambda),
    lambda <- (a * pop^b / g) * exp(k[society]),
    vector[10]:k ~ multi_normal(0, sigma),
    matrix[10, 10]:sigma <- cov_GPL2(Dat_matrix, etasq, rhosq, 0.01), c(a, b, g) ~ dexp(1), etasq ~ dexp(2), rhosq ~ dexp(0.5)
  ),
  data = dat_14, chains = 4, cores = 4, iter = 2000, log_lik = TRUE
)

# Now we can compare the two models
compare(m_14M4_11, m_14M4_14)
##               WAIC        SE    dWAIC     dSE    pWAIC      weight
## m_14M4_14 67.50015  2.264128  0.00000      NA 4.043478 0.998442654
## m_14M4_11 80.42657 11.203818 12.92643 11.1982 4.997416 0.001557346
# As we can see from the parameters, the chapter 14 model performance better than the chapter 11 model, as it considers the spatial distances of societies. Also, Gaussian model has less pWAIC than the model in chapter 11, which shows it's more regularized.

14M5. Modify the phylogenetic distance example to use group size as the outcome and brain size as a predictor. Plot the estimates. Assuming brain size influences group size, what is your estimate of the effect? How does phylogeny influence the estimate?

# First, let's load the data
# observational data
data(Primates301)
d <- Primates301
d$name <- as.character(d$name)
dstan <- d[complete.cases(d$group_size, d$body, d$brain), ]
spp_obs <- dstan$name
dat_list <- list(
  N_spp = nrow(dstan),
  M = standardize(log(dstan$body)),
  B = standardize(log(dstan$brain)),
  G = standardize(log(dstan$group_size)),
  Imat = diag(nrow(dstan))
)
# phylogenetic data
data(Primates301_nex)
tree_trimmed <- keep.tip(Primates301_nex, spp_obs) 
Rbm <- corBrownian(phy = tree_trimmed) 
V <- vcv(Rbm)
## Warning in Initialize.corPhyl(phy, dummy.df): No covariate specified, species
## will be taken as ordered in the data frame. To avoid this message, specify a
## covariate containing the species names with the 'form' argument.
Dmat <- cophenetic(tree_trimmed)
dat_list$V <- V[spp_obs, spp_obs]
dat_list$R <- dat_list$V / max(V)

# Now we can build the models
# original one
m_14M5_orig <- ulam(
  alist(
    G ~ multi_normal(mu, SIGMA),
    mu <- a + bM * M + bB * B,
    matrix[N_spp, N_spp]:SIGMA <- Imat * sigma_sq,
    a ~ normal(0, 1),
    c(bM, bB) ~ normal(0, 0.5),
    sigma_sq ~ exponential(1)
  ),
  data = dat_list, chains = 4, cores = 4
)

# Brownian motion
m_14M5_brownian <- ulam(
  alist(
    G ~ multi_normal(mu, SIGMA),
    mu <- a + bM * M + bB * B,
    matrix[N_spp, N_spp]:SIGMA <- R * sigma_sq,
    a ~ normal(0, 1),
    c(bM, bB) ~ normal(0, 0.5),
    sigma_sq ~ exponential(1)
  ),
  data = dat_list, chains = 4, cores = 4
)

# Gaussian process
dat_list$Dmat <- Dmat[spp_obs, spp_obs] / max(Dmat)
m_14M5_gaussian <- ulam(
  alist(
    G ~ multi_normal(mu, SIGMA),
    mu <- a + bM * M + bB * B,
    matrix[N_spp, N_spp]:SIGMA <- cov_GPL1(Dmat, etasq, rhosq, 0.01),
    a ~ normal(0, 1),
    c(bM, bB) ~ normal(0, 0.5),
    etasq ~ half_normal(1, 0.25),
    rhosq ~ half_normal(3, 0.25)
  ),
  data = dat_list, chains = 4, cores = 4
)

# Now plot the models
plot(coeftab(m_14M5_orig, m_14M5_brownian, m_14M5_gaussian), pars = c("bM", "bB"))

# From the plot, we can see that the original model has a non-zero dependence of brain size on group size. For other two models having phylogeny, there is no such relationship.
# Therefore, we can conclude that phylogeny seems to reduce/weaken the link between brain size and group size.

14H1. Let’s revisit the Bangladesh fertility data, data(bangladesh), from the practice problems for Chapter 13. Fit a model with both varying intercepts by district_id and varying slopes of urban by district_id. You are still predicting use.contraception. Inspect the correlation between the intercepts and slopes. Can you interpret this correlation, in terms of what it tells you about the pattern of contraceptive use in the sample? It might help to plot the mean (or median) varying effect estimates for both the intercepts and slopes, by district. Then you can visualize the correlation and maybe more easily think through what it means to have a particular correlation. Plotting predicted proportion of women using contraception, with urban women on one axis and rural on the other, might also help.

# Let's first load the data
data(bangladesh)
d <- bangladesh
dat_list <- list(
  C = d$use.contraception,
  did = as.integer(as.factor(d$district)),
  urban = d$urban
)

# Now run the model
m_14H1 <- ulam(
  alist(
    C ~ bernoulli(p),
    logit(p) <- a[did] + b[did] * urban,
    c(a, b)[did] ~ multi_normal(c(abar, bbar), Rho, Sigma),
    abar ~ normal(0, 1),
    bbar ~ normal(0, 0.5),
    Rho ~ lkj_corr(2),
    Sigma ~ exponential(1)
  ),
  data = dat_list, chains = 4, cores = 4, iter = 4000
)

precis(m_14H1)
##            mean         sd       5.5%      94.5%    n_eff     Rhat4
## abar -0.6843872 0.09844195 -0.8408878 -0.5291564 5657.661 1.0004183
## bbar  0.6373952 0.16145100  0.3799091  0.8957142 4241.404 0.9998665
# Looking at the parameters, bbar has positive effect, which would indicate the contraception is used more in urban.

precis(m_14H1, depth = 3, pars = c("Rho", "Sigma"))
##                mean           sd       5.5%      94.5%     n_eff     Rhat4
## Rho[1,1]  1.0000000 0.000000e+00  1.0000000  1.0000000       NaN       NaN
## Rho[1,2] -0.6607707 1.652459e-01 -0.8716759 -0.3594567 1269.3871 1.0020087
## Rho[2,1] -0.6607707 1.652459e-01 -0.8716759 -0.3594567 1269.3871 1.0020087
## Rho[2,2]  1.0000000 6.054658e-17  1.0000000  1.0000000 7612.9507 0.9994999
## Sigma[1]  0.5790704 9.962362e-02  0.4312533  0.7511908 2093.2002 1.0010980
## Sigma[2]  0.7891056 2.070562e-01  0.4756492  1.1289736  811.9034 1.0022107
# The Rho[1,2] and Rho[2,1] also show negative correlation between intercepts and slopes.

# Let's plot the model
post <- extract.samples(m_14H1)
a <- apply(post$a, 2, mean)
b <- apply(post$b, 2, mean)
non_urban <- inv_logit(a)
urban <- inv_logit(a + b)
plot(non_urban, urban, xlim = c(0, 1), ylim = c(0, 1), xlab = "non urban", ylab = "urban")
abline(h = 0.5, lty = 2)
abline(v = 0.5, lty = 2)

# From the plot, we can see that the area with a high usage of contraception has more probability in the urban area than non urban area.