Chapter 14 - Adventures in Covariance

This chapter extended the basic multilevel strategy of partial pooling to slopes as well as intercepts. Accomplishing this meant modeling covariation in the statistical population of parameters. The LKJcorr prior was introduced as a convenient family of priors for correlation matrices. You saw how covariance models can be applied to causal inference, using instrumental variables and the front-door criterion. Gaussian processes represent a practical method of extending the varying effects strategy to continuous dimensions of similarity, such as spatial, network, phylogenetic, or any other abstract distance between entities in the data.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Load packages:

library(MASS)
library(ellipse)
library(rethinking)
library(magrittr)
library(tidyverse)
library(brms)
library(ggplot2)
library(tidybayes)
library(rstan)
library(ape)
set.seed(5)

Questions

14E1. Add to the following model varying slopes on the predictor x. \[\begin{align} \ y_i ∼ Normal(μ_i, σ) \\ \ μ_i = α_{group[i]} + βx_i \\ \ α_{group} ∼ Normal(α, σ_α) \\ \ α ∼ Normal(0, 10) \\ \ β ∼ Normal(0, 1) \\ \ σ ∼ Exponential(1) \\ \ σ_α ∼ Exponential(1) \\ \end{align}\]

\[ y_i \sim Normal(\mu_i, \sigma)\\ \mu_i = \alpha_{GROUP[i]} + {\beta_{GROUP[i]}}x_i\\ { \begin{bmatrix} \alpha_{GROUP} \\ \beta_{GROUP} \end{bmatrix} } \sim Normal\left(\begin{bmatrix} \alpha \\ \beta \end{bmatrix} , S \right)\\ S = \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} R \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix}\\ \alpha \sim Normal(0, 10)\\ \beta \sim Normal(0, 1)\\ \sigma \sim Exponential(1)\\ \sigma_{\alpha} \sim Exponential(1)\\ \sigma_{\beta} \sim Exponential(1)\\ R \sim LKJcorr(2) \]

14E2. Think up a context in which varying intercepts will be positively correlated with varying slopes. Provide a mechanistic explanation for the correlation.

# Think about the cafes example. Now suppose that cafes with a long waiting 
# time in the morning have a even longer (short) waiting time in the evening 
# and vice-versa. This will result in a positive (negative) correlation.
# The correlation measures the strength and direction of a linear connection 
# between X and Y.

\[ \rho_{X, Y} = corr(X, Y) = \frac{cov(X, Y)}{\sigma_X\sigma_Y} \]

14E3. When is it possible for a varying slopes model to have fewer effective parameters (as estimated by WAIC or PSIS) than the corresponding model with fixed (unpooled) slopes? Explain.

# If the prior assigned to each intercept shrinks them all towards the mean, 
# this will result in fewer effective parameters. 
# If we have an aggressive regularizing prior, this will result in a less 
# flexible posterior and therefore fewer effective parameters (Ref p.360)

# Reference p.407:
# "The index in each vector is the varying intercept standard deviation, 
# while the other indices\ are the slopes. While these are just posterior means, 
# and the amount of shrinkage averages over the entire posterior, you can get a 
# sense from the small values that shrinkage is pretty aggressive here."

14M1. Repeat the café robot simulation from the beginning of the chapter. This time, set rho to zero, so that there is no correlation between intercepts and slopes. Plot the posterior distribution. How does the posterior distribution of the correlation reflect this change in the underlying simulation?

# define the entire population of cafes:
a <- 3.5        # average morning wait time
b <- (-1)       # average difference afternoon wait time
sigma_a <- 1    # std dev in intercepts
sigma_b <- 0.5   # std dev in slopes

# <- (-0.7)    # correlation between intercepts and slopes
rho <- 0
Mu <- c(a, b)   # vector of two means

sigmas <- c(sigma_a, sigma_b)
Rho <- matrix(c(1, rho, rho, 1), nrow = 2)

# now matrix multiply to get covariance matrix
Sigma <- diag(sigmas) %*% Rho %*% diag(sigmas)
N_cafes <- 20   # number of cafes (shops) you simulated

library(MASS)
set.seed(5)

vary_effects <- mvrnorm(N_cafes, Mu, Sigma)
a_cafe <- vary_effects[,1]
b_cafe <- vary_effects[,2]

plot(a_cafe, b_cafe, col=rethinking::rangi2,
     xlab = "intercept (a_cafe)",
     ylab = "slopes (b_cafe)")

# overlay population distribution
for (l in c(0.1,0.3,0.5,0.8,0.99))
     lines(ellipse(Sigma,centre=Mu,level=l),
           col=rethinking::col.alpha("black",0.2))

# 13.1.2. Simulate observations:
N_visits <- 10
afternoon <- rep(0:1, N_visits*N_cafes/2)
cafe_id <- rep(1:N_cafes, each = N_visits)

mu <- a_cafe[cafe_id] + b_cafe[cafe_id]*afternoon
sigma <- .5   # std dev within cafes
wait <- rnorm(N_visits*N_cafes, mu, sigma)
d <- data.frame(cafe=cafe_id, afternoon=afternoon, wait=wait)
R <- rlkjcorr(1e4, K=2, eta=2)
dens(R[,1,2], xlab="correlation")

# Model:
m13M1 <- map2stan(
  alist(
    wait <- dnorm(mu, sigma),
    mu <- a_cafe[cafe] + b_cafe[cafe]*afternoon,
    c(a_cafe, b_cafe)[cafe] ~ dmvnorm2(c(a, b), sigma_cafe, Rho),
    a ~ dnorm(0, 10),
    b ~ dnorm(0, 10),
    sigma_cafe ~ dcauchy(0, 2),
    sigma ~ dcauchy(0, 2),
    Rho ~ dlkjcorr(2)
  ),
  data = d,
  iter = 5000, warmup = 2000, chains = 2
)
## Warning in if (class(start[[1]][[i]]) == "matrix") {: the condition has length >
## 1 and only the first element will be used
## 
## SAMPLING FOR MODEL 'a514cfb1c06c28bfa5702186eb846fd5' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000492 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.92 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 5000 [  0%]  (Warmup)
## Chain 1: Iteration:  500 / 5000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1000 / 5000 [ 20%]  (Warmup)
## Chain 1: Iteration: 1500 / 5000 [ 30%]  (Warmup)
## Chain 1: Iteration: 2000 / 5000 [ 40%]  (Warmup)
## Chain 1: Iteration: 2001 / 5000 [ 40%]  (Sampling)
## Chain 1: Iteration: 2500 / 5000 [ 50%]  (Sampling)
## Chain 1: Iteration: 3000 / 5000 [ 60%]  (Sampling)
## Chain 1: Iteration: 3500 / 5000 [ 70%]  (Sampling)
## Chain 1: Iteration: 4000 / 5000 [ 80%]  (Sampling)
## Chain 1: Iteration: 4500 / 5000 [ 90%]  (Sampling)
## Chain 1: Iteration: 5000 / 5000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 2.95124 seconds (Warm-up)
## Chain 1:                3.52647 seconds (Sampling)
## Chain 1:                6.47771 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'a514cfb1c06c28bfa5702186eb846fd5' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 6.7e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.67 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 5000 [  0%]  (Warmup)
## Chain 2: Iteration:  500 / 5000 [ 10%]  (Warmup)
## Chain 2: Iteration: 1000 / 5000 [ 20%]  (Warmup)
## Chain 2: Iteration: 1500 / 5000 [ 30%]  (Warmup)
## Chain 2: Iteration: 2000 / 5000 [ 40%]  (Warmup)
## Chain 2: Iteration: 2001 / 5000 [ 40%]  (Sampling)
## Chain 2: Iteration: 2500 / 5000 [ 50%]  (Sampling)
## Chain 2: Iteration: 3000 / 5000 [ 60%]  (Sampling)
## Chain 2: Iteration: 3500 / 5000 [ 70%]  (Sampling)
## Chain 2: Iteration: 4000 / 5000 [ 80%]  (Sampling)
## Chain 2: Iteration: 4500 / 5000 [ 90%]  (Sampling)
## Chain 2: Iteration: 5000 / 5000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 2.78382 seconds (Warm-up)
## Chain 2:                3.9037 seconds (Sampling)
## Chain 2:                6.68753 seconds (Total)
## Chain 2:
post <- extract.samples(m13M1)
dens(post$Rho[,1,2])

# The posterior shifts to the right with a peak at roughly 0.5. The prior might 
# no longer learn from a negative correlation but rather from a slightly positive one. 

14M2. Fit this multilevel model to the simulated café data: \[\begin{align} \ W_i ∼ Normal(μ_i, σ) \\ \ μ_i = α_{CAFE[i]} + β_{CAFE[i]}A_i \\ \ α_{CAFE} ∼ Normal(α, σ_α) \\ \ β_{CAFE} ∼ Normal(β, σ_β) \\ \ α ∼ Normal(0, 10) \\ \ β ∼ Normal(0, 10) \\ \ σ, σ_α, σ_β ∼ Exponential(1) \\ \end{align}\] Use WAIC to compare this model to the model from the chapter, the one that uses a multi-variate Gaussian prior. Visualize the differences between the posterior means. Explain the result.

set.seed(5)
rho <- -.7

Mu <- c(a, b)   # vector of two means
sigmas <- c(sigma_a, sigma_b)
Rho <- matrix(c(1, rho, rho, 1), nrow = 2)

# now matrix multiply to get covariance matrix
Sigma <- diag(sigmas) %*% Rho %*% diag(sigmas)
N_cafes <- 20   # number of cafes (shops) you simulated
vary_effects <- mvrnorm(N_cafes, Mu, Sigma)
a_cafe <- vary_effects[,1]
b_cafe <- vary_effects[,2]

plot(a_cafe, b_cafe, col=rethinking::rangi2,
     xlab = "Intercept (a_cafe)",
     ylab = "Slopes (b_cafe)")

# overlay population distribution
for (l in c(0.1,0.3,0.5,0.8,0.99))
     lines(ellipse(Sigma,centre=Mu,level=l),
           col=rethinking::col.alpha("black",0.2))

# Simulate observations:
N_visits <- 10
afternoon <- rep(0:1, N_visits*N_cafes/2)
cafe_id <- rep(1:N_cafes, each = N_visits)
mu <- a_cafe[cafe_id] + b_cafe[cafe_id]*afternoon
sigma <- .5   # std dev within cafes
wait <- rnorm(N_visits*N_cafes, mu, sigma)
d <- data.frame(cafe=cafe_id, afternoon=afternoon, wait=wait)
R <- rlkjcorr(1e4, K=2, eta=2)
dens(R[,1,2], xlab="correlation")

# Model:
m14m1 <- map2stan(
  alist(
    wait <- dnorm(mu, sigma),
    mu <- a_cafe[cafe] + b_cafe[cafe]*afternoon,
    c(a_cafe, b_cafe)[cafe] ~ dmvnorm2(c(a, b), sigma_cafe, Rho),
    a ~ dnorm(0, 10),
    b ~ dnorm(0, 10),
    sigma_cafe ~ dcauchy(0, 2),
    sigma ~ dcauchy(0, 2),
    Rho ~ dlkjcorr(2)
  ),
  data = d,
  iter = 5000, warmup = 2000, chains = 2)
## Warning in if (class(start[[1]][[i]]) == "matrix") {: the condition has length >
## 1 and only the first element will be used
## 
## SAMPLING FOR MODEL 'f3774867bcc7a7fdd6aded82c0e8f02b' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000154 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.54 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 5000 [  0%]  (Warmup)
## Chain 1: Iteration:  500 / 5000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1000 / 5000 [ 20%]  (Warmup)
## Chain 1: Iteration: 1500 / 5000 [ 30%]  (Warmup)
## Chain 1: Iteration: 2000 / 5000 [ 40%]  (Warmup)
## Chain 1: Iteration: 2001 / 5000 [ 40%]  (Sampling)
## Chain 1: Iteration: 2500 / 5000 [ 50%]  (Sampling)
## Chain 1: Iteration: 3000 / 5000 [ 60%]  (Sampling)
## Chain 1: Iteration: 3500 / 5000 [ 70%]  (Sampling)
## Chain 1: Iteration: 4000 / 5000 [ 80%]  (Sampling)
## Chain 1: Iteration: 4500 / 5000 [ 90%]  (Sampling)
## Chain 1: Iteration: 5000 / 5000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 4.3003 seconds (Warm-up)
## Chain 1:                5.68875 seconds (Sampling)
## Chain 1:                9.98906 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'f3774867bcc7a7fdd6aded82c0e8f02b' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 8.3e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.83 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 5000 [  0%]  (Warmup)
## Chain 2: Iteration:  500 / 5000 [ 10%]  (Warmup)
## Chain 2: Iteration: 1000 / 5000 [ 20%]  (Warmup)
## Chain 2: Iteration: 1500 / 5000 [ 30%]  (Warmup)
## Chain 2: Iteration: 2000 / 5000 [ 40%]  (Warmup)
## Chain 2: Iteration: 2001 / 5000 [ 40%]  (Sampling)
## Chain 2: Iteration: 2500 / 5000 [ 50%]  (Sampling)
## Chain 2: Iteration: 3000 / 5000 [ 60%]  (Sampling)
## Chain 2: Iteration: 3500 / 5000 [ 70%]  (Sampling)
## Chain 2: Iteration: 4000 / 5000 [ 80%]  (Sampling)
## Chain 2: Iteration: 4500 / 5000 [ 90%]  (Sampling)
## Chain 2: Iteration: 5000 / 5000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 3.29972 seconds (Warm-up)
## Chain 2:                4.95001 seconds (Sampling)
## Chain 2:                8.24973 seconds (Total)
## Chain 2:
## Warning: There were 98 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
## Warning in map2stan(alist(wait <- dnorm(mu, sigma), mu <- a_cafe[cafe] + : There were 98 divergent iterations during sampling.
## Check the chains (trace plots, n_eff, Rhat) carefully to ensure they are valid.
post <- extract.samples(m14m1)
dens(post$Rho[,1,2])

# Model
m14m2 <- map2stan(
  alist(
    wait ~ dnorm(mu , sigma ),
    mu <- a_cafe[cafe] + b_cafe[cafe]*afternoon,
    a_cafe[cafe] ~ dnorm(a, sigma_a),
    b_cafe[cafe] ~ dnorm(b, sigma_b),
    a ~ dnorm(0, 10),
    b ~ dnorm(0, 10),
    sigma ~ dcauchy(0, 1),
    sigma_a ~ dcauchy(0, 1),
    sigma_b ~ dcauchy(0, 1)
  ),
  data=d, iter=5000, warmup=2000, chains=2)
## 
## SAMPLING FOR MODEL '710ebdd0ead2e29f7b4705ea4c71980d' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 3.6e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.36 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 5000 [  0%]  (Warmup)
## Chain 1: Iteration:  500 / 5000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1000 / 5000 [ 20%]  (Warmup)
## Chain 1: Iteration: 1500 / 5000 [ 30%]  (Warmup)
## Chain 1: Iteration: 2000 / 5000 [ 40%]  (Warmup)
## Chain 1: Iteration: 2001 / 5000 [ 40%]  (Sampling)
## Chain 1: Iteration: 2500 / 5000 [ 50%]  (Sampling)
## Chain 1: Iteration: 3000 / 5000 [ 60%]  (Sampling)
## Chain 1: Iteration: 3500 / 5000 [ 70%]  (Sampling)
## Chain 1: Iteration: 4000 / 5000 [ 80%]  (Sampling)
## Chain 1: Iteration: 4500 / 5000 [ 90%]  (Sampling)
## Chain 1: Iteration: 5000 / 5000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 1.12905 seconds (Warm-up)
## Chain 1:                1.19057 seconds (Sampling)
## Chain 1:                2.31961 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL '710ebdd0ead2e29f7b4705ea4c71980d' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 1.8e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 5000 [  0%]  (Warmup)
## Chain 2: Iteration:  500 / 5000 [ 10%]  (Warmup)
## Chain 2: Iteration: 1000 / 5000 [ 20%]  (Warmup)
## Chain 2: Iteration: 1500 / 5000 [ 30%]  (Warmup)
## Chain 2: Iteration: 2000 / 5000 [ 40%]  (Warmup)
## Chain 2: Iteration: 2001 / 5000 [ 40%]  (Sampling)
## Chain 2: Iteration: 2500 / 5000 [ 50%]  (Sampling)
## Chain 2: Iteration: 3000 / 5000 [ 60%]  (Sampling)
## Chain 2: Iteration: 3500 / 5000 [ 70%]  (Sampling)
## Chain 2: Iteration: 4000 / 5000 [ 80%]  (Sampling)
## Chain 2: Iteration: 4500 / 5000 [ 90%]  (Sampling)
## Chain 2: Iteration: 5000 / 5000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 1.37792 seconds (Warm-up)
## Chain 2:                5.47557 seconds (Sampling)
## Chain 2:                6.85349 seconds (Total)
## Chain 2:
## Warning: There were 60 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
## Warning in map2stan(alist(wait ~ dnorm(mu, sigma), mu <- a_cafe[cafe] + : There were 60 divergent iterations during sampling.
## Check the chains (trace plots, n_eff, Rhat) carefully to ensure they are valid.
# compare models
compare(m14m1, m14m2)
##           WAIC       SE     dWAIC       dSE    pWAIC    weight
## m14m2 309.4202 20.03792 0.0000000        NA 26.70451 0.5501865
## m14m1 309.8231 20.24622 0.4028489 0.9006463 27.14013 0.4498135
posterior.samples.m14m1 <- extract.samples(m14m1)
a14M1 <- apply( X = posterior.samples.m14m1$a_cafe, MARGIN = 2, FUN = mean)
b14M1 <- apply( X = posterior.samples.m14m1$b_cafe, MARGIN = 2, FUN = mean)

posterior.samples.m14m2 <- extract.samples(m14m2)
a14M2 <- apply( X = posterior.samples.m14m2$a_cafe, MARGIN = 2, FUN = mean)
b14M2 <- apply( X = posterior.samples.m14m2$b_cafe, MARGIN = 2, FUN = mean)

plot(a14M1, b14M1, xlab="intercept", ylab="slope",
      pch=16, col=rangi2, ylim=c(min(b14M1)-0.05, max(b14M1)+0.05) ,
      xlim=c( min(a14M1)-0.1, max(a14M1)+0.1))

points(a14M2, b14M2, pch=1)

# The WAIC is pretty much the same between both models, slightly preferring m14m1. 
# The data-generation process included a negative correlation between intercept
# and slope. Because the intercepts (x-values) are larger than average to the
# right of the center, the blue points (the ones from the varying slopes model 
# 14M1 "including correlation") are pushed to be smaller than average on the y-axis 
# when comparing them to the points from model m14m2. Conversely, the x-values to 
# the left of center push the y-values in blue to be larger.

14M3. Re-estimate the varying slopes model for the UCBadmit data, now using a non-centered parameterization. Compare the efficiency of the forms of the model, using n_eff. Which is better? Which chain sampled faster?

data(UCBadmit)
d <- UCBadmit

d$male <- ifelse( d$applicant.gender=="male" , 1 , 0 )
d$dept_id <- coerce_index( d$dept )

start_time <- Sys.time()
start_centered_time <- Sys.time()

m13M3 <- map2stan(
  alist(
    admit ~ dbinom( applications, p ),
    logit(p) <- a_dept[dept_id] +
                bm_dept[dept_id]*male,
    c(a_dept, bm_dept)[dept_id] ~ dmvnorm2( c(a, bm), sigma_dept, Rho ),
    a ~ dnorm(0, 10),
    bm ~ dnorm(0, 1),
    sigma_dept ~ dcauchy(0, 2),
    Rho ~ dlkjcorr(2)
  ),
  data=d , warmup=1000 , iter=5000 , chains=4 , cores=3 )
## Warning in if (class(start[[1]][[i]]) == "matrix") {: the condition has length >
## 1 and only the first element will be used
## Warning: There were 26 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning in map2stan(alist(admit ~ dbinom(applications, p), logit(p) <- a_dept[dept_id] + : There were 26 divergent iterations during sampling.
## Check the chains (trace plots, n_eff, Rhat) carefully to ensure they are valid.
end_centered_time <- Sys.time()

start_non_centered_time <- Sys.time()
m13M3.noncentered <- map2stan(
  alist(
    admit ~ dbinom( applications, p ),
    logit(p) <- a_dept[dept_id] + bm_dept[dept_id]*male,
    c(a_dept, bm_dept)[dept_id] ~ dmvnormNC( sigma_dept, Rho ),
    a ~ dnorm(0, 10),
    bm ~ dnorm(0, 1),
    sigma_dept ~ dcauchy(0, 2),
    Rho ~ dlkjcorr(2)
  ),
  data=d , warmup=1000 , iter=5000 , chains=4 , cores=3 )
## Warning in if (class(start[[1]][[i]]) == "matrix") {: the condition has length >
## 1 and only the first element will be used
## Warning in if (class(start[[1]][[i]]) == "matrix") {: the condition has length >
## 1 and only the first element will be used
## Warning: There were 1 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning in map2stan(alist(admit ~ dbinom(applications, p), logit(p) <- a_dept[dept_id] + : There were 1 divergent iterations during sampling.
## Check the chains (trace plots, n_eff, Rhat) carefully to ensure they are valid.
end_non_centered_time <- Sys.time()

# compare centered and non-centered models
compare(m13M3, m13M3.noncentered)
##                       WAIC       SE     dWAIC      dSE    pWAIC    weight
## m13M3             5190.968 57.25702 0.0000000       NA 11.09442 0.5287095
## m13M3.noncentered 5191.198 57.11395 0.2299291 1.027092 10.83960 0.4712905
# extract n_eff values for each model
df_c <- precis(m13M3, 2)
df_nc <- precis(m13M3.noncentered, 2)

neff_c <- precis(m13M3, 2)$n_eff
neff_nc <- precis(m13M3.noncentered, 2)$n_eff

# plot distributions
boxplot( list( 'm13M3'=neff_c , 'm13M3.noncentered'=neff_nc ) , ylab="Effective Samples" , xlab="Model" )

# Running time for the centered model
end_centered_time - start_centered_time
## Time difference of 1.144757 mins
# Running time for the non centered model
start_non_centered_time - end_non_centered_time
## Time difference of -59.43793 secs
# Models look identical, with the same "flexibility" (WAIC), similar weight.
# With that said, the non-centered model samples much more efficiently.

14M4. Use WAIC to compare the Gaussian process model of Oceanic tools to the models fit to the same data in Chapter 11. Pay special attention to the effective numbers of parameters, as estimated by WAIC.

# Load the data from Chapter 11
data(Kline)
d <- Kline

d$pop <- scale(log(d$population))
d$contact_id <- ifelse(d$contact == "high", 2, 1)
dat_11 <- list(total_tools = d$total_tools, pop = d$population, cid = d$contact_id)

# Load the data from Chapter 14
d$society <- 1:10
data(islandsDistMatrix)
dat_14 <- list(total_tools = d$total_tools, pop = d$population, society = d$society, Dat_matrix = islandsDistMatrix)

# Build the model
m_14M4_11 <- ulam(
  alist(
    total_tools ~ dpois(lambda),
    lambda <- exp(a[cid]) * pop^b[cid] / g,
    a[cid] ~ dnorm(1, 1),
    b[cid] ~ dexp(1),
    g ~ dexp(1)
  ),
  data = dat_11, chains = 4, cores = 4, log_lik = TRUE
)

m_14M4_14 <- ulam(
  alist(
    total_tools ~ dpois(lambda),
    lambda <- (a * pop^b / g) * exp(k[society]),
    vector[10]:k ~ multi_normal(0, sigma),
    matrix[10, 10]:sigma <- cov_GPL2(Dat_matrix, etasq, rhosq, 0.01), c(a, b, g) ~ dexp(1), etasq ~ dexp(2), rhosq ~ dexp(0.5)
  ),
  data = dat_14, chains = 4, cores = 4, iter = 2000, log_lik = TRUE
)

# Compare the two models
compare(m_14M4_11, m_14M4_14)
##               WAIC        SE    dWAIC      dSE    pWAIC     weight
## m_14M4_14 67.40214  2.281822  0.00000       NA 4.025927 0.99818841
## m_14M4_11 80.02562 11.239259 12.62347 11.25552 4.888344 0.00181159
# As we can see from the parameters in the model, the Chapter 14 model performance 
# is better than the model from Chapter 11. The primary reason for that is that 
# the former model considers the spatial distances of societies. 
# Also, the Gaussian model from Chapter 14 has less WAIC than the model 
# from Chapter 11 which shows that this model is more regularized.

14M5. Modify the phylogenetic distance example to use group size as the outcome and brain size as a predictor. Plot the estimates. Assuming brain size influences group size, what is your estimate of the effect? How does phylogeny influence the estimate?

# observational data
data(Primates301)
d <- Primates301
d$name <- as.character(d$name)
dstan <- d[complete.cases(d$group_size, d$body, d$brain), ]

spp_obs <- dstan$name
dat_list <- list(
  N_spp = nrow(dstan),
  M = standardize(log(dstan$body)),
  B = standardize(log(dstan$brain)),
  G = standardize(log(dstan$group_size)),
  Imat = diag(nrow(dstan))
)

# considering only the phylogenetic data
data(Primates301_nex)
tree_trimmed <- keep.tip(Primates301_nex, spp_obs) 
Rbm <- corBrownian(phy = tree_trimmed) 
V <- vcv(Rbm)
## Warning in Initialize.corPhyl(phy, dummy.df): No covariate specified, species
## will be taken as ordered in the data frame. To avoid this message, specify a
## covariate containing the species names with the 'form' argument.
Dmat <- cophenetic(tree_trimmed)
dat_list$V <- V[spp_obs, spp_obs]
dat_list$R <- dat_list$V / max(V)

# Now we can build the models
# original one
m_14M5_orig <- ulam(
  alist(
    G ~ multi_normal(mu, SIGMA),
    mu <- a + bM * M + bB * B,
    matrix[N_spp, N_spp]:SIGMA <- Imat * sigma_sq,
    a ~ normal(0, 1),
    c(bM, bB) ~ normal(0, 0.5),
    sigma_sq ~ exponential(1)
  ),
  data = dat_list, chains = 4, cores = 4
)

# Brownian motion
m_14M5_brownian <- ulam(
  alist(
    G ~ multi_normal(mu, SIGMA),
    mu <- a + bM * M + bB * B,
    matrix[N_spp, N_spp]:SIGMA <- R * sigma_sq,
    a ~ normal(0, 1),
    c(bM, bB) ~ normal(0, 0.5),
    sigma_sq ~ exponential(1)
  ),
  data = dat_list, chains = 4, cores = 4
)

# Gaussian process
dat_list$Dmat <- Dmat[spp_obs, spp_obs] / max(Dmat)
m_14M5_gaussian <- ulam(
  alist(
    G ~ multi_normal(mu, SIGMA),
    mu <- a + bM * M + bB * B,
    matrix[N_spp, N_spp]:SIGMA <- cov_GPL1(Dmat, etasq, rhosq, 0.01),
    a ~ normal(0, 1),
    c(bM, bB) ~ normal(0, 0.5),
    etasq ~ half_normal(1, 0.25),
    rhosq ~ half_normal(3, 0.25)
  ),
  data = dat_list, chains = 4, cores = 4
)

# Now plot the models
plot(coeftab(m_14M5_orig, m_14M5_brownian, m_14M5_gaussian), pars = c("bM", "bB"))

# From the plot, we can see that the original model has a non-zero dependence of 
# brain size on group size. For other two models having phylogeny, there is no 
# such relationship. On account of this, we can conclude that phylogeny seems to 
# reduce/weaken the link between brain size and group size.

14H1. Let’s revisit the Bangladesh fertility data, data(bangladesh), from the practice problems for Chapter 13. Fit a model with both varying intercepts by district_id and varying slopes of urban by district_id. You are still predicting use.contraception. Inspect the correlation between the intercepts and slopes. Can you interpret this correlation, in terms of what it tells you about the pattern of contraceptive use in the sample? It might help to plot the mean (or median) varying effect estimates for both the intercepts and slopes, by district. Then you can visualize the correlation and maybe more easily think through what it means to have a particular correlation. Plotting predicted proportion of women using contraception, with urban women on one axis and rural on the other, might also help.

data(bangladesh)
d <- bangladesh

dat_list <- list(
  C = d$use.contraception,
  did = as.integer(as.factor(d$district)),
  urban = d$urban
)

# Create and run the model
m_14H1 <- ulam(
  alist(
    C ~ bernoulli(p),
    logit(p) <- a[did] + b[did] * urban,
    c(a, b)[did] ~ multi_normal(c(abar, bbar), Rho, Sigma),
    abar ~ normal(0, 1),
    bbar ~ normal(0, 0.5),
    Rho ~ lkj_corr(2),
    Sigma ~ exponential(1)
  ),
  data = dat_list, chains = 4, cores = 4, iter = 4000
)

precis(m_14H1)
##            mean         sd       5.5%      94.5%    n_eff    Rhat4
## abar -0.6856697 0.09847572 -0.8414597 -0.5270433 5983.945 0.999859
## bbar  0.6442141 0.16001939  0.3902587  0.9026278 4140.402 1.000040
# inspect posterior distribution for Rho
posterior.samples <- extract.samples(m_14H1)
dens( posterior.samples$Rho[,1,2] )

# inspect estimates
precis(m_14H1, depth = 3, pars = c("Rho", "Sigma"))
##                mean           sd       5.5%      94.5%     n_eff     Rhat4
## Rho[1,1]  1.0000000 0.000000e+00  1.0000000  1.0000000       NaN       NaN
## Rho[1,2] -0.6562753 1.664164e-01 -0.8680248 -0.3580667 1543.0662 1.0011156
## Rho[2,1] -0.6562753 1.664164e-01 -0.8680248 -0.3580667 1543.0662 1.0011156
## Rho[2,2]  1.0000000 6.004824e-17  1.0000000  1.0000000 7237.3431 0.9994999
## Sigma[1]  0.5705128 9.843100e-02  0.4204388  0.7365218 1880.9021 1.0008899
## Sigma[2]  0.7676890 2.054250e-01  0.4570752  1.1105999  721.7441 1.0033373
# Urban context is of comparatively smaller impact on contraceptive use in districts 
# (within-cluster) that have higher-than-average contraceptive use (in rural areas); 
# urban context is of comparatively higher impact on contraceptive use in districts 
# that have lower-than-average contraceptive use (in rural areas).

post <- extract.samples(m_14H1)
a <- apply(post$a, 2, mean)
b <- apply(post$b, 2, mean)

# Plot the model
plot(a, b, xlab = "a (intercept)", ylab = "b (urban slope)")
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)

R <- apply(post$Rho, 2:3, mean)
s <- apply(post$Sigma, 2, mean)
S <- diag(s) %*% R %*% diag(s)
ll <- c(0.5, 0.67, 0.89, 0.97)
for (l in ll) {
  el <- ellipse(S, centre = c(mean(post$abar), mean(post$bbar)), level = l)
  lines(el, col = "black", lwd = 0.5)
}

# Plot the model
non_urban <- inv_logit(a)
urban <- inv_logit(a + b)
plot(non_urban, urban, xlim = c(0, 1), ylim = c(0, 1), xlab = "non urban", ylab = "urban")
abline(h = 0.5, lty = 2)
abline(v = 0.5, lty = 2)

# From the plot, we can see that the area with a high usage of contraception has 
# more probability in the urban area than non urban area.