Chapter 7 - Ulysses’ Compass

This week began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some simulations.

# All of the information criteria are defined based on the log-pointwise-predictive density, defined as follows, where 
# y
#  is the data, 
# Θ
#  is the posterior distribution, 
# S
#  is the number of samples, and 
# I
#  is the number of samples.
# 
# lppd(y,Θ)=∑ilog1S∑sp(yiΘs)
# 
# In words, this means take the log of the average probability across samples of each observation 
# i
#  and sum them together. Thus, a larger sample size will necessarily lead to a smaller log-pointwise-predictive-density, even if the data generating process and models are exactly equivalent (i.e., when the LPPD values are negative, the sum will get more negative as the sample size increases). More observations are entered into the sum, leading to a smaller final lppd, which will in turn increase the information criteria. We can run a quick simulation to demonstrate. For three different sample sizes, we’ll simulate 100 data sets from the exact same data generation process, estimate a the exact same linear model, and then calculate the LPPD, WAIC, and PSIS for each.
# 
# Visualizing the distribution of the LPPD, WAIC, and PSIS across simulated data sets with each sample size, we see that the LPPD gets more negative as the sample size increases, even though the data generation process and estimated model are identical. Accordingly, the WAIC and PSIS increase. Note that the WAIC and PSIS values are approximately 
# −
# 2
# ×
# lppd
# . Thus, if we fit one model with 100 observations and second model with 1,000 observations, we might conclude from the WAIC and PSIS that the first model with 100 observations has much better predictive accuracy, because the WAIC and PSIS values are lower. However, this would be only an artifact of different sample sizes, and may not actually represent true differences between the models.
# 
# 
# gen_data <- function(n) {
#   tibble(x1 = rnorm(n = n)) %>%
#     mutate(y = rnorm(n = n, mean = 0.3 + 0.8 * x1),
#            across(everything(), standardize))
# }
# fit_model <- function(dat) {
#   suppressMessages(output <- capture.output(
#     mod <- brm(y ~ 1 + x1, data = dat, family = gaussian,
#                prior = c(prior(normal(0, 0.2), class = Intercept),
#                          prior(normal(0, 0.5), class = "b"),
#                          prior(exponential(1), class = sigma)),
#                iter = 4000, warmup = 3000, chains = 4, cores = 4, seed = 1234)
#   ))
#   
#   return(mod)
# }
# calc_info <- function(model) {
#   mod_lppd <- log_lik(model) %>% 
#     as_tibble(.name_repair = "minimal") %>% 
#     set_names(paste0("obs_", 1:ncol(.))) %>% 
#     rowid_to_column(var = "iter") %>% 
#     pivot_longer(-iter, names_to = "obs", values_to = "logprob") %>% 
#     mutate(prob = exp(logprob)) %>% 
#     group_by(obs) %>% 
#     summarize(log_prob_score = log(mean(prob))) %>% 
#     pull(log_prob_score) %>% 
#     sum()
#   
#   mod_waic <- suppressWarnings(waic(model)$estimates["waic", "Estimate"])
#   mod_psis <- suppressWarnings(loo(model)$estimates["looic", "Estimate"])
#   
#   tibble(lppd = mod_lppd, waic = mod_waic, psis = mod_psis)
# }
# 
# sample_sim <- tibble(sample_size = rep(c(100, 500, 1000), each = 100)) %>%
#   mutate(sample_data = map(sample_size, gen_data),
#          model = map(sample_data, fit_model),
#          infc = map(model, calc_info)) %>%
#   select(-sample_data, -model) %>% 
#   unnest(infc) %>% 
#   write_rds(here("fits", "chp7", "b7m3-sim.rds"), compress = "gz")
# 
# library(ggridges)
# 
# sample_sim %>%
#   pivot_longer(cols = c(lppd, waic, psis)) %>%
#   mutate(sample_size = fct_inorder(as.character(sample_size)),
#          name = str_to_upper(name),
#          name = fct_inorder(name)) %>%
#   ggplot(aes(x = value, y = sample_size)) +
#   facet_wrap(~name, nrow = 1, scales = "free_x") +
#   geom_density_ridges(bandwidth = 4) +
#   scale_y_discrete(expand = c(0, .1)) +
#   scale_x_continuous(breaks = seq(-2000, 3000, by = 750)) +
#   coord_cartesian(clip = "off") +
#   labs(x = "Value", y = "Sample Size")
# 

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some simulations (at least 4).

# 
# The penalty term of the WAIC, p WAIC is defined as shown in the WAIC formula. Specifically, the penalty term is sum of the variances of the log probabilities for each observation
#  
#  Smaller variances in log probabilities will results in a lower penalty. If we restrict the prior to become more concentrated, we restrict the plausible range of the parameters. In other words, we restrict the variability in the posterior distribution. As the parameters become more consistent, the log probability of each observation will necessarily become more consistent also. Thus, the penalty term, or effective number of parameters, becomes smaller. We can again confirm with a small simulation.
# 
# In this simulation we keep everything constant (i.e., data generation, model), with the exception of the prior for the slope coefficients. We’ll try three different priors: 
# Normal(0,0.1), Normal(0,1), and Normal(0,10). Visualizing the results, we can see that the more constricted prior does indeed result in a smaller penalty or effective number of parameters.
# 
# gen_data <- function(n) {
#   tibble(x1 = rnorm(n = n),
#          x2 = rnorm(n = n),
#          x3 = rnorm(n = n)) %>%
#     mutate(y = rnorm(n = n, mean = 0.3 + 0.8 * x1 +
#                        0.6 * x2 + 1.2 * x3),
#            across(everything(), standardize))
# }
# fit_model <- function(dat, p_sd) {
#   suppressMessages(output <- capture.output(
#     mod <- brm(y ~ 1 + x1 + x2 + x3, data = dat,
#                family = gaussian,
#                prior = c(prior(normal(0, 0.2), class = Intercept),
#                          prior_string(glue("normal(0, {p_sd})"), class = "b"),
#                          prior(exponential(1), class = sigma)),
#                iter = 4000, warmup = 3000, chains = 4, cores = 4, seed = 1234)
#   ))
#   
#   return(mod)
# }
# calc_info <- function(model) {
#   w <- suppressWarnings(brms::waic(model))
#   p <- suppressWarnings(brms::loo(model))
#   
#   tibble(p_waic = w$estimates["p_waic", "Estimate"],
#          p_loo = p$estimates["p_loo", "Estimate"])
# }
# 
# prior_sim <- tibble(sample_size = 20,
#                     prior_sd = rep(c(0.1, 1, 10), each = 100)) %>%
#   mutate(sample_data = map(sample_size, gen_data),
#          model = map2(sample_data, prior_sd, fit_model),
#          infc = map(model, calc_info)) %>%
#   select(-sample_data, -model) %>% 
#   unnest(infc) %>% 
#   write_rds(here("fits", "chp7", "b7m4-sim.rds"), compress = "gz")
# 
# prior_sim %>%
#   pivot_longer(cols = c(p_waic, p_loo)) %>%
#   mutate(prior_sd = glue("&sigma; = {prior_sd}"),
#          prior_sd = fct_inorder(prior_sd),
#          name = factor(name, levels = c("p_waic", "p_loo"),
#                        labels = c("p<sub>WAIC</sub>", "p<sub>PSIS</sub>"))) %>%
#   ggplot(aes(x = value, y = prior_sd, height = stat(density))) +
#   facet_wrap(~name, nrow = 1) +
#   geom_density_ridges(stat = "binline", bins = 20) +
#   labs(x = "Effective Parameters", y = "Prior") +
#   theme(axis.text.y = element_markdown())

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

# These results show us that when using Island 1 to predict Island 2, the KL divergence is about 0.87. When we use Island 1 to predict Island 3, the KL divergence is about 0.63, and so on. Overall, the distances are shorter when we used Island 1 as the model. This is because Island 1 has the highest entropy. Thus, we are less surprised by the other islands, so there’s a shorter distance. In contrast, Island 2 and Island 3 have very concentrated distributions, so predicting the other islands leads to more surprises, and therefore greater distances

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

# library(dagitty)
# library(ggdag)
# 
# hma_dag <- dagitty("dag{H -> M <- A}")
# coordinates(hma_dag) <- list(x = c(H = 1, M = 2, A = 3),
#                              y = c(H = 1, M = 1, A = 1))
# 
# ggplot(hma_dag, aes(x = x, y = y, xend = xend, yend = yend)) +
#   geom_dag_text(color = "black", size = 10) +
#   geom_dag_edges(edge_color = "black", edge_width = 2,
#                  arrow_directed = grid::arrow(length = grid::unit(15, "pt"),
#                                               type = "closed")) +
#   theme_void()


# d <- sim_happiness(seed = 1977, N_years = 1000)
# dat <- d %>%
#   filter(age > 17) %>%
#   mutate(a = (age - 18) / (65 - 18),
#          mid = factor(married + 1, labels = c("single", "married")))
# 
# b6.9 <- brm(happiness ~ 0 + mid + a, data = dat, family = gaussian,
#             prior = c(prior(normal(0, 1), class = b, coef = midmarried),
#                       prior(normal(0, 1), class = b, coef = midsingle),
#                       prior(normal(0, 2), class = b, coef = a),
#                       prior(exponential(1), class = sigma)),
#             iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
#             file = here("fits", "chp7", "b7h4-6.9"))
# 
# b6.10 <- brm(happiness ~ 1 + a, data = dat, family = gaussian,
#              prior = c(prior(normal(0, 1), class = Intercept),
#                        prior(normal(0, 2), class = b, coef = a),
#                        prior(exponential(1), class = sigma)),
#              iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
#              file = here("fits", "chp7", "b7h4-6.10"))


# b6.9 <- add_criterion(b6.9, criterion = "loo")
# b6.10 <- add_criterion(b6.10, criterion = "loo")
# 
# loo_compare(b6.9, b6.10)
# #>       elpd_diff se_diff
# #> b6.9     0.0       0.0 
# #> b6.10 -194.0      17.6

# PSIS shows a strong preference for b6.9, which is the model that includes both age and marriage status. However, b6.10 provides the correct causal inference, as no additional conditioning is needed.
# 
# adjustmentSets(hma_dag, exposure = "A", outcome = "H")
# #>  {}

# The reason is that in this model, marital status is a collider. Adding this variable to the model add a real statistical association between happiness and age, which improves the predictions that are made. However, the association is not causal, so intervening on age (if that were possible), would not actually change happiness. Therefore it’s important to consider the causal implications of your model before selecting one based on PSIS or WAIC alone.

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

# data(foxes)
# 
# fox_dat <- foxes %>%
#   as_tibble() %>%
#   select(area, avgfood, weight, groupsize) %>%
#   mutate(across(everything(), standardize))
# 
# b7h5_1 <- brm(weight ~ 1 + avgfood + groupsize + area, data = fox_dat,
#               family = gaussian,
#               prior = c(prior(normal(0, 0.2), class = Intercept),
#                         prior(normal(0, 0.5), class = b),
#                         prior(exponential(1), class = sigma)),
#               iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
#               file = here("fits", "chp7", "b7h5_1"))
# 
# b7h5_2 <- brm(weight ~ 1 + avgfood + groupsize, data = fox_dat,
#               family = gaussian,
#               prior = c(prior(normal(0, 0.2), class = Intercept),
#                         prior(normal(0, 0.5), class = b),
#                         prior(exponential(1), class = sigma)),
#               iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
#               file = here("fits", "chp7", "b7h5_2"))
# 
# b7h5_3 <- brm(weight ~ 1 + groupsize + area, data = fox_dat,
#               family = gaussian,
#               prior = c(prior(normal(0, 0.2), class = Intercept),
#                         prior(normal(0, 0.5), class = b),
#                         prior(exponential(1), class = sigma)),
#               iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
#               file = here("fits", "chp7", "b7h5_3"))
# 
# b7h5_4 <- brm(weight ~ 1 + avgfood, data = fox_dat,
#               family = gaussian,
#               prior = c(prior(normal(0, 0.2), class = Intercept),
#                         prior(normal(0, 0.5), class = b),
#                         prior(exponential(1), class = sigma)),
#               iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
#               file = here("fits", "chp7", "b7h5_4"))
# 
# b7h5_5 <- brm(weight ~ 1 + area, data = fox_dat,
#               family = gaussian,
#               prior = c(prior(normal(0, 0.2), class = Intercept),
#                         prior(normal(0, 0.5), class = b),
#                         prior(exponential(1), class = sigma)),
#               iter = 4000, warmup = 2000, chains = 4, cores = 4, seed = 1234,
#               file = here("fits", "chp7", "b7h5_5"))

# 
# b7h5_1 <- add_criterion(b7h5_1, criterion = "waic")
# b7h5_2 <- add_criterion(b7h5_2, criterion = "waic")
# b7h5_3 <- add_criterion(b7h5_3, criterion = "waic")
# b7h5_4 <- add_criterion(b7h5_4, criterion = "waic")
# b7h5_5 <- add_criterion(b7h5_5, criterion = "waic")
# 
# comp <- loo_compare(b7h5_1, b7h5_2, b7h5_3, b7h5_4, b7h5_5, criterion = "waic")
# comp
# #>        elpd_diff se_diff
# #> b7h5_1  0.0       0.0   
# #> b7h5_3 -0.4       1.4   
# #> b7h5_2 -0.4       1.7   
# #> b7h5_4 -5.2       3.4   
# #> b7h5_5 -5.3       3.4

# 
# The first three models (b7h5_1, b7h5_2, and b7h5_3) all contain groupsize and one or both of area and avgfood. The reason these models is the same is that there are no back-door path from area or avgfood to weight. In other words, the effect of area adjusting for groupsize is the same as the effect of avgfood adjusting for groupsize, because the effect of area is routed entirely through avgfood.
# 
# Similarly, the last two models (b7h5_4 and b7h5_5) are also nearly identical because of the relationship of area to avgfood. Because the effect of area is routed entirely through avgfood, including only avgfood or area should result in the same inferences.