Assignment #6

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some simulations.

The information criterion is based on deviance. More observations generally lead to higher deviance because more values are summed up, rendering a comparison to a model with less observations useless.

The simulation I’m performing below uses N observations on a linear regression model and it returns the WAIC estimation:

waic_sim <- function(N){
  dat <- tibble(x = rnorm(N), 
                y = rnorm(N)) %>% 
    mutate(across(everything(), standardize)) 
  
  alist(
    x ~ dnorm(mu, 1), 
    mu <- a + By*y,
    a ~ dnorm(0, 0.2), 
    By ~ dnorm(0, 0.5) 
  ) %>% 
    quap(data = dat) %>% 
    WAIC() %>% 
    as_tibble() %>% 
    pull(WAIC)
}

Now we define a wequence of N from 100 to 1000 observations and apply waic_sim function above to each and plot the results:

seq(100, 1000, by = 100) %>%
  map_dbl(waic_sim) %>% 
  enframe(name = "n_obs", value = "WAIC") %>% 
  mutate(n_obs = seq(100, 1000, by = 100)) %>% 
  ggplot(aes(n_obs, WAIC)) +
  geom_line(colour = 'grey') +
  geom_point(size = 2, 
             stroke = 1, alpha = 0.9) +
  labs(x = "Number of observations", 
       caption = "Figure 1: Simulation of WAIC values of the same data 
       based on the number of observations.") +
  theme_minimal()

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some simulations (at least 4).

The simulation below will use prior as an argument:

psis_sim <- function(prior){
  dat <- tibble(x = rnorm(10), 
                y = rnorm(10)) %>% 
    mutate(across(everything(), standardize)) 
  
  suppressMessages(
    alist(
      x ~ dnorm(mu, 1), 
      mu <- a + By*y,
      a ~ dnorm(0, prior), 
      By ~ dnorm(0, prior)
    ) %>% 
      quap(data = list(x = dat$x, y = dat$y, prior = prior)) %>% 
      PSIS() %>% 
      as_tibble() %>% 
      pull(penalty)
  )
}

Then define the range for the prior and apply psis_sim function above to each of the values in the prior range. Plot the results:

seq(1, 0.1, length.out = 20) %>%
  purrr::map_dbl(psis_sim) %>% 
  enframe(name = "b_prior", value = "n_params") %>% 
  mutate(b_prior = seq(1, 0.1, length.out = 20)) %>% 
  ggplot(aes(b_prior, n_params)) +
  geom_smooth(method = "lm") +
  geom_point(size = 1, stroke = 1, alpha = 0.6) +
  labs(x = "Prior on the spread for alpha and beta",
       y = "Number of parameters",
       caption = "Figure 2: Simulation of the effective number 
       of parameters as a function of a prior") +
  theme_minimal()

The number of effective parameters decreases when the prior gets more concentrated. The model becomes less flexible in fitting the sample when a prior becomes more concentrated around particular parameter values. With a less flexible model, the effective number of parameters declines.

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island	Species A	Species B	Species C	Species D	Species E
1	0.2	0.2	0.2	0.2	0.2
2	0.8	0.1	0.05	0.025	0.025
3	0.05	0.15	0.7	0.05	0.05

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

dat_island <- tibble("Island" = c("Island 1", "Island 2", "Island 3"),
       "Species A" = c(0.2, 0.8, 0.05), 
       "Species B" = c(0.2, 0.1, 0.15), 
       "Species C" = c(0.2, 0.05, 0.7), 
       "Species D" = c(0.2, 0.025, 0.05), 
       "Species E" = c(0.2, 0.025, 0.05))

dat_island %>% 
knitr::kable()

Island	Species A	Species B	Species C	Species D	Species E
Island 1	0.20	0.20	0.20	0.200	0.200
Island 2	0.80	0.10	0.05	0.025	0.025
Island 3	0.05	0.15	0.70	0.050	0.050

dat_island_long <- dat_island %>% 
  janitor::clean_names() %>% 
  pivot_longer(cols = -island, names_to = "species", values_to = "proportion")

The entropy of each island’s bird distribution is:

inf_entropy <- function(p){
  -sum(p*log(p))
}

dat_island_long %>% 
  group_by(island) %>% 
  summarise(entropy = inf_entropy(proportion)) %>% 
  mutate(across(is.numeric, round, 2)) %>% 
  knitr::kable()

island	entropy
Island 1	1.61
Island 2	0.74
Island 3	0.98

The entropy value’s meaning in this circumstance is: higher entropy represents higher uncertainty of the species of an island’s bird. For example, island 1 has high entropy value, which means the species in island 1 are distributed equally but island 2 has a majority specie of A with a low entropy value. If we know a bird is coming from island 2, it has a higher probability that it’s a specie A but for island 1, it becomes unpredictable.

KL Divergence of each island from the others:

kl_divergence <- function(p,q) sum(p*(log(p)-log(q)))

dat_island_long %>% 
  select(-species) %>% 
  group_by(island) %>% 
  nest() %>% 
  pivot_wider(names_from = island, values_from = data) %>% 
  janitor::clean_names() %>% 
  mutate(one_vs_two = map2_dbl(island_1, island_2, kl_divergence),
         two_vs_one = map2_dbl(island_2, island_1, kl_divergence),
         one_vs_three = map2_dbl(island_1, island_3, kl_divergence),
         three_vs_one = map2_dbl(island_3, island_1, kl_divergence),
         two_vs_three = map2_dbl(island_2, island_3, kl_divergence),
         three_vs_two = map2_dbl(island_3, island_2, kl_divergence)
         ) %>% 
  select(one_vs_two:three_vs_two) %>% 
  pivot_longer(cols = everything(), names_to = "Comparison", values_to = "KL Divergence") %>% 
  knitr::kable()

Comparison	KL Divergence
one_vs_two	0.9704061
two_vs_one	0.8664340
one_vs_three	0.6387604
three_vs_one	0.6258376
two_vs_three	2.0109142
three_vs_two	1.8388452

The first row means: going from island 1 to island 2 has a divergence of 0.97. And we can see that starting at island 1 results in the overall lowest divergence values due to high entropy. So Island 1 predicts the other island best.

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

df_happy <- sim_happiness(seed = 1977, N_years = 1000)
df_happy <-
  df_happy %>% 
  filter(age > 17) %>% 
  mutate(age = (age - 18) / (65 - 18)) %>% 
  mutate(mid = married + 1)

# rerun each model
m6.9 <- 
  alist(
  happiness ~ dnorm( mu , sigma ),
  mu <- a[mid] + bA*age,
  a[mid] ~ dnorm( 0 , 1 ),
  bA ~ dnorm( 0 , 2 ),
  sigma ~ dexp(1)) %>% 
  quap(data = df_happy) 

m6.10 <- alist(
  happiness ~ dnorm( mu , sigma ),
  mu <- a + bA*age,
  a ~ dnorm( 0 , 1 ),
  bA ~ dnorm( 0 , 2 ),
  sigma ~ dexp(1)) %>% 
  quap(data = df_happy)

Then we apply WAIC for both models:

c(m6.9, m6.10) %>% 
  map_dfr(WAIC) %>% 
  add_column(model = c("m6.9", "m6.10"), .before = 1) %>% 
  knitr::kable()

model	WAIC	lppd	penalty	std_err
m6.9	2713.971	-1353.247	3.738532	37.54465
m6.10	3101.906	-1548.612	2.340445	27.74379

Model 6.9 has a lower WAIC (2713.97) compared to model 6.10 (3101.91). However model 6.10 provides the correct causal inference about the influence of age on happiness. Because WAIC is more effective for measuring the predictive power rather than the causal association. Therefore, the association between age and happiness results in improved predictions.

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

avgfood + groupsize + area
avgfood + groupsize
groupsize + area
avgfood
area

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

Reload the data and standarise both the predictors and the outcome.

data("foxes")

dat_foxes <- foxes %>% 
  as_tibble() %>% 
  mutate(across(-group, standardize))

Fit the model for each set of predictor variables above:

m1 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bf*avgfood + Bg*groupsize + Ba*area, 
  a ~ dnorm(0, 0.2), 
  c(Bf, Bg, Ba) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = dat_foxes)

m2 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bf*avgfood + Bg*groupsize, 
  a ~ dnorm(0, 0.2), 
  c(Bf, Bg) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = dat_foxes)

m3 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bg*groupsize + Ba*area, 
  a ~ dnorm(0, 0.2), 
  c(Bg, Ba) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = dat_foxes)

m4 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- Bf*avgfood, 
  a ~ dnorm(0, 0.2), 
  Bf ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = dat_foxes)

m5 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Ba*area, 
  a ~ dnorm(0, 0.2), 
  Ba ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = dat_foxes)

Compare them using WAIC:

compare(m1, m2, m3, m4, m5) %>% 
  as_tibble(rownames = "model") %>% 
  mutate(across(is.numeric, round, 2)) %>% 
  knitr::kable()

model	WAIC	SE	dWAIC	dSE	pWAIC	weight
m1	322.88	16.28	0.00	NA	4.66	0.46
m3	323.90	15.68	1.01	2.90	3.72	0.28
m2	324.13	16.14	1.24	3.60	3.86	0.25
m4	331.58	13.67	8.70	7.21	1.50	0.01
m5	333.72	13.79	10.84	7.24	2.65	0.00

We notice that the difference across m1, m2 and m3 are not significant based on AIC, but they are different from m4 and m5, which are grouped together as well. Revisit the DAG for this question:

We know that as long as we include the groupsize path, it does not make a difference if we use area or avgfood. They encapsulate the same information, and WAIC hence returns a similar value. m4 and m5 both don’t include groupsize.

Assignment #6

Chapter 7

Yingzhi Wang

2022-12-20

Chapter 7 - Ulysses’ Compass

Questions