Assignment #7

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some simulations.

#Information criteria is highly related to deviance. Greater number of observations lead to increase in deviance which impacts information criterion.

#Simulation
library(ggplot2)
library(tidyverse)
library(rethinking)

sim <- function(N){
  simulationdat <- tibble(x = rnorm(N), 
                y = rnorm(N)) %>% 
    mutate(across(everything(), standardize)) 
  
  alist(
    x ~ dnorm(mu, 1), 
    mu <- a + By*y,
    a ~ dnorm(0, 0.3), 
    By ~ dnorm(0, 0.9) 
  ) %>% 
    quap(data = simulationdat) %>% 
    WAIC() %>% 
    as_tibble() %>% 
    pull(WAIC)
}
seq(100, 1000, by = 100) %>%
  map_dbl(sim) %>% 
  enframe(name = "no_of_obs", value = "WAIC") %>% 
  mutate(no_of_obs = seq(100, 1000, by = 100))

## # A tibble: 10 x 2
##    no_of_obs  WAIC
##        <dbl> <dbl>
##  1       100  286.
##  2       200  570.
##  3       300  854.
##  4       400 1138.
##  5       500 1419.
##  6       600 1705.
##  7       700 1989.
##  8       800 2269.
##  9       900 2555.
## 10      1000 2840.

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some simulations.

#Effective parameters will decrease, as priors get concentrated.

#Simulation
sim_2 <- function(prior){
  simulationdat_2 <- tibble(x = rnorm(10), 
                y = rnorm(10)) %>% 
    mutate(across(everything(), standardize)) 
  
  suppressMessages(
    alist(
      x ~ dnorm(mu, 1), 
      mu <- a + By*y,
      a ~ dnorm(0, prior), 
      By ~ dnorm(0, prior)
    ) %>% 
      quap(data = list(x = simulationdat_2$x, y = simulationdat_2$y, prior = prior)) %>% 
      PSIS() %>% 
      as_tibble() %>% 
      pull(penalty)
  )
}

seq(1, 0.1, length.out = 20) %>%
  purrr::map_dbl(sim_2) %>% 
  enframe(name = "b_prior", value = "n_params") %>% 
  mutate(b_prior = seq(1, 0.1, length.out = 20)) %>% 
  ggplot(aes(b_prior, n_params)) +
  geom_smooth(method = "lm") +
 
  theme_classic()

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island	Species A	Species B	Species C	Species D	Species E
1	0.2	0.2	0.2	0.2	0.2
2	0.8	0.1	0.05	0.025	0.025
3	0.05	0.15	0.7	0.05	0.05

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

island <- tibble("Island" = c("Island 1", "Island 2", "Island 3"),
       "Species A" = c(0.2, 0.8, 0.05), 
       "Species B" = c(0.2, 0.1, 0.15), 
       "Species C" = c(0.2, 0.05, 0.7), 
       "Species D" = c(0.2, 0.025, 0.05), 
       "Species E" = c(0.2, 0.025, 0.05))

island %>% 
knitr::kable()

Island	Species A	Species B	Species C	Species D	Species E
Island 1	0.20	0.20	0.20	0.200	0.200
Island 2	0.80	0.10	0.05	0.025	0.025
Island 3	0.05	0.15	0.70	0.050	0.050

library(janitor)
formatlong <- island %>% 
  janitor::clean_names() %>% 
  pivot_longer(cols = -island, names_to = "species", values_to = "proportion") 

fun_entropy <- function(p){
  -sum(p*log(p))
}

formatlong %>% 
  group_by(island) %>% 
  summarise(entropy = fun_entropy(proportion)) %>% 
  mutate(across(is.numeric, round, 2)) %>% 
  knitr::kable()

island	entropy
Island 1	1.61
Island 2	0.74
Island 3	0.98

kldivergence <- function(p,q) sum(p*(log(p)-log(q)))

formatlong %>% 
  select(-species) %>% 
  group_by(island) %>% 
  nest() %>% 
  pivot_wider(names_from = island, values_from = data) %>% 
  janitor::clean_names() %>% 
  mutate(i1_vs_i2 = map2_dbl(island_1, island_2, kldivergence),
         i2_vs_i1 = map2_dbl(island_2, island_1, kldivergence),
         i1_vs_i3 = map2_dbl(island_1, island_3, kldivergence),
         i3_vs_i1 = map2_dbl(island_3, island_1, kldivergence),
         i2_vs_i3 = map2_dbl(island_2, island_3, kldivergence),
         i3_vs_i2 = map2_dbl(island_3, island_2, kldivergence)
         ) %>% 
  select(i1_vs_i2:i3_vs_i2) %>% 
  pivot_longer(cols = everything(), names_to = "Comparison", values_to = "KL Divergence") %>% 
  knitr::kable()

Comparison	KL Divergence
i1_vs_i2	0.9704061
i2_vs_i1	0.8664340
i1_vs_i3	0.6387604
i3_vs_i1	0.6258376
i2_vs_i3	2.0109142
i3_vs_i2	1.8388452

#we can see that island1 has the best prediction result.

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

simulationdat3 <- sim_happiness(seed=8888, N_years=1000)

simulationdat3 <- simulationdat3 %>% 
  as_tibble() %>% 
  filter(age > 17) %>% 
  mutate(age = (age - 18)/ (65 -18)) %>% 
  mutate(mid = married +1)

m6_9 <- alist(
  happiness ~ dnorm( mu , sigma ),
  mu <- a[mid] + bA*age,
  a[mid] ~ dnorm( 0 , 1 ),
  bA ~ dnorm( 0 , 2 ),
  sigma ~ dexp(1)) %>% 
  quap(data = simulationdat3) 

m6_10 <- alist(
  happiness ~ dnorm( mu , sigma ),
  mu <- a + bA*age,
  a ~ dnorm( 0 , 1 ),
  bA ~ dnorm( 0 , 2 ),
  sigma ~ dexp(1)) %>% 
  quap(data = simulationdat3)

c(m6_9, m6_10) %>% 
  map_dfr(WAIC) %>% 
  add_column(model = c("m6_9", "m6_10"), .before = 1)

##   model     WAIC     lppd  penalty  std_err
## 1  m6_9 2787.105 -1389.72 3.832274 37.53721
## 2 m6_10 3101.924 -1548.59 2.371678 27.71770

#Model 6_10 with higher WAIC score of 3101.924 has better prediction.

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

avgfood + groupsize + area
avgfood + groupsize
groupsize + area
avgfood
area

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

data("foxes")

foxes_data <- foxes %>% 
  as_tibble() %>% 
  mutate(across(-group, standardize))

mod_1 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bf*avgfood + Bg*groupsize + Ba*area, 
  a ~ dnorm(0, 0.2), 
  c(Bf, Bg, Ba) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = foxes_data)

mod_2 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bf*avgfood + Bg*groupsize, 
  a ~ dnorm(0, 0.2), 
  c(Bf, Bg) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = foxes_data)

mod_3 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bg*groupsize + Ba*area, 
  a ~ dnorm(0, 0.2), 
  c(Bg, Ba) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = foxes_data)

mod_4 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- Bf*avgfood, 
  a ~ dnorm(0, 0.2), 
  Bf ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = foxes_data)

mod_5 <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Ba*area, 
  a ~ dnorm(0, 0.2), 
  Ba ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = foxes_data)

compare(mod_1, mod_2, mod_3, mod_4, mod_5) %>% 
  as_tibble(rownames = "Model") %>% 
  mutate(across(is.numeric, round, 3))

## # A tibble: 5 x 7
##   Model  WAIC    SE  dWAIC   dSE pWAIC weight
##   <chr> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1 mod_1  323.  16.4  0     NA     4.83  0.421
## 2 mod_3  324.  15.8  0.706  2.95  3.73  0.295
## 3 mod_2  324.  16.2  0.847  3.49  3.83  0.275
## 4 mod_4  332.  13.7  8.49   7.19  1.56  0.006
## 5 mod_5  334.  13.8 10.3    7.24  2.58  0.002

Assignment #7

Sudhir Pattath Kumar

2022-08-08

Week 7 - Overfitting

Questions