Chapter 7 - Ulysses’ Compass

This week began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some simulations.

## Acknowledging and considering the fact that the deviance to a greater proportion is the driving factor for the criterion of the information, hence the greater magnitude of increase in the number of observations in our sample also in the process result in the higher deviance, thus equipping the model with more observations and higher deviance as less meaningful and effective in comparison to any of the model comprising of lesser magnitude of observations and low deviance and with a better fit.


# Determination of simulation function with "n" number of observations and providing an estimate for WAIC:

set.seed(12345)

Simulation_f <- function(n){
  Data_1 <- tibble(x = rnorm(n), 
                y = rnorm(n)) %>% 
    mutate(across(everything(), standardize)) 
  
  alist(
    x ~ dnorm(mu, 1), 
    mu <- a + By*y,
    a ~ dnorm(0, 0.4), 
    By ~ dnorm(0.5, 0.9) 
  ) %>% 
    quap(data = Data_1) %>% 
    WAIC() %>% 
    as_tibble() %>% 
    pull(WAIC)
}


# Determination of sequence for "n" ranging from 50 to 500 number of observations, simulation and further application of the WAIC function to each of the defined sequences.

# WAIC Table Determination:

seq(50, 500, by = 50) %>%
  map_dbl(Simulation_f) %>% 
  enframe(name = "Observations_number", value = "WAIC") %>% 
  mutate(Observations_number = seq(50, 500, by = 50)) 
## # A tibble: 10 × 2
##    Observations_number  WAIC
##                  <dbl> <dbl>
##  1                  50  145.
##  2                 100  283.
##  3                 150  427.
##  4                 200  570.
##  5                 250  712.
##  6                 300  852.
##  7                 350  996.
##  8                 400 1138.
##  9                 450 1280.
## 10                 500 1419.
# Determination of WAIC and Observations Plot:

seq(50, 500, by = 50) %>%
  map_dbl(Simulation_f) %>% 
  enframe(name = "Observations_number", value = "WAIC") %>% 
  mutate(Observations_number = seq(50, 500, by = 50)) %>%
ggplot(aes(Observations_number, WAIC)) +
  geom_smooth(method = "lm") +
  geom_point() +
  labs(x = "Number of Observations ",
       y = "WAIC value",
       caption = "WAIC values simulation based on different observations numbers pertinent to same data") +
  theme_minimal()

## Findings Inference: 
# Given the case if the models were to fit to different numbers of observations, it's evident from the WAIC function simulated output, that as the number of observations from the sample set pertinent to same data increases, there is also an increase in the WAIC value for the model fit, and hence the deviance which is the driving factor for the information criterion also increases with the increase in the number of observations in our samples.

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some simulations (at least 4).

# Determination of simulation function returning the estimate for the number of parameters, and taking and considering prior as our main argument.

Simulation_PSIS <- function(prior){
  Data_b <- tibble(x = rnorm(10), 
                y = rnorm(10)) %>% 
    mutate(across(everything(), standardize)) 
  
  suppressMessages(
    alist(
      x ~ dnorm(mu, 1), 
      mu <- a + By*y,
      a ~ dnorm(0, prior), 
      By ~ dnorm(0, prior)
    ) %>% 
      quap(data = list(x = Data_b$x, y = Data_b$y, prior = prior)) %>% 
      PSIS() %>% 
      as_tibble() %>% 
      pull(penalty)
  )
}

## Determination for the range of prior table and application of "Simulation_PSIS" to each of the prior range's values.

seq(1, 0.1, length.out = 20) %>%
  purrr::map_dbl(Simulation_PSIS) %>% 
  enframe(name = "b_prior", value = "n_parameters") %>% 
  mutate(b_prior = seq(1, 0.1, length.out = 20)) 
## # A tibble: 20 × 2
##    b_prior n_parameters
##      <dbl>        <dbl>
##  1   1            1.32 
##  2   0.953        1.33 
##  3   0.905        2.02 
##  4   0.858        1.33 
##  5   0.811        1.39 
##  6   0.763        1.54 
##  7   0.716        1.43 
##  8   0.668        1.45 
##  9   0.621        1.76 
## 10   0.574        1.35 
## 11   0.526        1.32 
## 12   0.479        1.22 
## 13   0.432        1.00 
## 14   0.384        0.811
## 15   0.337        1.01 
## 16   0.289        0.584
## 17   0.242        0.593
## 18   0.195        0.357
## 19   0.147        0.493
## 20   0.1          0.157
## Determination of Plot:

seq(1, 0.1, length.out = 20) %>%
  purrr::map_dbl(Simulation_PSIS) %>% 
  enframe(name = "b_prior", value = "n_parameters") %>% 
  mutate(b_prior = seq(1, 0.1, length.out = 20)) %>% 
ggplot(aes(b_prior, n_parameters)) +
  geom_smooth(method = "lm") +
  geom_point() +
  labs(x = "Prior",
       y = "Number of parameters",
       caption = "Effective number 
       of parameters simulation as a function of a prior") +
  theme_minimal()

## Inference: 
# It is evident from our simulation output result findings for the range of prior and application of "Simulation_PSIS" to each of our prior range's values, that there is a decrease in the effective number of the parameters with the increase in the concentration of our prior, and the increase in the concentration of the prior also in the process makes the model lesser in flexibility for fitting the sample, thus as a consequence of lesser flexibility in the model also leads to the declination in the magnitude of the number of effective parameters. 

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island Species A Species B Species C Species D Species E
1 0.2 0.2 0.2 0.2 0.2
2 0.8 0.1 0.05 0.025 0.025
3 0.05 0.15 0.7 0.05 0.05
Island Species A Species B Species C Species D Species E
Island 1 0.20 0.20 0.20 0.200 0.200
Island 2 0.80 0.10 0.05 0.025 0.025
Island 3 0.05 0.15 0.70 0.050 0.050

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

# Determination of the entropy of each island’s bird distribution.


Poly_Island_Data_A <- Poly_Island_Data %>%
  janitor::clean_names() %>%
  pivot_longer(cols = -island, names_to = "species", values_to = "proportion")


# Leveraging the inf_entropy function.

inf_entropy <- function(p) {-sum(p*log(p))}

Poly_Island_Data_A %>%
  group_by(island) %>%
  summarise(entropy = inf_entropy(proportion)) %>%
  mutate(across(is.numeric, round, 2)) %>%
  knitr::kable()
island entropy
Island 1 1.61
Island 2 0.74
Island 3 0.98
## Entropy Result Inference: 
# The first Island i.e. "Island 1" (Entropy: 1.61) has the highest entropy for the birds distribution, followed the third Island 3 (Entropy: 0.98), and "Island 2" has the least birds distribution of 0.74.



## Determination for the Kullback-Leibler (KL) divergence function of each island from the others.

KL_Divergence <- function(p,q) sum(p*(log(p) - log(q)))

Poly_Island_Data_A %>% 
  select(-species) %>% 
  group_by(island) %>% 
  nest() %>% 
  pivot_wider(names_from = island, values_from = data) %>% 
  janitor::clean_names() %>% 
  mutate(Island_1_vs_Island_2 = map2_dbl(island_1, island_2, KL_Divergence),
         Island_2_vs_Island_1 = map2_dbl(island_2, island_1, KL_Divergence),
         Island_1_vs_Island_3 = map2_dbl(island_1, island_3, KL_Divergence),
         Island_3_vs_Island_1 = map2_dbl(island_3, island_1, KL_Divergence),
         Island_2_vs_Island_3 = map2_dbl(island_2, island_3, KL_Divergence),
         Island_3_vs_Island_2 = map2_dbl(island_3, island_2, KL_Divergence)
         ) %>% 
  select(Island_1_vs_Island_2:Island_3_vs_Island_2) %>% 
  pivot_longer(cols = everything(), names_to = "Island Comparison", values_to = "KL Divergence Values") %>% 
  knitr::kable()
Island Comparison KL Divergence Values
Island_1_vs_Island_2 0.9704061
Island_2_vs_Island_1 0.8664340
Island_1_vs_Island_3 0.6387604
Island_3_vs_Island_1 0.6258376
Island_2_vs_Island_3 2.0109142
Island_3_vs_Island_2 1.8388452
## Island's KL Divergence Result Inference: 
# Referring to the KL Divergence values for the Island's comparison, the "Island 1" can be considered as the island that predicts the others best, since transitioning from Island 1 to Island 2 reflected the divergence value of 0.97, or in other words, starting from the first Island 1 lead to the lowest divergence value as a consequence of highest entropy for Island 1, hence the the "Island 1" can be considered as the island that predicts the others best.

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

# Simulation determination of the Chapter 6 model for run.

Happiness_Data <- sim_happiness(seed = 1977, N_years = 1000)

Happiness_Data<- Happiness_Data %>% 
  as_tibble() %>% 
  filter(age > 17) %>% 
  mutate(age = (age - 18)/ (65 -18)) %>% 
  mutate(mid = married +1)


## Re-Running of each model m6.9 and m6.10 (Chapter 6).

m6.9 <- alist(
  happiness ~ dnorm( mu , sigma ),
  mu <- a[mid] + bA*age,
  a[mid] ~ dnorm( 0 , 1 ),
  bA ~ dnorm( 0 , 2 ),
  sigma ~ dexp(1)) %>% 
  quap(data = Happiness_Data) 

m6.10 <- alist(
  happiness ~ dnorm( mu , sigma ),
  mu <- a + bA*age,
  a ~ dnorm( 0 , 1 ),
  bA ~ dnorm( 0 , 2 ),
  sigma ~ dexp(1)) %>% 
  quap(data = Happiness_Data)


## Application of the WAIC function to Model: m6.9 and Model: m6.10.

c(m6.9, m6.10) %>% 
  map_dfr(WAIC) %>% 
  add_column(model = c("m6.9", "m6.10"), .before = 1) %>% 
  knitr::kable()
model WAIC lppd penalty std_err
m6.9 2713.971 -1353.247 3.738532 37.54465
m6.10 3101.906 -1548.612 2.340445 27.74379
## Model's result Inference:
#  It's evident from the model comparison output that the WAIC to a larger magnitude favored the model with the Happiness collider bias i.e. Model: m6.9 with a lower WAIC value comparatively, therefore WAIC preferred the model comprising of the collider path, and further the association between the age and the happiness also lead or resulted in the more efficient and better predictions.

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

## Leveraging the foxes Data set for the determination of the relative differences in WAIC scores, using the fox DAG from the previous chapter.

data("foxes")

str(foxes)
## 'data.frame':    116 obs. of  5 variables:
##  $ group    : int  1 1 2 2 3 3 4 4 5 5 ...
##  $ avgfood  : num  0.37 0.37 0.53 0.53 0.49 0.49 0.45 0.45 0.74 0.74 ...
##  $ groupsize: int  2 2 2 2 2 2 2 2 3 3 ...
##  $ area     : num  1.09 1.09 2.05 2.05 2.12 2.12 1.29 1.29 3.78 3.78 ...
##  $ weight   : num  5.02 2.84 5.33 6.07 5.85 3.25 4.53 4.09 6.13 5.59 ...
summary(foxes)
##      group          avgfood         groupsize          area      
##  Min.   : 1.00   Min.   :0.3700   Min.   :2.000   Min.   :1.090  
##  1st Qu.:11.75   1st Qu.:0.6600   1st Qu.:3.000   1st Qu.:2.590  
##  Median :18.00   Median :0.7350   Median :4.000   Median :3.130  
##  Mean   :17.21   Mean   :0.7517   Mean   :4.345   Mean   :3.169  
##  3rd Qu.:24.00   3rd Qu.:0.8000   3rd Qu.:5.000   3rd Qu.:3.772  
##  Max.   :30.00   Max.   :1.2100   Max.   :8.000   Max.   :5.070  
##      weight     
##  Min.   :1.920  
##  1st Qu.:3.720  
##  Median :4.420  
##  Mean   :4.530  
##  3rd Qu.:5.375  
##  Max.   :7.550
## Standardization of all variables, and excluding group since that is a dummy variable.

Data_foxes <- foxes %>% 
  as_tibble() %>% 
  mutate(across(-group, standardize))

# Determination of fit for different Models:

set.seed(1234)

Model1_Foxs <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bavgfood*avgfood + Bgrp*groupsize + Barea*area, 
  a ~ dnorm(0, 0.2), 
  c(Bavgfood, Bgrp, Barea) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = Data_foxes)

Model2_Foxs <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bavgfood*avgfood + Bgrp*groupsize, 
  a ~ dnorm(0, 0.2), 
  c(Bavgfood, Bgrp) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = Data_foxes)

Model3_Foxs <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Bgrp*groupsize + Barea*area, 
  a ~ dnorm(0, 0.2), 
  c(Bgrp, Barea) ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = Data_foxes)

Model4_Foxs <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- Bavgfood*avgfood, 
  a ~ dnorm(0, 0.2), 
  Bavgfood ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = Data_foxes)

Model5_Foxs <- alist(
  weight ~ dnorm(mu, sigma), 
  mu <- a + Barea*area, 
  a ~ dnorm(0, 0.2), 
  Barea ~ dnorm(0, 0.5), 
  sigma ~ dexp(1)) %>% 
  quap(data = Data_foxes)


## Determination of various depicted Model's comparison leveraging the WAIC.

compare(Model1_Foxs, Model2_Foxs, Model3_Foxs, Model4_Foxs, Model5_Foxs) %>% 
  as_tibble(rownames = "Model") %>% 
  mutate(across(is.numeric, round, 2)) %>% 
  knitr::kable()
Model WAIC SE dWAIC dSE pWAIC weight
Model1_Foxs 322.95 16.28 0.00 NA 4.69 0.46
Model2_Foxs 323.64 16.05 0.69 3.55 3.62 0.32
Model3_Foxs 324.49 15.89 1.54 2.91 4.04 0.21
Model4_Foxs 331.88 13.79 8.93 7.18 1.62 0.01
Model5_Foxs 333.73 13.86 10.78 7.22 2.64 0.00
## Recalling and determination of the DAG from prior Chapter-6.


quest <- dagitty( "dag {area -> avgfood
    avgfood -> groupsize
    avgfood -> weight
    groupsize -> weight
}")

coordinates (quest) <- list(x=c(area=1,avgfood=0,groupsize=2, weight=1),
                          y=c(area=0, avgfood=1, groupsize=1, weight=2))

drawdag(quest, lwd = 2)

## Result Inference:
# Referring to the WAIC output's obtained for various models, "Model1_Foxs" was observed to be with lowest WAIC value i.e., 322.95, but with also the highest standard error of 16.28 in the models group, and "Model5_Foxs" was observed to be with highest WAIC value i.e., 333.73, but also with the relatively smallest standard error i.e., 13.86 compared to some of the other models, and by taking into considerations the WAIC output of all the models, standard errors, and also leveraging the DAG from chapter-6, it would be meaningful to not support the preference for any of the specific model, however it is interesting to note that "Model1_Foxs", "Model2_Foxs", and "Model3_Foxs", are to a larger proportion group together on the basis of WAIC values, and also considering the "DAG", the "Model4_Foxs", and "Model5_Foxs" has almost identical WAIC value since both of these models do not take into account the "groupsize" parameter, and models 4 & 5, prefering parameters of "avgfood" and "avgarea" also reflected almost identical values for the WAIC.