Chapter 7 - Ulysses’ Compass

This week began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some simulations.

# When comparing models with an information criterion, all models must be fit to exactly the same observation since information criterions are computed based on deviance, which are summations. The larger the number of observations, the higher the deviance, hence comparison between mismatched numbers of observations are not valuable.

# Set seed
set.seed(42)

# Create data for simulation
waic_sim <- function(N_POINTS){
  x <- rnorm(N_POINTS)
  y <- rnorm(N_POINTS)
  df <- data.frame(x=x, y=y)
  
  # Model
  model <- quap(
    alist(
      y ~ dnorm(mu, sigma),
      mu <- a + bX * x,
      a ~ dnorm(0, 0.2),
      bX ~ dnorm(0, 0.5),
      sigma ~ dexp(1)
    ), data=df
  )
  waic_result <- WAIC(model)
  return(waic_result["WAIC"])
}

# Simulate WAIC for different number of data points
sim_list <- seq(from=100, to=1000, by=100)
waic_list <- list()
for (i in 1:length(sim_list)) (waic_list[i] <- waic_sim(sim_list[i]))

# Plot simulated WAIC
plot(sim_list, waic_list,
     main="Simulated WAIC",
     xlab="Number of Observations",
     ylab="WAIC",
     type="b"
)

# As can be observed from the plot, as the number of observations increase, so does the WAIC values. 

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some simulations (at least 4).

# The effective number of parameters, as measured by PSIS or WAIC will decrease as a prior becomes more concentrated. This is dues to the increase of inflexibility in fitting the data as the prior becomes more concentrated around a particular values. As flexibility of the model decreases, the effective number of parameters also decreases.

# Set seed
set.seed(42)

# Create data for simulation
psis_sim <- function(prior){
  N_POINTS <- 10
  x <- rnorm(N_POINTS)
  y <- rnorm(N_POINTS)
  df <- data.frame(x=x, y=y)
  df$x <- scale(df$x)
  df$y <- scale(df$y)
  
  # Model
  model <- quap(
    alist(
      y ~ dnorm(mu, 1),
      mu <- a + bX * x,
      a ~ dnorm(0, prior),
      bX ~ dnorm(0, prior)
    ), data=list(x=df$x, y=df$y, prior=prior)
  )
  psis_result <- PSIS(model)
  return(psis_result["penalty"])
}

# Simulate WAIC for different number of data points
sim_list <- seq(1, 0.1, length.out = 20)
psis_list <- list()
for (i in 1:length(sim_list)) (psis_list[i] <- psis_sim(sim_list[i]))

psis_list <- unlist(psis_list)
lm_df <- data.frame(x=sim_list, y=psis_list)

# Plot simulated parameter count
plot(sim_list, psis_list,
     main="Simulated Parameter Count Using PSIS",
     xlab="Prior Spread",
     ylab="Parameter Count",
)
abline(lm(lm_df$y ~ lm_df$x))

# As can be observed from the plot, as the prior spread increases, so does the effective number of parameters. 

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island Species A Species B Species C Species D Species E
1 0.2 0.2 0.2 0.2 0.2
2 0.8 0.1 0.05 0.025 0.025
3 0.05 0.15 0.7 0.05 0.05

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

# Part 1
# Compute entropy of each island's bird distribution
entropy1 <- -sum(island1*log(island1))
cat("Entropy for Island 1's bird distribution", entropy1, "\n")
## Entropy for Island 1's bird distribution 1.609438
entropy2 <- -sum(island2*log(island2))
cat("Entropy for Island 2's bird distribution", entropy2, "\n")
## Entropy for Island 2's bird distribution 0.7430039
entropy3 <- -sum(island3*log(island3))
cat("Entropy for Island 3's bird distribution", entropy3, "\n")
## Entropy for Island 3's bird distribution 0.9836003
# The entropy values shows us the uncertainty contained in the probability distribution From the computed entropy of the island's bird distribution, we see that Island 1 has the highest entropy, which means that there is a higher uncertainty level on which bird we will encounter in Island 1, in comparison to the other two islands.

# Part 2
# Create function to evaluate DKL
kl_div <- function(p, q) sum(p*(log(p) - log(q)))

# Create empty 3x3 matrix
D_mat <- matrix(, nrow=3, ncol=3)

# Input island data into dictionary
island_dict <- list(island1, island2, island3)
names(island_dict) <- c("island1", "island2", "island3")

# Fill 3x3 matrix
for(p in 1:3) {
  for (q in 1:3) {
    p_val <- as.numeric(unlist(island_dict[paste("island", toString(p), sep="")]))
    q_val <- as.numeric(unlist(island_dict[paste("island", toString(q), sep="")]))
    D_mat[p, q] <- kl_div(p_val, q_val)
  }
}
D_mat
##           [,1]      [,2]      [,3]
## [1,] 0.0000000 0.9704061 0.6387604
## [2,] 0.8664340 0.0000000 2.0109142
## [3,] 0.6258376 1.8388452 0.0000000
# The six different KL divergence value exclude the diagonal where p and q are the same, since the divergence is 0 for those values. From this we see that overall Island 1 has the overall lowest KL divergence because of its high entropy. Because of this Island 1 predicts the other island the best. 

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

# Data
d <- sim_happiness(seed=1977 , N_years=1000)
precis(d)
##                    mean        sd      5.5%     94.5%     histogram
## age        3.300000e+01 18.768883  4.000000 62.000000 ▇▇▇▇▇▇▇▇▇▇▇▇▇
## married    3.007692e-01  0.458769  0.000000  1.000000    ▇▁▁▁▁▁▁▁▁▃
## happiness -1.000070e-16  1.214421 -1.789474  1.789474      ▇▅▇▅▅▇▅▇
d2 <- d[ d$age>17 , ] # only adults
d2$A <- ( d2$age - 18 ) / ( 65 - 18 )

# Model 6.9
d2$mid <- d2$married + 1
m6.9 <- quap(
    alist(
        happiness ~ dnorm( mu , sigma ),
        mu <- a[mid] + bA*A,
        a[mid] ~ dnorm( 0 , 1 ),
        bA ~ dnorm( 0 , 2 ),
        sigma ~ dexp(1)
    ) , data=d2 )
precis(m6.9,depth=2)
##             mean         sd       5.5%      94.5%
## a[1]  -0.2350877 0.06348986 -0.3365568 -0.1336186
## a[2]   1.2585517 0.08495989  1.1227694  1.3943340
## bA    -0.7490274 0.11320112 -0.9299447 -0.5681102
## sigma  0.9897080 0.02255800  0.9536559  1.0257600
# Model 6.10
m6.10 <- quap(
    alist(
        happiness ~ dnorm( mu , sigma ),
        mu <- a + bA*A,
        a ~ dnorm( 0 , 1 ),
        bA ~ dnorm( 0 , 2 ),
        sigma ~ dexp(1)
    ) , data=d2 )
precis(m6.10)
##                mean         sd       5.5%     94.5%
## a      1.649248e-07 0.07675015 -0.1226614 0.1226617
## bA    -2.728620e-07 0.13225976 -0.2113769 0.2113764
## sigma  1.213188e+00 0.02766080  1.1689803 1.2573949
# Compare the two models using WAIC or PSIS
set.seed(42)
compare(m6.9, m6.10, func=WAIC)
##           WAIC       SE    dWAIC      dSE    pWAIC       weight
## m6.9  2713.890 37.56357   0.0000       NA 3.707085 1.000000e+00
## m6.10 3102.197 27.76303 388.3072 35.41938 2.503050 4.787993e-85
# Based on the comparison, we see that WAIC for model 6.9 is lower, therefore preferred. Model 6.9 is the model with the collider bias on marriage status. The reason for this is because WAIC specifically favors predictive power and not causal relationship. The association between age and happiness resulted in improved predictions based on model 6.10. 

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

# Create dag
dag_6.4 <- dagitty("dag {
    Area -> AverageFood -> GroupSize -> Weight
    AverageFood -> Weight
}")

# Draw dag
drawdag(dag_6.4, lwd=2)

# Check data
data(foxes)
raw_data <- foxes
str(raw_data)
## 'data.frame':    116 obs. of  5 variables:
##  $ group    : int  1 1 2 2 3 3 4 4 5 5 ...
##  $ avgfood  : num  0.37 0.37 0.53 0.53 0.49 0.49 0.45 0.45 0.74 0.74 ...
##  $ groupsize: int  2 2 2 2 2 2 2 2 3 3 ...
##  $ area     : num  1.09 1.09 2.05 2.05 2.12 2.12 1.29 1.29 3.78 3.78 ...
##  $ weight   : num  5.02 2.84 5.33 6.07 5.85 3.25 4.53 4.09 6.13 5.59 ...
# Scale variables
raw_data$avgfood_scaled <- scale(raw_data$avgfood)
raw_data$groupsize_scaled <- scale(raw_data$groupsize)
raw_data$area_scaled <- scale(raw_data$area)
raw_data$weight_scaled <- scale(raw_data$weight)

# Model avgfood + groupsize + area
model_1 <- quap(
  alist(
    weight_scaled ~ dnorm(mu,sigma),
    mu <- a + bF * avgfood_scaled + bS * groupsize_scaled + bA * area_scaled,
    a ~ dnorm(0, 0.2),
    c(bF, bS, bA) ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ), data=raw_data
)

# Model avgfood + groupsize
model_2 <- quap(
  alist(
    weight_scaled ~ dnorm(mu,sigma),
    mu <- a + bF * avgfood_scaled + bS * groupsize_scaled,
    a ~ dnorm(0, 0.2),
    c(bF, bS) ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ), data=raw_data
)

# Model groupsize + area
model_3 <- quap(
  alist(
    weight_scaled ~ dnorm(mu,sigma),
    mu <- a + bS * groupsize_scaled + bA * area_scaled,
    a ~ dnorm(0, 0.2),
    c(bS, bA) ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ), data=raw_data
)

# Model avgfood
model_4 <- quap(
  alist(
    weight_scaled ~ dnorm(mu,sigma),
    mu <- a + bF * avgfood_scaled,
    a ~ dnorm(0, 0.2),
    bF ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ), data=raw_data
)


# Model area
model_5 <- quap(
  alist(
    weight_scaled ~ dnorm(mu,sigma),
    mu <- a + bA * area_scaled,
    a ~ dnorm(0, 0.2),
    bA ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ), data=raw_data
)

# Compare the five models using WAIC
set.seed(42)
compare(model_1, model_2, model_3, model_4, model_5, func=WAIC)
##             WAIC       SE      dWAIC      dSE    pWAIC      weight
## model_1 323.4034 16.39459  0.0000000       NA 4.912946 0.393758348
## model_2 323.7097 16.00734  0.3063719 3.663642 3.623341 0.337832915
## model_3 324.2015 15.88974  0.7981585 2.890907 3.896006 0.264187247
## model_4 333.8267 14.00524 10.4233566 7.242566 2.600692 0.002146973
## model_5 333.8954 13.81491 10.4920180 7.324749 2.704999 0.002074517
# Based on the result, there is not much difference in WAIC between models. Note that model_1, model_2, and model_3 are closer to one another in terms of WAIC, while model_4, and model_5 has similar WAIC value to each other. Based on the DAG developed, if we follow a path that contain Group Size, it doesn't matter if Area or Average Food are included. Ultimately the same information is passed through. Since both Average Food and Area contain similar information, their WAIC are also similar to each other.That said, standard error of the score differences (dSE) for model_1, model_2, and model_3 are lower than dSE for model_4 and model_5.