The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments.
As information criterion is based on the deviance, which is a sum. More observations generally lead to a higher deviance as more values are summed up, rendering a comparison to a model with less observations useless.
data(Howell1)
d <- Howell1[complete.cases(Howell1), ]
d200 <- d[sample(1:nrow(d), size = 200, replace = FALSE), ]
d300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
d400 <- d[sample(1:nrow(d), size = 400, replace = FALSE), ]
m200 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d200,
start = list(a = mean(d200$height), b = 0, sigma = sd(d200$height))
)
m300 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d300,
start = list(a = mean(d300$height), b = 0, sigma = sd(d300$height))
)
m400 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d400,
start = list(a = mean(d400$height), b = 0, sigma = sd(d400$height))
)
(model.compare <- compare(m200, m300, m400))
## WAIC SE dWAIC dSE pWAIC weight
## m200 1195.410 21.39121 0.0000 NA 3.091989 1.000000e+00
## m300 1824.224 28.01351 628.8135 31.95179 3.449534 2.850255e-137
## m400 2449.238 29.18498 1253.8277 29.80705 2.954761 5.429700e-273
From the 3 models compared with 200, 300 and 400 observations, we can see that the WAIC increases with the number of observations
7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments.
The effective number of parameters (or "pWAIC" in the WAIC() function output), is the penalty term of our regularisation approaches. As priors become more regularising (i.e. more concentrated on certain prior knowledge or assumptions), the effective number of parameters decreases.
In the case of WAIC, pWAIC is the variance in the log-likelihoods for each observation in the training data. More concentrated priors constrain this likelihood and subsequent measure of variance, thus reducing it.
As for PSIS, PD (effective number of parameters) tells us about the flexibility of the model. Increasingly regularised priors decrease the flexibility of the model.
d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_not_concentrated <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 10),
b ~ dnorm(1, 10),
sigma ~ dunif(0, 10)
),
data = d
)
m_concentrated <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 0.10),
b ~ dnorm(1, 0.10),
sigma ~ dunif(0, 1)
),
data = d
)
WAIC(m_not_concentrated, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.7875 55.63169 4.237921 36.4821
WAIC(m_concentrated, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.9054 55.63106 4.178351 36.3666
The pWAIC decreases with the more concentrated prior.
7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:
Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?
To compute the entropies,we just need a function to compute the entropy.Information entropy, as defined in lecture and the book, is simply: H(p)=−∑ipilog(pi) where p is a vector of probabilities summing to 1:
H <- function(p) -sum(p*log(p))
IB <- list()
IB[[1]] <- c( 0.2 , 0.2 , 0.2 , 0.2 , 0.2 )
IB[[2]] <- c( 0.8 , 0.1 , 0.05 , 0.025 , 0.025 )
IB[[3]] <- c( 0.05 , 0.15 , 0.7 , 0.05 , 0.05 )
sapply( IB , H )
## [1] 1.6094379 0.7430039 0.9836003
Entropy is a measure of the evenness of a distribution. The first islands has the most even distribution of birbs. This means you wouldn’t be very surprised by any particular birb. The second island, in contrast, has a very uneven distribution of birbs.
Now we need K-L distance, so let’s write a function for it:
DKL <- function(p,q) sum( p*(log(p)-log(q)) )
This is the distance from q to p, regarding p as true and q as the model. Now to use each island as a model of the others, we need to consider the different ordered pairings. I’ll just make a matrix and loop over rows and columns:
Dm <- matrix( NA , nrow=3 , ncol=3 )
for ( i in 1:3 ) for ( j in 1:3 ) Dm[i,j] <- DKL( IB[[j]] , IB[[i]])
#test <- DKL( IB[[1]] , IB[[2]])
#0.2*(log(0.2)- log(0.8)) + 0.2*(log(0.2)- log(0.1))+ 0.2*(log(0.2)- log(0.05))+0.2*(log(0.2)- log(0.025))+0.2*(log(0.2)-log(0.025))
round( Dm , 2 )
## [,1] [,2] [,3]
## [1,] 0.00 0.87 0.63
## [2,] 0.97 0.00 1.84
## [3,] 0.64 2.01 0.00
The way to read this is each row as a model and each column as a true distribution. So the first island, the first row, has the smaller distances to the other islands. This makes sense, since it has the highest entropy. Why does that give it a shorter distance to the other islands? Because it is less surprised by the other islands, due to its high entropy.
7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?
## R code 6.21
d <- sim_happiness(seed = 1977, N_years = 1000)
## R code 6.22
d2 <- d[d$age > 17, ] # only adults
d2$A <- (d2$age - 18) / (65 - 18)
## R code 6.23
d2$mid <- d2$married + 1
m6.9 <- quap(
alist(
happiness ~ dnorm(mu, sigma),
mu <- a[mid] + bA * A,
a[mid] ~ dnorm(0, 1),
bA ~ dnorm(0, 2),
sigma ~ dexp(1)
),
data = d2
)
## R code 6.24
m6.10 <- quap(
alist(
happiness ~ dnorm(mu, sigma),
mu <- a + bA * A,
a ~ dnorm(0, 1),
bA ~ dnorm(0, 2),
sigma ~ dexp(1)
),
data = d2
)
## Comparison
compare(m6.9, m6.10)
## WAIC SE dWAIC dSE pWAIC weight
## m6.9 2713.971 37.54465 0.0000 NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85
According to these information criteria, model m6.9 is a lot better performing (in terms of out-of-sample deviance). However, model m6.10 provides the true causal relationship between age and happiness (none). WAIC would have us favour model m6.9 simply because it does a better job at predicting the happiness of out-of-sample individuals.Because this model identifies the happiness of the different groups of people: the miserable unmarried as well as the ecstatic married ones. Conditioning on the collider added statistical association and so aids predictive accuracy. Thus, while doing better at predicting model m6.9 fails at finding causality.
7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:
Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).
data("foxes")
dat_foxes <- foxes %>%
as_tibble() %>%
mutate(across(-group, standardize))
m1 <- alist(
weight ~ dnorm(mu, sigma),
mu <- a + Bf*avgfood + Bg*groupsize + Ba*area,
a ~ dnorm(0, 0.2),
c(Bf, Bg, Ba) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_foxes)
m2 <- alist(
weight ~ dnorm(mu, sigma),
mu <- a + Bf*avgfood + Bg*groupsize,
a ~ dnorm(0, 0.2),
c(Bf, Bg) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_foxes)
m3 <- alist(
weight ~ dnorm(mu, sigma),
mu <- a + Bg*groupsize + Ba*area,
a ~ dnorm(0, 0.2),
c(Bg, Ba) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_foxes)
m4 <- alist(
weight ~ dnorm(mu, sigma),
mu <- Bf*avgfood,
a ~ dnorm(0, 0.2),
Bf ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_foxes)
m5 <- alist(
weight ~ dnorm(mu, sigma),
mu <- a + Ba*area,
a ~ dnorm(0, 0.2),
Ba ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_foxes)
compare(m1, m2, m3, m4, m5) %>%
as_tibble(rownames = "model") %>%
mutate(across(is.numeric, round, 2)) %>%
knitr::kable()
| model | WAIC | SE | dWAIC | dSE | pWAIC | weight |
|---|---|---|---|---|---|---|
| m1 | 322.88 | 16.28 | 0.00 | NA | 4.66 | 0.46 |
| m3 | 323.90 | 15.68 | 1.01 | 2.90 | 3.72 | 0.28 |
| m2 | 324.13 | 16.14 | 1.24 | 3.60 | 3.86 | 0.25 |
| m4 | 331.58 | 13.67 | 8.70 | 7.21 | 1.50 | 0.01 |
| m5 | 333.72 | 13.79 | 10.84 | 7.24 | 2.65 | 0.00 |
There is no support for the preference of a particular model. But we can see that m1, m2, and m3 are grouped together based on AIC as well as m4 and m5. Following the DAG, as long as we include the groupsize path, it does not make a difference if we use area or avgfood. They encapsulate the same information, and WAIC hence returns a similar value. m4 and m5 both don’t include groupsize. But as avgfood and area contain mostly the same information, they both show similar WAIC estimates.