The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments.
library(rethinking)
library(ggplot2)
data(Howell1)
#We comprehend that deviance has been provided based on the information criteria. The sum of all observations is equal to deviance. So if we have more number of observations then the sum of observations will be higher which will result in higher deviance and thus low accuracy.
d <- Howell1[complete.cases(Howell1), ]
d_400 <- d[sample(1:nrow(d), size = 400, replace = FALSE), ]
d_200 <- d[sample(1:nrow(d), size = 200, replace = FALSE), ]
d_500 <- d[sample(1:nrow(d), size = 500, replace = FALSE), ]
m_400 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_400,
start = list(a = mean(d_400$height), b = 0, sigma = sd(d_400$height))
)
m_200 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_200,
start = list(a = mean(d_200$height), b = 0, sigma = sd(d_200$height))
)
m_500 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_500,
start = list(a = mean(d_500$height), b = 0, sigma = sd(d_500$height))
)
(model.compare <- compare(m_400, m_200, m_500))
## WAIC SE dWAIC dSE pWAIC weight
## m_200 1231.962 25.92770 0.000 NA 3.476270 1.000000e+00
## m_400 2458.905 32.20075 1226.943 48.85243 3.206218 3.739387e-267
## m_500 3060.048 35.92410 1828.086 33.82293 3.340478 0.000000e+00
Warning messages: 1: In compare(m_400, m_200, m_500) : Different numbers of observations found for at least two models. Model comparison is valid only for models fit to exactly the same observations. Number of observations for each model: m_400 400 m_200 200 m_500 500
7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments.
#Regularisation approach can be termed as he effective number of pWAIC parameters in the WAIC() function output. As the effective number of pWAIC parameters decreases we can see more concentration on prior knowledge or assumptions.
#When we take log-likelihood of records in the training data then the output is termed as variance also called as pWAIC.
#Flexibility of the model is defined by P_D (effective number of parameters) and is inversely proportional to regularised priors thus flexibility decreases as regularised prior increases.
d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_wide <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 10),
b ~ dnorm(1, 10),
sigma ~ dunif(0, 10)
),
data = d
)
m_narrow <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 0.10),
b ~ dnorm(1, 0.10),
sigma ~ dunif(0, 1)
),
data = d
)
WAIC(m_wide, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.7507 55.65316 4.277825 36.46696
WAIC(m_narrow, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.7127 55.69573 4.339372 36.55705
7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:
Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?
# First Island
p1 <- c(0.2, 0.2, 0.2, 0.2, 0.2)
-sum(p1 * log(p1))
## [1] 1.609438
# Second Island
p2 <- c(0.8, 0.1, 0.05, 0.025, 0.025)
-sum(p2 * log(p2))
## [1] 0.7430039
# Third Island
p3 <- c(0.05, 0.15, 0.7, 0.05, 0.05)
-sum(p3 * log(p3))
## [1] 0.9836003
Measure of uncertainty is also called as Entropy. Entropy is inversely proportional to density distribution. That means higher the entropy, then probability density distribution at hand is more uncertain. Here, the entropies of the islands are in increasing order as Island 2, Island 3, and Island 1. There is a lot of certainty of the proportions assigned to each bird species at Island 2 over the other islands. Species A is available on Islans 2 in high population as compared to other species because presence of other species reduced significantly.
Divergence between Island 1 and the Island 2 and 3 taken together:
# Average the proportions of Island 2 and 3
(q <- apply(cbind(p2, p3), 1, mean))
## [1] 0.4250 0.1250 0.3750 0.0375 0.0375
# Divergence of Island 1 from combined Islands 2 & 3
(D_pq <- sum(p1 * log(p1 / q)))
## [1] 0.4871152
# Divergence of combined Islands 2 & 3 from Island 1
(D_qp <- sum(q * log(q / p1)))
## [1] 0.3717826
Combining island 2 and 3 to produce island 1, we get uncertainty of 0.49. On the other hand, using Island 1 for the combination of Islands 2 & 3, we get uncertainty of 0.37.
repeating this for the other two target islands:
output <- data.frame(
Approximated = D_pq,
Approximator = D_qp
)
# Target: Island 2
q <- apply(cbind(p1, p3), 1, mean)
D_pq <- sum(p2 * log(p2 / q))
D_qp <- sum(q * log(q / p2))
output <- rbind(output, c(D_pq, D_qp))
# Target: Island 3
q <- apply(cbind(p1, p2), 1, mean)
D_pq <- sum(p3 * log(p3 / q))
D_qp <- sum(q * log(q / p3))
output <- rbind(output, c(D_pq, D_qp))
# output
rownames(output) <- c("Island 1", "Island 2", "Island 3")
output
## Approximated Approximator
## Island 1 0.4871152 0.3717826
## Island 2 1.2387437 1.2570061
## Island 3 1.0097143 1.1184060
We get high skewed probability density distributions for Island 1 where all species are equally present.
7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?
There is a negative relationship between age and happiness which is not true because it conditions on the collider of marriage status. But marriage is influenced by age and happiness.
## R code 6.21
d <- sim_happiness(seed = 1977, N_years = 1000)
## R code 6.22
d2 <- d[d$age > 17, ] # only adults
d2$A <- (d2$age - 18) / (65 - 18)
## R code 6.23
d2$mid <- d2$married + 1
m6.9 <- quap(
alist(
happiness ~ dnorm(mu, sigma),
mu <- a[mid] + bA * A,
a[mid] ~ dnorm(0, 1),
bA ~ dnorm(0, 2),
sigma ~ dexp(1)
),
data = d2
)
## R code 6.24
m6.10 <- quap(
alist(
happiness ~ dnorm(mu, sigma),
mu <- a + bA * A,
a ~ dnorm(0, 1),
bA ~ dnorm(0, 2),
sigma ~ dexp(1)
),
data = d2
)
## Comparison
compare(m6.9, m6.10)
## WAIC SE dWAIC dSE pWAIC weight
## m6.9 2713.971 37.54465 0.0000 NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85
According to these information criteria, model m6.9 is a lot better performing. However, we know that model m6.10 provides the true causal relationship between age and happiness
7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:
Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).
# data loading and prepping
data(foxes)
d <- foxes
d$area <- scale(d$area)
d$avgfood <- scale(d$avgfood)
d$weight <- scale(d$weight)
d$groupsize <- scale(d$groupsize)
## Models
# (1) `avgfood + groupsize + area`
b7h5_1 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a + bFood * avgfood + bGroup * groupsize + bArea * area,
a ~ dnorm(0, .2),
c(bFood, bGroup, bArea) ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = d
)
# (2) `avgfood + groupsize`
b7h5_2 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a + bFood * avgfood + bGroup * groupsize,
a ~ dnorm(0, .2),
c(bFood, bGroup) ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = d
)
# (3) `groupsize + area`
b7h5_3 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a + bGroup * groupsize + bArea * area,
a ~ dnorm(0, .2),
c(bGroup, bArea) ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = d
)
# (4) `avgfood`
b7h5_4 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a + bFood * avgfood,
a ~ dnorm(0, .2),
bFood ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = d
)
# (5) `area`
b7h5_5 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a + bArea * area,
a ~ dnorm(0, .2),
bArea ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = d
)
## Comparison
compare(b7h5_1, b7h5_2, b7h5_3, b7h5_4, b7h5_5)
## WAIC SE dWAIC dSE pWAIC weight
## b7h5_1 323.3416 16.88472 0.0000000 NA 5.187854 0.410865167
## b7h5_2 323.9809 16.81807 0.6393574 3.894043 4.130907 0.298445216
## b7h5_3 324.0666 16.18771 0.7250103 4.207249 3.945774 0.285933698
## b7h5_4 333.5084 13.79238 10.1668387 8.659625 2.454047 0.002546821
## b7h5_5 333.7929 13.79707 10.4513608 8.704578 2.684646 0.002209099
Again, the differences in the WAIC scores are in 99% intervals of the differences:
plot(compare(b7h5_1, b7h5_2, b7h5_3, b7h5_4, b7h5_5))
Models b7h5_1, b7h5_2, b7h5_4, b7h5_5 and b7h5_3 are identical in terms of deviance. We can look at the DAG for the same.
Likewise, models b7h5_4 and b7h5_5 are nearly identical because these two only contain area or avgfood in isolation and all information of area onto weight must pass through avgfood.