This week began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html
file as: YourName_ANLY505-Year-Semester.html and publish
the assignment to your R Pubs account and submit the link to Canvas.
Each question is worth 5 points.
7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some simulations.
set.seed(10)
#As information criterion is based on the deviance of all the observations divided the number of observations, then make a summation of all the deviance. More observations generally lead to a higher deviance as more values are summed up, rendering a comparison to a model with less observations useless.
# Example
library(rethinking)
data(Howell1)
data <- Howell1[complete.cases(Howell1), ]
data1 <- data[sample(1:nrow(d), size = 100, replace = TRUE), ]
data2 <- data[sample(1:nrow(d), size = 150, replace = TRUE), ]
data3 <- data[sample(1:nrow(d), size = 250, replace = TRUE), ]
data4 <- data[sample(1:nrow(d), size = 350, replace = TRUE), ]
m1 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * weight
),
data = data1,
start = list(a = mean(data1$height), b = 0, sigma = sd(data1$height))
)
m2 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * weight
),
data = data2,
start = list(a = mean(data2$height), b = 0, sigma = sd(data2$height))
)
m3 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * weight
),
data = data3,
start = list(a = mean(data3$height), b = 0, sigma = sd(data3$height))
)
m4 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * weight
),
data = data4,
start = list(a = mean(data3$height), b = 0, sigma = sd(data3$height))
)
(model.compare <- compare(m1, m2, m3,m4))
## WAIC SE dWAIC dSE pWAIC weight
## m1 163.9087 8.64472 0.0000 NA 2.346737 1.000000e+00
## m2 224.6435 12.38355 60.7348 13.72386 2.488805 6.480474e-14
## m3 355.0308 16.16059 191.1221 12.56882 2.505492 3.150396e-42
## m4 481.4305 24.31944 317.5219 15.45877 2.629179 1.124602e-69
7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some simulations (at least 4).
#As prior gets more concentrated, we can expect the number of effective parameters to decrease. This is the outcome of the decrease in flexibility of fitting the sample. A simulation is presented below:
m1<-quap(
alist(
height~dnorm(mu,sigma),
mu<-a+b*weight,
a~dnorm(0,1),
b~dnorm(1,0.1),
sigma~dexp(1)
),
data=Howell1,
)
m2<-quap(
alist(
height~dnorm(mu,sigma),
mu<-a+b*weight,
a~dnorm(0,1),
b~dnorm(1,0.5),
sigma~dexp(1)
),
data=Howell1,
)
m3<-quap(
alist(
height~dnorm(mu,sigma),
mu<-a+b*weight,
a~dnorm(0,1),
b~dnorm(1,1),
sigma~dexp(1)
),
data=Howell1,
)
m4<-quap(
alist(
height~dnorm(mu,sigma),
mu<-a+b*weight,
a~dnorm(0,1),
b~dnorm(1,5),
sigma~dexp(1)
),
data=Howell1,
)
compare(m1,m2,m3,m4)
## WAIC SE dWAIC dSE pWAIC weight
## m1 5120.888 26.74444 0.00000 NA 1.214377 9.999999e-01
## m2 5153.988 25.51358 33.10047 8.087548 1.287882 6.491181e-08
## m4 5155.891 25.33239 35.00303 8.593188 1.324225 2.507199e-08
## m3 5156.025 25.36636 35.13752 8.446855 1.337753 2.344149e-08
7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:
| Island | Species A | Species B | Species C | Species D | Species E |
| 1 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| 2 | 0.8 | 0.1 | 0.05 | 0.025 | 0.025 |
| 3 | 0.05 | 0.15 | 0.7 | 0.05 | 0.05 |
Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?
island1 <- c(0.2, 0.2, 0.2, 0.2, 0.2)
island2 <- c(0.8, 0.1, 0.05, 0.025, 0.025)
island3 <- c(0.05, 0.15, 0.7, 0.05, 0.05)
entropy1 <- -sum(island1 * log(island1))
entropy2 <- -sum(island2 * log(island2))
entropy3 <- -sum(island3 * log(island3))
entropy1
## [1] 1.609438
entropy2
## [1] 0.7430039
entropy3
## [1] 0.9836003
#Entropy is just a measure for how uncertain we are which species we would get, if we randomly observe a bird from this island. We notice Island has the highest entropy value and island 2 has the lowest entropy value. Therefore we are more uncertain for island1 about which species we could run into.
DKL <- function(p,q) sum( p*(log(p)-log(q)) )
Dm <- matrix( NA , nrow=3 , ncol=3 )
Dm[1,1] <- DKL (island1, island1)
Dm[1,2] <- DKL (island1, island2)
Dm[1,3] <- DKL (island1, island3)
Dm[2,1] <- DKL (island2, island1)
Dm[2,2] <- DKL (island2, island2)
Dm[2,3] <- DKL (island2, island3)
Dm[3,1] <- DKL (island3, island1)
Dm[3,2] <- DKL (island3, island2)
Dm[3,3] <- DKL (island3, island3)
Dm
## [,1] [,2] [,3]
## [1,] 0.0000000 0.9704061 0.6387604
## [2,] 0.8664340 0.0000000 2.0109142
## [3,] 0.6258376 1.8388452 0.0000000
7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?
d <- sim_happiness(seed = 1977, N_years = 1000)
## R code 6.22
d2 <- d[d$age > 17, ] # only adults
d2$A <- (d2$age - 18) / (65 - 18)
## R code 6.23
d2$mid <- d2$married + 1
m6.9 <- quap(
alist(
happiness ~ dnorm(mu, sigma),
mu <- a[mid] + bA * A,
a[mid] ~ dnorm(0, 1),
bA ~ dnorm(0, 2),
sigma ~ dexp(1)
),
data = d2
)
## R code 6.24
m6.10 <- quap(
alist(
happiness ~ dnorm(mu, sigma),
mu <- a + bA * A,
a ~ dnorm(0, 1),
bA ~ dnorm(0, 2),
sigma ~ dexp(1)
),
data = d2
)
## Comparison
compare(m6.9, m6.10)
## WAIC SE dWAIC dSE pWAIC weight
## m6.9 2713.971 37.54465 0.0000 NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85
7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:
Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).
data("foxes")
m1<-quap(
alist(
weight~dnorm(mu,sigma),
mu<-a+bFood*avgfood+bGroup*groupsize+bArea*area,
a~dnorm(0,0.2),
c(bFood,bGroup,bArea)~dnorm(0,5),
sigma~dexp(1)
),
data=foxes
)
m2<-quap(
alist(
weight~dnorm(mu,sigma),
mu<-a+bFood*avgfood+bGroup*groupsize,
a~dnorm(0,0.2),
c(bFood,bGroup)~dnorm(0,5),
sigma~dexp(1)
),
data=foxes
)
m3<-quap(
alist(
weight~dnorm(mu,sigma),
mu<-a+bGroup*groupsize+bArea*area,
a~dnorm(0,0.2),
c(bGroup,bArea)~dnorm(0,5),
sigma~dexp(1)
),
data=foxes
)
m4<-quap(
alist(
weight~dnorm(mu,sigma),
mu<-a+bFood*avgfood,
a~dnorm(0,0.2),
bFood~dnorm(0,5),
sigma~dexp(1)
),
data=foxes
)
m5<-quap(
alist(
weight~dnorm(mu,sigma),
mu<-a+bArea*area,
a~dnorm(0,0.2),
bArea~dnorm(0,5),
sigma~dexp(1)
),
data=foxes
)
compare(m1,m2,m3,m4,m5)
## WAIC SE dWAIC dSE pWAIC weight
## m1 414.6214 14.62098 0.000000 NA 4.116776 6.693967e-01
## m2 416.0324 14.17557 1.410972 4.652538 3.428954 3.305940e-01
## m3 438.0770 16.27384 23.455615 10.782328 3.142175 5.399600e-06
## m4 438.9707 12.55201 24.349246 10.500356 1.888798 3.453919e-06
## m5 443.1985 14.63420 28.577095 11.339185 2.016616 4.171057e-07