The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments.
# Information criteria are based on deviance, which is accrued over observations without being divided by the number of observations, which denotes the sum. Hence, if everything else is equal, a model with more observations will have a higher deviance and less accuracy as per the information criteria. Thus, it would be an unfair comparison to contrast models fit to different numbers of observations.
# We may measure WAIC for model's fit to increasingly small samples of the same data as an experiment. The information criteria should decrease alongside the sample size.
# As shown below, we take an example from the howell database -
library(rethinking)
data(Howell1)
set.seed(6)
data <- Howell1[complete.cases(Howell1), ]
d_500 <- data[sample(1:nrow(data), size = 500, replace = FALSE), ]
d_400 <- data[sample(1:nrow(data), size = 400, replace = FALSE), ]
d_300 <- data[sample(1:nrow(data), size = 300, replace = FALSE), ]
m_500 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_500,
start = list(a = mean(d_500$height), b = 0, sigma = sd(d_500$height))
)
m_400 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_400,
start = list(a = mean(d_400$height), b = 0, sigma = sd(d_400$height))
)
m_300 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_300,
start = list(a = mean(d_300$height), b = 0, sigma = sd(d_300$height))
)
(model.compare <- compare(m_500, m_400, m_300))
## WAIC SE dWAIC dSE pWAIC weight
## m_300 1862.175 27.91431 0.0000 NA 3.437154 1.000000e+00
## m_400 2419.639 33.94881 557.4644 46.34780 3.442395 8.874639e-122
## m_500 3054.604 35.32283 1192.4289 53.17046 3.257118 1.167790e-259
# From the results, we see that the WAIC increased from the N=300 model to the N=400 model and again to the N=500 model. The compare() function also returns a warning about the number of observations being different.
7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments.
# The effective number of parameters decreases as the prior becomes more concentrated. The reason is because in this case of DIC, pD is a measure of the model's flexibility; with more concentrated priors, the model becomes less flexible (and therefore less prone to overfitting). In case of WAIC, pWAIC is a measure of the variance in the log-likelihood for each observation in the training sample; with more concentrated priors, the likelihood would become more concentrated as well, reducing variance.
# We perform an experiment to demonstrate this. We calculate WAIC for two models utilizing common data with the difference only in the concentration of their priors.
data <- Howell1[complete.cases(Howell1), ]
data$height.log <- log(data$height)
data$height.log.z <- (data$height.log - mean(data$height.log)) / sd(data$height.log)
data$weight.log <- log(data$weight)
data$weight.log.z <- (data$weight.log - mean(data$weight.log)) / sd(data$weight.log)
m_wide <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 10),
b ~ dnorm(1, 10),
sigma ~ dunif(0, 10)
),
data = data
)
m_narrow <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 0.10),
b ~ dnorm(1, 0.10),
sigma ~ dunif(0, 1)
),
data = data
)
WAIC(m_wide, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.7631 55.62747 4.24594 36.5361
WAIC(m_narrow, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.7579 55.62061 4.241636 36.44326
# From the results, we note that the pWAIC decreases as the priors become more concentrated.
7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:
| Island | Species A | Species B | Species C | Species D | Species E |
| 1 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| 2 | 0.8 | 0.1 | 0.05 | 0.025 | 0.025 |
| 3 | 0.05 | 0.15 | 0.7 | 0.05 | 0.05 |
Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?
H <- function(p) -sum(p*log(p))
IB <- list()
IB[[1]] <- c( 0.2 , 0.2 , 0.2 , 0.2 , 0.2 )
IB[[2]] <- c( 0.8 , 0.1 , 0.05 , 0.025 , 0.025 )
IB[[3]] <- c( 0.05 , 0.15 , 0.7 , 0.05 , 0.05 )
sapply( IB , H )
## [1] 1.6094379 0.7430039 0.9836003
# The first island has the highest entropy followed by the third island and then lastly, the second island. This is because birds are evenly distributed on the first island compared to the second island which has the most uneven distribution of birds.
DKL <- function(p,q) sum( p*(log(p)-log(q)) ) #K-L distance
Dm <- matrix( NA , nrow=3 , ncol=3 )
for ( i in 1:3 ) for ( j in 1:3 ) Dm[i,j] <- DKL( IB[[j]] , IB[[i]])
round( Dm , 2 )
## [,1] [,2] [,3]
## [1,] 0.00 0.87 0.63
## [2,] 0.97 0.00 1.84
## [3,] 0.64 2.01 0.00
# From the results, we note that the the first island has smaller distances to other islands because it has the highest entropy.
7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?
# We consider a multiple regression model aimed at inferring the influence of age on happiness, while controlling the marriage status.
#Let’s consider the slope βA first, because how we scale the predictor A will determine the intercept. We’ll focus only on the adult sample, aged 18 or above. Assuming a very strong relationship between age and happiness, such that happiness is at its maximum at age 18 and its minimum at age 65. It’ll be easier if we rescale age so that the range from 18 to 65 is one unit. Now this new variable A ranges from 0 to 1, where 0 is age 18 and 1 is age 65.
# Happiness is on an arbitrary scale, in this data, from −2 to +2. So our assumption of strongest relationship, taking happiness from maximum to minimum, has a slope with rise over run of (2 − (−2))/1 = 4. As we know, 95% of the mass of a normal distribution is contained within 2 standard deviations. So if we set the standard deviation of the prior to half of 4, we are implying that we expect 95% of plausible slopes to be less than maximally strong. This isn’t a very strong prior, but it does help bound inference to realistic ranges.
# Concerning the, intercepts, each α is the value of μi when Ai = 0. In this case, that means at age 18. So we need to allow α to cover the full range of happiness scores. Normal(0, 1) will put 95% of the mass in the −2 to +2 interval.
# We need to construct the marriage status index variable, as well.
# Approximating the posterior
d <- sim_happiness( seed=1977 , N_years=1000 )
d2 <- d[ d$age>17 , ] # only adults
d2$A <- ( d2$age - 18 ) / ( 65 - 18 )
d2$mid <- d2$married + 1
# Model m6.9 contains both marriage status and age. Model m6.10 contains only age. Model m6.9 produces a confounded inference about the relationship between age and happiness, due to opening a collider path.
m6.9 <- quap(
alist(
happiness ~ dnorm( mu , sigma ),
mu <- a[mid] + bA*A,
a[mid] ~ dnorm( 0 , 1 ),
bA ~ dnorm( 0 , 2 ),
sigma ~ dexp(1)
) , data=d2 )
precis(m6.9,depth=2)
## mean sd 5.5% 94.5%
## a[1] -0.2350877 0.06348986 -0.3365568 -0.1336186
## a[2] 1.2585517 0.08495989 1.1227694 1.3943340
## bA -0.7490274 0.11320112 -0.9299447 -0.5681102
## sigma 0.9897080 0.02255800 0.9536559 1.0257600
m6.10 <- quap(
alist(
happiness ~ dnorm( mu , sigma ),
mu <- a + bA*A,
a ~ dnorm( 0 , 1 ),
bA ~ dnorm( 0 , 2 ),
sigma ~ dexp(1)
) , data=d2 )
precis(m6.10)
## mean sd 5.5% 94.5%
## a 1.649248e-07 0.07675015 -0.1226614 0.1226617
## bA -2.728620e-07 0.13225976 -0.2113769 0.2113764
## sigma 1.213188e+00 0.02766080 1.1689803 1.2573949
#Now we comparing both models
compare( m6.9 , m6.10 )
## WAIC SE dWAIC dSE pWAIC weight
## m6.9 2713.971 37.54465 0.0000 NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85
# m6.9 is expected to predict much better as it produces the invalid inference. This is because the collider path does convey actual association. We simply end up mistaken about the causal inference. We should not use WAIC (or LOO) to choose among models, unless we have some clear sense of the causal model
7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:
Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).
library(dagitty)
quest <- dagitty("dag{
avgfood <- area
weight <- avgfood
weight <- groupsize
groupsize <- avgfood
}")
plot(quest)
data(foxes)
d <- foxes
d$W <- standardize(d$weight)
d$A <- standardize(d$area)
d$F <- standardize(d$avgfood)
d$G <- standardize(d$groupsize)
m1 <- quap(
alist(
W ~ dnorm( mu , sigma ),
mu <- a + bF*F + bG*G + bA*A,
a ~ dnorm(0,0.2),
c(bF,bG,bA) ~ dnorm(0,0.5),
sigma ~ dexp(1)
), data=d )
m2 <- quap(
alist(
W ~ dnorm( mu , sigma ),
mu <- a + bF*F + bG*G,
a ~ dnorm(0,0.2),
c(bF,bG) ~ dnorm(0,0.5),
sigma ~ dexp(1)
), data=d )
m3 <- quap(
alist(
W ~ dnorm( mu , sigma ),
mu <- a + bG*G + bA*A,
a ~ dnorm(0,0.2),
c(bG,bA) ~ dnorm(0,0.5),
sigma ~ dexp(1)
), data=d )
m4 <- quap(
alist(
W ~ dnorm( mu , sigma ),
mu <- a + bF*F,
a ~ dnorm(0,0.2),
bF ~ dnorm(0,0.5),
sigma ~ dexp(1)
), data=d )
m5 <- quap(
alist(
W ~ dnorm( mu , sigma ),
mu <- a + bA*A,
a ~ dnorm(0,0.2),
bA ~ dnorm(0,0.5),
sigma ~ dexp(1)
), data=d )
compare( m1 , m2 , m3 , m4 , m5 )
## WAIC SE dWAIC dSE pWAIC weight
## m1 322.8847 16.27783 0.000000 NA 4.656959 0.465363694
## m3 323.8985 15.68240 1.013749 2.899417 3.718565 0.280323674
## m2 324.1284 16.13964 1.243666 3.598475 3.859897 0.249881396
## m4 333.4444 13.78855 10.559695 7.193396 2.426279 0.002370193
## m5 333.7239 13.79447 10.839215 7.242069 2.650636 0.002061043
# From the results shown, the models m1, m2, m3 have very similar WAIC values. This is because WAIC sees these models as tied. This makes sense, given the DAG, because as long as a model has groupsize in it, we can include either avgfood or area or both and get the same inferences.
# the models m4 and m5 omit groupsize and are tied to one another, thus they have almost similar WAIC values.