The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments.
#The deviation of all observations is used as the information criterion without being split by the number of observations (sum and not an average). According to the information criterion, a model with more data will have a bigger deviation and less accuracy, hence it is not acceptable to compare models with different numbers of observations.
library(rethinking)
## Loading required package: rstan
## Warning: package 'rstan' was built under R version 4.0.5
## Loading required package: StanHeaders
## Warning: package 'StanHeaders' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Registered S3 methods overwritten by 'tibble':
## method from
## format.tbl pillar
## print.tbl pillar
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
## Loading required package: parallel
## rethinking (Version 2.20)
##
## Attaching package: 'rethinking'
## The following object is masked from 'package:rstan':
##
## stan
## The following object is masked from 'package:stats':
##
## rstudent
data(Howell1)
str(Howell1)
## 'data.frame': 544 obs. of 4 variables:
## $ height: num 152 140 137 157 145 ...
## $ weight: num 47.8 36.5 31.9 53 41.3 ...
## $ age : num 63 63 65 41 51 35 32 27 19 54 ...
## $ male : int 1 0 0 1 0 1 0 1 0 1 ...
d <- Howell1[complete.cases(Howell1), ]
d100 <- d[sample(1:nrow(d), size = 100, replace = FALSE), ]
d200 <- d[sample(1:nrow(d), size = 200, replace = FALSE), ]
d300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
m100 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d100,
start = list(a = mean(d100$height), b = 0, sigma = sd(d100$height))
)
m200 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d200,
start = list(a = mean(d200$height), b = 0, sigma = sd(d200$height))
)
m300 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d300,
start = list(a = mean(d300$height), b = 0, sigma = sd(d300$height))
)
set.seed(77)
compare( m100, m200, m300 , func=WAIC )
## Warning in compare(m100, m200, m300, func = WAIC): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m100 100
## m200 200
## m300 300
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
## WAIC SE dWAIC dSE pWAIC weight
## m100 590.1485 13.22542 0.0000 NA 2.652893 1.000000e+00
## m200 1246.7263 21.78461 656.5778 20.71721 3.200711 2.666568e-143
## m300 1844.8772 28.70078 1254.7287 21.56401 3.238796 3.460383e-273
#We can see that the WAIC grows with the amount of observations when comparing the three models with 100, 200, and 300 observations.
7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments.
#When Prior becomes more comcentrated the effective no of parameters will be reduced #WAIC is a measure of the variance in the log-likelihood for each observation in the training sample; with more concentrated priors, the likelihood will become more concentrated as well and thus variance will decrease.
d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
not_concentrated <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 10),
b ~ dnorm(1, 10),
sigma ~ dunif(0, 10)
),
data = d
)
concentrated <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 0.10),
b ~ dnorm(1, 0.10),
sigma ~ dunif(0, 1)
),
data = d
)
compare(concentrated,not_concentrated)
## WAIC SE dWAIC dSE pWAIC weight
## not_concentrated -102.5778 36.48738 0.0000000 NA 4.351863 0.5145623
## concentrated -102.4613 36.47684 0.1165315 0.2061048 4.444219 0.4854377
##With a more concentrated prior, the pWAIC decreases.
7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:
| Island | Species A | Species B | Species C | Species D | Species E |
| 1 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| 2 | 0.8 | 0.1 | 0.05 | 0.025 | 0.025 |
| 3 | 0.05 | 0.15 | 0.7 | 0.05 | 0.05 |
Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?
library(purrr)
p_logp <- function(p) {
if (p == 0) return(0)
p * log(p)
}
calc_entropy <- function(x) {
avg_logprob <- sum(map_dbl(x, p_logp))
-1 * avg_logprob
}
# First, lets compute the entropy for each each island.
library(tidyr)
islands <- tibble(island = paste("Island", 1:3),
a = c(0.2, 0.8, 0.05),
b = c(0.2, 0.1, 0.15),
c = c(0.2, 0.05, 0.7),
d = c(0.2, 0.025, 0.05),
e = c(0.2, 0.025, 0.05)) %>%
pivot_longer(-island, names_to = "species", values_to = "prop")
library(dplyr)
islands %>%
group_by(island) %>%
summarize(prop = list(prop), .groups = "drop") %>%
mutate(entropy = map_dbl(prop, calc_entropy))
## # A tibble: 3 x 3
## island prop entropy
## <chr> <list> <dbl>
## 1 Island 1 <dbl [5]> 1.61
## 2 Island 2 <dbl [5]> 0.743
## 3 Island 3 <dbl [5]> 0.984
d_kl <- function(p, q) {
sum(p * (log(p) - log(q)))
}
# Now, let's calculate D_KL for each set of islands.
crossing(model = paste("Island", 1:3),
predicts = paste("Island", 1:3)) %>%
filter(model != predicts) %>%
left_join(islands, by = c("model" = "island")) %>%
rename(model_prop = prop) %>%
left_join(islands, by = c("predicts" = "island", "species")) %>%
rename(predict_prop = prop) %>%
group_by(model, predicts) %>%
summarize(q = list(model_prop),
p = list(predict_prop),
.groups = "drop") %>%
mutate(kl_distance = map2_dbl(p, q, d_kl))
## # A tibble: 6 x 5
## model predicts q p kl_distance
## <chr> <chr> <list> <list> <dbl>
## 1 Island 1 Island 2 <dbl [5]> <dbl [5]> 0.866
## 2 Island 1 Island 3 <dbl [5]> <dbl [5]> 0.626
## 3 Island 2 Island 1 <dbl [5]> <dbl [5]> 0.970
## 4 Island 2 Island 3 <dbl [5]> <dbl [5]> 1.84
## 5 Island 3 Island 1 <dbl [5]> <dbl [5]> 0.639
## 6 Island 3 Island 2 <dbl [5]> <dbl [5]> 2.01
The above results show that kl distances obtained by using island 1 as model to predict island 2 and 3 are much shorter. So this shows that the entropy of island 1 is highest. Island 2 and 3 have shown concentrated distribution so they have shown more KL distance.
7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?
library(dagitty)
library(ggdag)
hma_dag <- dagitty("dag{H -> M <- A}")
coordinates(hma_dag) <- list(x = c(H = 1, M = 2, A = 3),
y = c(H = 1, M = 1, A = 1))
ggplot(hma_dag, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_text(color = "Blue", size = 10) +
geom_dag_edges(edge_color = "Red", edge_width = 1,
arrow_directed = grid::arrow(length = grid::unit(15, "pt"),
type = "closed")) +
theme_void()
library(dplyr)
d <- sim_happiness(seed = 1977, N_years = 1000)
dat <- d %>%
filter(age > 17) %>%
mutate(A = (age - 18) / (65 - 18),
mid = factor(married + 1, labels = c("single", "married")))
M1 <- quap(
alist(
happiness ~ dnorm(mu, sigma),
mu <- a[mid] + bA * A,
a[mid] ~ dnorm(0, 1),
bA ~ dnorm(0, 2),
sigma ~ dexp(1)
),
data = dat
)
M2 <- quap(
alist(
happiness ~ dnorm(mu, sigma),
mu <- a + bA * A,
a ~ dnorm(0, 1),
bA ~ dnorm(0, 2),
sigma ~ dexp(1)
),
data = dat
)
compare(M1,M2)
## WAIC SE dWAIC dSE pWAIC weight
## M1 2713.971 37.54465 0.0000 NA 3.738532 1.000000e+00
## M2 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85
adjustmentSets(hma_dag, exposure = "A", outcome = "H")
## {}
#from above comparision of WAIC scores we can see that model 1 is having better performance when compared to model 2 since it inclued both age and marital status. but model 2 provides the correct casual inference. So we can say that model may be good at predicting without making any causal sense some times.
7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:
Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).
library(rethinking)
data(foxes)
library(dplyr)
fox_dat <- foxes %>%
as_tibble() %>%
select(area, avgfood, weight, groupsize) %>%
mutate(across(everything(), standardize))
#Model1 avgfood + groupsize + area
model_1 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a + bFood * avgfood + bGroup * groupsize + bArea * area,
a ~ dnorm(0, 0.2),
c(bFood, bGroup, bArea) ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = fox_dat
)
#Model2 avgfood + groupsize
model_2 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a + bFood * avgfood + bGroup * groupsize,
a ~ dnorm(0, 0.2),
c(bFood, bGroup, bArea) ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = fox_dat
)
#Model3 groupsize + area
model_3 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a +bGroup * groupsize + bArea * area,
a ~ dnorm(0, 0.2),
c(bFood, bGroup, bArea) ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = fox_dat
)
#Model 4 avgfood
model_4 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a +bFood * avgfood ,
a ~ dnorm(0, 0.2),
c(bFood, bGroup, bArea) ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = fox_dat
)
#Model 5 area
model_5 <- quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a +bArea * area,
a ~ dnorm(0, 0.2),
c(bFood, bGroup, bArea) ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = fox_dat
)
comp<-compare(model_1,model_2,model_3,model_4,model_5)
comp
## WAIC SE dWAIC dSE pWAIC weight
## model_2 323.7798 16.73230 0.0000000 NA 4.014905 0.369720707
## model_3 324.0734 15.96182 0.2936057 7.372232 3.923116 0.319240596
## model_1 324.1593 17.02626 0.3795336 3.873874 5.628978 0.305815220
## model_4 333.4376 13.83540 9.6577502 8.432583 2.422865 0.002956106
## model_5 333.9681 13.85250 10.1882547 8.626652 2.765677 0.002267370
#models 1,2 and 3 seem to be similar since they have group size as a common variable. The WAIC scores difference is also very less for these models. #Models 3 and 4 are looking identical since they only use either average food or area alone. effect of area is completely dependent on avg food so they show very little diffrence between their WAIC scores.