The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# The three conditions that motivate the principle of information theory are as follows:
# Continuity: To maintain comparability, uncertainty must be quantified on a continuous scale with comparable intervals.
# Additivity: Total uncertainty is calculated by summing the uncertainties for each prediction.
# Scalability: Uncertainty scales with the number of alternative outcomes to show changes in certainty simply as a result of the varied numbers of outcomes.
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
The entropy of this coin is 0.61.
# From page 206 p1 being probability of heads, and p2 being probability of tails, we may simply enter our values in the following format:
p <- c( 0.7 , 0.3 )
-sum( p*log(p) )
## [1] 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
The entropy of this die is 1.38.
p <- c(0.20, 0.25, 0.25, 0.30)
-sum(p * log(p))
## [1] 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
The entropy of the die is 1.10.
p <- c(1 / 3, 1 / 3, 1 / 3)
-sum(p * log(p))
## [1] 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
# Dtrain+2p is the AIC formula, where Dtrain is the in-sample training deviance and p is the number of estimated free parameters in the model. It is based on the following assumptions: Priors are either flat or overpowered by models, The posterior distribution is multivariate in nature. and Sample size >> Number of parameters
# WAIC is defined as −2(lppd−𝑝𝑊𝐴𝐼𝐶)=−2(∑𝑁𝑖=1logPr(𝑦𝑖)–∑𝑁𝑖=1𝑉(𝑦𝑖)) where Pr(𝑦𝑖) is the average likelihood of observation 𝑖 in the training sample and 𝑉(𝑦𝑖) is the variance in log-likelihood for observation 𝑖 in the training sample. It is predicated on the following assumptions: Sample size >> Number of parameters.
# The WAIC is the most comprehensive. To transition from WAIC to AIC, we must assume that the posterior distribution is multivariate Gaussian and that the priors are flat or overpowered by the likelihood
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
# When choosing a model, we employ information criteria to narrow down the options and choose the one with the highest information criteria value. The DIC or WAIC are used in model comparison to create a posterior predictive distribution that includes all models. The information regarding relative model correctness included in the differences among information criteria values is lost during model selection; this is especially troublesome when the selected model only slightly outperforms its alternatives.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
# The deviation of all observations is used as the information criterion without being split by the number of observations (sum and not an average). According to the information criterion, a model with more data will have a bigger deviation and worse accuracy, hence it is not acceptable to compare models with different numbers of observations.
# The WAIC may be calculated for models with varied numbers of observations to corroborate this.
library(rethinking)
data(Howell1)
str(Howell1)
## 'data.frame': 544 obs. of 4 variables:
## $ height: num 152 140 137 157 145 ...
## $ weight: num 47.8 36.5 31.9 53 41.3 ...
## $ age : num 63 63 65 41 51 35 32 27 19 54 ...
## $ male : int 1 0 0 1 0 1 0 1 0 1 ...
d <- Howell1[complete.cases(Howell1), ]
d100 <- d[sample(1:nrow(d), size = 100, replace = FALSE), ]
d200 <- d[sample(1:nrow(d), size = 200, replace = FALSE), ]
d300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
m100 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d100,
start = list(a = mean(d100$height), b = 0, sigma = sd(d100$height))
)
m200 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d200,
start = list(a = mean(d200$height), b = 0, sigma = sd(d200$height))
)
m300 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d300,
start = list(a = mean(d300$height), b = 0, sigma = sd(d300$height))
)
set.seed(77)
compare( m100, m200, m300 , func=WAIC )
## Warning in compare(m100, m200, m300, func = WAIC): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m100 100
## m200 200
## m300 300
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
## WAIC SE dWAIC dSE pWAIC weight
## m100 609.4032 15.13361 0.0000 NA 3.045085 1.000000e+00
## m200 1217.1886 20.00248 607.7855 20.94896 2.942079 1.049700e-132
## m300 1801.0821 25.04086 1191.6790 20.77374 2.824663 1.699037e-259
# We can observe that the WAIC grows with the amount of observations when comparing the three models with 100, 200, and 300 observations. There's also a disclaimer regarding the amount of observations varying.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
# As a prior becomes more concentrated, the number of effective parameters decreases. As the priors get more concentrated, the model for PSIS becomes less flexible. The pWAIC is a measure of variation in log-likelihood for each observation in the training sample in WAIC. The likelihood will also get more concentrated as the prior becomes more concentrated, and the variance will decrease.
d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_not_concentrated <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 10),
b ~ dnorm(1, 10),
sigma ~ dunif(0, 10)
),
data = d
)
m_concentrated <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 0.10),
b ~ dnorm(1, 0.10),
sigma ~ dunif(0, 1)
),
data = d
)
WAIC(m_not_concentrated, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.4112 55.68215 4.476531 36.501
WAIC(m_concentrated, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.6269 55.63468 4.321246 36.46326
# With a more concentrated prior, the pWAIC decreases.
7M5. Provide an informal explanation of why informative priors reduce overfitting.
# Informative priors reduce overfitting by forcing the model to learn less from the sample data.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
# Overly informative priors result in underfitting because they constrain the flexibility of the model too much; they make it less likely for “correct” parameter values to be assigned high posterior probability.