The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
#1) should be measured with a continuous scale which the spacing between adjacent values is consistent
#2) capture the size of the possibility space is the value scales with the number of possible outcomes
#3) be flexible to add independent events such that it does not matter how the events are divided
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
coin <- c(0.7,1-0.7)
(H <- -sum(coin * log(coin)))
## [1] 0.6108643
#The entropy of the coin is 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
die <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(die * log(die)))
## [1] 1.376227
#The entropy of this die is 1.38.
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
die <- c(1/3, 1/3, 1/3)
(H <- -sum(die * log(die)))
## [1] 1.098612
#The entropy of the die is 1.10.
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
#AIC is defined as Dtrain+2p where Dtrain is the in-sample training deviance and p is the number of free #parameters estimated in the model (page 189).
#DIC is defined as D¯+pD=D¯+(D¯+D^) where D¯ is the average of the posterior distribution of deviance and D^ is #the deviance calculated at the posterior mean (page 190).
#WAIC is defined as −2(lppd−pWAIC)=−2(∑Ni=1logPr(yi)−∑Ni=1V(yi)) where Pr(yi) is the average likelihood of #observation i in the training sample and V(yi) is the variance in log-likelihood for observation i in the #training sample (pages 191-192).
#All three definitions involve two components: an estimate or analog of the in-sample training deviance (i.e., #Dtrain for AIC, D¯ for DIC, and lppd for WAIC) and an estimate or analog for the number of free parameters #estimated in the model (i.e., p in AIC, pD in DIC, and pWAIC in WAIC).
#WAIC is the most general, followed by DIC, and finally AIC (page 190). To move from WAIC to DIC, we must #assume that the posterior distribution is approximately multivariate Gaussian. To move from DIC to AIC, we #must further assume that the priors are flat or overwhelmed by the likelihood.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
#Model selection is choosing to retain (e.g., use and interpret) the model with the lowest information #criterion value and to discard all other models with higher values (page 195). This practice loses information #about relative model accuracy contained in the differences among information criterion values; this is #especially problematic when the selected model only outperforms its alternatives to a small degree. Model #averaging is using Bayesian information criteria to construct a posterior predictive distribution (i.e., make #predictions) that leverages the uncertainty in multiple models (page 196). This practice does not lose #information on its own. However, when combined with undisclosed data dredging (page 205), it can lead to #spurious findings (e.g., non-generalizable coincidence).
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
#Information criteria are based on deviance, which is accrued over observations without being divided by the #number of observations (page 182). Thus, it is a sum and not an average. So, all else being equal, a model #with more observations will have a higher deviance and thus worse accuracy according to information criteria. #It would be an unfair comparison to contrast models fit to different numbers of observations.
#As an experiment, we can calculate WAIC for models fit to increasingly small subsamples of the same data. The #information criteria should decrease alongside the sample size. In order to get a large sample to begin with, #I will return to the Howell1 database from last chapter.
#d <- Howell1[complete.cases(Howell1), ]
#d_550 <- d[sample(1:nrow(d), size = 550, replace = FALSE), ]
#d_450 <- d[sample(1:nrow(d), size = 450, replace = FALSE), ]
#d_350 <- d[sample(1:nrow(d), size = 350, replace = FALSE), ]
#m_550 <- map(
# alist(
# height ~ dnorm(mu, sigma),
# mu <- a + b * log(weight)
#),
#data = d_550,
#start = list(a = mean(d_550$height), b = 0, sigma = sd(d_550$height))
#)
#m_450 <- map(
# alist(
# height ~ dnorm(mu, sigma),
# mu <- a + b * log(weight)
# ),
# data = d_450,
# start = list(a = mean(d_450$height), b = 0, sigma = sd(d_450$height))
#)
#m_350 <- map(
# alist(
# height ~ dnorm(mu, sigma),
# mu <- a + b * log(weight)
# ),
# data = d_350,
# start = list(a = mean(d_350$height), b = 0, sigma = sd(d_350$height))
#)
#(model.compare <- compare(m_550, m_450, m_350))
#Different numbers of observations found for at least two models.
#Information criteria only valid for comparing models fit to exactly same observations.
#Number of observations for each model:
#m_550 550
#m_450 450
#m_350 350
#longer object length is not a multiple of shorter object lengthlonger object length is not a multiple of #shorter object lengthlonger object length is not a multiple of shorter object length
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
#As the prior becomes more concentrated (e.g., as in a regularizing prior), the effective number of parameters #decreases. In the case of DIC, this is because pD is a measure of how flexible the model is (page 191); with #more concentrated priors, the model becomes less flexible (and therefore less prone to overfitting). In the #case of WAIC, this is because pWAIC is a measure of the variance in the log-likelihood for each observation in #the training sample (page 191); with more concentrated priors, the likelihood will become more concentrated as #well and thus variance will decrease.
#As an experiment to demonstrate this, we can calculate WAIC for two models of the same data that only differ #in the concentration of their priors.
#d <- Howell1[complete.cases(Howell1), ]
#d$height.log <- log(d$height)
#d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
#d$weight.log <- log(d$weight)
#d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
#m_wide <- map(
# alist(
# height.log.z ~ dnorm(mu, sigma),
# mu <- a + b * weight.log.z,
# a ~ dnorm(0, 10),
# b ~ dnorm(1, 10),
# sigma ~ dunif(0, 10)
#),
#data = d
#)
#m_narrow <- map(
# alist(
# height.log.z ~ dnorm(mu, sigma),
# mu <- a + b * weight.log.z,
#a ~ dnorm(0, 0.10),
#b ~ dnorm(1, 0.10),
#sigma ~ dunif(0, 1)
#),
#data = d
#)
#WAIC(m_wide, refresh = 0)
#WAIC(m_narrow, refresh = 0)
#Indeed, the pWAIC decreases as the priors become more concentrated. However, note that the regularizing priors #must be chosen carefully in order for this to occur
7M5. Provide an informal explanation of why informative priors reduce overfitting.
#Informative priors reduce overfitting because they constrain the flexibility of the model; they make it less #likely for extreme parameter values to be assigned high posterior probability. In simpler terms, informative #priors reduce overfitting by forcing the model to learn less from the sample data.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
#Overly informative priors result in underfitting because they constrain the flexibility of the model too much; #they make it less likely for “correct” parameter values to be assigned high posterior probability. In simpler #terms, overly informative priors result in underfitting by preventing the model from learning enough from the #sample data.