The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html
file as: YourName_ANLY505-Year-Semester.html
and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# 1) Information entropy should be measured on a continuous scale.
# (2) Its value should scale up with the number of possible outcomes.
# (3) Is should be additive for independent events
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
p <- c(0.7, 1 - 0.7)
H <- -sum(p * log(p))
H
## [1] 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
p2 <- c(0.2, 0.25, 0.25, 0.3)
-sum( p2*log(p2) )
## [1] 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
p3 <- c(1/3, 1/3, 1/3)
-sum( p3*log(p3) )
## [1] 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
# AIC is a way to select model quality and evaluate overfitting in the model. AIC is defined as 𝐷𝑡𝑟𝑎𝑖𝑛+2𝑝 where 𝐷𝑡𝑟𝑎𝑖𝑛 is the in-sample training deviance and 𝑝 is the number of free parameters estimated in the model.
# WAIC is the generalized version of AIC onto singular statistical models. It is defined as −2(lppd−𝑝𝑊𝐴𝐼𝐶)=−2(∑𝑁𝑖=1logPr(𝑦𝑖)–∑𝑁𝑖=1𝑉(𝑦𝑖)) where Pr(𝑦𝑖) is the average likelihood of observation 𝑖 in the training sample and 𝑉(𝑦𝑖) is the variance in log-likelihood for observation 𝑖 in the training sample
# WAIC is the most general criteria
# To transform from WAIC to AIC, we need to assume that the posterior distribution is approximately multivariate Gaussian and the priors are flat.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
# Model is to use information criteria to select one model that has the best information criteria value. Model comparison is to make a posterior predictive distribution that combines all models using DIC or WAIC. Model selection will cause the loss of the information about relative model accuracy contained in the differences among information criterion values.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
# Information criteria are based on deviance, which is accrued over observations without being divided by the number of observations, which lead to that a model with more observations will have a higher deviance and worse accuracy.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
# The effective number of parameters decreases, when the prior becomes more concentrated.
# In the case of DIC, the model will become less flexible as there are more concentrated priors thus less likely to be overfitting. In the case of WAIC, the likelihood will become more concentrated and variance will decrease with more concentrated priors. the model becomes less flexible thus less likely to be overfitting.
7M5. Provide an informal explanation of why informative priors reduce overfitting.
# The reason why informative priors reduce overfitting is that information priors makes a model less flexible by making it less likely to be assigned with high posterior probability to extreme parameters, which subsequently force the model learn less from the extreme parameters.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
# The reason that overly informative priors will cause underfitting is that it constrains the flexibility of the model in a large degree, which will then lead to that “correct” parameter values to be unlikely assigned high posterior probability. This will prevent the model to learn enough from the sample data.