The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# The three motivating criteria that define information entropy are the following:
#1. Continuous. Uncertainty should be measured on a continuous scale such that the spacing between adjacent values is consistent. 2. Increasing with number of possible events. Uncertainty should increase when the number of different outcome increase. 3. Additive. Uncertainty should be additive for independent events.
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
#Entropy is defined on page 178. We just need to plug in the appropriate probabilities.
p <- c(0.7, 1 - 0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643
#The entropy of this coin is 0.61.
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
p <- c(0.2, 0.25, 0.25, 0.3)
E <- -sum(p*log(p))
E
## [1] 1.376227
#The entropy of this die is 1.38.
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
p <- c(1/3, 1/3, 1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612
#The entropy of the die is 1.10.
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
#AIC: Akaike Information Criterion. It estimates K-L Distance in theory. AIC = Dtrain + 2p = −2lppd + 2p
#WAIC: Widely Applicable Information Criterion. It makes no assumption about the shape of the posterior. In a finite sample, it can disagree. In the large-sample limit, they tend to be the same. WAIC=-2(lppd-pWAIC)
#WAIC is the most general. Assume the posterior distribution is approximately multivariable Gaussian is required if we move from WAIC to DIC. Assume the priors are flat or overwhelmed by the likelihood is required if we move from DIC to WAIC.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
#Model selection is a model with the lowest information criterion value and to discard all other models with higher values. It loses information relative model accuracy contained in the differences among information criterion value.
#Model comparison is that it conceives of statistical tests of experimental effects as a comparison between two alternative models of the data that differ in the assumptions that they make.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
# It is because deviance is the measure of model performance. It accrued over observations. If the models fit to different observations, the model fit to more observations will probably have higher deviance, seemingly less accurate. However, that inaccuracy comes from application but not from the model itself. It would be an unfair comparison to contrast models fit to different numbers of observations.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
#The effective number of parameters decreases as the prior becomes more concentrated.In the case of PSIS, with more concentrated priors, the model becomes less flexible. In the case of WAIC, the likelihood will become more concentrated as well and thus variance will decrease with more concentrated priors.
7M5. Provide an informal explanation of why informative priors reduce overfitting.
#This is because informative priors constrain the flexibility of the model. They make it less likely for extreme parameter values to be assigned high posterior probability which means the model to learn less from the sample data.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
#While informative priors limit the model in a good way, overly informative priors will limit the model too much and does not let it learn enough from the sample data. This time, even good values lose their voice and do not get assigned high enough posterior probability. The model not learning enough results in underfitting.