The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# 1. Continuity. If the probability changes a little bit, then the entropy will also changes a little bit.
# 2. Scalability. Entropy increases as number of possible events increase.
# 3. Additivity. The total uncertainty will be the sum of divided uncertainty from all independent events.
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
# based on the entropy formula
# -(0.7 * log(0.7) + 0.3 *log(0.3)) = 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
# based on the entropy formula
# -(0.2 * log(0.2) + 0.25 *log(0.25) + 0.25 * log(0.25) + 0.3 * log(0.3)) = 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
# based on the entropy formula
# -(1/3 * log(1/3) + 1/3 * log(1/3) + 1/3 * log(1/3)) = 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
# AIC is Akaike information criterion and the formula is D_train + 2p which is in-sample training deviance plus number of parameters in the model. The assumptions are
# 1. The priors are flat or overwhelmed by the likelihood.
# 2. The posterior distribution is approximately multivariate Gaussian.
# 3. The sample size N is much greater than the number of parameters k.
# WAIC is Widely Applicable Information Criterion and the formula is −2(lppd − ∑_i var_θ*log p(yi|θ)) where lppd is log-pointwise-predictive-density, θ is posterior distribution and y is observation. The assumption is the sample size N is much greater than the number of parameters k.
# WAIC is more general for less assumption. To transform from WAIC to AIC, we need to assume that posterior distribution is approximately multivariate Gaussian and priors are flat or overwhelmed by the likelihood.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
# In model selection, we use information criteria to find out the best model. We pick the one with best performance on the information criteria and then done. In this case, there are so many hidden things are neglected like relationship between variables and so on. But in model comparison, we compare the current model with a new model which is generally a slightly differences from the current model (like add one more variable) and use information criteria as a metric to help us better understand the two models.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
# Since Information criteria is based on the deviance from all observations in summation rather than average, more observations get higher deviance. To be fair, the number of observations should be the same.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
# As a prior becomes more concentrated, the effective number of parameters or the penalty term will decrease.
# For WAIC, if the prior is more concentrated, that means smaller variance in the log-likelihoods
# for each observation in the training data and then smaller pWAIC.
7M5. Provide an informal explanation of why informative priors reduce overfitting.
# Informative prior avoid model to solely learn from the given data.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
# Overly informative prior can cause the model leverage too much on the prior compared to the likelihood from the given data and provide inaccurate posterior distribution.