The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# In the information theory, the information entropy is a standard to measure the the uncertanity degree. 1. it concerned with the data compression and transmission and build on top of the probability to support maching learning calculation 2. it give the measure of informaiton needed to represent an event drawn from a probability distribution for a random variables
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
p<- c(0.7,1-0.7)
H<- -sum(p*log(p))
H
## [1] 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
p<- c(0.2,0.25,0.25,0.3)
H<- -sum(p*log(p))
H
## [1] 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
p<- c(1/3,1/3,1/3)
H<- -sum(p*log(p))
H
## [1] 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
# AIC: is defined as Dtrain+2p where Dtrain is the in-sample training deviance and p is the number of free parameters estimated in the model
# DIC: is defined as D¯+pD=D¯+(D¯+D^) where D¯ is the average of the posterior distribution of deviance and D^ is the deviance calculated at the posterior mean.
# WAIC: is defined as −2(lppd−pWAIC)=−2(∑Ni=1logPr(yi)–∑Ni=1V(yi)) where Pr(yi) is the average likelihood of observation i in the training sample and V(yi) is the variance in log-likelihood for observation i in the training sample.
# All three definitions involve two components: 1.an estimate or analog of the in-sample training deviance (i.e., Dtrain for AIC, D¯ for DIC, and lppd for WAIC) and 2.an estimate or analog for the number of free parameters estimated in the model (i.e., p in AIC, pD in DIC, and pWAIC in WAIC).
# WAIC is the most general, followed by DIC, and finally AIC. To transfer from WAIC to DIC, the posterior distribution is approximately multivariate Gaussian must be assumed. To transfer from DIC to AIC, the priors are flat or overwhelmed by the likelihood must be assumed
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
# Model selection is pick a model with the lowest or maximum criterion value in terms of your objectivity
# Model comparison is a more general approach to compare multiple model with different variable influencers, including the causual effect, the conditional independencies among their variables
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
# Information criteria is based on deviance, different observations will result in different deviance.
# If the model were fit to different numbers of observations, model with more observations will have higher deviance, the result may be inaccurate.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
# When prior getting more concentrated, the effective parameter will decrease accordingly
# the penalty term in WAIC is a pointwise log-likelihood. However, we recommend LOO-CV using PSIS (as implemented by the loo() function) because PSIS provides useful diagnostics as well as effective sample size and Monte Carlo estimates.
7M5. Provide an informal explanation of why informative priors reduce overfitting.
# Informative priors constrain the flexibility of the model.
# Extreme parameter values will not be assigned high posterior probability. the informative priors reduce the overfitting by forcing the model to learn less from the sample data
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
# Informative priors reduce the overfitting because they can constrains the flexibility of model, however underfitting occurs because they constrain too much for the flexibility of model. IF the true parameter is outside the bounds of prior, then the underfitting will occurs.