The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# The first motivating criteria is information uncertainty is measured on a continuous scale with consistency of spacing between adjacent values.
# The second motivating criteria is uncertainty must be increasing with the number of possible outcomes
# The third motivating criteria is uncertainty should be additive for independent events in a way that it does not matter how the events are separated.
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
p <- c(0.7, 0.3)
-sum( p*log(p) )
## [1] 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
p2 <- c(0.2, 0.25, 0.25, 0.3)
-sum( p2*log(p2) )
## [1] 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
p3 <- c(1/3, 1/3, 1/3)
-sum( p3*log(p3) )
## [1] 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
#AIC is defined to be the sum of deviance of in sampling training data and twice the number of free parameters. The formula expression is D_train+2p. D_train represents in sample training deviance while p represent free parameters count
#WAIC is defined to be the difference between the sum of average likelihood of observations and the penalty term. The formula expression is −2(lppd−pWAIC).lppd represents in sample training deviance and pWAIC represents the log likelihood variance sum for every observation (also an estimate of free parameters in the model)
#WAIC is the most general. To transform from the more general criterion into a less general one, we must assume two things:
#1 We must assume posterior distribution is Gaussian
#2 We must assume the priors are flat
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
# Model selection is to choose the model with the lowest information criterion value and discard other models with higher values. In contrast, model comparison utilize and analyze multiple models in making predictions. Under model selection, information about relative model accuracy included in the differences among information criterion values is lost as only one model is chosen and all the others are discarded.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
#Information criteria is based on deviance, which is a sum instead of average. So if models are fitted to different numbers of observations, models with more observations will have higher deviances and therefore, lead to unfair comparison between models.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
```{r# As the prior becomes more concentrated, the effective number of parameters decreases.
# As the prior becomes more concentrated, the effective number of parameters decreases.
# Pwaic is a measure of the variance in the log-likelihood for each observation. With more concentrated priors, the likelihood will become more concentrated as well and thus variance will decrease
7M5. Provide an informal explanation of why informative priors reduce overfitting.
#Informative priors can reduce overfitting because it makes the model more restrictive and reduce its flexibility. As a result, it becomes not likely for extreme parameter values to be assigned with high posterior probability and it does not overfit on its training data
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
#Overly informative priors can lead to underfitting because the correct parameter values are less likely to be assigned with a high posterior probability, and therefore limit the flexibility of the model by too much.