Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

# 1. Information entropy. Uncertainty should be measured on a continuous scale with spacing between adjacent values being consistent.
# 2. Uncertainty should increase with possibilities.Such as number of events.
# 3. Uncertainty should be additive for independent events and not depend on how these events are segregated.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p <- 0.7
prob <- c(p, 1-p)
(H <- -sum(prob*log(prob)))
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

prob <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(prob*log(prob)))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

prob <- c(1/3, 1/3, 1/3)
(H <- -sum(prob*log(prob)))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# AIC <- D_train + 2p; where
# D_train <- in-sample training deviance
# p <- parameters count

# WAIC <- [−2(lppd−pWAIC)]; where
# lppd <- in-sample training deviance
# pWAIC (the penalty term) <- sum of variance in log-likelihood for every observation

# WAIC criteria is the most general

# To transform WAIC into AIC:
# Assume posterior distribution is Gaussian.
# Assume flat priors.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# Model selection recommends choosing the model with the lowest information criterion value over other models, whereas model comparison takes advantage of multiple models in making predictions. Relative model accuracy is lost under model selection, as you pick only one model and completely discard the others.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# Information criterion estimates prediction error. That is, it is based on deviance, and that too on sum of deviance and not on average deviance. It relatively evaluates model performance of statistical models trained on the same set of data. 
# Now if a models are trained on different number of observations, model with higher number of observations will have higher deviance, thereby leading to incorrect comparison. 

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

# The effective number of parameters decreases as prior becomes more concentrated.
# For WAIC, pWAIC (penalty term in WAIC) is sum of variance in log-likelihood for every observation. Now with concentrated priors, that likelihood too will be more concentrated thereby reducing the variance and pWAIC.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Informative priors restrict the model. Extreme parameter values do not get assigned a higher posterior probability. As a result, extreme parameter values do not corrupt the model and it does not overfit on its training data.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# Overly informative priors restrict the model to an extent that EVEN the correct parameter values are not assigned a high posterior probability. As a result, the model remains underfitting.