Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

#1. Continuous. A continuous scale should be used in the measurement and so the spacing between adjacent values is consistent. 2. Increasing with number of possible events. Uncertainty increases when the number of different outcome increase. 3. Additive. Uncertainty os additive for independent events.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p <- c(0.7, 1 - 0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(p * log(p)))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p <- c(1/3, 1/3, 1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# Akaike Information Criterion (AIC)- is defined as sum of deviance of in sampling training data (Dtrain) and twice the number of parameters (p) follows:
# AIC = Dtrain + 2p
#
# WAIC is defined as the difference between sum of average likelihood of observations
# WAIC(y, Θ) = -2(lppd - varΘ logp(yi|Θ))
# Both of them share an estimate of the in-sample training deviance and an estimate for the number of free parameters estimated in the model
# However, WAIC is the most general, which doesn't assume Gaussian posterior. . On the other side, WAIC assumes that the posteriror distribution is close to multivariate Gaussian

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# Model selection is choosing to retain  the model with the lowest information criterion value and to discard all other models with higher values . This practice loses information about relative model accuracy contained in the differences among information criterion values; this is especially problematic when the selected model only outperforms its alternatives to a small degree. Model averaging is using Bayesian information criteria to construct a posterior predictive distribution that leverages the uncertainty in multiple models . This practice does not lose information on its own. However, when combined with undisclosed data dredging, it can lead to spurious findings 

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# Information criteria are based on the total deviance, which is accrued over the number of observations. Thus, different numbers of observations it will lead to different Information Criterion in model cmparison.
#
# For example, when we calculate WAIC for models with a small sample of the data. The information criteria should decrease with the small sample size. 

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

#The effective number of parameters decreases as the prior becomes more concentrated.In the case of PSIS, with more concentrated priors, the model becomes less flexible. In the case of WAIC, the likelihood will become more concentrated as well and thus variance will decrease with more concentrated priors.
# WAIC(y, Θ) = -2(lppd - varΘ logp(yi|Θ)) 

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Informative priors are regularizing which reduces overfitting. As we know, they constrain the flexibility of the model; they make it less likely for extreme parameter values to be assigned high posterior probability. 

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# Overly informative priors result in underfitting as they constrain the flexibility of the model too conservative. Therefore, the number of parameters these priors look for are too limited and picky to develop a proper model.