Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

# 1) Continuity:  The measure of uncertainty should be continuous. It should not be discrete to avoid massive change in uncertainty due to arbitary small change in any of the probabilities. 
# 2) The measure of uncertainty should increase as the number of possible events increases. When more possible events could happen, we will expact more uncertainty to be measured.It is due to the added dimensionality of the prediction problem.
# 3) Additivity: The measure of uncertainty should be additive. For example, if we first measure the unceratinty about 2 possible events and then measure another 2 different possible events, the uncertainty over the four combinations of these events should be the sum of the separate uncertainties.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p <- c(0.7, 0.3)
-sum(p*log(p))
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p <- c(0.2, 0.25, 0.25, 0.3)
-sum(p*log(p))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p <- c(1/3, 1/3, 1/3)
-sum(p*log(p))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# The definition of the AIC: 

\[ AIC = D_{train} + 2p = -2lppd +2p\] where p is the number of free parameters in the posterior distribution

# The definition of the WAIC:

\[ WAIC(y, \Theta) = -2(lppd - \sum var_\theta log p(y_i|\theta))\] where lppd is the log-posterior-predictive-density and the penalty term is the variance in the log-likelihood of observation i, taking the variance over the posterior distribution.

Both definitions involve two components: an estimate of the fit to the training sample (i.e., Dtrain for AIC and lppd for WAIC) and some penalty for flexibility in fitting to the sample (i.e.twice the number of parameters, p in AIC, and sum of the variance in the log-likelihood of observation i, taking the variance over the posterior distribution in WAIC).

The WAIC is the more general than AIC. To move from WAIC to AIC, we must require assume that the priors are flat or overwhelmed by the likelihood; posterior distribution is approximately multivariate Gaussian; and sample size N is much greater than the number of parameters k.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# Model selection refers to choose the model with the lowest (i.e., best) criterion value and discards other models. When we take this approach, it discards information about relative model accuracy contained in the difference among the criterion values for the candidate models. This relative model accuracy information can inform how confident we are in the models. Additionally, the model selection cares only about predictive accuracy and ignores causal inference. Thus, a model may be selected that has confounds or that would not correctly inform an intervention.

# In contrast, model comparison is a more general approach that uses multiple models to understand both how the variables influence prediction and affect implied conditional independencies in a causal model. Thus, we preserve information and can make more holistic judgments our data and models.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# When one model is fit to fewer (or different) observations, it is being evaluated and compared based on a different target than the other models. If fewer observations are used, the model will usually appear to perform better, because the deviance will be smaller due to having less to predict. Since the information criterion values like PSIS and WAIC are pointwise, therefore less observations results to less prediction error and eventually less information criterion value. 

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

set.seed(11)
data(WaffleDivorce)
d <- WaffleDivorce
d$A <- standardize( d$MedianAgeMarriage )
d$D <- standardize( d$Divorce )

# sigma ~ uniform (0,10)
m_1 <- quap(
  alist(
    D ~ dnorm( mu , sigma ) ,
    mu <- a + bA * A ,
    a ~ dnorm( 0 , 0.2 ) ,
    bA ~ dnorm( 0 , 0.5 ) ,
    sigma ~ dunif( 0,10 )
    ) , data = d )

# concentrated priors: sigma ~ uniform (0,1)
m_2 <- quap(
  alist(
    D ~ dnorm( mu , sigma ) ,
    mu <- a + bA * A ,
    a ~ dnorm( 0 , 0.2 ) ,
    bA ~ dnorm( 0 , 0.5 ) ,
    sigma ~ dunif( 0,1 )
    ) , data = d )



compare(m_1, m_2, func=WAIC)
##         WAIC       SE     dWAIC     dSE    pWAIC    weight
## m_2 126.5506 13.90576 0.0000000      NA 4.140526 0.5535415
## m_1 126.9806 14.24906 0.4299805 0.37976 4.417956 0.4464585
compare(m_1, m_2, func=PSIS)
##         PSIS       SE    dPSIS      dSE    pPSIS    weight
## m_2 126.2403 13.80840 0.000000       NA 3.979589 0.7741175
## m_1 128.7037 15.58872 2.463417 1.989058 5.256225 0.2258825
# As a prior becomes more concentrated, the effective number of parameters decrease, as measured by PSIS and WAIC. It is because when the prior is more concentrated around the particular parameter value (regularizing prior), which means the prior is skeptical, the model is less flexible in fitting the sample, therefore the penalty term for overfitting risk (the effective number of parameters) decreases.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Informative priors reduce overfitting by reducing the sensitivity of a model to a sample. Some of the information in a sample is irregular, not a recurring feature of the process of interest. Such a prior, when tuned properly, reduces overfitting while still allowing the model to learn the regular features of a sample. 

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# If a prior is overly informative, then even regular features of a sample will be missed and will not be learned by a model. In this case, it will result in underfitting.