Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

#1 The measure of uncertainty should be continuous.
#2 The measure of uncertainty should increase as the number of possible events increases.
#3 The measure of uncertainty should be additive i.e. the uncertainty measured several times can be summed up.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p <- c( 0.3 , 0.7 )
-sum( p*log(p) )
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p <- c(0.20, 0.25, 0.25, 0.30)
-sum(p * log(p))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p <- c(1/3, 1/3, 1/3)
-sum(p * log(p))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# AIC = D_train + 2*n_params 
#  assumptions: flat priors, posterior is multivariative gaussian, N_samples is larger than n_params 
#     
# DIC = mean(D) + p_d = mean(D) + (mean(D) - D_at_mean)
# where, D - sampled distribution of deviance
#        p_d - effective number of params
#        D_at_mean - deviance calculated at the posterior mean
#  assumptions: posterior is multivariative gaussian
#     
# WAIC = -2(lppd - p_waic)
# where, lppd - log pointwise predictive density
#        lppd = sum(log(Pr(y_i))) for i=1 to N_samples
#        Pr(y_i) - mean likelihood of observation y_i from sample. It's caclulated by averaging likelihood from posterior distribution of parameters
#        p_waic - effective number of parameters
#        p_waic = sum(Var_ll(y_i)) for i=1 to N_samples 
#        Var_ll(y_i) - variance of the likelihood of observation y_i. It's calculated from posterior params distribution
#
# Most to less general: WAIC > DIC > AIC
# Assumptions: no > Gaussian posterior > Gaussian posterior + flat priors
# When the posterior predictive mean is a good representation of the posterior predictive distribution, WAIC and DIC will tend to agree, as a  consequence from the Gaussian assumptions on posterior distribution.
#When priors are effectively flat or overwhelmed by the amount of data, the DIC and AIC will tend to agree

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# Selection is choosing one of the best model for prediction. We lose information about differences among models. 
#   If we have two model m1 and m2 that are equal in performance but might be different in structure then choosing one is essentially discarding part of the evidence
# 
# Averaging means including uncertainty about correctness of the models into predictions. 
#   Information criteria is used as a weight of certainty of the model and predictions of different models are mixed using the weights. 
#   Averaging discard information about model differences.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# All information criteria use sum over all samples as part of the calculation
# Different number of examples will cause differences in the value of the criteria that were influenced. Models fitted on a smaller sample will have better results due to value of the criteria being lower.

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

# Effective number of parameters should reduce, as models become less flexible and more rigid. Regularizing priors constrain a model's flexibility. Since the effective num of parameters measures how flexible the model is, as a prior becomes more concentrated, it redueces the effective num of parameters.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Informative priors reduce overfitting because more data is required to shift params from their initial most probable values which means that if a pattern is not frequent in data, it will be hard to move params from the initial state, as data has a small influence.  
#
# One way to prevent a model from getting too excited by the training sample is to give it a skeptical prior. By “skeptical,” it mean a prior that slows the rate of learning from the sample. The most common skeptical prior is a regularizing prior, which is applied to a beta-coefficient, a “slope” in a linear model. Such a prior, when tuned properly, reduces overfitting while still allowing the model to learn the regular features of a sample. If the prior is too skeptical, however, then regular features will be missed, resulting in underfitting. So the problem is really one of tuning. But as you’ll see, even mild skepticism can help a model do better, and doing better is all we can really hope for in the large world, where no model nor prior is optimal

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# Overly informative priors result in underfitting because effective number of parameters reduces and model have less freedom to fit the data. If regularisation is too strong fitting requires more and more data, and model underfits on the small samples.A model becomes overconfident in parameters values and it's hard to move it from that point. It requires more evidence/data for even tiny changes of parameters.