Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

#(1) Continuity is required o avoid a massive change in uncertainty.

#(2) A larger measure of uncertainty for a larger number of possible events.

#(3) The measure of uncertainty should be additive.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p <- c( 0.7 , 0.3 )
-sum( p*log(p) )
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p <- c( 0.2 , 0.25 , 0.25 , 0.3 )
-sum( p*log(p) )
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

#The other three sides show equally often.
p <- c( 1/3 , 1/3 , 1/3 )
-sum( p*log(p) )
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

#AIC stands for Akaike Information Criterion, which provides a simple estimate of the average out-of-sample deviance. It is only reliable when (1) The priors are flat or overwhelmed by the likelihood. (2) The posterior distribution is approximately multivariate Gaussian. (3) The sample size N is much greater113 than the number of parameters k.

## AIC = Dtrain +2p = −2lppd+2p

#WAIC stands for Widely Applicable Information Criterion, which makes no assumption about the shape of the posterior and provides an approximation of the out-of-the-sample deviance that converges to the cross-validation approximation in a large sample. Although it can disagree in a finite sample.

## WAIC(y, Θ) = −2 (lppd − summation [varθ log p(yi|θ)])

#WAIC is more general than AIC and DIC. DIC and WAIC tend to agree when the posterior predictive mean properly represents the posterior predictive distribution. When priors are effectively flat or overwhelmed by the amount of data, the DIC and AIC will tend to agree.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

#Model selection is a very common use of cross-validation and information. It means choosing the models by information criteria (or any other criteria) and choosing the highest ranked model.

#Under model selection, the information about relative model accuracy contained in the differences among the CV/PSIS/WAIC values and causal inferences is discarded.

#Model comparison is a more general approach that uses multiple models to understand both how different variables influence predictions and, in combination with a causal model, implied conditional indecencies among variables help us infer causal relationships.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

#All models must be fit to exactly the same observations when being compared because the number of observations affects model performance and deviance. When the models are fit to a smaller number of observations, the deviance and information criteria would appear to be smaller, and the models would appear to perform better.

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

#When the prior becomes more concentrated, the effective number of parameters decreases because the model becomes less flexible in fitting the sample.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

#Overfitting occurs when a model learns too much from the sample. Informative priors reduce overfitting because the sensitivity of a model to the sample is reduced. Non or weakly informative priors result in overfitting because irregularities in the sample are mistaked as recurring features.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

#If the prior is too skeptical (overly informative), then regular features will be missed, resulting in underfitting.