Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.


7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

# Three Motivating criteria that define information entropy are:
# 1) Additive which makes sure the uncertainty is added for ala carte events
# 2) Apprehend size of possibility space - Uncertainty increases as the no of possible outcomes increase
# 3) Continuous Scale - its good to measure on continuous scale as the spacing between the values is consistent

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p<- c(0.7,1 - 0.7)
E<- -sum(p*log(p))

# Entropy is = 0.610864302

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p<- c(0.20,0.25,0.25,0.3)
E<- -sum(p*log(p))

# The entropy is = 1.37622660

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p<- c(1/3,1/3,1/3)
E<- -sum(p*log(p))

# The entropy for the die is = 1.098612288

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# AIC refers to Akaike Infromation Criterion and on the other hand WAIC referes to Watanabe-Akaike Infromation Criterion and also referred as Widely applicable information criterion. 

# AIC helps in estimating the K-L distance in theory as Dtrain + 2p = -2lppd + 2p

# On the other hand WAIC is a general version of AIC and doesn't assume about the shape of the posterior. WAIC(y,O) = -2 lppd - X i varQ logp(yi|q)|{z}

# We will have to assume Gaussian posterior while transforming WAIC to AIC in a sample size where n is much greater than no of parameter. The posterior is assumed to flat or saturated with the likelihood

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# Model selection refers to basically selecting certain representation models from a large set of models under uncertainty which in turn would help in better decision making.  

# On the other hand Model comparison refers to selecting the model which best fits to the data at hand by applying multiple models competing with each other. 

# In Model selection the model loses the accuracy around crierion value and causal interference difference. 

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# All models must fit to exactly same obeservations because deviation is an important measure to evaluate the performance of the model. If models were fit to different numbers of observations then there would be higher deviations between the information criteria values  and it would lead to less accuracy.

# Thus I think it would be good to have the same number of observations for the different models to maintain accuracy and deviation.

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

# As the prior becomes more concentrated the likelihood of the observed will also become more concentrated leading to a decrease in pWAIC and thus effective number of parameters. 

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Informative priors basically have no influence on the model and reduces the flexibility of the model. High posterior probability is thus not assigned and therefore infromative priors reduce overfitting.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# Overly informative priors basically limits the model too much and in this scenario informative priors do not get high posterior probability. Thus it doesn't analyze the data properly leading to underfitting.