7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

# Uncertainty should be measured on a continuous scale. 
# When to quantify uncertainty, it needs to increase with number of possible events. 
# Events need to be additive. 

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p<-c(0.7, 1-0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p<-c(0.2, 0.25,0.25,0.3)
(H <- -sum(p * log(p)))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p<-c(1/3, 1/3,1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

#AIC is Dtrain + 2k where k is parameter count and Dtrain is in-sample training deviance. 
# WAIC is -2(lppd - penalty_termWAIC). 
# Both definitions have two components: an estimate of the in-sample traning deviance, and an estimate for the parameter count 
#WAIC is most popular then AIC 
# To transform WAIC to AIC, we need to assume that the posterior distribution is Gaussian and the priors are flat.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# We should try to avoide model selection and practice model comparison. According the textbook, model selection is use the model with the lowset information criterion value and to discard all other models with higher values (p.195). We lost relative model accuracy, and this can be a problem when the selected model only outperforms its alternatives to a small degree. Model comparison is not causal inference. It uses multiple models competing to explain and it leverages multiple models in predictions. This practice does not lose information on its own. 

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# Information criterion is based on deviance, and deviance is a sum not an average. A model with more observations will have a higher deviance and less accuracy in information criteria, wihch is not fair to compare models fit to different number of observations. 
# Information criterion will decrease with increasing number of observatinons for different models. 

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

# The effective number of parameters decreases for WAIC. In the case of WAIC, with more concentrated priors, the log-likelihood will become more concentrated and variance will decrease (pWAIC). It will not affect PSIS as PSIS uses a single data point and average over samples. 

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Overfitting can happen when the model has too weak a prior. Informative priors reduce overfitting because it helps to constrain the flexibility of the model. 

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# Overly informative priors could constrain the model too much so the "correct" paramters values are less likely to be assigned high posterior probability.