7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

#Motivating criteria that define information entropy should
#1) be measured on a continuous scale such that the spacing between adjacent values is consistent
#2) capture the size of the possibility space is the value scales with the number of possible outcomes
#3) be flexible to add independent events such that it does not matter how the events are divided

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p <- c(0.7, 1 - 0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(p * log(p)))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p <- c(1/3,1/3,1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

#AIC is D_train + 2p
#Dtrain is the in-sample training deviance 
#p is the number of free parameters estimated in the model

#WAIC is defined as −2(lppd−pWAIC) where 
#Pr(yi) is the average likelihood of observation i in the training sample, and 
#V(yi) is variance in log-likelihood for observation i in the training sample.

#WAIC is the most general

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

#In model selection, we pick the one model with best information criteria value.
#In model averaging, the DIC or WAIC are used to make a posterior predictive distribution that combines all models. 

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

#Information criteria are based on deviance, which is accrued over observations without being divided by the number of observations. Therefore, it is a sum, not an average. So, all else being equal, a model with more observations will have a higher deviance and thus worse accuracy according to information criteria. It would be an unfair comparison to contrast models fit to different numbers of observations.

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

#As the prior becomes more concentrated, the effective number of parameters decreases. In the case of WAIC, this is because 𝑝𝑊𝐴𝐼𝐶 is a measure of the variance in the log-likelihood for each observation in the training sample; with more concentrated priors, the likelihood will become more concentrated as well and thus variance will decrease.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

#If we use informative priors, we restrict the parameters to only vary in the bounds of the prior. Therefore, the degree at which the model can overfit the data is limited.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

#This is because overly informative priors constrain the flexibility of the model too much. Thus they make it less likely for "correct" parameter values to be assigned high posterior probability.