7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
#Motivating criteria for defining information entropy should
#1) be measured on a continuous scale such that the spacing between adjacent values is consistent
#2) capture the size of the possibility space is the value scales with the number of possible outcomes
#3) be flexible to add independent events such that it does not matter how the events are divided
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
p <- c(0.7, 1 - 0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643
#The entropy of this coin is 0.61.
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
p <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(p * log(p)))
## [1] 1.376227
#The entropy of this die is 1.38.
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
p <- c(1/3,1/3,1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612
#The entropy of the die is 1.10.
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
#AIC is defined as 𝐷𝑡𝑟𝑎𝑖𝑛+2𝑝 where 𝐷𝑡𝑟𝑎𝑖𝑛 is the in-sample training deviance and 𝑝 is the number of free parameters estimated in the model.
#WAIC is defined as −2(lppd−𝑝𝑊𝐴𝐼𝐶)=−2(∑𝑁𝑖=1logPr(𝑦𝑖)–∑𝑁𝑖=1𝑉(𝑦𝑖)) where Pr(𝑦𝑖) is the average likelihood of observation 𝑖 in the training sample and 𝑉(𝑦𝑖) is the variance in log-likelihood for observation 𝑖 in the training sample.
#WAIC is the most general. To move from WAIC to AIC, we need to assume that the posterior distribution is approximately multivariate Gaussian, and the priors are flat or overwhelmed by the likelihood.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
#In model selection, we use information criteria to select one model, and pick the one that has the best information criteria value. In model averaging, the DIC or WAIC are used to make a posterior predictive distribution that combines all models. Model selection loses information about relative model accuracy contained in the differences among information criterion values; this is especially problematic when the selected model only outperforms its alternatives to a small degree.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
#Information criteria are based on deviance, which is accrued over observations without being divided by the number of observations. Therefore, it is a sum and not an average. So, all else being equal, a model with more observations will have a higher deviance and thus worse accuracy according to information criteria. It would be an unfair comparison to contrast models fit to different numbers of observations.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
#As the prior becomes more concentrated, the effective number of parameters decreases.In the case of WAIC, this is because 𝑝𝑊𝐴𝐼𝐶 is a measure of the variance in the log-likelihood for each observation in the training sample; with more concentrated priors, the likelihood will become more concentrated as well and thus variance will decrease.
7M5. Provide an informal explanation of why informative priors reduce overfitting.
#If we use informative priors, we restrict the parameters to only vary in the bounds of the prior. Thus the degree at which the model can overfit the data is limited.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
#This is because overly informative priors constrain the flexibility of the model too much. Thus they make it less likely for "correct" parameter values to be assigned high posterior probability.