7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
Regarding the motivating criteria, please refer textbook on page 177 to page 178.
Those motivating criteria are measures of uncertainty, which should be featured in the following aspects: (1) be measured on a continuous scale such that the spacing between adjacent values is consistent, (2) capture the size of the possibility space such that its value scales with the number of possible outcomes, (3) be additive for independent events such that it does not matter how the events are divided.
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
The entropy of this coin (as mentioned on page of 178), it could be implemented by plugging in the appropriate probabilities:
H(p)=−∑i=1npilog(pi)=−(pHlog(pH)+pTlog(pT))
p <- c(0.7, 1 - 0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643
Therefore, the entropy of this coin is 0.61
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
Refering page 178,
p <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(p * log(p)))
## [1] 1.376227
Therefore, the entropy of this die is 1.38.
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
By the same token, refering page 178 and page 179
p <- c(1/3, 1/3, 1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612
The entropy of the die is 1.10.
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
AIC is defined as: Dtrain + 2p in which Dtrain is the in-sample training deviance and the p is the number of free parameters estimated in the model.
DIC is defined as: D¯ + pD = D¯ + ( D¯ + D^ ) in which D¯ is the average of the posterior distribution of deviance and D^ is the deviance calculated at the posterior mean.
WAIC is defined as: −2 ( lppd − pWAIC ) = −2( ∑Ni=1 log Pr( yi ) − ∑Ni=1 V(yi) ) in case that Pr(yi) is the average likelihood of observation i in the training sample and V(yi) is the variance in log-likelihood for observation i in the training sample.
All three definitions involve two components: an estimate or analog of the in-sample training deviance (i.e., Dtrain for AIC, D¯ for DIC, and lppd for WAIC) and an estimate or analog for the number of free parameters estimated in the model (i.e., p in AIC, pD in DIC, and pWAIC in WAIC).
WAIC is the most general, followed by DIC, and finally AIC. To move from WAIC to DIC, we must assume that the posterior distribution is approximately multivariate Gaussian. To move from DIC to AIC, we must further assume that the priors are flat or overwhelmed by the likelihood.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
Model selection is to choose to retain (for exmaple, use and interpret) the model with the lowest information criterion value and to discard all other models with higher values.
This approach loses information about relative model accuracy lies in the differences among information criterion values; this brings even more problematic when the selected model only outperforms its alternatives to a small propotion
Model averaging is using Bayesian information criteria to construct a posterior predictive distribution (i.e., forecasting) that leverages the uncertainty in multiple models. This practice does not lose information on its own. However, when combined with undisclosed data dredging, which could lead to spurious findings.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
As we mentioned in previous questions, information criteria are based on deviance, which is accrued over observations without being divided by the number of observations. Thus, it is a sum and not an average. So, all else being equal, a model with more observations will have a higher deviance and thus worse accuracy according to information criteria. It would be an unfair comparison to contrast models fit to different numbers of observations.
As an experiment, we can calculate WAIC for models fit to increasingly small subsamples of the same data. The information criteria should decrease alongside the sample size. In order to get a large sample to begin with, I will return to the Howell1 database from last chapter.
d <- Howell1[complete.cases(Howell1), ]
d_500 <- d[sample(1:nrow(d), size = 500, replace = FALSE), ]
d_400 <- d[sample(1:nrow(d), size = 400, replace = FALSE), ]
d_300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
m_500 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_500,
start = list(a = mean(d_500$height), b = 0, sigma = sd(d_500$height))
)
m_400 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_400,
start = list(a = mean(d_400$height), b = 0, sigma = sd(d_400$height))
)
m_300 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_300,
start = list(a = mean(d_300$height), b = 0, sigma = sd(d_300$height))
)
(model.compare <- compare(m_500, m_400, m_300))
## Warning in compare(m_500, m_400, m_300): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m_500 500
## m_400 400
## m_300 300
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
## WAIC SE dWAIC dSE pWAIC weight
## m_300 1862.175 27.91431 0.0000 NA 3.437154 1.000000e+00
## m_400 2419.639 33.94881 557.4644 46.34780 3.442395 8.874639e-122
## m_500 3054.604 35.32283 1192.4289 53.17046 3.257118 1.167790e-259
In fact, the WAIC increases from the N=300 model to the N=400 model and then to the N=500 model. The compare() function also returns a warning about the number of observations being different.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
As the prior becomes more specifized (for example, as in a regularizing prior), the effective number of parameters decreases. In the case of DIC, this is because pD is a measure of how flexible the model is; with more concentrated priors, the model becomes less flexible (and therefore less prone to overfitting). In the case of WAIC, this is because pWAIC is a measure of the variance in the log-likelihood for each observation in the training sample (refering textbook on page 191); with more concentrated priors, the likelihood will become more concentrated as well and thus variance will decrease.
As an experiment to demonstrate this, we can calculate WAIC for two models of the same data that only differ in the concentration of their priors.
d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_wide <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 10),
b ~ dnorm(1, 10),
sigma ~ dunif(0, 10)
),
data = d
)
m_narrow <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 0.10),
b ~ dnorm(1, 0.10),
sigma ~ dunif(0, 1)
),
data = d
)
WAIC(m_wide, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.7631 55.62747 4.24594 36.5361
WAIC(m_narrow, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.7579 55.62061 4.241636 36.44326
Finally, the pWAIC decreases as the priors become more concentrated. However, note that the regularizing priors must be chosen carefully in order for this to occur (refering question 7M6).
7M5. Provide an informal explanation of why informative priors reduce overfitting.
Informative priors reduce overfitting because they constrain the flexibility of the model; they make it less likely for extreme parameter values to be assigned high posterior probability. In simpler terms, informative priors reduce overfitting by forcing the model to learn less from the sample data.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
In general, informative priors result in underfitting because they constrain the flexibility of the model too much; they make it less likely for “correct” parameter values to be assigned high posterior probability. In simpler terms, overly informative priors result in underfitting by preventing the model from learning enough from the sample data.