Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

In my own words, it is a measure of uncertainty that should be: 1. measured on a continuously and the gap/spacing between values is consistent. No sudden jumps should be there.

  1. increased as the number of possible events increase. If only one event has a high possibility of happening and all other events are unlikely to happen, then there is little uncertainty in what comes next. If all events are equally likely to happen, it is uncertain in regards to what will happen.

  2. additive for independent events so that how the events are divided do not matter. The uncertainty for the distribution over the combinations of independent events is the same as taking the sum over the single distributions.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

Based on the proabilities: H(p)=−∑i=1npilog(pi)=−(pHlog(pH)+pTlog(pT))

p <- c(0.7, 1 - 0.7)

(H <- -sum(p * log(p)))
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

Based on the proabilities: H(p)=−∑i=1npilog(pi)=−(pHlog(pH)+pTlog(pT))

p <- c(0.20, 0.25, 0.25, 0.30)

(H <- -sum(p * log(p)))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p <- c(1/3, 1/3, 1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

AIC: is defined as Dtrain+2p where Dtrain is the in-sample training deviance and p is the number of free parameters estimated in the model

DIC: is defined as D¯+pD=D¯+(D¯+D^) where D¯ is the average of the posterior distribution of deviance and D^ is the deviance calculated at the posterior mean.

WAIC: is defined as −2(lppd−pWAIC)=−2(∑Ni=1logPr(yi)–∑Ni=1V(yi)) where Pr(yi) is the average likelihood of observation i in the training sample and V(yi) is the variance in log-likelihood for observation i in the training sample.

All three definitions involve two components: 1.an estimate or analog of the in-sample training deviance (i.e., Dtrain for AIC, D¯ for DIC, and lppd for WAIC) and 2.an estimate or analog for the number of free parameters estimated in the model (i.e., p in AIC, pD in DIC, and pWAIC in WAIC).

WAIC is the most general, followed by DIC, and finally AIC. To transfer from WAIC to DIC, the posterior distribution is approximately multivariate Gaussian must be assumed. To transfer from DIC to AIC, the priors are flat or overwhelmed by the likelihood must be assumed.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

Model selection is to choose the model with the lowest criterion value and then discard the others.

Model selection loses information about relative model accuracy contained in the differences among information criterion values; the differences in criteria can be either large or small - it is especially problematic when the selected model only outperforms its alternatives to a small degree.

Model comparison a more general approach that uses multiple models to understand both how different variables influence predictions and, in combination with a causal model, implied conditional independencies among variables help us infer causal relationships.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

Information criteria are based on deviance, which is accrued over observations without being divided by the number of observations.

In general, a model with more observations will have a higher deviance and thus worse accuracy according to information criteria. So it would be inaccurate to contrast models fir to different numbers of observations.

As an experiment, we can calculate WAIC for models fit to increasingly small subsamples of the same data. The information criteria should decrease alongside the sample size. We will utilize the Howell1 database.

# use Howell1 data
data(Howell1)

d <- Howell1[complete.cases(Howell1), ]
d_500 <- d[sample(1:nrow(d), size = 500, replace = FALSE), ]
d_400 <- d[sample(1:nrow(d), size = 400, replace = FALSE), ]
d_300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
m_500 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_500,
  start = list(a = mean(d_500$height), b = 0, sigma = sd(d_500$height))
)
m_400 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_400,
  start = list(a = mean(d_400$height), b = 0, sigma = sd(d_400$height))
)
m_300 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_300,
  start = list(a = mean(d_300$height), b = 0, sigma = sd(d_300$height))
)
(model.compare <- compare(m_500, m_400, m_300))
## Warning in compare(m_500, m_400, m_300): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m_500 500 
## m_400 400 
## m_300 300
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length

## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length

## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
##           WAIC       SE     dWAIC      dSE    pWAIC        weight
## m_300 1846.381 27.87543    0.0000       NA 3.077676  1.000000e+00
## m_400 2473.463 32.00720  627.0819 45.71977 3.156126 6.774893e-137
## m_500 3065.341 35.40793 1218.9604 49.18951 3.208829 2.023578e-265

We can see that the WAIC increased as the N increased from N= 300 to N = 500.

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

As the prior becomes more concentrated, the effective number of parameters decreases.With more concentrated priors, the likelihood will become more concentrated as well and thus variance will decrease.

As an experiment, we can calculate WAIC for two models of the same data that only differ in the concentration of their priors.

# use Howell1 data
data(Howell1)

d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_wide <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = d
)
m_narrow <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = d
)

# WAIC for wide
WAIC(m_wide, refresh = 0)
##        WAIC     lppd  penalty  std_err
## 1 -102.6836 55.65305 4.311264 36.48541
# WAIC for narrow
WAIC(m_narrow, refresh = 0)
##        WAIC     lppd  penalty  std_err
## 1 -102.3497 55.69693 4.522059 36.67303

From the result, we can see that the WAIC decreased as the pirors become more concentrated.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

Overfitting occurs when there are too many parameters and they fit the data too closely. Those parameters including all the noises and generalized poorly to the new data. Informative priors constrain the flexibility of the model - it reduces overfitting by forcing the model to learn less from the sample data.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

Informative priors reduce overfitting because they constrain the flexibility of the model, underfitting occurs when and because they constrain the flexibility of the model too much. If a very restrictive prior is in place, the parameter is not allowed to vary too much, thus, reducing the flexibility of the model and the ability to learn the parameter (especially when the true parameter is outside the bounds of the prior).

In summary, overly informative priors result in underfitting by preventing the model from learning enough from the sample data.