Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

#The three motivating criteria that define information entropy are,
#(1) Be measurable on a continuous scale so that the spacing between adjacent values is consistent;
#(2) The size of the possibility space can be captured so that its value scales with the number of possible outcomes;
#(3) Be additive for independent events so that how the events are divided does not matter. 

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p <- c(0.7, 1 - 0.7)
H <- -sum(p * log(p))
H
## [1] 0.6108643
#The entropy of this coin is about 0.61.

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p <- c(0.20, 0.25, 0.25, 0.30)
H <- -sum(p * log(p))
H
## [1] 1.376227
#The entropy of this die is about 1.38.

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p <- c(1/3, 1/3, 1/3)
H <- -sum(p * log(p))
H
## [1] 1.098612
#The entropy of this die is about 1.10.

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

#AIC: Dtrain+2p where Dtrain is the in-sample training deviance, and p is the number of free parameters estimated in the model.
#WAIC: −2(lppd−pWAIC) = −2(∑N i=1 logPr(yi) − ∑N i=1 V(yi)) where Pr(yi) is the average likelihood of observation i in the training sample, and V(yi) is the variance in log-likelihood for observation i in the training sample. 

#WAIC is the most general criteria, then AIC. To transform the more general criterion into a less general one, we should assume the posterior distribution is approximately multivariate Gaussian or the priors are either flat or overwhelmed by the likelihood.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

#Model selection is choosing to retain the model with the lowest information criterion value and to discard those models with higher values. 
#However, model comparison is not causal inference, and it is to choose a model that makes good predictions by cross validation and information criteria and to compare models.

#Under model selection, the information about relative model accuracy contained in the differences among informaton criterion values is lost.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

#Information criteria are from the deviance of all observations without being divided by the number of observations (sum and not an average). A model with a more observations will have a higher deviance and worse accuracy according to the information criterion, so it is not reasonable to compare models with different number of observations.

#To confirm this, the WAIC can be calculated for models with different number of observations. 
library(rethinking)
## Loading required package: rstan
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
## Loading required package: parallel
## rethinking (Version 2.12)
## 
## Attaching package: 'rethinking'
## The following object is masked from 'package:stats':
## 
##     rstudent
data(Howell1)
d <- Howell1[complete.cases(Howell1), ]
d200 <- d[sample(1:nrow(d), size = 200, replace = FALSE), ]
d300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
d400 <- d[sample(1:nrow(d), size = 400, replace = FALSE), ]

m200 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d200,
  start = list(a = mean(d200$height), b = 0, sigma = sd(d200$height))
)

m300 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d300,
  start = list(a = mean(d300$height), b = 0, sigma = sd(d300$height))
)

m400 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d400,
  start = list(a = mean(d400$height), b = 0, sigma = sd(d400$height))
)

(model.compare <- compare(m200, m300, m400))
## Warning in compare(m200, m300, m400): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m200 200 
## m300 300 
## m400 400
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length

## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
##          WAIC       SE    dWAIC      dSE    pWAIC        weight
## m200 1234.306 22.64280    0.000       NA 3.217053  1.000000e+00
## m300 1841.625 30.10291  607.319 35.48855 3.606321 1.325424e-132
## m400 2457.244 32.44937 1222.938 32.02571 3.515183 2.769437e-266
#From the 3 models compared with 200, 300 and 400 observations, we can see that the WAIC increases with the number of observations.There is also a warning about the number of observations being different.

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

#The effective number of parameters decrease, as a prior becomes more concentrated. 
#For PSIS, the model becomes less flexible as the prior become more concentrated.
#For WAIC, the pWAIC is a measure of the variance in the log-likelihood for each observation in the training sample. As the prior becomes more concentrated, the likelihood will also become more concentrated and the variance will decrease.

d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_not_concentrated <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = d
)
m_concentrated <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = d
)
WAIC(m_not_concentrated, refresh = 0)
##        WAIC     lppd  penalty  std_err
## 1 -102.3895 55.64577 4.451023 36.46047
WAIC(m_concentrated, refresh = 0)
##        WAIC    lppd  penalty  std_err
## 1 -102.7222 55.6539 4.292785 36.40651
#The pWAIC decreases with the more concentrated prior.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

#Informative priors reduce overfitting because they constrain the flexibility of the model and make it hard for extreme parameter values to be assigned high posterior probability. In other words, they reduce overfitting by forcing the model to learn less from the sample data.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

#Overly informative priors result in underfitting because they constrain the fexibility of the model too much and make it hard for correct paramter values to be assigned high posterior probabilities. In other words, they result in underfitting by preventing the model from learning enough from the sample data.