Chapter 7 - Ulysses’ Compass

The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

#The principle of information theory is motivated by the three following criteria:
#1. Continuity. Uncertainty must be measured on a continuous scale of equal intervals to ensure comparability.
#2. Additivity. Total uncertainty is derived by adding up the uncertainties associated with each prediction.
#3. Scalability. Uncertainty scales with number of possible outcomes to reflect changes in certainty just by virtue of different numbers of possible outcomes.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

#Entropy needs us to calculate the probability.

#H(p)=−∑i=1npilog(pi)=−(pHlog(pH)+pTlog(pT))

p <- c(0.7, 1 - 0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

#We can use the same approach as above

p <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(p * log(p)))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

#We can use the same approach as above (page 178). The only complication is that we need to use L’Hopital’s rule and ignore the side that never shows (page 179).

p <- c(1/3, 1/3, 1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# AIC: AIC is the best know information criteria.AIC provides a simple estimate of average out-of-sample deviance. AIC = -2 * lppd + 2*p
# WAIC: WAIC is the log-posterior-predictive density that makes no assumption about the shape of the posterior. It provides an approximation of the out-of-sample deviance that converges to the cross-validation approximation in a large sample. WAIC(y, θ) = -2{lppd -∑i varθlogp(y_i | θ)}

# Thus, WAIC is more general.To transform WAIC to AIC, we must first assume that the posterior distribution is approximately multivariate Gaussian, and then further assume that the priors are flat or overwhelmed by the likelihood.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# Compared with model selection, model comparison is a more general approach that uses multiple models to understand both how different variables influence predictions and, in combination with a causal model, implied conditional independencies among variables help us infer casual relationships.

# Model selection loses information about the relative model accuracy and model selection paradigm only focuses on prediction accuracy and ignores causal reasoning.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# Information criteria are based on deviance. Deviance, in turn, is a sum and not a mean product of all observations. All else being equal, a model with more observations returns a higher deviance and thus worse accuracy according to information criteria.

library(rethinking)
data(Howell1)
str(Howell1)
## 'data.frame':    544 obs. of  4 variables:
##  $ height: num  152 140 137 157 145 ...
##  $ weight: num  47.8 36.5 31.9 53 41.3 ...
##  $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
##  $ male  : int  1 0 0 1 0 1 0 1 0 1 ...
d <- Howell1[complete.cases(Howell1), ]
d_500 <- d[sample(1:nrow(d), size = 500, replace = FALSE), ]
d_400 <- d[sample(1:nrow(d), size = 400, replace = FALSE), ]
d_300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
m_500 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_500,
  start = list(a = mean(d_500$height), b = 0, sigma = sd(d_500$height))
)
m_400 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_400,
  start = list(a = mean(d_400$height), b = 0, sigma = sd(d_400$height))
)
m_300 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_300,
  start = list(a = mean(d_300$height), b = 0, sigma = sd(d_300$height))
)
(model.compare <- compare(m_500, m_400, m_300))
## Warning in compare(m_500, m_400, m_300): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m_500 500 
## m_400 400 
## m_300 300
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length

## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length

## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
##           WAIC       SE     dWAIC      dSE    pWAIC        weight
## m_300 1863.247 29.21341    0.0000       NA 3.440806  1.000000e+00
## m_400 2427.545 33.01408  564.2982 48.24429 3.384544 2.912115e-123
## m_500 3072.772 35.27337 1209.5258 52.69511 3.176305 2.263697e-263

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

#In the below experiments, i would like to calculate WAIC for two models of the same data that only differ in concentration of their priors

d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_wide <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = d
)
m_narrow <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = d
)
WAIC(m_wide, refresh = 0)
##        WAIC     lppd  penalty std_err
## 1 -102.6821 55.67891 4.337849 36.5203

As we can see from above, the WAIC decreases as the priors become more concentrated.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Informative priors constrain the model by making it harder for the model to pick up on extreme parameter values and assign them high posterior probabilities.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# Overly informative priors result in underfitting because they constrain the flexibility of the model too much; they make it less likely for “correct” parameter values to be assigned high posterior probability. In simpler terms, overly informative priors result in underfitting by preventing the model from learning enough from the sample data.