The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# the criteria for the measure of "uncertainty"
# 1. continuity. Uncertainty should be measured on a continuous scale, it may be bounded.
# 2. Increasing. Uncertanity needs increase with number of possible events
# 3. Additivity. we can define event categories again, without changing the amount of uncertainty.
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
# define probabilities of heads and tails
p <- c( 0.7 , 0.3 )
# compute entropy
-sum( p*log(p) )
## [1] 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
# define probabilities of sides
p <- c( 0.2 , 0.25 , 0.25 , 0.3 )
# compute entropy
-sum( p*log(p) )
## [1] 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
# define probabilities of sides
p <- c( 1/3 , 1/3 , 1/3 )
# compute entropy
-sum( p*log(p) )
## [1] 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
\[\begin{align} AIC = D_{train} + 2p\\ \end{align}\]
\[\begin{align} WAIC = −2(lppd − p_{WAIC})\\ =−2(\sum_{i=1}^{N} log P_r(y_i) − \sum_{i=1}^{N} V(y_i)) \end{align}\]
# these are the definition of AIC and WAIC.
# p is the number of parameters in the model.
# Pr(yi) is the likelihood of observation i, averaging over the posterior distribution
# V(yi) is the variance in the log likelihood of observation i, and the variance over the posterior distribution.
# both definitions have part about the fit to the training sample, and part about penalty for flexibility in fitting to the sample.
# AIC: fit term is Dtrain, penalty term is 2p
# WACI: fit term is lppd, penalty term is pWAIC
# WAIC is more general than AIC.
# When the posterior prediction mean can well represent the posterior prediction distribution, and the prior data is effectively flat or overwhelmed by the amount of data, WAIC and AIC will tend to be consistent.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
#Model selection: rank models by information criteria and choose the highest ranked model.
#Model selection discards information about model uncertainty.
#Model averaging: weight every model by its relative distance from the best model and create a posterior predictive distribution comprising predictions from every model in proportion to these weights
#Model averaging: retains some information, discards some information about model uncertainty,
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
# if a model is fit to different observations. this model may be different purpose from other models
# if use fewer observations, the model may be perform better, as the deviance will be smaller. many R regression functions will drop incomplete cases, and reduce the observations. it's normal cases, but will bring bad results.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
# when prior becomes more concentrated around particular parameter values, the model is less flexible in fitting the sample
# more concentrated, prior represents more previous data, the model is less flexible, the effective number of parameters declines.
# some codes to show this
# change the value for sigma in the data list. the smaller sigma, the more concentrated the prior and the smaller the effective number of parameters
library(rethinking)
## Loading required package: rstan
## Warning: package 'rstan' was built under R version 4.0.3
## Loading required package: StanHeaders
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
## Loading required package: parallel
## Loading required package: dagitty
## Warning: package 'dagitty' was built under R version 4.0.3
## rethinking (Version 2.01)
##
## Attaching package: 'rethinking'
## The following object is masked from 'package:stats':
##
## rstudent
y <- rnorm(10) # execute once to get data
# repeat , change sigma each time
m <- quap(
alist(
y ~ dnorm(mu,1),
mu ~ dnorm(0,sigma)
), data=list(y=y,sigma=10) )
WAIC(m)
## WAIC lppd penalty std_err
## 1 35.11963 -16.07976 1.480054 6.425264
7M5. Provide an informal explanation of why informative priors reduce overfitting.
# Informative priors reduce overfitting by reducing the sensitivity of a sample model. some sample information is irregular, it's not a recurring feature of the process of interest.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
# if a prior have over information, then the model even will not learn the regular features of a sample. the model may underfit the sample. there is usually a broad family of priors which is almost not different to the same sample.