library(rethinking)
data(Howell1)
set.seed(6)
The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# The motivating criteria are described as following
## uncertainty should be measured on a continuous scale such that the spacing between adjacent values is consistent
## capture the size of the possibility space such that its value scales with the number of possible outcomes
## addition of independent events such that it does not matter how the events are divided. This is also very important. It states that dividing a possibility space into four events (e.g., rain/hot, rain/cold, shine/hot, shine/cold) provides the same amount of information as dividing this same possibility space into two sets of two events (e.g., rain-or-shine and hot-or-cold).
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
p <- c(0.7, 1 - 0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
p <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(p * log(p)))
## [1] 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
p <- c(1/3, 1/3, 1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
## AIC => Dtrain + 2p ( AIC is the oldest and most restrictive which defines approximation of predictive accuracy)
## Dtrain represents deviance on the training sample and p represents free parameters to be estimated in the model.
## DIC => D¯+pD=D¯+(D¯+D^)
## where D represents posterior distribution of deviance and D¯ represents average of D and D^ is the deviance calculated at the posterior mean.
## WAIC => −2(lppd−pWAIC)
## lppd represents pointwise analog of deviance, averaged over the posterior distribution
## pWAIC represents computed log-likelihood of yi for each sample from the posterior distribution
## where Pr(yi) represents average likelihood of observation i in the training sample and V(yi) represents variance in log-likelihood for observation i in the training sample.
## all 3 criteria aim to nominate the model that will produce best predictions, judged by out-of-sample deviance.With infinite data all 3 criteria always select most complex model. And so, these 3 criteria are accused of over-fitting.It is also observed that the most complex model will make predictions identical to true model.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
## Model selection is basically retaining the model with the lowest information criterion value and to discard all other models with higher information criterion values. Doing so we will lose relative model accuracy contained in the differences among information criterion values. this is especially problematic when the selected model only outperforms its alternatives to a small degree.
## Leveraging the uncertainty in multiple models by using posterior predictive distribution constructed using Bayesian information criteria is called Model averaging. By doing so we do not lose information on its own. However, when combined with undisclosed data dredging, it can lead to spurious findings.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
## Information criteria is based on deviance that are summed on various observations. Which is why a model with more observations will have higher deviance, therefore giving the worst accuracy.
## using WAIC for models fit to increasingly small sub samples of same data. Hence, information criteria will decrease along side the sample size. for a large sample lets use the Howell1 database from last chapter.
d <- Howell1[complete.cases(Howell1), ]
d_500 <- d[sample(1:nrow(d), size = 500, replace = FALSE), ]
d_400 <- d[sample(1:nrow(d), size = 400, replace = FALSE), ]
d_300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
m_500 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_500,
start = list(a = mean(d_500$height), b = 0, sigma = sd(d_500$height))
)
m_400 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_400,
start = list(a = mean(d_400$height), b = 0, sigma = sd(d_400$height))
)
m_300 <- map(
alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight)
),
data = d_300,
start = list(a = mean(d_300$height), b = 0, sigma = sd(d_300$height))
)
(model.compare <- compare(m_500, m_400, m_300))
## Warning in compare(m_500, m_400, m_300): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m_500 500
## m_400 400
## m_300 300
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
## WAIC SE dWAIC dSE pWAIC weight
## m_300 1862.175 27.91431 0.0000 NA 3.437154 1.000000e+00
## m_400 2419.639 33.94881 557.4644 46.34780 3.442395 8.874639e-122
## m_500 3054.604 35.32283 1192.4289 53.17046 3.257118 1.167790e-259
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_wide <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 10),
b ~ dnorm(1, 10),
sigma ~ dunif(0, 10)
),
data = d
)
m_narrow <- map(
alist(
height.log.z ~ dnorm(mu, sigma),
mu <- a + b * weight.log.z,
a ~ dnorm(0, 0.10),
b ~ dnorm(1, 0.10),
sigma ~ dunif(0, 1)
),
data = d
)
WAIC(m_wide, refresh = 0)
## WAIC lppd penalty std_err
## 1 -102.7631 55.62747 4.24594 36.5361
7M5. Provide an informal explanation of why informative priors reduce overfitting.
## Because flexibility of the model is constrained.I.e. they reduce overfitting by forcing the model to learn less from the sample data.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
## because overly informative priors constrain the flexibility of the model too much. Thereby making the model to learn not enough to predict from the sample data.