The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
# The first one is Continuity, which defines that the measure of uncertainty must be continuous. It's prefer to set a minimum and maximum bound, in order to prevents large changes in the measurement from small changes in probabilities.
# The second one is Positive and Monotonic, which defines that the measure of uncertainty should increase when number of events increase.With more potential outcomes, more predictions and uncertainty about the outcome will be made.
# The third one is Additivity, which defines that the measure of uncertainty should be additive. That means when calculating the uncertainty of two sets of outcomes, the uncertainty of the combination of both events should be the sum of the uncertainty of each event.
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
# We could simulate this in the following
p <- c(0.7, 0.3)
entropy_p <- -1 * sum(p * log(p))
entropy_p
## [1] 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
# Similarly, we could simulate this in the following
p <- c(0.2, 0.25, 0.25, 0.3)
entropy_p <- -1 * sum(p * log(p))
entropy_p
## [1] 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
# we could simulate this in the following
p <- c(1/3, 1/3, 1/3)
entropy_p <- -1 * sum(p * log(p))
entropy_p
## [1] 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
# AIC is defined as AIC = -2lppd + 2p. WAIC is defined as WAIC = -2(lppd - pWAIC).
# WAIC is more general, since WAIC has no assumptions with the shape of the posterior. On the other hand, AIC has the following assumptions: assumes the priors are flat, or overwhelmed by the likelihood; a posterior distribution is approximately multivariate Gaussian; the sample size is greater than the parameters.
# When all those assumptions are met, then we would transform the more general one into less general one.
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
# Model selection is the process of choosing model with lowest criteria over others. We would lose information about the relative model accuracy, which is used to tell how confident we are in the models.
# Model comparison is the process of comparing different models in order to understand how the variables and affect implied conditional independence in a causal model.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
# Information criterion are on the basis of the accumulated deviance values, and it's a log predictive density. Therefore, more observations would lead to higher deviance and less accuracy.
# When models were fit to different number of observations, the comparison wouldn't be a good one, and would not represent the true differences between the models.
# We could do a small experiment by sampling two different number of observations.
d1 <- rnorm(100)
d2 <- rnorm(1000)
m_d1 <- quap(
alist(
y ~ dnorm(μ,1),
μ ~ dnorm(0,σ)
), data=list(y=d1,σ=1)
)
m_d2 <- quap(
alist(
y ~ dnorm(μ,1),
μ ~ dnorm(0,σ)
),data=list(y=d2,σ=1)
)
WAIC(m_d1, refresh = 0)
## WAIC lppd penalty std_err
## 1 279.0347 -138.5974 0.9199038 17.15144
WAIC(m_d2, refresh = 0)
## WAIC lppd penalty std_err
## 1 2780.832 -1389.477 0.9391666 42.88847
# As we can see from the WAIC, 100 observations has much better predictive accuracy, since the value are much lower. However, the model are actually the same for 100 and 1000 observations. Therefore, this is not the correct prediction.
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
# When prior becomes more concentrated, the effective would decrease. This is due to the fact that the WAIC consist of the penalty term pWAIC that is the sum of the variances of the log probabilities. When the prior is concentrated, the plausible range of the parameters would be more consistent, hence less effective.
# Let's simulate this scenario
d <- rnorm(100)
p1 <- 100
p2 <- 1000
m_p1 <- quap(
alist(
y ~ dnorm(μ,1),
μ ~ dnorm(0,σ)
), data=list(y=d,σ=p1)
)
m_p2 <- quap(
alist(
y ~ dnorm(μ,1),
μ ~ dnorm(0,σ)
),data=list(y=d,σ=p2)
)
WAIC(m_p1, refresh = 0)
## WAIC lppd penalty std_err
## 1 282.7351 -140.3777 0.9898131 14.1481
WAIC(m_p2, refresh = 0)
## WAIC lppd penalty std_err
## 1 282.7871 -140.3786 1.014964 14.17621
# As we can see from the results, a concentrated (smaller) prior would have a smaller number of effective parameters (penalty).
7M5. Provide an informal explanation of why informative priors reduce overfitting.
# As discussed in 7M4, informative priors would limit the plausible values of the parameters. Therefore, it would make the model less flexible, and learn less from the sample data.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
# When prior becomes too informative, the parameter space would be too restrictive, preventing the model from learning enough from the sample data. This would prevent the future predictions based on the learning of the current data.