Assignment #6

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

# The first one is Continuity, which defines that the measure of uncertainty must be continuous. It's prefer to set a minimum and maximum bound, in order to prevents large changes in the measurement from small changes in probabilities.
# The second one is Positive and Monotonic, which defines that the measure of uncertainty should increase when number of events increase.With more potential outcomes, more predictions and uncertainty about the outcome will be made.
# The third one is Additivity, which defines that the measure of uncertainty should be additive. That means when calculating the uncertainty of two sets of outcomes, the uncertainty of the combination of both events should be the sum of the uncertainty of each event.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

# We could simulate this in the following
p <- c(0.7, 0.3)
entropy_p <- -1 * sum(p * log(p))
entropy_p

## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

# Similarly, we could simulate this in the following
p <- c(0.2, 0.25, 0.25, 0.3)
entropy_p <- -1 * sum(p * log(p))
entropy_p

## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

# we could simulate this in the following
p <- c(1/3, 1/3, 1/3)
entropy_p <- -1 * sum(p * log(p))
entropy_p

## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# AIC is defined as AIC = -2lppd + 2p. WAIC is defined as WAIC = -2(lppd - pWAIC).
# WAIC is more general, since WAIC has no assumptions with the shape of the posterior. On the other hand, AIC has the following assumptions: assumes the priors are flat, or overwhelmed by the likelihood; a posterior distribution is approximately multivariate Gaussian; the sample size is greater than the parameters.
# When all those assumptions are met, then we would transform the more general one into less general one.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# Model selection is the process of choosing model with lowest criteria over others. We would lose information about the relative model accuracy, which is used to tell how confident we are in the models.
# Model comparison is the process of comparing different models in order to understand how the variables and affect implied conditional independence in a causal model.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# Information criterion are on the basis of the accumulated deviance values, and it's a log predictive density. Therefore, more observations would lead to higher deviance and less accuracy.
# When models were fit to different number of observations, the comparison wouldn't be a good one, and would not represent the true differences between the models.
# We could do a small experiment by sampling two different number of observations.
d1 <- rnorm(100)
d2 <- rnorm(1000)
m_d1 <- quap(
  alist(
    y ~ dnorm(μ,1),
    μ ~ dnorm(0,σ)
  ), data=list(y=d1,σ=1)
)
m_d2 <- quap(
  alist(
    y ~ dnorm(μ,1),
    μ ~ dnorm(0,σ)
  ),data=list(y=d2,σ=1)
)
WAIC(m_d1, refresh = 0)

##       WAIC      lppd   penalty  std_err
## 1 279.0347 -138.5974 0.9199038 17.15144

WAIC(m_d2, refresh = 0)

##       WAIC      lppd   penalty  std_err
## 1 2780.832 -1389.477 0.9391666 42.88847

# As we can see from the WAIC, 100 observations has much better predictive accuracy, since the value are much lower. However, the model are actually the same for 100 and 1000 observations. Therefore, this is not the correct prediction.

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

# When prior becomes more concentrated, the effective would decrease. This is due to the fact that the WAIC consist of the penalty term pWAIC that is the sum of the variances of the log probabilities. When the prior is concentrated, the plausible range of the parameters would be more consistent, hence less effective.
# Let's simulate this scenario
d <- rnorm(100)
p1 <- 100
p2 <- 1000
m_p1 <- quap(
  alist(
    y ~ dnorm(μ,1),
    μ ~ dnorm(0,σ)
  ), data=list(y=d,σ=p1)
)
m_p2 <- quap(
  alist(
    y ~ dnorm(μ,1),
    μ ~ dnorm(0,σ)
  ),data=list(y=d,σ=p2)
)
WAIC(m_p1, refresh = 0)

##       WAIC      lppd   penalty std_err
## 1 282.7351 -140.3777 0.9898131 14.1481

WAIC(m_p2, refresh = 0)

##       WAIC      lppd  penalty  std_err
## 1 282.7871 -140.3786 1.014964 14.17621

# As we can see from the results, a concentrated (smaller) prior would have a smaller number of effective parameters (penalty).

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# As discussed in 7M4, informative priors would limit the plausible values of the parameters. Therefore, it would make the model less flexible, and learn less from the sample data.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# When prior becomes too informative, the parameter space would be too restrictive, preventing the model from learning enough from the sample data. This would prevent the future predictions based on the learning of the current data.

Assignment #6

Qing Ji

2021-05-01

Chapter 7 - Ulysses’ Compass

Questions