7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

# 1 Should be measured on a continuous scale such that the spacing between adjacent values is consistent.
# 2 Should capture the size of the possibility space such that its value scales with the number of possible outcomes.
# 3 Should be additive for independent events such that it does not matter how the events are divided.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

p = c(0.7, 1 - 0.7)
(H = -sum(p * log(p)))
## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

p = c(0.2, 0.25, 0.25, 0.3)
(H = -sum(p * log(p)))
## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

p = c(1/3, 1/3, 1/3)
(H = -sum(p * log(p)))
## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# AIC = D_train + 2p 
# AIC is an approximation that is reliable only when:
# 1 The priors are flat or overwhelmed by the likelihood.
# 2 The posterior distribution is approximately multivariate Gaussian.
# 3 The sample size N is much greater than the number of parameters k.

# WAIC = -2(lppd - pWAIC)
# WAIC is valid when The sample size N is much greater than the number of parameters.
# There are some commom parameters in AIC and WAIC. WAIC is the most general one, and then AIC. From WAIC to AIC, we assume than the priors are flat or overwhelmed by the likelihood.

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# Model selection is to choose the model with the lowest criteria value and then disvarding the others. Instand of model selection, Model comparison is a more general approach that can let us see how different variables can influence predictions by using multiple models.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# Information criteria is based on deviance, which is accrued over observations without being divided by the number of observations. So we can draw the conclusion that a model with more observations have a higher deviance and worse accuracy accordingly. So models with different numbers of observations are not on the same situation for comparsion.
# The next, we are going to use the dataset given in the textbook to show the experiments.

library(rethinking)
## Loading required package: rstan
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Loading required package: parallel
## Loading required package: dagitty
## rethinking (Version 2.01)
## 
## Attaching package: 'rethinking'
## The following object is masked from 'package:stats':
## 
##     rstudent
data(Howell1)
d <- Howell1
d$age <- (d$age - mean(d$age))/sd(d$age)
str(Howell1)
## 'data.frame':    544 obs. of  4 variables:
##  $ height: num  152 140 137 157 145 ...
##  $ weight: num  47.8 36.5 31.9 53 41.3 ...
##  $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
##  $ male  : int  1 0 0 1 0 1 0 1 0 1 ...
dim(Howell1)
## [1] 544   4
d1 <- d[sample(1:nrow(d), size = 544, replace = FALSE), ]
d2 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
d3 <- d[sample(1:nrow(d), size = 100, replace = FALSE), ]
m1 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)),
  data = d1,
  start = list(a = mean(d1$height), b = 0, sigma = sd(d1$height))
)

m2 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)),
  data = d2,
  start = list(a = mean(d2$height), b = 0, sigma = sd(d2$height))
)

m3 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d3,
  start = list(a = mean(d3$height), b = 0, sigma = sd(d3$height))
)
(model.compare <- compare(m1, m2, m3))
## Warning in compare(m1, m2, m3): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m1 544 
## m2 300 
## m3 100
## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length

## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length
##         WAIC       SE    dWAIC      dSE    pWAIC        weight
## m3  625.4135 15.45509    0.000       NA 3.406130  1.000000e+00
## m2 1828.1659 25.92885 1202.752 36.34585 3.231599 6.693157e-262
## m1 3329.7860 36.97352 2704.372 52.62880 3.073955  0.000000e+00

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

d$height_l <- log(d$height)
d$height_z <- (d$height_l - mean(d$height_l)) / sd(d$height_l)
d$weight_l <- log(d$weight)
d$weight_z <- (d$weight_l - mean(d$weight_l)) / sd(d$weight_l)
m_wide <- map(
  alist(
    height_z ~ dnorm(mu, sigma),
    mu <- a + b * weight_z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = d
)
m_narrow <- map(
  alist(
    height_z ~ dnorm(mu, sigma),
    mu <- a + b * weight_z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = d
)
WAIC(m_wide, refresh = 0)
##        WAIC     lppd  penalty  std_err
## 1 -102.8361 55.62888 4.210814 36.47965
WAIC(m_narrow, refresh = 0)
##        WAIC     lppd  penalty  std_err
## 1 -102.4457 55.65986 4.436999 36.60965

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Informative priors reduce overfitting because informative priors limit the parameters' degree of vary, since too many parameters directly cause the overfitting. 

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# The reason is that overly informative priors strictly rule the  parameters, and limit parameters' degree of vary. This situation accordingly limits the model to choose the best parameters to form a fitted trend. As a result, overly informative priors cause underfitting.