Assignment #6

Questions

7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.

# The three conditions that motivate the principle of information theory are as follows:
# Continuity: To maintain comparability, uncertainty must be quantified on a continuous scale with comparable intervals.
# Additivity: Total uncertainty is calculated by summing the uncertainties for each prediction.
# Scalability: Uncertainty scales with the number of alternative outcomes to show changes in certainty simply as a result of the varied numbers of outcomes.

7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

The entropy of this coin is 0.61.

# From page 206 p1 being probability of heads, and p2 being probability of tails, we may simply enter our values in the following format:
p <- c( 0.7 , 0.3 )
-sum( p*log(p) )

## [1] 0.6108643

7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

The entropy of this die is 1.38.

p <- c(0.20, 0.25, 0.25, 0.30)
-sum(p * log(p))

## [1] 1.376227

7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

The entropy of the die is 1.10.

p <- c(1 / 3, 1 / 3, 1 / 3)
-sum(p * log(p))

## [1] 1.098612

7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

# Dtrain+2p is the AIC formula, where Dtrain is the in-sample training deviance and p is the number of estimated free parameters in the model. It is based on the following assumptions: Priors are either flat or overpowered by models, The posterior distribution is multivariate in nature. and Sample size >> Number of parameters

# WAIC is defined as −2(lppd−𝑝𝑊𝐴𝐼𝐶)=−2(∑𝑁𝑖=1logPr(𝑦𝑖)–∑𝑁𝑖=1𝑉(𝑦𝑖)) where Pr(𝑦𝑖) is the average likelihood of observation 𝑖 in the training sample and 𝑉(𝑦𝑖) is the variance in log-likelihood for observation 𝑖 in the training sample. It is predicated on the following assumptions: Sample size >> Number of parameters.

# The WAIC is the most comprehensive. To transition from WAIC to AIC, we must assume that the posterior distribution is multivariate Gaussian and that the priors are flat or overpowered by the likelihood

7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?

# When choosing a model, we employ information criteria to narrow down the options and choose the one with the highest information criteria value. The DIC or WAIC are used in model comparison to create a posterior predictive distribution that includes all models. The information regarding relative model correctness included in the differences among information criteria values is lost during model selection; this is especially troublesome when the selected model only slightly outperforms its alternatives.

7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

# The deviation of all observations is used as the information criterion without being split by the number of observations (sum and not an average). According to the information criterion, a model with more data will have a bigger deviation and worse accuracy, hence it is not acceptable to compare models with different numbers of observations.

# The WAIC may be calculated for models with varied numbers of observations to corroborate this.

library(rethinking)
data(Howell1)
str(Howell1)

## 'data.frame':    544 obs. of  4 variables:
##  $ height: num  152 140 137 157 145 ...
##  $ weight: num  47.8 36.5 31.9 53 41.3 ...
##  $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
##  $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

d <- Howell1[complete.cases(Howell1), ]
d100 <- d[sample(1:nrow(d), size = 100, replace = FALSE), ]
d200 <- d[sample(1:nrow(d), size = 200, replace = FALSE), ]
d300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]

m100 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d100,
  start = list(a = mean(d100$height), b = 0, sigma = sd(d100$height))
)

m200 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d200,
  start = list(a = mean(d200$height), b = 0, sigma = sd(d200$height))
)

m300 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d300,
  start = list(a = mean(d300$height), b = 0, sigma = sd(d300$height))
)

set.seed(77)
compare( m100, m200, m300 , func=WAIC )

## Warning in compare(m100, m200, m300, func = WAIC): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m100 100 
## m200 200 
## m300 300

## Warning in ic_ptw1 - ic_ptw2: longer object length is not a multiple of shorter
## object length

##           WAIC       SE     dWAIC      dSE    pWAIC        weight
## m100  609.4032 15.13361    0.0000       NA 3.045085  1.000000e+00
## m200 1217.1886 20.00248  607.7855 20.94896 2.942079 1.049700e-132
## m300 1801.0821 25.04086 1191.6790 20.77374 2.824663 1.699037e-259

# We can observe that the WAIC grows with the amount of observations when comparing the three models with 100, 200, and 300 observations. There's also a disclaimer regarding the amount of observations varying.

7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

# As a prior becomes more concentrated, the number of effective parameters decreases. As the priors get more concentrated, the model for PSIS becomes less flexible. The pWAIC is a measure of variation in log-likelihood for each observation in the training sample in WAIC. The likelihood will also get more concentrated as the prior becomes more concentrated, and the variance will decrease.

d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)

m_not_concentrated <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = d
)

m_concentrated <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = d
)

WAIC(m_not_concentrated, refresh = 0)

##        WAIC     lppd  penalty std_err
## 1 -102.4112 55.68215 4.476531  36.501

WAIC(m_concentrated, refresh = 0)

##        WAIC     lppd  penalty  std_err
## 1 -102.6269 55.63468 4.321246 36.46326

# With a more concentrated prior, the pWAIC decreases.

7M5. Provide an informal explanation of why informative priors reduce overfitting.

# Informative priors reduce overfitting by forcing the model to learn less from the sample data.

7M6. Provide an informal explanation of why overly informative priors result in underfitting.

# Overly informative priors result in underfitting because they constrain the flexibility of the model too much; they make it less likely for “correct” parameter values to be assigned high posterior probability.

Assignment #6

Karthik Mummidisetti

2021-07-01

Chapter 7 - Ulysses’ Compass

Questions