Assignment #6

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments.

# Models with a higher number of observations will have a higher deviance and worse accuracy according to the    information criterion. This would make it difficult to compare models with different number of observations as   the total deviance is a sum of the deviance of observations.

# To confirm this, the WAIC can be calculated for models with different number of observations.

library(rethinking)

#Experiment
data(Howell1)
set.seed(6)
d1 <- Howell1[complete.cases(Howell1), ]
d200 <- d1[sample(1:nrow(d1), size = 200, replace = FALSE), ]
d300 <- d1[sample(1:nrow(d1), size = 300, replace = FALSE), ]
d400 <- d1[sample(1:nrow(d1), size = 400, replace = FALSE), ]

m200 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d200,
  start = list(a = mean(d200$height), b = 0, sigma = sd(d200$height))
)

m300 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d300,
  start = list(a = mean(d300$height), b = 0, sigma = sd(d300$height))
)

m400 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d400,
  start = list(a = mean(d400$height), b = 0, sigma = sd(d400$height))
)

(model.compare <- compare(m200, m300, m400))

##          WAIC       SE     dWAIC      dSE    pWAIC        weight
## m200 1245.284 24.21065    0.0000       NA 3.534788  1.000000e+00
## m300 1837.213 25.96781  591.9297 30.77817 3.014162 2.911310e-129
## m400 2398.874 33.25872 1153.5904 34.01419 3.453782 3.169806e-251

# From the three models compared with 200, 300 and 400 observations it can be seen that the WAIC increases with  the number of observations.

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments.

# The effective number of parameters decrease as a prior becomes more concentrated.

# For WAIC, the pWAIC is a measure of the variance in the log-likelihood for each observation in the training    sample. 

# As the prior becomes more concentrated, the likelihood will also become more concentrated and the variance will decrease.

# For PSIS, the model becomes less flexible as the prior become more concentrated.

#Experiment
d2 <- Howell1[complete.cases(Howell1), ]
d2$height.log <- log(d2$height)
d2$height.log.z <- (d2$height.log - mean(d2$height.log)) / sd(d2$height.log)
d2$weight.log <- log(d2$weight)
d2$weight.log.z <- (d2$weight.log - mean(d2$weight.log)) / sd(d2$weight.log)

model_initialprior <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = d2
)

WAIC(model_initialprior, refresh = 0)

##        WAIC     lppd  penalty  std_err
## 1 -102.5922 55.64071 4.344601 36.53807

#Concentrating the prior
model_concentratedprior <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = d2
)
WAIC(model_concentratedprior, refresh = 0)

##        WAIC     lppd  penalty  std_err
## 1 -102.7406 55.59951 4.229228 36.50247

# The pWAIC decreases with the more concentrated prior.

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island	Species A	Species B	Species C	Species D	Species E
1	0.2	0.2	0.2	0.2	0.2
2	0.8	0.1	0.05	0.025	0.025
3	0.05	0.15	0.7	0.05	0.05

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

library(rethinking)

# We can see that the more variance in the species, the higher the entropy. Island 1 has equal amount of each    species, so it has the highest entropy, island 2 has the lowest, since most species are species A.

# By computing the K-L divergence, we can see Island 1 predicts other Islands' species the best, since it has the lowest entropy, its species distribution is not as extreme as the other two.

getIsland = function(island) {
  cur = as.numeric(c(d[island + 1, ]))
  cur
}

getIslandEntropy = function(island) {
  row = getIsland(island)
  -sum(row * log(row))
}

klDivergence = function(islandP, islandQ) {
  p = getIsland(islandP)
  q = getIsland(islandQ)
  sum(p * log(p/q))
}

print("Entropy for Island 1:")

## [1] "Entropy for Island 1:"

getIslandEntropy(1)

## [1] 1.609438

print("Entropy for Island 2:")

## [1] "Entropy for Island 2:"

getIslandEntropy(2)

## [1] 0.7430039

print("Entropy for Island 3:")

## [1] "Entropy for Island 3:"

getIslandEntropy(3)

## [1] 0.9836003

print("K-L divergence between 1 and 2:")

## [1] "K-L divergence between 1 and 2:"

klDivergence(1, 2)

## [1] 0.9704061

print("K-L divergence between 2 and 1:")

## [1] "K-L divergence between 2 and 1:"

klDivergence(2, 1)

## [1] 0.866434

print("K-L divergence between 2 and 3:")

## [1] "K-L divergence between 2 and 3:"

klDivergence(2, 3)

## [1] 2.010914

print("K-L divergence between 3 and 2:")

## [1] "K-L divergence between 3 and 2:"

klDivergence(3, 2)

## [1] 1.838845

print("K-L divergence between 1 and 3:")

## [1] "K-L divergence between 1 and 3:"

klDivergence(1, 3)

## [1] 0.6387604

print("K-L divergence between 3 and 1:")

## [1] "K-L divergence between 3 and 1:"

klDivergence(3, 1)

## [1] 0.6258376

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

# The first model has lower WAIC score, so it made a better fit to the data, better at causal inference. 

# The second model has higher WAIC score, but it is not overfitting, so it is better at prediction. 

# This is reasonable because the first model directly used a linear relationship between married and age, which  isolated one path to happiness, so it is worse at prediction due to overfitting but better at causual inference.

d = sim_happiness(seed=1977, N_years = 1000)
d2 = d[d$age > 17,]
d2$A = (d2$age - 18) / (65 - 18)
d2$mid = d2$married + 1
m6.9 = quap(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a[mid] + bA * A,
    a[mid] ~ dnorm(0, 1),
    bA ~ dnorm(0, 2),
    sigma ~ dexp(1)
  ), data=d2
)

m6.10 = quap(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + bA * A,
    a ~ dnorm(0, 1),
    bA ~ dnorm(0, 2),
    sigma ~ dexp(1)
  ), data=d2
)

WAIC(m6.9)

##       WAIC      lppd  penalty  std_err
## 1 2713.971 -1353.247 3.738532 37.54465

WAIC(m6.10)

##       WAIC      lppd  penalty  std_err
## 1 3101.906 -1548.612 2.340445 27.74379

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

avgfood + groupsize + area
avgfood + groupsize
groupsize + area
avgfood
area

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

# The first three models provide better prediction, because they all have lower WAIC scores.

# This makes sense because they have the most information.

# However, they are overfitting, in reality avgfood is the causal factor of both group size and area.

# So, the last two have the lowest dSE.

data(foxes)

# avgfood + groupsize + area
model_1 = quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bf * avgfood + bg * groupsize + ba * area,
    sigma ~ dexp(1),
    a ~ dnorm(4, 1),
    bf ~ dnorm(2, 1),
    ba ~ dnorm(0, 1),
    bg ~ dnorm(0, 1)
  ), data=foxes
)

WAIC(model_1)

##       WAIC      lppd  penalty  std_err
## 1 361.0342 -176.1679 4.349176 16.63425

# avgfood + groupsize
model_2 = quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bf * avgfood + bg * groupsize,
    sigma ~ dexp(1),
    a ~ dnorm(4, 1),
    bf ~ dnorm(2, 1),
    bg ~ dnorm(0, 1)
  ), data=foxes
)

WAIC(model_2)

##       WAIC      lppd  penalty  std_err
## 1 362.4916 -177.8684 3.377455 15.86947

# groupsize + area
model_3 = quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bg * groupsize + ba * area,
    sigma ~ dexp(1),
    a ~ dnorm(4, 1),
    ba ~ dnorm(0, 1),
    bg ~ dnorm(0, 1)
  ), data=foxes
)

WAIC(model_3)

##       WAIC      lppd  penalty  std_err
## 1 363.1302 -177.7101 3.854931 15.97327

# avgfood
model_4 = quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bf * avgfood,
    sigma ~ dexp(1),
    a ~ dnorm(4, 1),
    bf ~ dnorm(2, 1)
  ), data=foxes
)

WAIC(model_4)

##       WAIC      lppd  penalty  std_err
## 1 373.3557 -184.2908 2.387063 13.63095

# area
model_5 = quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + ba * area,
    sigma ~ dexp(1),
    a ~ dnorm(4, 1),
    ba ~ dnorm(0, 1)
  ), data=foxes
)

WAIC(model_5)

##       WAIC      lppd  penalty  std_err
## 1 372.9585 -183.8168 2.662476 13.71016

Assignment #6

Chapter 7

Nischal Bondalapati

5/10/2022

Chapter 7 - Ulysses' Compass

Questions