Assignment #6

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments.

set.seed(10)
#As information criterion is based on the deviance of all the observations divided the number of observations, then make a summation of all the deviance. More observations generally lead to a higher deviance as more values are summed up, rendering a comparison to a model with less observations useless.

# Example
library(rethinking)
data(Howell1)
data <- Howell1[complete.cases(Howell1), ]
data1 <- data[sample(1:nrow(d), size = 100, replace = TRUE), ]
data2 <- data[sample(1:nrow(d), size = 150, replace = TRUE), ]
data3 <- data[sample(1:nrow(d), size = 250, replace = TRUE), ]
data4 <- data[sample(1:nrow(d), size = 350, replace = TRUE), ]

m1 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * weight
  ),
  data = data1,
  start = list(a = mean(data1$height), b = 0, sigma = sd(data1$height))
)

m2 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * weight
  ),
  data = data2,
  start = list(a = mean(data2$height), b = 0, sigma = sd(data2$height))
)

m3 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * weight
  ),
  data = data3,
  start = list(a = mean(data3$height), b = 0, sigma = sd(data3$height))
)

m4 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * weight
  ),
  data = data4,
  start = list(a = mean(data3$height), b = 0, sigma = sd(data3$height))
)



(model.compare <- compare(m1, m2, m3,m4))

##        WAIC       SE    dWAIC      dSE    pWAIC       weight
## m1 163.9087  8.64472   0.0000       NA 2.346737 1.000000e+00
## m2 224.6435 12.38355  60.7348 13.72386 2.488805 6.480474e-14
## m3 355.0308 16.16059 191.1221 12.56882 2.505492 3.150396e-42
## m4 481.4305 24.31944 317.5219 15.45877 2.629179 1.124602e-69

# Comment : With more and more data, the WAIC increases 164- > 225 -> 355 -> 481

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments.

# When the prior becomes more concentrated, the range of simulated regressions parameter is reduced accordingly, which could lead to fewer possible ways for the parameter correlated with each other. Therefore the effective number of parameters decreases as priors become concentrated.

# PSIS The Model becomes less flexible as the prior become more concentrated
# WAIC Since the prior becomes more concentrated, the likelihood will become concentrated as the variance decreasing. Therefore more concentrated priors constraint the likelihood and subsequent measure of variance

data <- Howell1[complete.cases(Howell1), ]
data$height_std <- (log(data$height) - mean (log(data$height))) / sd(log(data$height))
data$weight_std <- (log(data$weight) - mean (log(data$weight))) / sd(log(data$weight))


m1 <- map(
  alist(
    height_std ~ dnorm(mu, sigma),
    mu <- a + b * weight_std,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = data
)
m2 <- map(
  alist(
    height_std ~ dnorm(mu, sigma),
    mu <- a + b * weight_std,
    a ~ dnorm(0, 1),
    b ~ dnorm(1, 1),
    sigma ~ dunif(0, 1)
  ),
  data = data
)
WAIC(m1, refresh = 0)

##        WAIC     lppd  penalty  std_err
## 1 -102.7642 55.62751 4.245435 36.41576

WAIC(m2, refresh = 0)

##        WAIC    lppd  penalty  std_err
## 1 -102.5433 55.6371 4.365429 36.43306

PSIS(m1)

##        PSIS     lppd  penalty  std_err
## 1 -102.5679 51.28395 4.359383 36.49007

PSIS(m2)

##        PSIS     lppd penalty  std_err
## 1 -102.5221 51.26106 4.41025 36.61891

# The result difference is not significant in this case, but we observe the WAIC increases a little bit with more concentrated variables, and PSIS decreases a little bit with more concentrated variables

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island	Species A	Species B	Species C	Species D	Species E
1	0.2	0.2	0.2	0.2	0.2
2	0.8	0.1	0.05	0.025	0.025
3	0.05	0.15	0.7	0.05	0.05

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

island1 <- c(0.2, 0.2, 0.2, 0.2, 0.2)
island2 <- c(0.8, 0.1, 0.05, 0.025, 0.025)
island3 <- c(0.05, 0.15, 0.7, 0.05, 0.05)

entropy1 <- -sum(island1 * log(island1))
entropy2 <- -sum(island2 * log(island2))
entropy3 <- -sum(island3 * log(island3))
entropy1

## [1] 1.609438

entropy2

## [1] 0.7430039

entropy3

## [1] 0.9836003

#Entropy is just a measure for how uncertain we are which species we would get, if we randomly observe a bird from this island. We notice Island has the highest entropy value and island 2 has the lowest entropy value. Therefore we are more uncertain for island1 about which species we could run into. 

DKL <- function(p,q) sum( p*(log(p)-log(q)) )

Dm <- matrix( NA , nrow=3 , ncol=3 )
Dm[1,1] <- DKL (island1, island1)
Dm[1,2] <- DKL (island1, island2)
Dm[1,3] <- DKL (island1, island3)
Dm[2,1] <- DKL (island2, island1)
Dm[2,2] <- DKL (island2, island2)
Dm[2,3] <- DKL (island2, island3)
Dm[3,1] <- DKL (island3, island1)
Dm[3,2] <- DKL (island3, island2)
Dm[3,3] <- DKL (island3, island3)

Dm

##           [,1]      [,2]      [,3]
## [1,] 0.0000000 0.9704061 0.6387604
## [2,] 0.8664340 0.0000000 2.0109142
## [3,] 0.6258376 1.8388452 0.0000000

# In conclusion : Island 1 have the overall lowest divergence values , because it has a high entropy. Island predicts the other bests

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

## R code 6.21
d <- sim_happiness(seed = 1977, N_years = 1000)
## R code 6.22
d2 <- d[d$age > 17, ] # only adults
d2$A <- (d2$age - 18) / (65 - 18)
## R code 6.23
d2$mid <- d2$married + 1
m6.9 <- quap(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a[mid] + bA * A,
    a[mid] ~ dnorm(0, 1),
    bA ~ dnorm(0, 2),
    sigma ~ dexp(1)
  ),
  data = d2
)
## R code 6.24
m6.10 <- quap(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + bA * A,
    a ~ dnorm(0, 1),
    bA ~ dnorm(0, 2),
    sigma ~ dexp(1)
  ),
  data = d2
)
## Comparison
compare(m6.9, m6.10)

##           WAIC       SE    dWAIC      dSE    pWAIC       weight
## m6.9  2713.971 37.54465   0.0000       NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85

# Comment : model 6.9 has a lower WAIC compared to model 6.10 -- 2714.0 vs 3101.9 . However model 6.10 provides the correct causal inference about the influence of age on happiness. Because WAIC is more effective for measuring the predictive power rather than the causal association.  Therefore the association between age and happiness results in improved predictions.

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

avgfood + groupsize + area
avgfood + groupsize
groupsize + area
avgfood
area

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

# data loading and prepping
data(foxes)
foxes$area <- scale(foxes$area)
foxes$avgfood <- scale(foxes$avgfood)
foxes$weight <- scale(foxes$weight)
foxes$groupsize <- scale(foxes$groupsize)
## Models

m1 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bFood * avgfood + bGroup * groupsize + bArea * area,
    a ~ dnorm(0, .2),
    c(bFood, bGroup, bArea) ~ dnorm(0, 5),
    sigma ~ dexp(1)
  ),
  data = foxes
)


m2 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bFood * avgfood + bGroup * groupsize,
    a ~ dnorm(0, .2),
    c(bFood, bGroup) ~ dnorm(0, 5),
    sigma ~ dexp(1)
  ),
  data = foxes
)


m3 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bGroup * groupsize + bArea * area,
    a ~ dnorm(0, .2),
    c(bGroup, bArea) ~ dnorm(0, 5),
    sigma ~ dexp(1)
  ),
  data = foxes
)

m4 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bFood * avgfood,
    a ~ dnorm(0, .2),
    bFood ~ dnorm(0, 5),
    sigma ~ dexp(1)
  ),
  data = foxes
)

m5 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bArea * area,
    a ~ dnorm(0, .2),
    bArea ~ dnorm(0, 5),
    sigma ~ dexp(1)
  ),
  data = foxes
)
## Comparison
compare(m1, m2, m3, m4, m5)

##        WAIC       SE      dWAIC      dSE    pWAIC      weight
## m1 323.3416 16.88472  0.0000000       NA 5.187854 0.410865167
## m2 323.9809 16.81807  0.6393574 3.894043 4.130907 0.298445216
## m3 324.0666 16.18771  0.7250103 4.207249 3.945774 0.285933698
## m4 333.5084 13.79238 10.1668387 8.659625 2.454047 0.002546821
## m5 333.7929 13.79707 10.4513608 8.704578 2.684646 0.002209099

# Comment : The difference across model 1 2, 3 are not significant. But we can see that the model 1,2 and 3 are much different than model 4 and 5. In terms of the standard error, we could also jump to the same conclusion, which model 1,2,3 are different from model 4 and 5.Following the DAG, as long as we include the groupsize path, it does not make a difference if we use area or avgfood. Model 4 and Model 5 both don’t include groupsize. But as avgfood and area contain mostly the same information, they both show similar WAIC estimates.

Assignment #6

Chapter 7

Yuan Ji

2022-05-09

Chapter 7 - Ulysses’ Compass

Questions