Assignment #6

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments.

#Information criteria are from the deviance of all observations without being divided by the number of observations (sum and not an average). A model with a more observations will have a higher deviance and worse accuracy according to the information criterion, so it is not reasonable to compare models with different number of observations.

#To confirm this, the WAIC can be calculated for models with different number of observations. 
library(rethinking)

data(Howell1)
dt <- Howell1[complete.cases(Howell1), ]
dt250 <- dt[sample(1:nrow(dt), size = 250, replace = FALSE), ]
dt350 <- dt[sample(1:nrow(dt), size = 350, replace = FALSE), ]
dt450 <- dt[sample(1:nrow(dt), size = 450, replace = FALSE), ]

m250 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = dt250,
  start = list(a = mean(dt250$height), b = 0, sigma = sd(dt250$height))
)

m350 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = dt350,
  start = list(a = mean(dt350$height), b = 0, sigma = sd(dt350$height))
)

m450 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = dt450,
  start = list(a = mean(dt450$height), b = 0, sigma = sd(dt450$height))
)

(model.compare <- compare(m250, m350, m450))

##          WAIC       SE     dWAIC      dSE    pWAIC        weight
## m250 1511.216 22.23438    0.0000       NA 2.880546  1.000000e+00
## m350 2126.185 27.66868  614.9687 32.49309 3.172372 2.892233e-134
## m450 2753.223 31.34023 1242.0066 30.65115 3.038258 2.003047e-270

#The deviance, which is a sum, is used as the information criterion. As more values are summed together with more observations, the deviance increases, making a comparison to a model with fewer observations unusable. Using a simulation approach, we can demonstrate this.

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments.

#Because a prior becomes more concentrated, the number of effective parameters reduces. 
#As the prior gets more concentrated, the model for PSIS becomes less flexible.
#The pWAIC is a measure of variance in log-likelihood for each observation in the training sample for WAIC. The likelihood will get more concentrated as the prior becomes more concentrated, and the variance will decrease.

dt <- Howell1[complete.cases(Howell1), ]
dt$height.log <- log(dt$height)
dt$height.log.z <- (dt$height.log - mean(dt$height.log)) / sd(dt$height.log)
dt$weight.log <- log(dt$weight)
dt$weight.log.z <- (dt$weight.log - mean(dt$weight.log)) / sd(dt$weight.log)
m_not_concentrated <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = dt
)
m_concentrated <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = dt
)
WAIC(m_not_concentrated, refresh = 0)

##        WAIC     lppd  penalty  std_err
## 1 -102.1036 55.73913 4.687316 36.79195

WAIC(m_concentrated, refresh = 0)

##        WAIC     lppd penalty  std_err
## 1 -102.9472 55.61117 4.13757 36.30203

#With a more concentrated prior, the pWAIC decreases.

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island	Species A	Species B	Species C	Species D	Species E
1	0.2	0.2	0.2	0.2	0.2
2	0.8	0.1	0.05	0.025	0.025
3	0.05	0.15	0.7	0.05	0.05

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

library(magrittr)
library(dplyr)
# H(p) = −∑(p_i x log(p_i))
#Compute the entropies

H <- function(p) -1 * sum(p * log(p) )
Island_1 <- c( 0.2 , 0.2 , 0.2 , 0.2 , 0.2)
Island_2 <- c( 0.8 , 0.1 , 0.05 , 0.025 , 0.025)
Island_3 <- c( 0.05 , 0.15 , 0.7 , 0.05 , 0.05)

H(Island_1)

## [1] 1.609438

H(Island_2)

## [1] 0.7430039

H(Island_3)

## [1] 0.9836003

#The uncertainty included in a probability distribution is defined as the entropy, H. The five bird species are evenly distributed on Island 1. As a result, the island is the most difficult to predict. Island 2 is, on the other hand, the easiest to predict because the large proportion of the population is species A, resulting in the lowest entropy.

D_kl <- function(p, q) sum(p * (log(p) - log(q)))
D_kl(Island_1,Island_2)

## [1] 0.9704061

D_kl(Island_1,Island_3)

## [1] 0.6387604

D_kl(Island_2,Island_1)

## [1] 0.866434

D_kl(Island_2,Island_3)

## [1] 2.010914

D_kl(Island_3,Island_1)

## [1] 0.6258376

D_kl(Island_3,Island_2)

## [1] 1.838845

#Because Island 1 has the maximum entropy, the KL distance is shorter when it is chosen as the model.

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

library(rethinking)
dt <- sim_happiness(seed = 1977, N_years = 1000)
dt2 <- dt[dt$age>17,] # only adults
dt2$A <- (dt2$age - 18) / (65 - 18)
dt2$mid <- dt2$married + 1

m6.9 <- quap(
    alist(
        happiness ~ dnorm(mu, sigma),
        mu <- a[mid] + bA*A,
        a[mid] ~ dnorm(0, 1),
        bA ~ dnorm(0, 2),
        sigma ~ dexp(1)
    ) , data=dt2)
precis(m6.9, depth = 2)

##             mean         sd       5.5%      94.5%
## a[1]  -0.2350877 0.06348986 -0.3365568 -0.1336186
## a[2]   1.2585517 0.08495989  1.1227694  1.3943340
## bA    -0.7490274 0.11320112 -0.9299447 -0.5681102
## sigma  0.9897080 0.02255800  0.9536559  1.0257600

m6.10 <- quap(
    alist(
        happiness ~ dnorm(mu, sigma),
        mu <- a + bA*A,
        a ~ dnorm(0, 1),
        bA ~ dnorm(0, 2),
        sigma ~ dexp(1)
    ) , data=dt2 )
precis(m6.10)

##                mean         sd       5.5%     94.5%
## a      1.649248e-07 0.07675015 -0.1226614 0.1226617
## bA    -2.728620e-07 0.13225976 -0.2113769 0.2113764
## sigma  1.213188e+00 0.02766080  1.1689803 1.2573949

compare(m6.9, m6.10)

##           WAIC       SE    dWAIC      dSE    pWAIC       weight
## m6.9  2713.971 37.54465   0.0000       NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85

#WAIC claims that m6.9 is better at prediction. M6.10, on the other hand, contributes to the insights by revealing information about the casual relationship.

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

avgfood + groupsize + area
avgfood + groupsize
groupsize + area
avgfood
area

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

data(foxes)

fox_dt <- foxes[,-1] %>%
           as_tibble() %>%
           mutate(across(everything(), standardize))

set.seed(1234)
N<-100
a <- rnorm(N, 0, 0.3)
b <- rnorm(N, 0, 0.5)
plot(NULL, xlim=range(fox_dt$area),ylim=c(-4,4))
xbar<-mean(fox_dt$area)
for(i in 1:N) curve(a[i] + b[i]*(x- xbar),
                   from=min(fox_dt$area), to = max(fox_dt$area), add = TRUE,
                   col = col.alpha("black",0.2)
                   )

# 1 avgFood + Groupsize + area
m7.1 = quap(
  alist(
      weight ~ dnorm(mu, sigma),
      mu <- a + b_avgfood*avgfood + b_groupsize*groupsize + b_area*area,
      a ~ dnorm(0,3),
      b_avgfood ~ dnorm(0,5),
      b_groupsize ~ dnorm(0,5),
      b_area ~ dnorm(0,5), 
     sigma ~ dexp(1)
  ), data = fox_dt
)

# 2 avgFood + Groupsize
m7.2 = quap(
  alist(
      weight ~ dnorm(mu, sigma),
      mu <- a + b_avgfood*avgfood + b_groupsize*groupsize,
      a ~ dnorm(0,3),
      b_avgfood ~ dnorm(0,5),
      b_groupsize ~ dnorm(0,5),
     sigma ~ dexp(1)
  ), data = fox_dt
)

# 3 Groupsize and area
m7.3 = quap(
  alist(
      weight ~ dnorm(mu, sigma),
      mu <- a + b_area*area + b_groupsize*groupsize,
      a ~ dnorm(0,3),
      b_area ~ dnorm(0,5),
      b_groupsize ~ dnorm(0,5),
     sigma ~ dexp(1)
  ), data = fox_dt
)

#4 avgFood
m7.4 = quap(
  alist(
      weight ~ dnorm(mu, sigma),
      mu <- a + b_avgfood*avgfood,
      a ~ dnorm(0,3),
      b_avgfood ~ dnorm(0,5),
     sigma ~ dexp(1)
  ), data = fox_dt
)

#5 Area
m7.5 = quap(
  alist(
      weight ~ dnorm(mu, sigma),
      mu <- a + b_area*area,
      a ~ dnorm(0,3),
      b_area ~ dnorm(0,5),
     sigma ~ dexp(1)
  ), data = fox_dt
)

compare(m7.1, m7.2, m7.3, m7.4, m7.5)

##          WAIC       SE      dWAIC      dSE    pWAIC      weight
## m7.1 323.7790 16.97973  0.0000000       NA 5.394304 0.395584053
## m7.2 324.0854 16.82870  0.3063355 3.890751 4.194372 0.339405493
## m7.3 324.6152 16.13735  0.8361358 4.277576 4.220335 0.260419919
## m7.4 333.9423 13.86435 10.1632843 8.712388 2.663152 0.002456460
## m7.5 334.2237 13.90559 10.4446596 8.760104 2.886085 0.002134075

plot(compare(m7.1, m7.2, m7.3, m7.4, m7.5))

# Because they all have groupsize in their models, m7.1, m7.2, and m7.3 have almost the same dWAIC. This means there is no back door route from the remainder variables.
# Because area and avgFood information travels to weight solely in one way, when group size is conditioned for, m7.4 and m7.5 have roughly the same dWAIC.

Assignment #6

Chapter 7

Aniket Muluk

2022-03-28

Chapter 7 - Ulysses’ Compass

Questions