Assignment #6

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments.

# Information criteria are based on deviance, which is accrued over observations without being divided by the number of observations, which denotes the sum. Hence, if everything else is equal, a model with more observations will have a higher deviance and less accuracy as per the information criteria. Thus, it would be an unfair comparison to contrast models fit to different numbers of observations.

# We may measure WAIC for model's fit to increasingly small samples of the same data as an experiment. The information criteria should decrease alongside the sample size. 

# As shown below, we take an example from the howell database - 

library(rethinking)
data(Howell1)
set.seed(6)
data <- Howell1[complete.cases(Howell1), ]
d_500 <- data[sample(1:nrow(data), size = 500, replace = FALSE), ]
d_400 <- data[sample(1:nrow(data), size = 400, replace = FALSE), ]
d_300 <- data[sample(1:nrow(data), size = 300, replace = FALSE), ]
m_500 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_500,
  start = list(a = mean(d_500$height), b = 0, sigma = sd(d_500$height))
)
m_400 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_400,
  start = list(a = mean(d_400$height), b = 0, sigma = sd(d_400$height))
)
m_300 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_300,
  start = list(a = mean(d_300$height), b = 0, sigma = sd(d_300$height))
)
(model.compare <- compare(m_500, m_400, m_300))

##           WAIC       SE     dWAIC      dSE    pWAIC        weight
## m_300 1862.175 27.91431    0.0000       NA 3.437154  1.000000e+00
## m_400 2419.639 33.94881  557.4644 46.34780 3.442395 8.874639e-122
## m_500 3054.604 35.32283 1192.4289 53.17046 3.257118 1.167790e-259

# From the results, we see that the WAIC increased from the N=300 model to the N=400 model and again to the N=500 model. The compare() function also returns a warning about the number of observations being different.

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments.

# The effective number of parameters decreases as the prior becomes more concentrated. The reason is because in this case of DIC, pD is a measure of the model's flexibility; with more concentrated priors, the model becomes less flexible (and therefore less prone to overfitting). In case of WAIC, pWAIC is a measure of the variance in the log-likelihood for each observation in the training sample; with more concentrated priors, the likelihood would become more concentrated as well, reducing variance.

# We perform an experiment to demonstrate this. We  calculate WAIC for two models utilizing common data with the difference only in the concentration of their priors.

data <- Howell1[complete.cases(Howell1), ]
data$height.log <- log(data$height)
data$height.log.z <- (data$height.log - mean(data$height.log)) / sd(data$height.log)
data$weight.log <- log(data$weight)
data$weight.log.z <- (data$weight.log - mean(data$weight.log)) / sd(data$weight.log)
m_wide <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = data
)
m_narrow <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = data
)
WAIC(m_wide, refresh = 0)

##        WAIC     lppd penalty std_err
## 1 -102.7631 55.62747 4.24594 36.5361

WAIC(m_narrow, refresh = 0)

##        WAIC     lppd  penalty  std_err
## 1 -102.7579 55.62061 4.241636 36.44326

# From the results, we note that the pWAIC decreases as the priors become more concentrated.

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island	Species A	Species B	Species C	Species D	Species E
1	0.2	0.2	0.2	0.2	0.2
2	0.8	0.1	0.05	0.025	0.025
3	0.05	0.15	0.7	0.05	0.05

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

H <- function(p) -sum(p*log(p))

IB <- list()
IB[[1]] <- c( 0.2 , 0.2 , 0.2 , 0.2 , 0.2 )
IB[[2]] <- c( 0.8 , 0.1 , 0.05 , 0.025 , 0.025 )
IB[[3]] <- c( 0.05 , 0.15 , 0.7 , 0.05 , 0.05 )
sapply( IB , H )

## [1] 1.6094379 0.7430039 0.9836003

# The first island has the highest entropy followed by the third island and then lastly, the second island. This is because birds are evenly distributed on the first island compared to the second island which has the most uneven distribution of birds.

DKL <- function(p,q) sum( p*(log(p)-log(q)) ) #K-L distance

Dm <- matrix( NA , nrow=3 , ncol=3 )

for ( i in 1:3 ) for ( j in 1:3 ) Dm[i,j] <- DKL( IB[[j]] , IB[[i]])

round( Dm , 2 )

##      [,1] [,2] [,3]
## [1,] 0.00 0.87 0.63
## [2,] 0.97 0.00 1.84
## [3,] 0.64 2.01 0.00

# From the results, we note that the the first island has smaller distances to other islands because it has the highest entropy.

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

# We consider a multiple regression model aimed at inferring the influence of age on happiness, while controlling the marriage status.

#Let’s consider the slope βA first, because how we scale the predictor A will determine the intercept. We’ll focus only on the adult sample, aged 18 or above. Assuming a very strong relationship between age and happiness, such that happiness is at its maximum at age 18 and its minimum at age 65. It’ll be easier if we rescale age so that the range from 18 to 65 is one unit. Now this new variable A ranges from 0 to 1, where 0 is age 18 and 1 is age 65.

# Happiness is on an arbitrary scale, in this data, from −2 to +2. So our assumption of strongest relationship, taking happiness from maximum to minimum, has a slope with rise over run of (2 − (−2))/1 = 4. As we know,  95% of the mass of a normal distribution is contained within 2 standard deviations. So if we set the standard deviation of the prior to half of 4, we are implying that we expect 95% of plausible slopes to be less than maximally strong. This isn’t a very strong prior, but it does help bound inference to realistic ranges.

# Concerning the, intercepts, each α is the value of μi when Ai = 0. In this case, that means at age 18. So we need to allow α to cover the full range of happiness scores. Normal(0, 1) will put 95% of the mass in the −2 to +2 interval.

# We need to construct the marriage status index variable, as well.

# Approximating the posterior
d <- sim_happiness( seed=1977 , N_years=1000 )
d2 <- d[ d$age>17 , ] # only adults
d2$A <- ( d2$age - 18 ) / ( 65 - 18 )
d2$mid <- d2$married + 1

# Model m6.9 contains both marriage status and age. Model m6.10 contains only age. Model m6.9 produces a confounded inference about the relationship between age and happiness, due to opening a collider path.

m6.9 <- quap(
    alist(
        happiness ~ dnorm( mu , sigma ),
        mu <- a[mid] + bA*A,
        a[mid] ~ dnorm( 0 , 1 ),
        bA ~ dnorm( 0 , 2 ),
        sigma ~ dexp(1)
    ) , data=d2 )
precis(m6.9,depth=2)

##             mean         sd       5.5%      94.5%
## a[1]  -0.2350877 0.06348986 -0.3365568 -0.1336186
## a[2]   1.2585517 0.08495989  1.1227694  1.3943340
## bA    -0.7490274 0.11320112 -0.9299447 -0.5681102
## sigma  0.9897080 0.02255800  0.9536559  1.0257600

m6.10 <- quap(
    alist(
        happiness ~ dnorm( mu , sigma ),
        mu <- a + bA*A,
        a ~ dnorm( 0 , 1 ),
        bA ~ dnorm( 0 , 2 ),
        sigma ~ dexp(1)
    ) , data=d2 )
precis(m6.10)

##                mean         sd       5.5%     94.5%
## a      1.649248e-07 0.07675015 -0.1226614 0.1226617
## bA    -2.728620e-07 0.13225976 -0.2113769 0.2113764
## sigma  1.213188e+00 0.02766080  1.1689803 1.2573949

#Now we comparing both models
compare( m6.9 , m6.10 )

##           WAIC       SE    dWAIC      dSE    pWAIC       weight
## m6.9  2713.971 37.54465   0.0000       NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85

# m6.9 is expected to predict much better as it produces the invalid inference. This is because the collider path does convey actual association. We simply end up mistaken about the causal inference. We should not use WAIC (or LOO) to choose among models, unless we have some clear sense of the causal model

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

avgfood + groupsize + area
avgfood + groupsize
groupsize + area
avgfood
area

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

library(dagitty)
quest <- dagitty("dag{
                  avgfood <- area
                  weight <- avgfood
                  weight <- groupsize
                  groupsize <- avgfood
                  }")

plot(quest)

data(foxes)
d <- foxes
d$W <- standardize(d$weight)
d$A <- standardize(d$area)
d$F <- standardize(d$avgfood)
d$G <- standardize(d$groupsize)



m1 <- quap(
    alist(
        W ~ dnorm( mu , sigma ),
        mu <- a + bF*F + bG*G + bA*A,
        a ~ dnorm(0,0.2),
        c(bF,bG,bA) ~ dnorm(0,0.5),
        sigma ~ dexp(1)
    ), data=d )
m2 <- quap(
    alist(
        W ~ dnorm( mu , sigma ),
        mu <- a + bF*F + bG*G,
        a ~ dnorm(0,0.2),
        c(bF,bG) ~ dnorm(0,0.5),
        sigma ~ dexp(1)
    ), data=d )
m3 <- quap(
    alist(
        W ~ dnorm( mu , sigma ),
        mu <- a + bG*G + bA*A,
        a ~ dnorm(0,0.2),
        c(bG,bA) ~ dnorm(0,0.5),
        sigma ~ dexp(1)
    ), data=d )
m4 <- quap(
    alist(
        W ~ dnorm( mu , sigma ),
        mu <- a + bF*F,
        a ~ dnorm(0,0.2),
        bF ~ dnorm(0,0.5),
        sigma ~ dexp(1)
    ), data=d )
m5 <- quap(
    alist(
        W ~ dnorm( mu , sigma ),
        mu <- a + bA*A,
        a ~ dnorm(0,0.2),
        bA ~ dnorm(0,0.5),
        sigma ~ dexp(1)
), data=d )


compare( m1 , m2 , m3 , m4 , m5 )

##        WAIC       SE     dWAIC      dSE    pWAIC      weight
## m1 322.8847 16.27783  0.000000       NA 4.656959 0.465363694
## m3 323.8985 15.68240  1.013749 2.899417 3.718565 0.280323674
## m2 324.1284 16.13964  1.243666 3.598475 3.859897 0.249881396
## m4 333.4444 13.78855 10.559695 7.193396 2.426279 0.002370193
## m5 333.7239 13.79447 10.839215 7.242069 2.650636 0.002061043

# From the results shown, the models m1, m2, m3 have very similar WAIC values. This is because WAIC sees these models as tied. This makes sense, given the DAG, because as long as a model has groupsize in it, we can include either avgfood or area or both and get the same inferences.

# the models m4 and m5 omit groupsize and are tied to one another, thus they have almost similar WAIC values.

Assignment #6

Chapter 7

Shruti Jose

2022-03-28

Chapter 7 - Ulysses’ Compass

Questions