CemSoylu_ANLY505_2022_Summer

Title: ‘Assignment #7’ Author: “Cem Soylu” Date: “2022-08-08” Output: html_document

Questions

7-1. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some simulations.

library(rethinking)

set.seed(2022)

# perform simulations
x <-  rnorm(1000, mean=1, sd=0.8)            # main predictor
x1 <- rnorm(1000, mean=0, sd=1)            # random variable to add noise
y <- rnorm(1000, mean = 3 + 2*x, sd=1)
d <- data.frame(x=x, y=y, x1=x1)

# compare two models
m1a <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + b*x,
    a ~ dnorm(0, 0.1),
    b ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d
)

m1b <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + b*x + c*x1,
    a ~ dnorm(0, 0.1),
    b ~ dnorm(0, 0.5),
    c ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d
)

# Comparing Models
comparison1 <- compare(m1a, m1b)
comparison1

##         WAIC       SE  dWAIC       dSE    pWAIC   weight
## m1a 3020.493 46.21983 0.0000        NA 2.908110 0.640538
## m1b 3021.648 46.30764 1.1554 0.2683395 4.013427 0.359462

The second model, m1b, has a slightly higher WAIC and lower weight than the first model, m1a.

A. We can perform an experiment where two models are fit with the same linear model m1a. These models will have the same number of observations but using different data. So all models are fit to exactly the same observations but different data so we can analyze the difference in model performance.

m1a_exp1 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + b*x,
    a ~ dnorm(0, 0.1),
    b ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d[1:300,]
)

m1a_exp2 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + b*x,
    a ~ dnorm(0, 0.1),
    b ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d[301:600,]
)


m1a_exp3 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + b*x,
    a ~ dnorm(0, 0.1),
    b ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d[601:900,]
)

# Comparing Models
comparison2 <- compare(m1a_exp1, m1a_exp2, m1a_exp3)
comparison2

##              WAIC       SE    dWAIC      dSE    pWAIC       weight
## m1a_exp3 1110.498 24.80900  0.00000       NA 2.331701 1.000000e+00
## m1a_exp2 1149.817 24.13764 39.31956 34.66866 2.175391 2.896452e-09
## m1a_exp1 1179.740 22.67284 69.24267 31.96222 1.808913 9.207580e-16

Even though all models are fit to exactly the same observations and same linear model, functional form, different data causes different WAIC. We can see that WAICs are different across three models and one of them have all the weight and the other two have zero weights.

B - We can perform another experiment where all models are fit to exactly the same observations. However, two models are fit with the same linear model m1b and one model has a linear model of m1a. These models will have the same number of observations but using different data and different models. Thus we can analyze the difference in model performance due to difference in data and model even though they have the same number of observations.

m1b_exp4 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + b*x + c*x1,
    a ~ dnorm(0, 0.1),
    b ~ dnorm(0, 0.5),
    c ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d[1:300,]
)

m1a_exp5 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + b*x,
    a ~ dnorm(0, 0.1),
    b ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d[301:600,]
)

m1b_exp6 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + b*x + c*x1,
    a ~ dnorm(0, 0.1),
    b ~ dnorm(0, 0.5),
    c ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d[601:900,]
)

# Comparing Models
comparison3 <- compare(m1b_exp4, m1a_exp5, m1b_exp6)
comparison3

##              WAIC       SE    dWAIC      dSE    pWAIC       weight
## m1b_exp6 1112.602 24.56077  0.00000       NA 3.199753 1.000000e+00
## m1a_exp5 1148.608 24.19160 36.00597 34.54984 2.215988 1.518456e-08
## m1b_exp4 1180.829 22.61746 68.22767 31.70916 2.778199 1.529499e-15

Even though two models m1b_exp4 and m1b_exp6 both use the unnecessary variable x1 but this does not translate into WAICs. The last model m1b_exp6 has the lowest WAIC and all the weight whereas the other two models have zero weights.

When comparing models with an information criterion, we must fit all models to exactly the same observations because a model with more observations returns a higher deviance and worse accuracy according to information criteria. Generally, information criteria are based on deviance. Deviance is a sum and not a mean product of all observations. Thus a model with more observations returns a higher deviance and worse accuracy.

From those simulation experiments above, it also suggests that information criteria WAIC also depends on which data/sample we use to fit the model because same number of observations, and same model but different data/sample can still have different WAICs.

7-2. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some simulations.

set.seed(2022)
x <-  rnorm(1000, mean=1, sd=2)
x1 <- rnorm(1000, mean=0, sd=1)            # random variable 1 to add noise
x2 <- rnorm(1000, mean=x, sd=5)            # random variable 2 with spurious relationship with y
y <- rnorm(1000, mean = 3 + 2*x, sd=1)
d <- data.frame( y=y, x=x, x1=x1, x2=x2)
pairs(d)

N <- 10    # Number of experiments
dic.l <- list()
waic.l <- list()

# perform simulation for N times
set.seed(2022)

for (i in 1:N){
  # generate simulated data
  x <-  rnorm(1000, mean=1, sd=1.5)
  x1 <- rnorm(1000, mean=0, sd=1)            # random variable 1 to add noise
  x2 <- rnorm(1000, mean=x, sd=5)            # random variable 2 with spurious relationship with y
  y <- rnorm(1000, mean = 3 + 2*x, sd=1)
  d <- data.frame( y=y, x=x, x1=x1, x2=x2)

  # run the same model with different priors
  mod_100 <- map(
    alist(
      y ~ dnorm( mu, sigma) ,
      mu <- a + b*x +b1*x1 + b2*x2,
      c(b,b1,b2) ~ dnorm(0, 100),
      a ~ dnorm(0, 100),
      sigma ~ dunif(0, 100)
    ), data=d, start=list(b=0, b1=0, b2=0, a=3, sigma=10)
  )
  dic.l[["100"]] <- c(dic.l[["100"]], attr(DIC(mod_100), "pD"))
  waic.l[["100"]] <- c(waic.l[["100"]], attr(WAIC(mod_100), "WAIC"))
  
  mod_50 <- map(
    alist(
      y ~ dnorm(mu, sigma),
      mu <- a + b*x + b1*x1 +b2*x2,
      c(b, b1, b2) ~ dnorm(0, 50),
      a ~ dnorm(0, 50),
      sigma ~ dunif(0, 50)
    ), data=d, start=list(b=0, b1=0, b2=0, a=3, sigma=10)
  )
  
  dic.l[["50"]] <- c(dic.l[["50"]], attr(DIC(mod_50), "pD"))
  waic.l[["50"]] <- c(waic.l[["50"]], attr(WAIC(mod_50), "WAIC"))
  
  mod_10 <- map(
    alist(
      y ~ dnorm(mu, sigma),
      mu <- a + b*x + b1*x1 + b2*x2,
      c(b, b1,b2) ~ dnorm(0, 10),
      a ~ dnorm(0, 10),
      sigma ~ dunif(0, 10)
    ), data=d, start=list(b=0, b1=0, b2=0, a=3, sigma=5)
  )
  
  dic.l[["10"]] <- c(dic.l[["10"]], attr(DIC(mod_10), "pD"))
  waic.l[["10"]] <- c(waic.l[["10"]], attr(WAIC(mod_10), "WAIC"))
  
    mod_1 <- map(
    alist(
      y ~ dnorm(mu, sigma),
      mu <- a + b*x + b1*x1 + b2*x2,
      c(b, b1,b2) ~ dnorm(0, 1),
      a ~ dnorm(0, 1),
      sigma ~ dunif(0, 10)
    ), data=d, start=list(b=0, b1=0, b2=0, a=3, sigma=5)
  )
  
  dic.l[["1"]] <- c(dic.l[["1"]], attr(DIC(mod_1), "pD"))
  waic.l[["1"]] <- c(waic.l[["1"]], attr(WAIC(mod_1), "WAIC"))
  
  mod_0.5 <- map(
    alist(
      y ~ dnorm(mu, sigma),
      mu <- a + b*x + b1*x1 +b2*x2,
      c(b, b1,b2) ~ dnorm(0, 0.5),
      a ~ dnorm(0, 0.5),
      sigma ~ dunif(0, 15)
    ), data=d, start=list(b=0, b1=0, b2=0, a=3, sigma=5)
  )
  
  dic.l[["0.5"]] <- c(dic.l[["0.5"]], attr(DIC(mod_0.5), "pD"))
  waic.l[["0.5"]] <- c(waic.l[["0.5"]], attr(WAIC(mod_0.5), "WAIC"))
  
  mod_0.1 <- map(
    alist(
      y ~ dnorm(mu, sigma),
      mu <- a + b*x + b1*x1 +b2*x2,
      c(b, b1,b2) ~ dnorm(0, 0.1),
      a ~ dnorm(0, 0.1),
      sigma ~ dunif(0, 10)
    ), data=d, start=list(b=0, b1=0, b2=0, a=3, sigma=5)
  )
  
  dic.l[["0.1"]] <- c(dic.l[["0.1"]], attr(DIC(mod_0.1), "pD"))
  waic.l[["0.1"]] <- c(waic.l[["0.1"]], attr(WAIC(mod_0.1), "WAIC"))
  
  mod_0.01 <- map(
    alist(
      y ~ dnorm(mu, sigma),
      mu <- a + b*x + b1*x1 +b2*x2,
      c(b, b1,b2) ~ dnorm(0, 0.01),
      a ~ dnorm(0, 0.01),
      sigma ~ dunif(0, 10)
    ), data=d, start=list(b=0, b1=0, b2=0, a=3, sigma=5)
  )
  
  dic.l[["0.01"]] <- c(dic.l[["0.01"]], attr(DIC(mod_0.01), "pD"))
  waic.l[["0.01"]] <- c(waic.l[["0.01"]], attr(WAIC(mod_0.01), "WAIC"))
  
}

options(scipen = 3)
prior <- c(100, 50, 10, 1, 0.5, 0.1, 0.01)
prior2 <- c(110, 55, 11, 1.1, 0.55, 0.15, 0.015)
dic.mean <- sapply(dic.l, mean)
dic.sd <- sapply(dic.l, sd)
waic.mean <- sapply(waic.l, mean)
waic.sd <- sapply(waic.l, sd)
plot(prior , dic.mean, ylim=c(0,6),
       xlab="Prior", ylab="Number of parameters", log="x",
      pch=16, cex=1, col="steelblue", type="o",
     main="Number of parameters")

# Reporting WAIC
print("WAIC for models")

## [1] "WAIC for models"

prior

## [1] 100.00  50.00  10.00   1.00   0.50   0.10   0.01

waic = c(WAIC(mod_100)[1], WAIC(mod_50)[1], WAIC(mod_10)[1], WAIC(mod_1)[1], WAIC(mod_0.5)[1], WAIC(mod_0.1)[1], WAIC(mod_0.01)[1])
waic

## $WAIC
## [1] 2794.624
## 
## $WAIC
## [1] 2794.457
## 
## $WAIC
## [1] 2794.629
## 
## $WAIC
## [1] 2794.471
## 
## $WAIC
## [1] 2794.961
## 
## $WAIC
## [1] 2876.123
## 
## $WAIC
## [1] 6323.062

# Reporting PSIS
psis = c(PSIS(mod_100)[1], PSIS(mod_50)[1], PSIS(mod_10)[1], PSIS(mod_1)[1], PSIS(mod_0.5)[1], PSIS(mod_0.1)[1], PSIS(mod_0.01)[1])
psis

## $PSIS
## [1] 2794.899
## 
## $PSIS
## [1] 2794.7
## 
## $PSIS
## [1] 2794.472
## 
## $PSIS
## [1] 2794.639
## 
## $PSIS
## [1] 2794.297
## 
## $PSIS
## [1] 2877.292
## 
## $PSIS
## [1] 6323.255

The effective number of parameters is the penalty term of our regularization approaches. As priors become more regularizing or more concentrated, the effective number of parameters decreases. We can see that the number of effective parameters decreases when the prior gets more concentrated. This makes sense as model becomes less flexible in fitting the sample when a prior becomes more concentrated around particular parameter values. With a less flexible model, the effective number of parameters declines. Increasingly regularized priors decrease the flexibility of the model which we can see via measures of PSIS or WAIC.

7-3. Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species:

Island	Species A	Species B	Species C	Species D	Species E
1	0.2	0.2	0.2	0.2	0.2
2	0.8	0.1	0.05	0.025	0.025
3	0.05	0.15	0.7	0.05	0.05

Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?

# First Island
p1 <- c(0.2, 0.2, 0.2, 0.2, 0.2)

# formula for entropy
entropy <- function(p) - sum( p[p!=0] * log(p[p!=0]) )
entropy(p1)

## [1] 1.609438

# Second Island
p2 <- c(0.8, 0.1, 0.05, 0.025, 0.025)
entropy(p2)

## [1] 0.7430039

# Third Island
p3 <- c(0.05, 0.15, 0.7, 0.05, 0.05)
entropy(p3)

## [1] 0.9836003

Entropy is a measure of uncertainty. The higher the entropy, the more uncertain we are of the probability density distribution at hand.

An alternative way of entropy calculation:

entropy1 = -sum(p1 * log(p1))
entropy1

## [1] 1.609438

entropy2 = -sum(p2 * log(p2))
entropy2

## [1] 0.7430039

entropy3 = -sum(p3 * log(p3))
entropy3

## [1] 0.9836003

The three entropies of the islands are ordered from low to high as Island 2, Island 3, and Island 1 with 0.74-0.98-1.61. This indicates that there is a lot of certainty of the proportions assigned to each bird species at Island 2 over the other islands. In other words, there are a lot of species A in Island 2 even though other species drop drastically to almost being non-existent on Island 2. This also means that we are sure of the sole inhabitant on an island vs. other species with equal proportions on other islands.

Divergence is calculated as the average difference in log probability between target (p) and model (q). To find model (q), we need to average out the proportions on two islands and then compare these to the target island. For each of these pairings, two K-L distances will be calculated.

Calculating the divergence between Island 1 and both the Island 2 and 3 together:

# Average the proportions of Island 2 and 3
(q <- apply(cbind(p2, p3), 1, mean))

## [1] 0.4250 0.1250 0.3750 0.0375 0.0375

# Divergence of Island 1 from combined Islands 2 & 3
(D_pq <- sum(p1 * log(p1 / q)))

## [1] 0.4871152

# Divergence of combined Islands 2 & 3 from Island 1
(D_qp <- sum(q * log(q / p1)))

## [1] 0.3717826

By using both Islands 2 & 3 together vs. Island 1, we introduce an additional 0.49 of uncertainty. Due to non-directional distance, we also calculate using the knowledge of Island 1 to approximate the combination of Islands 2 and 3, an additional 0.37 of uncertainty is introduced.

Then we using both Islands 1 & 3 together, and both Islands 1 & 2 together, respectively.

output <- data.frame(
  Approximated = D_pq,
  Approximator = D_qp
)


# Target: Island 2
q <- apply(cbind(p1, p3), 1, mean)
D_pq <- sum(p2 * log(p2 / q))
D_qp <- sum(q * log(q / p2))
output <- rbind(output, c(D_pq, D_qp))

# Target: Island 3
q <- apply(cbind(p1, p2), 1, mean)
D_pq <- sum(p3 * log(p3 / q))
D_qp <- sum(q * log(q / p3))
output <- rbind(output, c(D_pq, D_qp))

# Reporting results
rownames(output) <- c("Island 1", "Island 2", "Island 3")
output

##          Approximated Approximator
## Island 1    0.4871152    0.3717826
## Island 2    1.2387437    1.2570061
## Island 3    1.0097143    1.1184060

We can see that starting at island 1 results in the overall lowest divergence values, due to the high entropy. In other words, if we start to draw birds from a bag 1 (from island 1), we will be more certain of specicies from drawing from the other bags. The results also show that the divergence between Island 1 (a “safe bet”) where all species are equally present and other systems is much smaller than when comparing heavily skewed probability density distributions.

7-4. Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC (or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?

We recall that m6.9 shows a negative relationship between age and happiness which we know to be untrue because it conditions on the collider of marriage status which itself, is influenced by age and happiness. m6.10 does not condition on said collider and thus does not find a relationship between age and happiness.

d <- sim_happiness(seed = 1977, N_years = 1000)


d2 <- d[d$age > 17, ] # need only sample of adults
d2$A <- (d2$age - 18) / (65 - 18)


d2$mid <- d2$married + 1


m6.9 <- quap(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a[mid] + bA * A,
    a[mid] ~ dnorm(0, 1),
    bA ~ dnorm(0, 2),
    sigma ~ dexp(1)
  ),
  data = d2
)


m6.10 <- quap(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + bA * A,
    a ~ dnorm(0, 1),
    bA ~ dnorm(0, 2),
    sigma ~ dexp(1)
  ),
  data = d2
)


# Reporting comparison of models
compare(m6.9, m6.10)

##           WAIC       SE    dWAIC      dSE    pWAIC       weight
## m6.9  2713.971 37.54465   0.0000       NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85

Based on WAIC, we can see that model m6.9 has better performance with lower WAIC. However, we know that model m6.10 provides the true causal relationship between age and happiness, which is no causal relationship.The right causal inference regarding the impact of age on happiness is provided by model 6.10. Since WAIC is more useful for assessing prediction ability than causal relationship. Thus, the relationship between age and happiness leads to more accurate forecasts.

By comparing these two models m6.9 and 6.10, it suggests that there might be some confounding. Since the model m6.9 identifies the happiness of the different groups of people: the miserable unmarried as well as the ecstatic married ones, its WAIC would have us favor this model because it does a better job at predicting the happiness of out-of-sample individuals. Conditioning on the collider, predictive accuracy increases. Therefore, m6.9 fails at finding causality while doing better at predicting model, which demonstrates that a model may be good at prediction but does not have correct causality.

7-5. Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables:

avgfood + groupsize + area
avgfood + groupsize
groupsize + area
avgfood
area

Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).

# Getting data
data(foxes)

# Standardizing variables
foxes$avgfood <- (foxes$avgfood-mean(foxes$avgfood))/sd(foxes$avgfood)
foxes$groupsize <- (foxes$groupsize-mean(foxes$groupsize))/sd(foxes$groupsize)
foxes$area <- (foxes$area -mean(foxes$area ))/sd(foxes$area)
foxes$weight <- (foxes$weight-mean(foxes$weight))/sd(foxes$weight)


## Models
# Model 1: weight ~ avgfood + groupsize + area

m1 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bFood*avgfood + bGroup*groupsize + bArea*area,
    a ~ dnorm(0, .1),
    c(bFood, bGroup, bArea) ~ dnorm(0, 3),
    sigma ~ dexp(1)
  ),
  data = foxes
)

# Model 2: weight ~ avgfood + groupsize

m2 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bFood*avgfood + bGroup*groupsize,
    a ~ dnorm(0, .1),
    c(bFood, bGroup) ~ dnorm(0, 3),
    sigma ~ dexp(1)
  ),
  data = foxes
)

# Model 3: weight ~ groupsize + area

m3 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bGroup*groupsize + bArea*area,
    a ~ dnorm(0, .1),
    c(bGroup, bArea) ~ dnorm(0, 3),
    sigma ~ dexp(1)
  ),
  data = foxes
)

# Model 4: weight ~ avgfood

m4 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bFood * avgfood,
    a ~ dnorm(0, .1),
    bFood ~ dnorm(0, 3),
    sigma ~ dexp(1)
  ),
  data = foxes
)

# Model 5: weight ~ area

m5 <- quap(
  alist(
    weight ~ dnorm(mu, sigma),
    mu <- a + bArea * area,
    a ~ dnorm(0, .1),
    bArea ~ dnorm(0, 3),
    sigma ~ dexp(1)
  ),
  data = foxes
)

# Model Comparisons
compare(m1, m2, m3, m4, m5)

##        WAIC       SE      dWAIC      dSE    pWAIC      weight
## m1 322.7394 16.82943  0.0000000       NA 4.889538 0.395629037
## m2 323.2461 16.75386  0.5067064 3.860621 3.784133 0.307084761
## m3 323.3420 15.96881  0.6026121 4.109584 3.572415 0.292706664
## m4 332.9107 13.75468 10.1713046 8.607595 2.155021 0.002446907
## m5 333.1856 13.76245 10.4462414 8.651809 2.381344 0.002132631

plot(compare(m1, m2, m3, m4, m5))

The differences of in the WAIC scores all fall well within the 99% intervals of the differences: However, we can see that models m1, m2, m3 are almost the same in terms of WAIC or out-of-sample deviance. This is because models m1, m2, m3 all use groupsize and one of/both area and/or avgfood. As a result, all of these models are quite the same in their predictive power because there are no open backdoor paths from either area or avgfood when groupsize is used in conjunction. In other words, the effect of area while adjusting for groupsize is the same as the effect of avgfood while adjusting for groupsize, because the effect of area is entirely pass through avgfood. Two models m4 and m5 are almost the same in terms of WAIC or out-of-sample deviance as well. This is due to the fact that models m4 and m5 only contain area or avgfood alone and the impact of area on weight was transmitted via avgfood.

Week 7 - Overfitting

Questions