Notes 10

Modeling Proportions with Binomial Distribution | Part 5: AIC/BIC, Case Study

Setup

Loading packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(GLMsData)
library(broom) # for tidy, augment
## Warning: package 'broom' was built under R version 4.3.3
library(statmod) # for qresid()
## Warning: package 'statmod' was built under R version 4.3.3

AIC/BIC

(See Section 7.8.) To compare the fit of models (even non-nested ones), the AIC and BIC can be used. The AIC and BIC are calculated as

\[AIC = -2\times LL + 2p\]

\[BIC = -2\times LL + (\log n)p\]

The \(LL\) stands for the log-likelihood calculated at the maximum likelihood estimates of the parameters. The \(p\) stands for the number of parameters. Note:

  • A higher log-likelihood means a better fit, so lower AIC and BIC values indicate a better fit.

  • The only difference between the two measures is the size of the penalty per parameter. The AIC has a smaller penalty than the BIC because \(\log(n) > 2\) for \(n>8\).

  • So the BIC will tend to pick models with fewer variables than the AIC.

log(7:8)
## [1] 1.945910 2.079442

After trying out several models for the case study, we will compare their AIC/BICs at the end of the activity.

Case study - Section 9.11

An experiment exposed batches of insects to various deposits (in mg) of insecticides (Table 9.5; data set: deposit). The proportion of insects killed after six days of exposure in each batch of size is potentially a function of the dose of insecticide and the type of insecticide.

data(deposit)
head(deposit, 5)
##   Killed Number Insecticide Deposit
## 1      3     50           A    2.00
## 2      5     49           A    2.64
## 3     19     47           A    3.48
## 4     19     38           A    4.59
## 5     24     29           A    6.06

Calculating proportion killed

deposit <- deposit |> 
  mutate(prop_killed = Killed/Number)

Plotting proportions killed

ggplot(
  data = deposit, 
  mapping = aes(x = Deposit, y = prop_killed, label = Insecticide, color = Insecticide)
) + 
  geom_text()

Interpretations:

  • Effect of deposit: As deposit used increases the proportion killed increases

  • Effect of type of insecticide: Type C has a higher proportion killed compared to B and A (which are mostly similar).

Model 1: Deposit and Insecticide (fit1)

Writing binomial model with logit link:

  • Random component: Proportion killed\(\sim Binomial(\mu, Number)\)

  • Systematic component: \(logit(\mu) = \log\frac{\mu}{1-\mu} = \beta_0 + \beta_1 Deposit + \beta_2I(type=B) + \beta_3I(type=C)\)

Make 1 less dummy variable than the number of categories. R does’nt make a dummy variable for the first category (alphabetically).

fit1 <- glm(prop_killed ~ Deposit + Insecticide, family = binomial(link = "logit"), weights = Number, data = deposit)

Plot fitted probabilities

ggplot(
  data = augment(fit1, type.predict = "response"), 
  mapping = aes(x = Deposit, color = Insecticide)
) + 
  geom_point(aes(y = prop_killed)) + 
  geom_line(aes(y = .fitted))

How is the fit?

Plotting quantile residuals vs. fitted probabilities

augment(fit1, type.predict = "response") |> 
  mutate(qresid = qresid(fit1)) |> 
  ggplot(
    mapping = aes(x = .fitted, y = qresid, color = Insecticide)
  ) + 
  geom_point() + 
  geom_line()

Interpretation: There seems to be a violation of *centered around 0** criteria

Solution: add quadratic deposit term

Model 2: Quadratic Deposit and Insecticide (fit2)

Fit the model

fit2 <- glm(
  prop_killed ~ Deposit + I(Deposit^2) + Insecticide, 
  family = binomial(link="logit"), weights = Number, data = deposit
  )

Plot observed proportions with fitted probabilities

ggplot(
  data = augment(fit2, type.predict = "response"), 
  mapping = aes(x = Deposit, color = Insecticide)
) + 
  geom_point(aes(y = prop_killed)) + 
  geom_line(aes(y = .fitted))

Plotting quantile residuals vs. fitted probabilities

augment(fit2, type.predict = "response") |> 
  mutate(qresid = qresid(fit2)) |> 
  ggplot(
    mapping = aes(x = .fitted, y = qresid, color = Insecticide)
  ) + 
  geom_point() + 
  geom_line()

Interpretation: The u-shaped patterns are gone. This is an improvement.

Model 3: log(Deposit) and Insecticide (fit3)

The book recommends using the log of the dose, stating that it is commonly used in dose-response models.

Another argument for its use is that the Deposit values used in the experiment were designed to constant spacing in the log scale.

On the regular scale:

unique(deposit$Deposit)
## [1] 2.00 2.64 3.48 4.59 6.06 8.00
unique(deposit$Deposit) |> diff()
## [1] 0.64 0.84 1.11 1.47 1.94

On the log scale:

unique(deposit$Deposit) |> log()
## [1] 0.6931472 0.9707789 1.2470323 1.5238800 1.8017098 2.0794415
unique(deposit$Deposit) |> log() |> diff()
## [1] 0.2776317 0.2762534 0.2768477 0.2778298 0.2777317

Fit model with log(Deposit)

fit3 <- glm(
    prop_killed ~ I(log(Deposit)) + Insecticide, 
    family = binomial(link="logit"), weights = Number, data = deposit
  )

Plotting quantile residuals vs. fitted probabilities

augment(fit3, type.predict = "response") |> 
  mutate(qresid = qresid(fit3)) |> 
  ggplot(
    mapping = aes(x = .fitted, y = qresid, color = Insecticide)
  ) + 
  geom_point() + 
  geom_line()

Interpretation:

Model 4: Quadratic log(Deposit) and Insecticide (fit4)

fit4 <- glm(
    prop_killed ~ I(log(Deposit)) + I(log(Deposit)^2) + Insecticide, 
    family = binomial(link="logit"), weights = Number, data = deposit
  )

Plotting quantile residuals vs. fitted probabilities

augment(fit4, type.predict = "response") |> 
  mutate(qresid = qresid(fit4)) |> 
  ggplot(
    mapping = aes(x = .fitted, y = qresid, color = Insecticide)
  ) + 
  geom_point() + 
  geom_line()

Comparing models with AIC/BIC

The glance(<fit-obj>) function provides single number summaries of a model. For instance:

glance(fit1)
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          414.      17  -55.1  118.  122.     48.0          14    18

which includes the loglikelihood function logLik, the AIC, and the BIC.

Illustrating the calculation of the AIC:

-2 * glance(fit1)$logLik + 2 * 4
## [1] 118.2233

Collecting AIC/BIC values into a table

bind_rows(
  glance(fit1), 
  glance(fit2), 
  glance(fit3), 
  glance(fit4)
) |> 
  select(AIC, BIC) |> 
  mutate(model = c("Linear Deposit", "Quadratic Deposit", "Linear log(Deposit)", "Quadratic log(Deposit)")) |> 
  relocate(model)
## # A tibble: 4 × 3
##   model                    AIC   BIC
##   <chr>                  <dbl> <dbl>
## 1 Linear Deposit         118.  122. 
## 2 Quadratic Deposit       86.7  91.1
## 3 Linear log(Deposit)     93.6  97.1
## 4 Quadratic log(Deposit)  87.3  91.7

Based on this, the Quadratic Deposit is the best fitting since it has the lowest AIC and lowest BIC, but the “Quadratic log(Deposit)” is very close.

Interpret Insecticide coefficients as odds ratio

Using the “Quadratic Deposit” model, let’s interpret the coefficients of the dummy variables for Insecticide.

tidy(fit2)
## # A tibble: 5 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    -6.38     0.658      -9.70 2.91e-22
## 2 Deposit         2.15     0.280       7.68 1.58e-14
## 3 I(Deposit^2)   -0.153    0.0269     -5.69 1.26e- 8
## 4 InsecticideB    0.306    0.206       1.49 1.37e- 1
## 5 InsecticideC    2.93     0.264      11.1  1.67e-28

At the same mg of deposit,

Insecticide B has ______ times higher odds of killing insects than Insecticide A. Answer:

exp(.3061244)
## [1] 1.358151

or Insecticide B has 35.8%higher odds of killing insects than Insecticide A

Insecticide C has _________ times higher odds of killing insects than Insecticide A. Answer:

exp(2.9264026)
## [1] 18.66038