Notes 10
Modeling Proportions with Binomial Distribution | Part 5: AIC/BIC, Case Study
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'statmod' was built under R version 4.3.3
AIC/BIC
(See Section 7.8.) To compare the fit of models (even non-nested ones), the AIC and BIC can be used. The AIC and BIC are calculated as
\[AIC = -2\times LL + 2p\]
\[BIC = -2\times LL + (\log n)p\]
The \(LL\) stands for the log-likelihood calculated at the maximum likelihood estimates of the parameters. The \(p\) stands for the number of parameters. Note:
A higher log-likelihood means a better fit, so lower AIC and BIC values indicate a better fit.
The only difference between the two measures is the size of the penalty per parameter. The AIC has a smaller penalty than the BIC because \(\log(n) > 2\) for \(n>8\).
So the BIC will tend to pick models with fewer variables than the AIC.
## [1] 1.945910 2.079442
After trying out several models for the case study, we will compare their AIC/BICs at the end of the activity.
Case study - Section 9.11
An experiment exposed batches of insects to various deposits (in mg)
of insecticides (Table 9.5; data set: deposit
). The
proportion of insects killed after six days of exposure in each batch of
size is potentially a function of the dose of insecticide and the type
of insecticide.
## Killed Number Insecticide Deposit
## 1 3 50 A 2.00
## 2 5 49 A 2.64
## 3 19 47 A 3.48
## 4 19 38 A 4.59
## 5 24 29 A 6.06
Calculating proportion killed
Plotting proportions killed
ggplot(
data = deposit,
mapping = aes(x = Deposit, y = prop_killed, label = Insecticide, color = Insecticide)
) +
geom_text()
Interpretations:
Effect of deposit: As deposit used increases the proportion killed increases
Effect of type of insecticide: Type C has a higher proportion killed compared to B and A (which are mostly similar).
Model 1: Deposit and Insecticide (fit1
)
Writing binomial model with logit link:
Random component: Proportion killed\(\sim Binomial(\mu, Number)\)
Systematic component: \(logit(\mu) = \log\frac{\mu}{1-\mu} = \beta_0 + \beta_1 Deposit + \beta_2I(type=B) + \beta_3I(type=C)\)
Make 1 less dummy variable than the number of categories. R does’nt make a dummy variable for the first category (alphabetically).
fit1 <- glm(prop_killed ~ Deposit + Insecticide, family = binomial(link = "logit"), weights = Number, data = deposit)
Plot fitted probabilities
ggplot(
data = augment(fit1, type.predict = "response"),
mapping = aes(x = Deposit, color = Insecticide)
) +
geom_point(aes(y = prop_killed)) +
geom_line(aes(y = .fitted))
How is the fit?
Plotting quantile residuals vs. fitted probabilities
augment(fit1, type.predict = "response") |>
mutate(qresid = qresid(fit1)) |>
ggplot(
mapping = aes(x = .fitted, y = qresid, color = Insecticide)
) +
geom_point() +
geom_line()
Interpretation: There seems to be a violation of *centered around 0** criteria
Solution: add quadratic deposit term
Model 2: Quadratic Deposit and Insecticide (fit2
)
Fit the model
fit2 <- glm(
prop_killed ~ Deposit + I(Deposit^2) + Insecticide,
family = binomial(link="logit"), weights = Number, data = deposit
)
Plot observed proportions with fitted probabilities
ggplot(
data = augment(fit2, type.predict = "response"),
mapping = aes(x = Deposit, color = Insecticide)
) +
geom_point(aes(y = prop_killed)) +
geom_line(aes(y = .fitted))
Plotting quantile residuals vs. fitted probabilities
augment(fit2, type.predict = "response") |>
mutate(qresid = qresid(fit2)) |>
ggplot(
mapping = aes(x = .fitted, y = qresid, color = Insecticide)
) +
geom_point() +
geom_line()
Interpretation: The u-shaped patterns are gone. This is an improvement.
Model 3: log(Deposit) and Insecticide (fit3
)
The book recommends using the log of the dose, stating that it is commonly used in dose-response models.
Another argument for its use is that the Deposit
values
used in the experiment were designed to constant spacing in the log
scale.
On the regular scale:
## [1] 2.00 2.64 3.48 4.59 6.06 8.00
## [1] 0.64 0.84 1.11 1.47 1.94
On the log scale:
## [1] 0.6931472 0.9707789 1.2470323 1.5238800 1.8017098 2.0794415
## [1] 0.2776317 0.2762534 0.2768477 0.2778298 0.2777317
Fit model with log(Deposit)
fit3 <- glm(
prop_killed ~ I(log(Deposit)) + Insecticide,
family = binomial(link="logit"), weights = Number, data = deposit
)
Plotting quantile residuals vs. fitted probabilities
augment(fit3, type.predict = "response") |>
mutate(qresid = qresid(fit3)) |>
ggplot(
mapping = aes(x = .fitted, y = qresid, color = Insecticide)
) +
geom_point() +
geom_line()
Interpretation:
Model 4: Quadratic log(Deposit) and Insecticide
(fit4
)
fit4 <- glm(
prop_killed ~ I(log(Deposit)) + I(log(Deposit)^2) + Insecticide,
family = binomial(link="logit"), weights = Number, data = deposit
)
Plotting quantile residuals vs. fitted probabilities
augment(fit4, type.predict = "response") |>
mutate(qresid = qresid(fit4)) |>
ggplot(
mapping = aes(x = .fitted, y = qresid, color = Insecticide)
) +
geom_point() +
geom_line()
Comparing models with AIC/BIC
The glance(<fit-obj>)
function provides single
number summaries of a model. For instance:
## # A tibble: 1 × 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 414. 17 -55.1 118. 122. 48.0 14 18
which includes the loglikelihood function logLik
, the
AIC
, and the BIC
.
Illustrating the calculation of the AIC:
## [1] 118.2233
Collecting AIC/BIC values into a table
bind_rows(
glance(fit1),
glance(fit2),
glance(fit3),
glance(fit4)
) |>
select(AIC, BIC) |>
mutate(model = c("Linear Deposit", "Quadratic Deposit", "Linear log(Deposit)", "Quadratic log(Deposit)")) |>
relocate(model)
## # A tibble: 4 × 3
## model AIC BIC
## <chr> <dbl> <dbl>
## 1 Linear Deposit 118. 122.
## 2 Quadratic Deposit 86.7 91.1
## 3 Linear log(Deposit) 93.6 97.1
## 4 Quadratic log(Deposit) 87.3 91.7
Based on this, the Quadratic Deposit is the best fitting since it has the lowest AIC and lowest BIC, but the “Quadratic log(Deposit)” is very close.
Interpret Insecticide
coefficients as odds ratio
Using the “Quadratic Deposit” model, let’s interpret the coefficients
of the dummy variables for Insecticide
.
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.38 0.658 -9.70 2.91e-22
## 2 Deposit 2.15 0.280 7.68 1.58e-14
## 3 I(Deposit^2) -0.153 0.0269 -5.69 1.26e- 8
## 4 InsecticideB 0.306 0.206 1.49 1.37e- 1
## 5 InsecticideC 2.93 0.264 11.1 1.67e-28
At the same mg of deposit,
Insecticide B has ______ times higher odds of killing insects than Insecticide A. Answer:
## [1] 1.358151
or Insecticide B has 35.8%higher odds of killing insects than Insecticide A
Insecticide C has _________ times higher odds of killing insects than Insecticide A. Answer:
## [1] 18.66038