Tobacco budworm example in the slides

tobacco <- read.csv("https://raw.githubusercontent.com/summer072591/MTHS5051B-MTHS6046B/refs/heads/main/budworm.csv", 
                    header = TRUE) 
head(tobacco)
dim(tobacco) 
## [1] 12  4

In the dataset, there are 12 obs and 4 variables: deaths, live, dose, sex.

# Load necessary library
library(ggplot2)

# Check structure of the dataset
str(tobacco)
## 'data.frame':    12 obs. of  4 variables:
##  $ deaths: int  1 4 9 13 18 20 0 2 6 10 ...
##  $ live  : int  19 16 11 7 2 0 20 18 14 10 ...
##  $ dose  : int  1 2 4 8 16 32 1 2 4 8 ...
##  $ sex   : int  1 1 1 1 1 1 2 2 2 2 ...
# Ensure 'sex' is a factor
tobacco$sex <- as.factor(tobacco$sex)

# Plot the association between death and dose separately for males and females
ggplot(tobacco, aes(x = dose, y = deaths, color = sex, shape = sex)) +
  geom_point(alpha = 0.7) +  # Add points
  labs(title = "Association Between Death and Dose by Sex",
       x = "Dose",
       y = "Deaths") +
  theme_minimal()

We could see that there’s a nonlinear trend between deaths and dose of insecticide. We consider plotting log-odds defined as

\[ Z_i = \log (\frac{Y_i + .5}{n - Y_i + 0.5} ) \]

Z <- log((tobacco$deaths + .5)/(20 - tobacco$deaths + .5))

tobacco.extend <- data.frame(tobacco, logodds = Z)

ggplot(tobacco.extend, aes(x = log(dose), y = logodds, color = sex, shape = sex)) +
  geom_point(alpha = 0.7) +  # Add points
  labs(title = "Association Between Death and Dose by Sex",
       x = "Dose",
       y = "Deaths") +
  theme_minimal()

fitting a binomial GLM

#' dose has a wide range; the marginal distribution is asymmetric and has a long tail
#' hence, we consider log transformation 

g0 <- glm(cbind(deaths, live) ~ sex * log(dose), data = tobacco, family = binomial)
summary(g0)
## 
## Call:
## glm(formula = cbind(deaths, live) ~ sex * log(dose), family = binomial, 
##     data = tobacco)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.8186     0.5480  -5.143 2.70e-07 ***
## sex2            -0.1750     0.7783  -0.225    0.822    
## log(dose)        1.8163     0.3059   5.937 2.91e-09 ***
## sex2:log(dose)  -0.5091     0.3895  -1.307    0.191    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   4.9937  on  8  degrees of freedom
## AIC: 43.104
## 
## Number of Fisher Scoring iterations: 4

From the summary, the variable and its interaction with is not significant. We next perform model diagnostics and check whether dropping it would change the residual deviance.

anova(g0, test = "Chisq")

The Chisquared tests suggest that does not make a significant contribution for explaining the deviance residuals. Hence, we accept a simpler model with two main effects.

g1 <- glm(cbind(deaths, live) ~ sex + log(dose), data = tobacco, family = binomial)


anova(g1, test = "Chisq")
#' comparing residual deviance with the saturated model
pchisq(6.757, df = 9, lower.tail = FALSE)
## [1] 0.6624024

By comparing the deviance residual with its asymptotic distribution \(\chi_9^2\) under the null, the \(p\)-value suggests that we accept the null, i.e. the fitted model explains a sufficient amount of variance of the response.

We express the final model by using the linear predictor and the link function. We are using the default logit link function, that is, \[ \log (\frac{\pi}{1 - \pi} ) = \eta = X^\top \beta . \] In this case, the estimated coefficients are

coef <- coef(g1)

Hence, the fitted model \[ \log (\frac{\pi}{1 - \pi} ) = -2.372412 -1.100743 \cdot \mbox{sex} + 1.535336 \cdot \mbox{log(dose)}. \] If we change from female to male, the log-odds ratio drops 1.100743. Further, the estimated sucess probability can be expressed as follows \[ \pi = \frac{1}{1 + \exp \{-(X^\top \beta)\}}. \] Changing from female to male and using the highest dose 32, based on the fitted model , the predicted probability of death drops.

pi_female <- 1/(1 + exp(-predict(g1, newdata = data.frame(dose = 32, sex = factor(1)))))
pi_male <- 1/(1 + exp(-predict(g1, newdata = data.frame(dose = 32, sex = factor(2)))))

pi_male - pi_female
##           1 
## -0.08631796