In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression.
Generalized Linear Model with binomial random component and canonical link (Logit Link=Log Odds) - is called Logistic Regression. In R, glm function are used for the Logistic regression with following parameters.
- formula: predictors used in formula (Systematic Component)
- family: "binomial" needs to be chosen(Random Component)
- link: "logit" (default).(Link Component)
…
Overdispersion In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model.
Data:
This example comes from Agresti 4.7.4. “Teratology is the study of abnormalities of physiological development.”
This study considered pregnant female rats on iron-deficient diets:
Treatment Group 1 (un-treated except for placebo injections)
Treatment Group 2 (iron supplement injection on days 7 and 10)
Treatment Group 3 (iron supplement on days 0 and 7)
Treatment Group 4 (weekly iron supplement).
Data set contains 58 observations on the following 4 variables:
Litter: Litter number Treatment: Treatment group N: Number of litters Y: Number of dead litters
Objective:
In this study, I compared generalized linear models from complex(saturated) the simplest(null) model to understand the relationship between iron as a supplement and dead rat.
Methods to use: - GLM Models (Saturated to Null) - Logistic Regression with binomial family (Likelihood) - Logistic Regression with quasibinomial family (Quasilikelihood) - Analyzing Pearson residuals to understand the lack of fit - Analyzing Overdispersion
Data Set
For the \(g\)th treatment group, we have litters \(i=1,2,\ldots,N_g\), where the \(i\)th litter is of size \(n_{gi}\). Initial assumption is that the number of dead fetuses for the \(i\)th litter of the \(g\)th treatment group is \(y_{gi}\sim\)Binomial\((n_{gi},\pi_g)\). Not that this treatment group is not ordered.
There are \(\sum_{g=1}^4N_g=58\) litters. In this case, we have \(N=58\) binomial observations, with different sample sizes \(n_{gi}\) for each. Read in the data and display it:
setwd("C:/Users/myavu/OneDrive/Desktop/CSU/Old/Rpub")
df <- read.csv("Teratology.csv", header = TRUE)
head(df)
## Litter Treatment N Y
## 1 1 1 10 1
## 2 2 1 11 4
## 3 3 1 12 9
## 4 4 1 4 4
## 5 5 1 10 10
## 6 6 1 11 9
plot(df$Treatment, df$Y / df$N, pch = 16, col = "blue", xaxt = "n",
xlab = "Treatment Group",
ylab = "Proportion of Dead Rat Fetuses")
axis(side = 1, at = 1:4, labels = c("Placebo", "Day 7&10", "Day 0&7", "Weekly"))
In this case, we have \(N=58\) binomial observations, with different sample sizes \(n_{gi}\) for each. The saturated model has 58 parameters and gives \[
\hat\theta_{gi}=\frac{\mbox{number dead in litter}}{\mbox{number in litter}}
\] for \(i=1,2,\ldots,58\). We can fit the saturated model and see that it fits perfectly, but first we put the data into one of the standard input forms for binomial glm: a matrix with one column of successes and one column of failures. We are dropping the intercept there (with -1) so that every litter has its own parameter:
sat <-
glm(cbind(Y, N - Y) ~ -1 + factor(Litter),# -1 remove he intercept
family = binomial(link = logit),
data = df)
summary(sat)
##
## Call:
## glm(formula = cbind(Y, N - Y) ~ -1 + factor(Litter), family = binomial(link = logit),
## data = df)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [26] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [51] 0 0 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## factor(Litter)1 -2.197e+00 1.054e+00 -2.084 0.03712 *
## factor(Litter)2 -5.596e-01 6.268e-01 -0.893 0.37194
## factor(Litter)3 1.099e+00 6.667e-01 1.648 0.09937 .
## factor(Litter)4 2.737e+01 2.655e+05 0.000 0.99992
## factor(Litter)5 2.812e+01 2.447e+05 0.000 0.99991
## factor(Litter)6 1.504e+00 7.817e-01 1.924 0.05435 .
## factor(Litter)7 2.803e+01 2.463e+05 0.000 0.99991
## factor(Litter)8 2.820e+01 2.434e+05 0.000 0.99991
## factor(Litter)9 2.812e+01 2.447e+05 0.000 0.99991
## factor(Litter)10 8.473e-01 6.901e-01 1.228 0.21950
## factor(Litter)11 2.828e+01 2.423e+05 0.000 0.99991
## factor(Litter)12 2.197e+00 1.054e+00 2.084 0.03712 *
## factor(Litter)13 2.792e+01 2.483e+05 0.000 0.99991
## factor(Litter)14 1.504e+00 7.817e-01 1.924 0.05435 .
## factor(Litter)15 6.931e-01 8.660e-01 0.800 0.42349
## factor(Litter)16 1.253e+00 8.018e-01 1.562 0.11818
## factor(Litter)17 2.842e+01 2.406e+05 0.000 0.99991
## factor(Litter)18 3.365e-01 5.855e-01 0.575 0.56554
## factor(Litter)19 1.504e+00 7.817e-01 1.924 0.05435 .
## factor(Litter)20 4.700e-01 5.701e-01 0.824 0.40969
## factor(Litter)21 -5.878e-01 5.578e-01 -1.054 0.29197
## factor(Litter)22 2.812e+01 2.447e+05 0.000 0.99991
## factor(Litter)23 1.609e+00 7.746e-01 2.078 0.03773 *
## factor(Litter)24 4.700e-01 5.701e-01 0.824 0.40969
## factor(Litter)25 2.812e+01 2.447e+05 0.000 0.99991
## factor(Litter)26 -1.299e+00 6.513e-01 -1.995 0.04607 *
## factor(Litter)27 2.835e+01 2.414e+05 0.000 0.99991
## factor(Litter)28 1.099e+00 1.155e+00 0.951 0.34139
## factor(Litter)29 2.792e+01 2.483e+05 0.000 0.99991
## factor(Litter)30 -4.700e-01 5.701e-01 -0.824 0.40969
## factor(Litter)31 2.828e+01 2.423e+05 0.000 0.99991
## factor(Litter)32 -2.197e+00 1.054e+00 -2.084 0.03712 *
## factor(Litter)33 -6.931e-01 1.225e+00 -0.566 0.57143
## factor(Litter)34 -2.485e+00 1.041e+00 -2.387 0.01697 *
## factor(Litter)35 -2.828e+01 2.423e+05 0.000 0.99991
## factor(Litter)36 -9.163e-01 5.916e-01 -1.549 0.12143
## factor(Litter)37 -1.253e+00 8.018e-01 -1.562 0.11818
## factor(Litter)38 -1.705e+00 7.687e-01 -2.218 0.02658 *
## factor(Litter)39 -2.708e+00 1.033e+00 -2.622 0.00874 **
## factor(Litter)40 -2.820e+01 2.434e+05 0.000 0.99991
## factor(Litter)41 -2.737e+01 2.655e+05 0.000 0.99992
## factor(Litter)42 -2.657e+01 3.561e+05 0.000 0.99994
## factor(Litter)43 -2.828e+01 2.423e+05 0.000 0.99991
## factor(Litter)44 -2.792e+01 2.483e+05 0.000 0.99991
## factor(Litter)45 -2.303e+00 1.049e+00 -2.195 0.02813 *
## factor(Litter)46 -2.842e+01 2.406e+05 0.000 0.99991
## factor(Litter)47 -2.565e+00 1.038e+00 -2.472 0.01345 *
## factor(Litter)48 -2.820e+01 2.434e+05 0.000 0.99991
## factor(Litter)49 -2.716e+01 2.766e+05 0.000 0.99992
## factor(Litter)50 -2.835e+01 2.414e+05 0.000 0.99991
## factor(Litter)51 -1.253e+00 8.018e-01 -1.562 0.11818
## factor(Litter)52 -2.015e+00 7.528e-01 -2.677 0.00744 **
## factor(Litter)53 -2.848e+01 2.399e+05 0.000 0.99991
## factor(Litter)54 -2.690e+01 2.979e+05 0.000 0.99993
## factor(Litter)55 -2.565e+00 1.038e+00 -2.472 0.01345 *
## factor(Litter)56 -2.792e+01 2.483e+05 0.000 0.99991
## factor(Litter)57 -2.768e+01 2.541e+05 0.000 0.99991
## factor(Litter)58 -2.860e+01 2.388e+05 0.000 0.99990
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5.1823e+02 on 58 degrees of freedom
## Residual deviance: 3.2873e-10 on 0 degrees of freedom
## AIC: 187.47
##
## Number of Fisher Scoring iterations: 25
Null Model has huge deviance value (5.1823e+02), so Null model(all treatements has the same success probability) doest fit at all, we can notice from the graph that each treatment has different mean.
On the other hand, the residual deviance is zero, so that the saturated models fits perfectly.
And the saturated model reproduces the simple proportions:
plot(df$Treatment, (df$Y / df$N), pch = 16, col = "blue", xaxt = "n",
xlab = "Treatment Group",
ylab = "Proportion of Dead Rat Fetuses" )
axis(side = 1, at = 1:4, labels = c("Placebo", "Day 7&10", "Day 0&7", "Weekly"))
points(df$Treatment, predict.glm(sat, type = "response"), pch = 5, col = "magenta")
We can see that the saturated models fits perfectly.
Simpler Model
Now we fit a simpler GLM than the saturated model, by letting each treatment have its own effect instead of each litter. This means there are only four parameters instead of 58.Specifically, the formula gives y = dead, n-y = alive as a function of iron supplement treatment, a categorical variable with four levels.
The model is expressed with an intercept and three dummy variables.
Fisher Scoring is the iterative numerical optimization method used to get the parameter estimates.
out <- glm(cbind(Y, N - Y) ~ factor(Treatment), family = binomial(link = logit),
data = df)
summary(out)
##
## Call:
## glm(formula = cbind(Y, N - Y) ~ factor(Treatment), family = binomial(link = logit),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4295 -0.9750 -0.0285 1.4024 2.7826
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1440 0.1292 8.855 < 2e-16 ***
## factor(Treatment)2 -3.3225 0.3308 -10.043 < 2e-16 ***
## factor(Treatment)3 -4.4762 0.7311 -6.122 9.22e-10 ***
## factor(Treatment)4 -4.1297 0.4762 -8.672 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.43 on 57 degrees of freedom
## Residual deviance: 173.45 on 54 degrees of freedom
## AIC: 252.92
##
## Number of Fisher Scoring iterations: 5
Interpretation Note that, if we look at the coefficients closely, we can notice that the Null group(Treatment 1 which did not get any iron) has a positive coefficient, but all of the iron supplement groups have negative coefficients.
In our case, the success probability is the dead rat.So, If we provide an iron to the rats, it lowers the success probaility that means that reduces the number of dead rats.
Residual deviance is huge, so this model doesnt fit.
Plot the probabilities from the GLM fit along with the litter-level estimated probabilities:
plot(df$Treatment, df$Y / df$N, pch = 16, col = "blue", xlab = "Treatment Group",
ylab = "Proportion of Dead Rat Fetuses", xaxt = "n")
axis(side = 1, at = 1:4, labels = c("Placebo", "Day 7&10", "Day 0&7", "Weekly"))
fitted_prob <- predict.glm(out, type = "response")
points(df$Treatment, fitted_prob, pch = 17, col = "red")
The fitted probabilities (red triangles) track the empirical proportions (blue dots) quite well.
Lack of Fit Test
pchisq(out$deviance, df = 54, lower.tail = F)
## [1] 1.87631e-14
It is zero, so there is no way that this models fit. The null hypothesis of adequate model fit with four parameters is strongly rejected.
Let’s analyze the Pearson residuals to understand the lack-of-fit.
# Residual diagnostic plot
e <- residuals(out, "pearson")
range(e) # Compare to deviance residuals
## [1] -4.864124 2.442102
e_by_hand <- (df$Y - df$N * fitted_prob) / sqrt(df$N * fitted_prob * (1 - fitted_prob))
range(e - e_by_hand)
## [1] -8.881784e-16 8.881784e-16
m <- max(c(abs(range(e)), 3))
plot(predict(out), e, pch = 16, col = "blue", ylim = c(-m, m),
xlab = "Predicted probability under GLM", ylab = "Pearson residual")
abline(h = c(-2, 2), col = "red", lty = 2, lwd = 2)
abline(h = c(-3, 3), col = "red", lty = 1, lwd = 2)
There is clear evidence of overdispersion here. There are observations are out of range, there are even observations out of [-4,+4] bands.
Let’s compute the Pearson statistic for model lack-of-fit and estimate the dispersion parameter, remembering that it would be close to one in the absence of overdispersion:
obs_dead <- df$Y # observed dead
exp_dead <- df$N*fitted_prob # expected dead
obs_live <- df$N - df$Y
exp_live <- df$N * (1 - fitted_prob)
X2 <- sum((obs_dead - exp_dead)^2 / exp_dead + (obs_live - exp_live)^2 / exp_live)
X2 # Pearson statistic
## [1] 154.707
df2 <- length(df$Litter) - 4
pchisq(X2, df = df2, lower.tail = F) # p-value for Pearson test of null that GLM fits
## [1] 1.187222e-11
phi_hat <- X2 / df2 # estimated overdispersion parameter
phi_hat
## [1] 2.864945
The pearson statistics is 2.86 is way bigger than 1 so definitelty the overdispersion is the case.
Fit model with quasilikelihood
Now fit the model using quasi-likelihood instead of likelihood:
out_quasi <- glm(cbind(Y, N - Y) ~ factor(Treatment), family = quasibinomial(link = logit),
data = df)
summary(out_quasi)
##
## Call:
## glm(formula = cbind(Y, N - Y) ~ factor(Treatment), family = quasibinomial(link = logit),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4295 -0.9750 -0.0285 1.4024 2.7826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1440 0.2187 5.231 2.81e-06 ***
## factor(Treatment)2 -3.3225 0.5600 -5.933 2.18e-07 ***
## factor(Treatment)3 -4.4762 1.2375 -3.617 0.000656 ***
## factor(Treatment)4 -4.1297 0.8061 -5.123 4.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 2.864945)
##
## Null deviance: 509.43 on 57 degrees of freedom
## Residual deviance: 173.45 on 54 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
Notice that Dispersion parameter for quasibinomial family is exactly what we computed “by hand” above: the estimated dispersion parameter \(\widehat\phi=X^2/(N-4)\), which is used to account for overdispersion by modifying the standard errors of the original binomial fit:
summary(out)$coefficients[, 2] # standard errors from original fit
## (Intercept) factor(Treatment)2 factor(Treatment)3 factor(Treatment)4
## 0.1291917 0.3308440 0.7311275 0.4762261
summary(out)$coefficients[, 2] * sqrt(phi_hat) # multiplied by sqrt(phi_hat)
## (Intercept) factor(Treatment)2 factor(Treatment)3 factor(Treatment)4
## 0.2186717 0.5599915 1.2375173 0.8060675
summary(out_quasi)$coefficients[, 2] # standard errors from quasi-likelihood fit
## (Intercept) factor(Treatment)2 factor(Treatment)3 factor(Treatment)4
## 0.2186717 0.5599916 1.2375173 0.8060675
The quasibinomial adjust the standard errors accordingly, it chnages the dispersion parameter to 2.86.None of the estimaes has chnaged, but now we get the bigger confidence interval.
The standard errors are increased—nearly doubled—to account for the overdispersion. Similarly, the t-statistics of the original binomial fit are modified:
summary(out)$coefficients[, 3] # t-statistics from original fit
## (Intercept) factor(Treatment)2 factor(Treatment)3 factor(Treatment)4
## 8.854913 -10.042536 -6.122305 -8.671642
summary(out)$coefficients[, 3] / sqrt(phi_hat) # divided by sqrt(phi_hat)
## (Intercept) factor(Treatment)2 factor(Treatment)3 factor(Treatment)4
## 5.231499 -5.933149 -3.617069 -5.123222
#summary(out_quasi)$coefficients[, 3] # t-statistics from quasi-likelihood fit
The t-statistics are now still significant, but are not as large as those that failed to account properly for the overdispersion.
Quasi binomial takes sqrt(overdispersion) parameter into effect, and adjust standard errors accordingly. We still get the exact point estimate but any confidence interval we compute out of it, will be bigger by the sqrt(overdispersion=2.86).
So, why do we get overdispersion for the rat litters?
Because there are dependecy here, the survival probability from the same mother should be taken into account because it is highly correlated.
References