We’ve learned that a Linear Model is of the form:
\[\begin{array}{rcl} y_i & = & \mu_i + \epsilon_i \\ & = & X_i\beta + \epsilon_i \end{array}\]that is, the observation \(y_i\) of a response variables is explained in part the the expected value \(\mu_i\), which consists of a linear combination of predictors in \(X_i\), where \(\beta\) are their related coefficients. For example, we could have \(X_i\beta = \beta_0 + \beta_1x\), for a simple linear regression.
The part of \(y_i\) that was not captured by the predictors falls into the residual term \(\epsilon_i\). But the main point is that the expected value \(\mu_i\) is identically related to the predictor variables in \(X_i\),
\[\mu_i = X_i\beta\]
This is somehow a restriction, which could be generalized by
\[g(\mu_i) = X_i\beta\]
where \(g(\cdot)\) is called link function. For example, we could define \(\log{\mu_i} = X_i\beta\), or maybe \(\sqrt{\mu_i} = X_i\beta\), or even \(\frac{1}{\mu_i} = X_i\beta\), or many other functional forms. So we still model the expected values \(\mu_i\) using a linear predictor \(X_i\beta\), but not restricted to the identity link (\(\mu=X\beta\)). And that is what GLM is about – generalization. In fact, with LM we also assume normality and constant variance, whereas with GLM different distributions can be assumed, such as Gamma, Inverse Gaussian, Poisson, Binomial, Multinomial (the latter three for count data). Hence, the linear models are but special cases of GLM.
Download this data. They consist of the number of insects (column ‘insects’) evaluated in plants of cucumber subjected to five (A-E) insecticide treatments (‘treat’). Each level of ‘treat’ was repeated six times.
cucumber <- read.csv('http://arsilva.weebly.com/uploads/2/1/0/0/21008856/pulgao.csv')
colnames(cucumber) <- c('treat', 'replication', 'insects')
str(cucumber)'data.frame': 30 obs. of 3 variables:
$ treat : chr "A" "B" "C" "D" ...
$ replication: int 1 1 1 1 1 2 2 2 2 2 ...
$ insects : int 2370 1282 562 173 193 1687 1527 321 127 71 ...
boxplot(insects ~ treat, data = cucumber)The boxplot shows the effect of the different insecticides on the number of insects per plant, but also shows an interesting pattern of the length of each box. Have you noticed that the higher the values the larger is the variability? Well, then it seems that assuming equality of variances is not quite reasonable. Let’s start by fitting a LM:
model_n <- lm(insects ~ treat, data = cucumber)
summary(model_n)
Call:
lm(formula = insects ~ treat, data = cucumber)
Residuals:
Min 1Q Median 3Q Max
-790.00 -92.75 -22.33 93.92 543.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2477.0 108.6 22.81 < 2e-16 ***
treatB -1402.0 153.6 -9.13 1.95e-09 ***
treatC -1949.8 153.6 -12.70 2.12e-12 ***
treatD -2320.7 153.6 -15.11 4.43e-14 ***
treatE -2385.7 153.6 -15.54 2.37e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 266 on 25 degrees of freedom
Multiple R-squared: 0.929, Adjusted R-squared: 0.9177
F-statistic: 81.79 on 4 and 25 DF, p-value: 5.505e-14
Now let’s estimate the expected mean and the expected standard errors with the package emmeans.
library(emmeans)
means_n <- emmeans(model_n, specs = 'treat', type = 'response')
means_n treat emmean SE df lower.CL upper.CL
A 2477.0 109 25 2253.4 2701
B 1075.0 109 25 851.4 1299
C 527.2 109 25 303.5 751
D 156.3 109 25 -67.3 380
E 91.3 109 25 -132.3 315
Confidence level used: 0.95
Now we have the expected mean of each level of treat. But let’s check the SE - Standard Errors as expected by the fitted model. They are all the same (109). Well, let’s go back and check the data again, on what are the sample standard errors:
# a home-brewed function for sample mean and S.E.
mean_and_se <- function(x) {
m <- mean(x)
se <- sd(x)/sqrt(length(x))
return(c(Mean = m, SE = se))
}
aggregate(insects ~ treat, data = cucumber, FUN = mean_and_se) treat insects.Mean insects.SE
1 A 2477.00000 197.37849
2 B 1075.00000 112.21913
3 C 527.16667 81.77833
4 D 156.33333 15.82333
5 E 91.33333 21.57107
The estimates of the expected means (emmean) are ok when compared to the sample means (insects.Mean). And for level B our model gives a relative good estimate (109) of the standard error (compared to the sample SE, which is 112.2). But consider for example level A or E. Their sample standard errors (197.4 and 21.6) are very different from what the model is estimating (109).
Instead, consider fitting a GLM with a Gamma distribution, where
\[g(\mu_i) = \frac{1}{\mu_i}\]
Differently from the previous linear model, where \(Var(y_i) = \phi\) is a constant, this GLM Gamma assumes that the variances are not constant, but related to the expected mean through \(Var(y_i) = \phi\mu_i^2\). This means that the variance are also modeled using a constant \(\phi\), called dispersion parameter, and the expected mean \(\mu_i\). We simply inform the family argument to:
model_g <- glm(insects ~ treat, data = cucumber,
family = Gamma)
summary(model_g)
Call:
glm(formula = insects ~ treat, family = Gamma, data = cucumber)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.65127 -0.20269 -0.06386 0.14992 0.85436
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.037e-04 5.915e-05 6.825 3.74e-07 ***
treatB 5.265e-04 1.486e-04 3.544 0.00158 **
treatC 1.493e-03 2.842e-04 5.255 1.93e-05 ***
treatD 5.993e-03 9.391e-04 6.382 1.11e-06 ***
treatE 1.055e-02 1.605e-03 6.569 6.99e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.128805)
Null deviance: 41.1228 on 29 degrees of freedom
Residual deviance: 2.8804 on 25 degrees of freedom
AIC: 391.98
Number of Fisher Scoring iterations: 5
Before interpreting the output, let’s just quickly check the expected means and S.E.:
means_g <- emmeans(model_g, specs = 'treat', type = 'response')
means_g treat response SE df asymp.LCL asymp.UCL
A 2477.0 362.9 Inf 1924 3475
B 1075.0 157.5 Inf 835 1508
C 527.2 77.2 Inf 410 740
D 156.3 22.9 Inf 121 219
E 91.3 13.4 Inf 71 128
Confidence level used: 0.95
Intervals are back-transformed from the inverse scale
The means are the same as the sample means. Good! But now the SE are describing better the variability of each level of the factor treat. Notice how the SE changes along with the the expected mean (column response).
The AIC - Akaike’s Information Criterion can be used to compare the goodness-of-fit of two or more models. It is based on the likelihood function (\(L\)) and the number of parameters (\(k\)) fitted by the models:
\[AIC = -2\log{L} + 2k\] While \(\log{L}\) accounts for how well the model fits the data, the term \(-2k\) penalizes the model by the number of parameters, so that a good model should be as parsimonious as possible. The smaller the AIC, the better the fit.
AIC(model_n, model_g) df AIC
model_n 6 426.6721
model_g 6 391.9849
Both the normal (LM) and the GLM Gamma have estimated \(k = 6\) parameters (five \(\beta\)’s and one \(\phi\)). So their complexity is the same. But the GLM Gamma gives a better fit to the data, because it has a smaller AIC.
The difference between means of the levels C and D is almost 400 insects. Using Tukey’s method, see how model_n fails to detect that difference as significant (\(p < 0.05\)), whereas that is detected by model_g:
pairs(means_n) contrast estimate SE df t.ratio p.value
A - B 1402 154 25 9.130 <.0001
A - C 1950 154 25 12.697 <.0001
A - D 2321 154 25 15.112 <.0001
A - E 2386 154 25 15.535 <.0001
B - C 548 154 25 3.567 0.0118
B - D 919 154 25 5.982 <.0001
B - E 984 154 25 6.406 <.0001
C - D 371 154 25 2.415 0.1443
C - E 436 154 25 2.838 0.0618
D - E 65 154 25 0.423 0.9929
P value adjustment: tukey method for comparing a family of 5 estimates
pairs(means_g) contrast estimate SE df z.ratio p.value
A - B -0.000527 0.000149 Inf -3.544 0.0036
A - C -0.001493 0.000284 Inf -5.255 <.0001
A - D -0.005993 0.000939 Inf -6.382 <.0001
A - E -0.010545 0.001605 Inf -6.569 <.0001
B - C -0.000967 0.000310 Inf -3.123 0.0154
B - D -0.005466 0.000947 Inf -5.772 <.0001
B - E -0.010019 0.001610 Inf -6.223 <.0001
C - D -0.004500 0.000978 Inf -4.603 <.0001
C - E -0.009052 0.001628 Inf -5.560 <.0001
D - E -0.004552 0.001858 Inf -2.450 0.1023
Note: contrasts are still on the inverse scale
P value adjustment: tukey method for comparing a family of 5 estimates
Because one model is better then another, it does not mean it is well fitted to the data. A check on the models’ residuals is always a good call. With the GLMs we do this using the Pearson’s residuals - a standardized form of residuals. For example, for model_g,
res_g <- residuals(model_g, type = 'pearson')
plot(res_g ~ fitted(model_g))Ideally, the residuals should have no trend, being randomly distributed around zero.
Also, a model may suffer of overdispersion, when it fails to capture the variability of the data. A simple way to check if there is this problem is through the Pearson’s chi-square:
X2 <- sum(res_g^2)
X2[1] 3.220108
When X2 is ‘relatively’ greater than the residual degrees of freedom (25 for model_g), it is an indicative of overdispersion.
The following data is on number of dead and alive insects subjected to dose of an insecticide, where every experimental unit consists of n = 20 insects. With the plot, let’s check how the proportion of dead look like.
n <- 20
dead <- c(0, 1, 5, 17, 19, 19)
alive <- n - dead
dose <- c(0, 0.25, 0.5, 1, 1.5, 2)
plot(dead/n ~ dose)Notice that the response variable dead (or alive) consists of a binomial response, that can be described using a logit or logistic model for \(p\), the expected proportion of dead insects, as a function of \(x\), the dose of the insecticide. The logit model is of the form:
\[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 x\]
In this case, our linear predictor consists of a first-degree polynomial. But notice that the relationship between \(p\) and \(x\) is not a straight line, because of the inversion:
\[p = \frac{\exp(\beta_0 + \beta_1 x)}{1 + \exp(\beta_0 + \beta_1 x)}\] it is rather a sigmoidal curve.
model_b <- glm(cbind(dead, alive) ~ dose, family = binomial)
summary(model_b)
Call:
glm(formula = cbind(dead, alive) ~ dose, family = binomial)
Deviance Residuals:
1 2 3 4 5 6
-1.1304 -0.6788 0.1897 1.2123 -0.2964 -1.7918
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4277 0.6150 -5.573 2.50e-08 ***
dose 4.4606 0.7639 5.839 5.25e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 103.0984 on 5 degrees of freedom
Residual deviance: 6.5429 on 4 degrees of freedom
AIC: 22.417
Number of Fisher Scoring iterations: 5
A significant (p-value = \(5.25 \times 10^{-9}\)) effect is observed for dose.
The fitted curve is
betas <- coef(model_b)
curve(exp(betas[1] + betas[2]*x)/(1 + exp(betas[1] + betas[2]*x)),
from = 0, to = 2,
ylab = 'Proportion of dead insects',
xlab = 'Dose')The \(\rho^2\) of McFadden (1974), analogous to the classical \(R^2\), can be used to measure the goodness-of-fit of a logit model.
\[\rho^2 = 1 - \frac{\log L}{\log L_{null}}\] where \(\log L_{null}\) is the log-likelihood of the ‘null’ model, containing only the intercept (~ 1).
model_null <- glm(cbind(dead, alive) ~ 1, family = binomial)
c( 1 - logLik(model_b)/logLik(model_null) )[1] 0.839815
help(family) to see what distributions and link functions can be used to fit GLMs.Fit GLMs with distributions gaussian (normal), poisson, Gamma, inverse.gaussian to describe y, the number of tomato fruits damaged by Tuta absoluta in plots cultivated with three different varieties. After choosing the best-fit model, test for differences (\(p < 0.05\)) on the expected mean of varieties.
y <- c(0, 0, 0, 0, 1, 0, 8, 5, 0, 3, 1, 6, 10, 4, 0, NA, 12, 7) + 1
variety <- gl(3, 6, labels = LETTERS[1:3])