Generalized Linear Models in R

Introduction

We’ve learned that a Linear Model is of the form:

\[\begin{array}{rcl} y_i & = & \mu_i + \epsilon_i \\ & = & X_i\beta + \epsilon_i \end{array}\]

that is, the observation \(y_i\) of a response variables is explained in part the the expected value \(\mu_i\), which consists of a linear combination of predictors in \(X_i\), where \(\beta\) are their related coefficients. For example, we could have \(X_i\beta = \beta_0 + \beta_1x\), for a simple linear regression.

The part of \(y_i\) that was not captured by the predictors falls into the residual term \(\epsilon_i\). But the main point is that the expected value \(\mu_i\) is identically related to the predictor variables in \(X_i\),

\[\mu_i = X_i\beta\]

This is somehow a restriction, which could be generalized by

\[g(\mu_i) = X_i\beta\]

where \(g(\cdot)\) is called link function. For example, we could define \(\log{\mu_i} = X_i\beta\), or maybe \(\sqrt{\mu_i} = X_i\beta\), or even \(\frac{1}{\mu_i} = X_i\beta\), or many other functional forms. So we still model the expected values \(\mu_i\) using a linear predictor \(X_i\beta\), but not restricted to the identity link (\(\mu=X\beta\)). And that is what GLM is about – generalization. In fact, with LM we also assume normality and constant variance, whereas with GLM different distributions can be assumed, such as Gamma, Inverse Gaussian, Poisson, Binomial, Multinomial (the latter three for count data). Hence, the linear models are but special cases of GLM.

Example 1

Download this data. They consist of the number of insects (column ‘insects’) evaluated in plants of cucumber subjected to five (A-E) insecticide treatments (‘treat’). Each level of ‘treat’ was repeated six times.

cucumber <- read.csv('http://arsilva.weebly.com/uploads/2/1/0/0/21008856/pulgao.csv')
colnames(cucumber) <- c('treat', 'replication', 'insects')
str(cucumber)

'data.frame':   30 obs. of  3 variables:
 $ treat      : chr  "A" "B" "C" "D" ...
 $ replication: int  1 1 1 1 1 2 2 2 2 2 ...
 $ insects    : int  2370 1282 562 173 193 1687 1527 321 127 71 ...

boxplot(insects ~ treat, data = cucumber)

The boxplot shows the effect of the different insecticides on the number of insects per plant, but also shows an interesting pattern of the length of each box. Have you noticed that the higher the values the larger is the variability? Well, then it seems that assuming equality of variances is not quite reasonable. Let’s start by fitting a LM:

model_n <- lm(insects ~ treat, data = cucumber)
summary(model_n)


Call:
lm(formula = insects ~ treat, data = cucumber)

Residuals:
    Min      1Q  Median      3Q     Max 
-790.00  -92.75  -22.33   93.92  543.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2477.0      108.6   22.81  < 2e-16 ***
treatB       -1402.0      153.6   -9.13 1.95e-09 ***
treatC       -1949.8      153.6  -12.70 2.12e-12 ***
treatD       -2320.7      153.6  -15.11 4.43e-14 ***
treatE       -2385.7      153.6  -15.54 2.37e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 266 on 25 degrees of freedom
Multiple R-squared:  0.929, Adjusted R-squared:  0.9177 
F-statistic: 81.79 on 4 and 25 DF,  p-value: 5.505e-14

Now let’s estimate the expected mean and the expected standard errors with the package emmeans.

library(emmeans)
means_n <- emmeans(model_n, specs = 'treat', type = 'response')
means_n

 treat emmean  SE df lower.CL upper.CL
 A     2477.0 109 25   2253.4     2701
 B     1075.0 109 25    851.4     1299
 C      527.2 109 25    303.5      751
 D      156.3 109 25    -67.3      380
 E       91.3 109 25   -132.3      315

Confidence level used: 0.95

Now we have the expected mean of each level of treat. But let’s check the SE - Standard Errors as expected by the fitted model. They are all the same (109). Well, let’s go back and check the data again, on what are the sample standard errors:

# a home-brewed function for sample mean and S.E.
mean_and_se <- function(x) {
  m <- mean(x)
  se <- sd(x)/sqrt(length(x))
  return(c(Mean = m, SE = se))
}
aggregate(insects ~ treat, data = cucumber, FUN = mean_and_se)

  treat insects.Mean insects.SE
1     A   2477.00000  197.37849
2     B   1075.00000  112.21913
3     C    527.16667   81.77833
4     D    156.33333   15.82333
5     E     91.33333   21.57107

The estimates of the expected means (emmean) are ok when compared to the sample means (insects.Mean). And for level B our model gives a relative good estimate (109) of the standard error (compared to the sample SE, which is 112.2). But consider for example level A or E. Their sample standard errors (197.4 and 21.6) are very different from what the model is estimating (109).

Instead, consider fitting a GLM with a Gamma distribution, where

\[g(\mu_i) = \frac{1}{\mu_i}\]

Differently from the previous linear model, where \(Var(y_i) = \phi\) is a constant, this GLM Gamma assumes that the variances are not constant, but related to the expected mean through \(Var(y_i) = \phi\mu_i^2\). This means that the variance are also modeled using a constant \(\phi\), called dispersion parameter, and the expected mean \(\mu_i\). We simply inform the family argument to:

model_g <- glm(insects ~ treat, data = cucumber,
               family = Gamma)
summary(model_g)


Call:
glm(formula = insects ~ treat, family = Gamma, data = cucumber)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.65127  -0.20269  -0.06386   0.14992   0.85436  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.037e-04  5.915e-05   6.825 3.74e-07 ***
treatB      5.265e-04  1.486e-04   3.544  0.00158 ** 
treatC      1.493e-03  2.842e-04   5.255 1.93e-05 ***
treatD      5.993e-03  9.391e-04   6.382 1.11e-06 ***
treatE      1.055e-02  1.605e-03   6.569 6.99e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.128805)

    Null deviance: 41.1228  on 29  degrees of freedom
Residual deviance:  2.8804  on 25  degrees of freedom
AIC: 391.98

Number of Fisher Scoring iterations: 5

Before interpreting the output, let’s just quickly check the expected means and S.E.:

means_g <- emmeans(model_g, specs = 'treat', type = 'response')
means_g

 treat response    SE  df asymp.LCL asymp.UCL
 A       2477.0 362.9 Inf      1924      3475
 B       1075.0 157.5 Inf       835      1508
 C        527.2  77.2 Inf       410       740
 D        156.3  22.9 Inf       121       219
 E         91.3  13.4 Inf        71       128

Confidence level used: 0.95 
Intervals are back-transformed from the inverse scale

The means are the same as the sample means. Good! But now the SE are describing better the variability of each level of the factor treat. Notice how the SE changes along with the the expected mean (column response).

Comparing the goodness-of-fit of models

The AIC - Akaike’s Information Criterion can be used to compare the goodness-of-fit of two or more models. It is based on the likelihood function (\(L\)) and the number of parameters (\(k\)) fitted by the models:

\[AIC = -2\log{L} + 2k\] While \(\log{L}\) accounts for how well the model fits the data, the term \(-2k\) penalizes the model by the number of parameters, so that a good model should be as parsimonious as possible. The smaller the AIC, the better the fit.

AIC(model_n, model_g)

        df      AIC
model_n  6 426.6721
model_g  6 391.9849

Both the normal (LM) and the GLM Gamma have estimated \(k = 6\) parameters (five \(\beta\)’s and one \(\phi\)). So their complexity is the same. But the GLM Gamma gives a better fit to the data, because it has a smaller AIC.

The difference between means of the levels C and D is almost 400 insects. Using Tukey’s method, see how model_n fails to detect that difference as significant (\(p < 0.05\)), whereas that is detected by model_g:

pairs(means_n)

 contrast estimate  SE df t.ratio p.value
 A - B        1402 154 25   9.130  <.0001
 A - C        1950 154 25  12.697  <.0001
 A - D        2321 154 25  15.112  <.0001
 A - E        2386 154 25  15.535  <.0001
 B - C         548 154 25   3.567  0.0118
 B - D         919 154 25   5.982  <.0001
 B - E         984 154 25   6.406  <.0001
 C - D         371 154 25   2.415  0.1443
 C - E         436 154 25   2.838  0.0618
 D - E          65 154 25   0.423  0.9929

P value adjustment: tukey method for comparing a family of 5 estimates

pairs(means_g)

 contrast  estimate       SE  df z.ratio p.value
 A - B    -0.000527 0.000149 Inf  -3.544  0.0036
 A - C    -0.001493 0.000284 Inf  -5.255  <.0001
 A - D    -0.005993 0.000939 Inf  -6.382  <.0001
 A - E    -0.010545 0.001605 Inf  -6.569  <.0001
 B - C    -0.000967 0.000310 Inf  -3.123  0.0154
 B - D    -0.005466 0.000947 Inf  -5.772  <.0001
 B - E    -0.010019 0.001610 Inf  -6.223  <.0001
 C - D    -0.004500 0.000978 Inf  -4.603  <.0001
 C - E    -0.009052 0.001628 Inf  -5.560  <.0001
 D - E    -0.004552 0.001858 Inf  -2.450  0.1023

Note: contrasts are still on the inverse scale 
P value adjustment: tukey method for comparing a family of 5 estimates

Model check

Because one model is better then another, it does not mean it is well fitted to the data. A check on the models’ residuals is always a good call. With the GLMs we do this using the Pearson’s residuals - a standardized form of residuals. For example, for model_g,

res_g <- residuals(model_g, type = 'pearson')
plot(res_g ~ fitted(model_g))

Ideally, the residuals should have no trend, being randomly distributed around zero.

Also, a model may suffer of overdispersion, when it fails to capture the variability of the data. A simple way to check if there is this problem is through the Pearson’s chi-square:

X2 <- sum(res_g^2)
X2

[1] 3.220108

When X2 is ‘relatively’ greater than the residual degrees of freedom (25 for model_g), it is an indicative of overdispersion.

GLM binomial (or logistic regression)

The following data is on number of dead and alive insects subjected to dose of an insecticide, where every experimental unit consists of n = 20 insects. With the plot, let’s check how the proportion of dead look like.

n <- 20
dead <- c(0, 1, 5, 17, 19, 19)
alive <- n - dead
dose <- c(0, 0.25, 0.5, 1, 1.5, 2)
plot(dead/n ~ dose)

Notice that the response variable dead (or alive) consists of a binomial response, that can be described using a logit or logistic model for \(p\), the expected proportion of dead insects, as a function of \(x\), the dose of the insecticide. The logit model is of the form:

\[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 x\]

In this case, our linear predictor consists of a first-degree polynomial. But notice that the relationship between \(p\) and \(x\) is not a straight line, because of the inversion:

\[p = \frac{\exp(\beta_0 + \beta_1 x)}{1 + \exp(\beta_0 + \beta_1 x)}\] it is rather a sigmoidal curve.

model_b <- glm(cbind(dead, alive) ~ dose, family = binomial)
summary(model_b)


Call:
glm(formula = cbind(dead, alive) ~ dose, family = binomial)

Deviance Residuals: 
      1        2        3        4        5        6  
-1.1304  -0.6788   0.1897   1.2123  -0.2964  -1.7918  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.4277     0.6150  -5.573 2.50e-08 ***
dose          4.4606     0.7639   5.839 5.25e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 103.0984  on 5  degrees of freedom
Residual deviance:   6.5429  on 4  degrees of freedom
AIC: 22.417

Number of Fisher Scoring iterations: 5

A significant (p-value = \(5.25 \times 10^{-9}\)) effect is observed for dose.

The fitted curve is

betas <- coef(model_b)
curve(exp(betas[1] + betas[2]*x)/(1 + exp(betas[1] + betas[2]*x)), 
      from = 0, to = 2,
      ylab = 'Proportion of dead insects',
      xlab = 'Dose')

R-squared

The \(\rho^2\) of McFadden (1974), analogous to the classical \(R^2\), can be used to measure the goodness-of-fit of a logit model.

\[\rho^2 = 1 - \frac{\log L}{\log L_{null}}\] where \(\log L_{null}\) is the log-likelihood of the ‘null’ model, containing only the intercept (~ 1).

model_null <- glm(cbind(dead, alive) ~ 1, family = binomial)
c( 1 - logLik(model_b)/logLik(model_null) )

[1] 0.839815

Miscellanea

Check help(family) to see what distributions and link functions can be used to fit GLMs.
Originally, GLM assume independence of observations. Sometimes there is dependence (autocorrelation). And GLMs can capture non-equal variances through the relationship between the expected mean and variance. But sometimes there is no such a relationship. In those situations GLMs might fail to fit the data. A linear mixed model or a generalized linear mixed model can be considered for the task.

Exercises

Fit GLMs with distributions gaussian (normal), poisson, Gamma, inverse.gaussian to describe y, the number of tomato fruits damaged by Tuta absoluta in plots cultivated with three different varieties. After choosing the best-fit model, test for differences (\(p < 0.05\)) on the expected mean of varieties.

y <- c(0, 0, 0, 0, 1, 0,   8, 5, 0, 3, 1, 6,   10, 4, 0, NA, 12, 7) + 1
variety <- gl(3, 6, labels = LETTERS[1:3])

Generalized Linear Models in `R`

Anderson Rodrigo da Silva

10/03/2022