library(ggplot2)
library(gridExtra)
library(MASS)

The last few blogs covered the theory and practice of logistic and Poisson regression, where the response variable is binomial or Poisson distributed. Linear regression can be generalized to many other kinds of non-normal responses, under the umbrella of generalized linear models.

Since I have already covered two of these models, I thought it would be a good idea to cover their residuals and diagnostics in more detail. That is the focus of this blog.

The Problem

To accurately capture the phenomena of interest, GLM requires that:

There are no outliers
The link function is correct
All important independent variabbles are used and each is linear
The correct variance function \(V(\mu)\) is used (since GLMs posit mean and variance are proportional to eachother and not constant)
The dispersion parameter is constant
The response \(y_i\) are independent of eachother
The response variable comes from the specified distribution

How can we tell if our fitted GLM is consistent with these assumptions, and fits the data at hand adequately? We can employ the methods ennumerated below.

Simulate Some Data

First, I’ll simulate a negative binomial distribution for our response variable, and a few different types of independent variables. Note that I specify a quadratic relationship. I then fit two models: \(M_0\) does not include quadratic term, while \(M_1\) does. I will compare their residuals later.

n <- 1000

set.seed(1806)

X <- matrix(0, nrow=n, ncol=3)
X[, 1] <- rnorm(n, mean=0, sd=.5)  # normally distributed IV
X[, 2] <- sample(0:1, size=n, replace=TRUE, prob=c(0.5, 0.5)) # dichotomous IV (balanced)
X[, 3] <- rpois(n, lambda=.5)  # Poisson distributed

beta_1 <- log(2.5); beta_2 <- log(4); beta_3 <- -log(1.4); beta_4 <- log(1.76)

eta <- beta_1*X[,1] + beta_2*X[,2] + beta_3*X[,3] + beta_4*(X[,1]^2)
g <- exp(eta)
y <- rnbinom(n, size=g, prob=0.20)

df <- data.frame(y=y, x=X)

Graphically, our simulated dataset looks like:

Models

Without quadratic term:

m0 <- glm.nb(y ~ x.1 + x.2 + x.3, df)
summary(m0)

## 
## Call:
## glm.nb(formula = y ~ x.1 + x.2 + x.3, data = df, init.theta = 2.261447404, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9301  -0.9969  -0.3005   0.4106   3.4936  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.45237    0.04158  34.927   <2e-16 ***
## x.1          0.98175    0.04959  19.798   <2e-16 ***
## x.2          1.47603    0.05000  29.522   <2e-16 ***
## x.3         -0.32741    0.03437  -9.527   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.2614) family taken to be 1)
## 
##     Null deviance: 2831.3  on 999  degrees of freedom
## Residual deviance: 1212.2  on 996  degrees of freedom
## AIC: 6084
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.261 
##           Std. Err.:  0.150 
## 
##  2 x log-likelihood:  -6073.960

with the quadratic term;

m1 <- glm.nb(y ~ poly(x.1, 2) + x.2 + x.3, df)
summary(m1)

## 
## Call:
## glm.nb(formula = y ~ poly(x.1, 2) + x.2 + x.3, data = df, init.theta = 2.648412977, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0113  -0.9514  -0.2772   0.4651   3.4977  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.42981    0.03969  36.027   <2e-16 ***
## poly(x.1, 2)1 14.72014    0.73065  20.147   <2e-16 ***
## poly(x.1, 2)2  6.30277    0.70962   8.882   <2e-16 ***
## x.2            1.46824    0.04742  30.964   <2e-16 ***
## x.3           -0.32697    0.03254 -10.049   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.6484) family taken to be 1)
## 
##     Null deviance: 3162.5  on 999  degrees of freedom
## Residual deviance: 1233.1  on 995  degrees of freedom
## AIC: 5999.5
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.648 
##           Std. Err.:  0.187 
## 
##  2 x log-likelihood:  -5987.510

Residuals

Standard ‘raw’ residuals aren’t used in GLM modeling because they don’t always make sense. But for the sake of completeness, they are defined as:

\(y_i - \hat{y_i}\)

Below, the black line shows the residuals of the correct model \(M_0\), while the red lines show those of the incorrect model \(M_1\):

plot(density(resid(m0, type='response')))
lines(density(resid(m1, type='response')), col='red')

Pearson residuals

A better alternative would be Pearson residuals. These also come in a standardized variant, useful to ensure the residuals have a constant variance.

plot(density(resid(m0, type='pearson')))
lines(density(resid(m1, type='pearson')), col='red')

Quite a bit more contained, though with a very long tail.

With standardization:

plot(density(rstandard(m0, type='pearson')))
lines(density(rstandard(m1, type='pearson')), col='red')

In the case at hand, the standardized residuals do not differ much from the unstandardized.

Deviance residuals

Deviance residuals are another class of residuals. In fact, the course textbook indicate they are usually preferable to Pearson residuals. Deviance residuals can also be standardized.

plot(density(resid(m0, type='deviance')))
lines(density(resid(m1, type='deviance')), col='red')

Interestingly, these residuals have a pattern not present in the plots of the other classes of residuals. There is a ‘hump’ around -2. Were we doing this analysis for real, that should prompt an investigation.

Standardized deviance residuals:

plot(density(rstandard(m0, type='deviance')))
lines(density(rstandard(m1, type='deviance')), col='red')

As before, there is not much of a difference between standardized and unstandardized residuals.

Quantile residuals

Dunn and Gordon (2018) introduce quantile residuals for discrete response variables. Their primary benefits are they do not show weird patterns (due to variable’s discreteness). They are available in R via statmod::qresid().

library(statmod)
plot(density(qresid(m0)))
lines(density(qresid(m1)), col='red')

In this case, the range of the residuals is around the same as the deviance residuals. The distribution, however, is a bit smoother and more closely approximates a normal distribution.

Diagnostics

Spot check densities of \(y\) and \(\hat{y}\)

The predicted values from a model will appear similar to \(y\) if the model is well-fit. Below, we see the distributions are about the same for each model.

par(mfrow=c(1,2))
plot(density(df$y), xlim=c(0, 160), ylim=c(0, .08), main='M_0 y_hat')
lines(density(predict(m0, type='response')), col='red')

plot(density(df$y), xlim=c(0, 160), ylim=c(0, .08), main='M_1 y_hat')
lines(density(predict(m1, type='response')), col='red')

Checking independence of observations

Especially important when dealing with data across time and space, we can also test simply by plotting the residuals in order. Ideally there will be no pattern.

par(mfrow=c(1,2))
scatter.smooth(1:1000, rstandard(m0, type='deviance'), col='gray')
scatter.smooth(1:1000, rstandard(m1, type='deviance'), col='gray')

We can confirm in both cases that the observations are independent.

Plot residuals against fitted values \(\hat{y_i}\)

In plots of this type, a well fit model will display no pattern. We see on the left that the residuals from \(M_0\) (without quadratic term) increase with \(\hat{y_i}\). This is directly attributable to the missing term. The correct model \(M_1\) has no pattern.

par(mfrow=c(1,2))
scatter.smooth(predict(m0, type='response'), rstandard(m0, type='deviance'), col='gray')
scatter.smooth(predict(m1, type='response'), rstandard(m1, type='deviance'), col='gray')

Dunn and Smyth (2018, p. 306–7) suggests that trends in these plots can be easier to detect by using a variance-stabilizing transformation on \(\hat{y}\). He recommends for Poisson (and presumably negative binomial) distributions square root. Let’s see if it helps at all:

par(mfrow=c(1,2))
scatter.smooth(sqrt(predict(m0, type='response')), qresid(m0), col='gray')
scatter.smooth(sqrt(predict(m1, type='response')), qresid(m1), col='gray')

It is easy to see this transformation spreads the points out along the horizontal access. The present dataset is clean enough that it’s not exactly necessary, but it is a good tool to keep in the toolbox, especially in conjunction with quartile residuals.

Quartile residuals

Quartile residuals show us the same information as those above.

par(mfrow=c(1,2))
scatter.smooth(predict(m0, type='response'), qresid(m0), col='gray')
scatter.smooth(predict(m1, type='response'), qresid(m1), col='gray')

Plot residuals against predictors

Here, we see a problem with \(x_1\) im \(M_0\). The quadratic pattern appears because of the poor functional form:

The second variable is fine:

The third variable should also be fine in both models, although we see a slight downward trend. Strange.

Quantile residuals

Since our response variable \(y\) is discrete, take a look at quartile residuals, but only for \(x_1\) where we see the annoying patterns.

Compared to above, the graph is cleaner—no annoying patterns, but the essential information still comes through.

Checking the link function

Plot the working residuals against the linear predictor. If no pattern is discernable, we can ascertain the correct link function was used.

Interestingly, both models appear to have slight logarithmic relationship.

Checking if choice of distribution for \(y\) is appropriate

Use Q-Q plots with quartile residuals to determine if our chosen distribution—negative binomial—makes sense.

par(mfrow=c(1,2))
qqnorm(qresid(m0)); qqline(qresid(m0))
qqnorm(qresid(m1)); qqline(qresid(m1))

These look good. There are outliers in both models, but it is worse for \(M_0\) as it fails to account for the quadratic nature of \(x_1\).

Outliers and influential observations

R provides a conveniant function to plot Cook’s distances, which help us find observations with high leverage:

par(mfrow=c(1,2))
plot(cooks.distance(m0), type='h')
plot(cooks.distance(m1), type='h')

The scale of the \(y\)-axis is much lower in \(M_1\) (on the right) than \(M_0\).

We can further specify the specific points with high leverage using a benchmark of 2 times the mean of Cook’s distance:

cooksd_m0 <- cooks.distance(m0)
cooksd_m1 <- cooks.distance(m1)

length(cooksd_m0[cooksd_m0 > mean(cooksd_m0) * 2])

## [1] 65

length(cooksd_m1[cooksd_m1 > mean(cooksd_m1) * 2])

## [1] 96

\(M_0\) appears to have more high leverage points than the correct model \(M_1\).

Let’s examine an observation with a high leverage in both models:

df[73,]

##     y       x.1 x.2 x.3
## 73 80 0.7473986   1   0

Influence

We can also look for influential observations; neither model has any. The simulated data is clean enough to avoid them.

 which(influence.measures(m0)$is.inf[,'cook.d'] )

## named integer(0)

 which(influence.measures(m0)$is.inf[,'cook.d'] )

## named integer(0)

References

Dunn, Peter K. and Gorden K. Smyth. 2018. Generalized Linear Models With Examples in R (Springer).

Generalized Linear Models: Residuals and Diagnostics

Ben Horvath

November 2019

The Problem

Simulate Some Data

Models

Residuals

Pearson residuals

Deviance residuals

Quantile residuals

Diagnostics

Spot check densities of \(y\) and \(\hat{y}\)

Checking independence of observations

Plot residuals against fitted values \(\hat{y_i}\)

Quartile residuals

Plot residuals against predictors

Quantile residuals

Checking the link function

Checking if choice of distribution for \(y\) is appropriate

Outliers and influential observations

Influence

References