The last few blogs covered the theory and practice of logistic and Poisson regression, where the response variable is binomial or Poisson distributed. Linear regression can be generalized to many other kinds of non-normal responses, under the umbrella of generalized linear models.
Since I have already covered two of these models, I thought it would be a good idea to cover their residuals and diagnostics in more detail. That is the focus of this blog.
The Problem
To accurately capture the phenomena of interest, GLM requires that:
- There are no outliers
- The link function is correct
- All important independent variabbles are used and each is linear
- The correct variance function \(V(\mu)\) is used (since GLMs posit mean and variance are proportional to eachother and not constant)
- The dispersion parameter is constant
- The response \(y_i\) are independent of eachother
- The response variable comes from the specified distribution
How can we tell if our fitted GLM is consistent with these assumptions, and fits the data at hand adequately? We can employ the methods ennumerated below.
Simulate Some Data
First, I’ll simulate a negative binomial distribution for our response variable, and a few different types of independent variables. Note that I specify a quadratic relationship. I then fit two models: \(M_0\) does not include quadratic term, while \(M_1\) does. I will compare their residuals later.
n <- 1000
set.seed(1806)
X <- matrix(0, nrow=n, ncol=3)
X[, 1] <- rnorm(n, mean=0, sd=.5) # normally distributed IV
X[, 2] <- sample(0:1, size=n, replace=TRUE, prob=c(0.5, 0.5)) # dichotomous IV (balanced)
X[, 3] <- rpois(n, lambda=.5) # Poisson distributed
beta_1 <- log(2.5); beta_2 <- log(4); beta_3 <- -log(1.4); beta_4 <- log(1.76)
eta <- beta_1*X[,1] + beta_2*X[,2] + beta_3*X[,3] + beta_4*(X[,1]^2)
g <- exp(eta)
y <- rnbinom(n, size=g, prob=0.20)
df <- data.frame(y=y, x=X)
Graphically, our simulated dataset looks like:
Models
Without quadratic term:
##
## Call:
## glm.nb(formula = y ~ x.1 + x.2 + x.3, data = df, init.theta = 2.261447404,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9301 -0.9969 -0.3005 0.4106 3.4936
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.45237 0.04158 34.927 <2e-16 ***
## x.1 0.98175 0.04959 19.798 <2e-16 ***
## x.2 1.47603 0.05000 29.522 <2e-16 ***
## x.3 -0.32741 0.03437 -9.527 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.2614) family taken to be 1)
##
## Null deviance: 2831.3 on 999 degrees of freedom
## Residual deviance: 1212.2 on 996 degrees of freedom
## AIC: 6084
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.261
## Std. Err.: 0.150
##
## 2 x log-likelihood: -6073.960
with the quadratic term;
##
## Call:
## glm.nb(formula = y ~ poly(x.1, 2) + x.2 + x.3, data = df, init.theta = 2.648412977,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0113 -0.9514 -0.2772 0.4651 3.4977
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.42981 0.03969 36.027 <2e-16 ***
## poly(x.1, 2)1 14.72014 0.73065 20.147 <2e-16 ***
## poly(x.1, 2)2 6.30277 0.70962 8.882 <2e-16 ***
## x.2 1.46824 0.04742 30.964 <2e-16 ***
## x.3 -0.32697 0.03254 -10.049 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.6484) family taken to be 1)
##
## Null deviance: 3162.5 on 999 degrees of freedom
## Residual deviance: 1233.1 on 995 degrees of freedom
## AIC: 5999.5
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.648
## Std. Err.: 0.187
##
## 2 x log-likelihood: -5987.510
Residuals
Standard ‘raw’ residuals aren’t used in GLM modeling because they don’t always make sense. But for the sake of completeness, they are defined as:
\(y_i - \hat{y_i}\)
Below, the black line shows the residuals of the correct model \(M_0\), while the red lines show those of the incorrect model \(M_1\):
Pearson residuals
A better alternative would be Pearson residuals. These also come in a standardized variant, useful to ensure the residuals have a constant variance.
Quite a bit more contained, though with a very long tail.
With standardization:
plot(density(rstandard(m0, type='pearson')))
lines(density(rstandard(m1, type='pearson')), col='red')
In the case at hand, the standardized residuals do not differ much from the unstandardized.
Deviance residuals
Deviance residuals are another class of residuals. In fact, the course textbook indicate they are usually preferable to Pearson residuals. Deviance residuals can also be standardized.
Interestingly, these residuals have a pattern not present in the plots of the other classes of residuals. There is a ‘hump’ around -2. Were we doing this analysis for real, that should prompt an investigation.
Standardized deviance residuals:
plot(density(rstandard(m0, type='deviance')))
lines(density(rstandard(m1, type='deviance')), col='red')
As before, there is not much of a difference between standardized and unstandardized residuals.
Quantile residuals
Dunn and Gordon (2018) introduce quantile residuals for discrete response variables. Their primary benefits are they do not show weird patterns (due to variable’s discreteness). They are available in R via statmod::qresid()
.
In this case, the range of the residuals is around the same as the deviance residuals. The distribution, however, is a bit smoother and more closely approximates a normal distribution.
Diagnostics
Spot check densities of \(y\) and \(\hat{y}\)
The predicted values from a model will appear similar to \(y\) if the model is well-fit. Below, we see the distributions are about the same for each model.
par(mfrow=c(1,2))
plot(density(df$y), xlim=c(0, 160), ylim=c(0, .08), main='M_0 y_hat')
lines(density(predict(m0, type='response')), col='red')
plot(density(df$y), xlim=c(0, 160), ylim=c(0, .08), main='M_1 y_hat')
lines(density(predict(m1, type='response')), col='red')
Checking independence of observations
Especially important when dealing with data across time and space, we can also test simply by plotting the residuals in order. Ideally there will be no pattern.
par(mfrow=c(1,2))
scatter.smooth(1:1000, rstandard(m0, type='deviance'), col='gray')
scatter.smooth(1:1000, rstandard(m1, type='deviance'), col='gray')
We can confirm in both cases that the observations are independent.
Plot residuals against fitted values \(\hat{y_i}\)
In plots of this type, a well fit model will display no pattern. We see on the left that the residuals from \(M_0\) (without quadratic term) increase with \(\hat{y_i}\). This is directly attributable to the missing term. The correct model \(M_1\) has no pattern.
par(mfrow=c(1,2))
scatter.smooth(predict(m0, type='response'), rstandard(m0, type='deviance'), col='gray')
scatter.smooth(predict(m1, type='response'), rstandard(m1, type='deviance'), col='gray')
Dunn and Smyth (2018, p. 306–7) suggests that trends in these plots can be easier to detect by using a variance-stabilizing transformation on \(\hat{y}\). He recommends for Poisson (and presumably negative binomial) distributions square root. Let’s see if it helps at all:
par(mfrow=c(1,2))
scatter.smooth(sqrt(predict(m0, type='response')), qresid(m0), col='gray')
scatter.smooth(sqrt(predict(m1, type='response')), qresid(m1), col='gray')
It is easy to see this transformation spreads the points out along the horizontal access. The present dataset is clean enough that it’s not exactly necessary, but it is a good tool to keep in the toolbox, especially in conjunction with quartile residuals.
Quartile residuals
Quartile residuals show us the same information as those above.
par(mfrow=c(1,2))
scatter.smooth(predict(m0, type='response'), qresid(m0), col='gray')
scatter.smooth(predict(m1, type='response'), qresid(m1), col='gray')
Plot residuals against predictors
Here, we see a problem with \(x_1\) im \(M_0\). The quadratic pattern appears because of the poor functional form:
The second variable is fine:
The third variable should also be fine in both models, although we see a slight downward trend. Strange.
Quantile residuals
Since our response variable \(y\) is discrete, take a look at quartile residuals, but only for \(x_1\) where we see the annoying patterns.
Compared to above, the graph is cleaner—no annoying patterns, but the essential information still comes through.
Checking the link function
Plot the working residuals against the linear predictor. If no pattern is discernable, we can ascertain the correct link function was used.
Interestingly, both models appear to have slight logarithmic relationship.
Checking if choice of distribution for \(y\) is appropriate
Use Q-Q plots with quartile residuals to determine if our chosen distribution—negative binomial—makes sense.
These look good. There are outliers in both models, but it is worse for \(M_0\) as it fails to account for the quadratic nature of \(x_1\).
Outliers and influential observations
R provides a conveniant function to plot Cook’s distances, which help us find observations with high leverage:
The scale of the \(y\)-axis is much lower in \(M_1\) (on the right) than \(M_0\).
We can further specify the specific points with high leverage using a benchmark of 2 times the mean of Cook’s distance:
cooksd_m0 <- cooks.distance(m0)
cooksd_m1 <- cooks.distance(m1)
length(cooksd_m0[cooksd_m0 > mean(cooksd_m0) * 2])
## [1] 65
## [1] 96
\(M_0\) appears to have more high leverage points than the correct model \(M_1\).
Let’s examine an observation with a high leverage in both models:
## y x.1 x.2 x.3
## 73 80 0.7473986 1 0
Influence
We can also look for influential observations; neither model has any. The simulated data is clean enough to avoid them.
## named integer(0)
## named integer(0)
References
- Dunn, Peter K. and Gorden K. Smyth. 2018. Generalized Linear Models With Examples in R (Springer).