Residual analysis in linear regression (BIO213 version)

## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE, 
    echo = TRUE, fig.width = 7, fig.height = 7)
options(width = 116, scipen = 10)

setwd("~/Documents/HSPH/useR.at.HSPH/")
library(car)

References

Kleinbaum. Applied Regression Analysis and Other Multivariable Methods.
Fox and Weisberg. An R Companion to Applied Regression.

Residual analysis

Performed to check for validity of assumptions and presence of outliers/influential points.

Assumptions in linear regression

Independence
Normality
Homoscedasticity
Linearity

If normality and/or homoscedasticity assumptions are violated, consider transformation of the outcome variable. If linearity assumption is violated, consider transformation of the predictor variable.

Examples using R default functions and car package are demonstrated here.

Most useful functions are:

car::qqPlot(linear.model.full, id.n = 5) for normality assumption.
car::residualPlots(linear.model.full, terms = ~ 1, id.n = 5) for homoscedasticity assumption.
car::avPlots(linear.model.full, terms = ~ age + lwt, id.n = 5) for linearity assumption.
car::influencePlot(linear.model.full, id.n = 5) for influential points.

Load data

library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")

## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)

## Recoding
lbw <- within(lbw, {
    ## race relabeling
    race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

    ## ftv (frequency of visit) relabeling
    ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
    ftv.cat <- relevel(ftv.cat, ref = "Normal")

    ## ptl
    preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("=0",">=1"))

    ## Categorize smoke ht ui
    smoke    <- factor(smoke, levels = 0:1, labels = c("No","Yes"))
    ht       <- factor(ht, levels = 0:1, labels = c("No","Yes"))
    ui       <- factor(ui, levels = 0:1, labels = c("No","Yes"))
})

Create a linear model

## Full
lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm, data = lbw)
## summary
summary(lm.full)


Call:
lm(formula = bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + 
    preterm, data = lbw)

Residuals:
    Min      1Q  Median      3Q     Max 
-1896.7  -443.3    53.2   466.1  1654.4 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2947.32     320.48    9.20  < 2e-16 ***
age              -2.91       9.67   -0.30  0.76354    
lwt               4.22       1.72    2.46  0.01488 *  
smokeYes       -307.34     109.13   -2.82  0.00541 ** 
htYes          -568.63     200.88   -2.83  0.00518 ** 
uiYes          -494.11     137.23   -3.60  0.00041 ***
ftv.catNone     -56.50     105.36   -0.54  0.59245    
ftv.catMany    -185.86     203.19   -0.91  0.36158    
race.catBlack  -467.30     149.78   -3.12  0.00211 ** 
race.catOther  -322.81     117.40   -2.75  0.00658 ** 
preterm>=1     -207.91     136.35   -1.52  0.12907    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 647 on 178 degrees of freedom
Multiple R-squared: 0.255,  Adjusted R-squared: 0.213 
F-statistic: 6.08 on 10 and 178 DF,  p-value: 0.0000000609

Independence

Not interpretable from graphs, think about the design.

Normality

Simple: Normality of the outcome ($ Y_i $)

car::qqPlot(lbw$bwt)

Intermediate: Normality of the residual ($ r_i $)

layout(matrix(1:2, ncol = 2))
plot(lm.full, which = 2)
car::qqPlot(residuals(lm.full), id.n = 5)

plot of chunk unnamed-chunk-5

132 130 136  94 155 
  1 189   2 188   3

Homoscedasticity

Plot the residual $ r_i $ versus the predicted outcome $ \hat{Y}_i $

layout(matrix(1:2, ncol = 2))
plot(lm.full, which = 1)
## Raw residual
car::residualPlots(lm.full, terms = ~ 1, fitted = T, id.n = 5, smoother = loessLine)

plot of chunk unnamed-chunk-6

           Test stat Pr(>|t|)
Tukey test    -0.611    0.541

## Kleinbaum's jackknifed residuals (R calls these studentized residuals)
car::residualPlots(lm.full, terms = ~ 1, id.n = 5, type = "rstudent", quadratic = F, smoother = loessLine)

plot of chunk unnamed-chunk-6

           Test stat Pr(>|t|)
Tukey test    -0.611    0.541

Spread level plot basically does the same thing. Non-constant variance test tests for it.

car::spreadLevelPlot(lm.full, id.n = 5)

plot of chunk unnamed-chunk-7


Suggested power transformation:  0.7456

car::ncvTest(lm.full)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.004127    Df = 1     p = 0.9488

Linearity of continuous covariates

Simple: Plot the outcome $ Y_i $ versus the predictor $ X_i $

layout(matrix(1:2, ncol = 2))
plot(bwt ~ age, lbw)
plot(bwt ~ lwt, lbw)

plot of chunk unnamed-chunk-8

Intermediate: Plot the residual $ r_i $ versus the predictor $ X_i $

car::residualPlots(lm.full, terms = ~ age + lwt, fitted = F, id.n = 5, smoother = loessLine)

plot of chunk unnamed-chunk-9

    Test stat Pr(>|t|)
age     2.338    0.021
lwt     0.307    0.759

Hard: Partial regression plot (Added-variable plot)

What Dr Orav calls “partial regression residual plot” is this one. See if the red line fits the data points well and if there is any non-linear pattern.

car::avPlots(lm.full, terms = ~ age + lwt, id.n = 5)

plot of chunk unnamed-chunk-10

Partial residual plot (Component+Residual plot)

This is similar, but different in that the X-axis is the raw predictor. The red line is the partial fit, assuming linearity in the partial relationship between $ y $ and $ x_j $. The green line is a lowess smooth, and if it is very different from the red line, it suggests need for predictor transformation.

car::crPlots(lm.full, terms = ~ age + lwt, id.n = 5)

plot of chunk unnamed-chunk-11

Outlier and influential points

Simple: Scatter plots for continuous covariates

layout(matrix(1:2, ncol = 2))
plot(bwt ~ age, lbw)
plot(bwt ~ lwt, lbw)

plot of chunk unnamed-chunk-12

Intermediate: Standardized residuals (Kleinbaum definition)

$ z_{i} = \frac {\hat{E}_i} {S} $

where $ S $ is the square root of the mean squared error (MSE)

Absolute values larger than 2 are considered large residuals.

There is probably no built-in functions calculate this in R.

Hard: Studentized residuals (Kleinbaum definition)

$ r_{i} = \frac {\hat{E}_i} {S \sqrt{1 - h_i}} $

where $ h_i $ is the leverage of the $i$th observation.

These are standardized residuals that incorporate leverage (hat-values). Absolute values larger than 2 are considered large residuals.

R uses different terminology (See Fox & Weisberg), Kleinbaum's studentized residual is R's standardized residual.

plot(rstandard(lm.full))

plot of chunk unnamed-chunk-13

car::residualPlots(lm.full, terms = ~ 1, id.n = 5, type = "rstandard", quadratic = F)

plot of chunk unnamed-chunk-13

           Test stat Pr(>|t|)
Tukey test    -0.611    0.541

Leverage (hat-values)

Leverage or hat-values are considred high if greater than 2*(Number of predictors + 1) / (Number of data points). Here it is 10 predictors, and 189 data points, thus the cutoff is 0.1164021. These values indicate how far these data points are from the center of the regressor space.

influencePlot() function is more useful for hat-values (combined with Jackknifed residuals and Cook's distances).' The cut off value is automatically added.

car::influenceIndexPlot(lm.full, vars = "hat", id.n = 20)
abline(h = 2 * length(coef(lm.full)) / nrow(lm.full$model))

plot of chunk unnamed-chunk-14

Hard: Jackknifed residuals (Kleinbaum definition)

$ r_{(-i)} = \frac {\hat{E}_i} {S_{(-i)} \sqrt{1 - h_i}} $

where $ S_{(-i)} $ is the square root of the MSE calculated excluding observation $ i $.

R uses different terminology (See Fox & Weisberg), Kleinbaum's jackknifed residual is R's studentized residual.

car::residualPlots(lm.full, terms = ~ 1, id.n = 5, type = "rstudent", quadratic = F)

plot of chunk unnamed-chunk-15

           Test stat Pr(>|t|)
Tukey test    -0.611    0.541

car::influenceIndexPlot(lm.full, vars = "Studentized", id.n = 5)

Cook's distance

$ \sum_{j=0}^{k}[\hat{\beta}_j - \hat{\beta}_{j(-i)}]^2 $

Cook's distance mesures the extent to which the estimates of the regression coefficients change when an observation is deleted from the analysis.

car::influenceIndexPlot(lm.full, vars = "Cook", id.n = 5)

plot of chunk unnamed-chunk-16

Combination of jackknifed residuals (Kleinbaum's definition), hat-values (leverage), and Cook's distance

Jackknifed residuals (Kleinbaum definition) on Y-axis Hat-values (leverage) on X-axis Cook's distances as size

An observation that is both outlying and has high leverage exerts influence on the regression coefficients, thus high Cook's distance.

car::influencePlot(lm.full, id.n = 5)

    StudRes     Hat   CookD
22   0.3970 0.16159 0.05267
68   0.3463 0.22651 0.05665
93   1.1371 0.15114 0.14455
94   2.3185 0.07208 0.19248
106  1.6674 0.14865 0.20903
130  2.7618 0.11042 0.28806
131 -2.0750 0.08114 0.18422
132 -3.1015 0.06288 0.23657
133 -1.8795 0.16491 0.25006
136 -2.4445 0.03153 0.13116
189  0.2073 0.18064 0.02943

title("Y: jackknifed residuals, X: leverage, Size: Cook D")

plot of chunk unnamed-chunk-17

dfbeta

This is a non-summarized version of Cook's distance, i.e., changes in coefficients when each observation is deleted.'

dfbeta$ _{ij} = b_{(-i)j} - b_{j} $ for $ j $ = 0, …, $ k $

where $ b_j $ is the coefficient computed using all the data and $ b_{(-i)j} $ is the same coefficient computed with case $ i $ omitted. dfbetas$ _{ij} $ is the standardized version of dfbeta$ _{ij} $.

## dfbeta for first 10 observations
dfbeta(lm.full)[1:10,]

   (Intercept)      age       lwt smokeYes    htYes  uiYes ftv.catNone ftv.catMany race.catBlack race.catOther
1       4.1285  0.06162 -0.040169   1.5818  1.77997 -5.449     -1.4157    -0.33133       -4.1889        0.6775
2      31.5126 -0.89067 -0.063748   1.5680  5.39399  0.378     -1.3863   -41.55109       -0.9061       -9.4662
3     -25.0841  0.41187  0.066758  -7.6297 -0.08118  3.551      8.3429     7.41421        3.1864        1.9774
4       1.6985 -0.02605 -0.004506   0.8550  0.33112  2.164     -1.0242    -0.77335       -0.3296       -0.2545
5       2.9867 -0.07558 -0.009088   1.0008  0.21583  4.265      0.8387     0.17631       -1.1392       -1.4148
6      -0.7135  0.06693 -0.009364   2.2187  3.42553  3.133     -3.7675    -0.06969        0.8473       -4.8949
7     -40.1601  0.52170  0.074151   9.1164  0.56609  4.105      7.9833     6.71370        8.9521       11.3928
8     -21.6675  0.53326  0.031435   0.3232  0.53376  3.087      7.4300     5.36974        0.3177       -5.9116
9       1.8222 -0.43739  0.034003  -7.6687  0.66096  2.044      5.5879     6.94256        0.8800       -0.0994
10      0.6666 -0.30998  0.047935  -3.7820  1.30704  2.401     -4.1182     0.29882        1.6926        2.7312
   preterm>=1
1      0.8375
2      3.9622
3      4.5351
4     -0.9888
5     -1.6422
6      1.4513
7      1.4624
8      1.9072
9      5.3374
10     3.2578

## dfbetas (standardized) for first 10 observations
dfbetas(lm.full)[1:10,]

   (Intercept)       age       lwt  smokeYes      htYes     uiYes ftv.catNone ftv.catMany race.catBlack
1     0.012848  0.006354 -0.023329  0.014455  0.0088370 -0.039604   -0.013401  -0.0016262     -0.027892
2     0.098203 -0.091959 -0.037075  0.014349  0.0268171  0.002751   -0.013141  -0.2042290     -0.006042
3    -0.078171  0.042525  0.038826 -0.069822 -0.0004036  0.025841    0.079083   0.0364423      0.021247
4     0.005285 -0.002685 -0.002617  0.007813  0.0016438  0.015726   -0.009694  -0.0037954     -0.002195
5     0.009294 -0.007792 -0.005278  0.009145  0.0010715  0.030996    0.007939   0.0008654     -0.007586
6    -0.002223  0.006908 -0.005444  0.020296  0.0170243  0.022795   -0.035699  -0.0003424      0.005648
7    -0.125442  0.053990  0.043225  0.083619  0.0028210  0.029943    0.075849   0.0330754      0.059831
8    -0.067486  0.055028  0.018272  0.002956  0.0026522  0.022458    0.070391   0.0263787      0.002117
9     0.005677 -0.045144  0.019769 -0.070154  0.0032850  0.014869    0.052950   0.0341122      0.005866
10    0.002076 -0.031980  0.027857 -0.034583  0.0064932  0.017459   -0.039006   0.0014676      0.011277
   race.catOther preterm>=1
1      0.0057553   0.006126
2     -0.0805302   0.029022
3      0.0168223   0.033219
4     -0.0021617  -0.007232
5     -0.0120184  -0.012011
6     -0.0416261   0.010627
7      0.0971458   0.010737
8     -0.0502633   0.013962
9     -0.0008454   0.039082
10     0.0232173   0.023845

dffits

Standardized changes in the predicted value when each observation is deleted.

dffits(lm.full)[1:10]

       1        2        3        4        5        6        7        8        9       10 
-0.06420 -0.26596 -0.14219  0.02232  0.04353 -0.09531 -0.19103 -0.11884 -0.12071 -0.08937