Residual analysis in linear regression (abbreviated BIO213 version)

## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE, 
    echo = TRUE, fig.width = 7, fig.height = 7)
options(width = 116, scipen = 10)

setwd("~/Documents/HSPH/useR.at.HSPH/")
library(car)

References

For more fuller examples, see: http://rpubs.com/kaz_yos/residuals-bio213

Residual analysis

Performed to check for validity of assumptions and presence of outliers/influential points.

Assumptions in linear regression

If normality and/or homoscedasticity assumptions are violated, consider transformation of the outcome variable. If linearity assumption is violated, consider transformation of the predictor variable.

Examples using R default functions and car package are demonstrated here.

Most useful functions are:

Load data

library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")

## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)

## Recoding
lbw <- within(lbw, {
    ## race relabeling
    race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

    ## ftv (frequency of visit) relabeling
    ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
    ftv.cat <- relevel(ftv.cat, ref = "Normal")

    ## ptl
    preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("=0",">=1"))
})

Create a linear model

## Full
linear.model.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm, data = lbw)
## summary
summary(linear.model.full)

Call:
lm(formula = bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + 
    preterm, data = lbw)

Residuals:
    Min      1Q  Median      3Q     Max 
-1896.7  -443.3    53.2   466.1  1654.4 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2947.32     320.48    9.20  < 2e-16 ***
age              -2.91       9.67   -0.30  0.76354    
lwt               4.22       1.72    2.46  0.01488 *  
smoke          -307.34     109.13   -2.82  0.00541 ** 
ht             -568.63     200.88   -2.83  0.00518 ** 
ui             -494.11     137.23   -3.60  0.00041 ***
ftv.catNone     -56.50     105.36   -0.54  0.59245    
ftv.catMany    -185.86     203.19   -0.91  0.36158    
race.catBlack  -467.30     149.78   -3.12  0.00211 ** 
race.catOther  -322.81     117.40   -2.75  0.00658 ** 
preterm>=1     -207.91     136.35   -1.52  0.12907    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 647 on 178 degrees of freedom
Multiple R-squared: 0.255,  Adjusted R-squared: 0.213 
F-statistic: 6.08 on 10 and 178 DF,  p-value: 0.0000000609 

Independence

Not interpretable from graphs, think about the design.

Normality

Intermediate: Normality of the residual (\( r_i \))

car::qqPlot(residuals(linear.model.full), id.n = 5)

plot of chunk unnamed-chunk-4

132 130 136  94 155 
  1 189   2 188   3 

Homoscedasticity

Plot the residual \( r_i \) versus the predicted outcome \( \hat{Y}_i \)

## Raw residual
car::residualPlots(linear.model.full, terms = ~ 1, fitted = T, id.n = 5, smoother = loessLine)

plot of chunk unnamed-chunk-5

           Test stat Pr(>|t|)
Tukey test    -0.611    0.541

Linearity of continuous covariates

Hard: Partial regression plot (Added-variable plot)

What Dr Orav calls “partial regression residual plot” is this one. See if the red line fits the data points well and if there is any non-linear pattern.

car::avPlots(linear.model.full, terms = ~ age + lwt, id.n = 5)

plot of chunk unnamed-chunk-6

Outlier and influential points

Combination of jackknifed residuals (Kleinbaum's definition), hat-values (leverage), and Cook's distance

Jackknifed residuals (Kleinbaum definition) on Y-axis Hat-values (leverage) on X-axis Cook's distances as size

An observation that is both outlying and has high leverage exerts influence on the regression coefficients, thus high Cook's distance.

car::influencePlot(linear.model.full, id.n = 5)
    StudRes     Hat   CookD
22   0.3970 0.16159 0.05267
68   0.3463 0.22651 0.05665
93   1.1371 0.15114 0.14455
94   2.3185 0.07208 0.19248
106  1.6674 0.14865 0.20903
130  2.7618 0.11042 0.28806
131 -2.0750 0.08114 0.18422
132 -3.1015 0.06288 0.23657
133 -1.8795 0.16491 0.25006
136 -2.4445 0.03153 0.13116
189  0.2073 0.18064 0.02943
title("Y: jackknifed residuals, X: leverage, Size: Cook D")

plot of chunk unnamed-chunk-7