Residual plots in linear regression

References

Prepare data

## Load dataset directly from the internet
lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4)

## Change to lower case variable names
names(lbw) <- tolower(names(lbw))

## Recoding using within function
lbw <- within(lbw, {

    ## Relabel race: 1, 2, 3 -> White, Black, Other
    race <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

    ## Dichotomize ptl
    preterm  <- factor(ptl >= 1, levels = c(FALSE,TRUE), labels = c("0","1+"))
    ## You can alse use cut
    ## preterm  <- cut(ptl, breaks = c(-Inf, 0, Inf), labels = c("0","1+"))

    ## Change 0,1 binary to No,Yes binary
    ## smoke    <- factor(smoke, levels  = c(0,1), labels = c("No","Yes"))
    ## ht       <- factor(ht   , levels  = c(0,1), labels = c("No","Yes"))
    ## ui       <- factor(ui   , levels  = c(0,1), labels = c("No","Yes"))
    ## low      <- factor(low  , levels  = c(0,1), labels = c("No","Yes"))

    ## Categorize ftv (frequency of visit): 0   1   2   3   4   6 -> None(0), Normal(1-2), Many (>= 3)
    ftv.cat  <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))

    ## Make Normal the reference level
    ftv.cat  <- relevel(ftv.cat, ref = "Normal")
})

## Categorize smoke ht ui
lbw[,c("smoke","ht","ui","low")] <- lapply(lbw[,c("smoke","ht","ui")],
                                     function(var) {
                                         var <- factor(ifelse(var == 1, "Yes", "No"))
                                     })
## Summary
summary(lbw)
       id       low           age            lwt         race    smoke          ptl          ht        ui     
 Min.   :  4   No :115   Min.   :14.0   Min.   : 80   White:96   No :115   Min.   :0.000   No :177   No :161  
 1st Qu.: 68   Yes: 74   1st Qu.:19.0   1st Qu.:110   Black:26   Yes: 74   1st Qu.:0.000   Yes: 12   Yes: 28  
 Median :123             Median :23.0   Median :121   Other:67             Median :0.000                      
 Mean   :121             Mean   :23.2   Mean   :130                        Mean   :0.196                      
 3rd Qu.:176             3rd Qu.:26.0   3rd Qu.:140                        3rd Qu.:0.000                      
 Max.   :226             Max.   :45.0   Max.   :250                        Max.   :3.000                      
      ftv             bwt         ftv.cat    preterm 
 Min.   :0.000   Min.   : 709   Normal: 77   0 :159  
 1st Qu.:0.000   1st Qu.:2414   None  :100   1+: 30  
 Median :0.000   Median :2977   Many  : 12           
 Mean   :0.794   Mean   :2945                        
 3rd Qu.:1.000   3rd Qu.:3475                        
 Max.   :6.000   Max.   :4990                        

## Check the outcome variable
library(lattice)
densityplot(lbw$bwt)

plot of chunk unnamed-chunk-2

Fit a linear model

## No interaction. Squared terms for continuous variables.
lm.full <- lm(bwt ~ age + I(age^2) + lwt + I(lwt^2) + smoke + race + preterm,
              data = lbw)
summary(lm.full)

Call:
lm(formula = bwt ~ age + I(age^2) + lwt + I(lwt^2) + smoke + 
    race + preterm, data = lbw)

Residuals:
    Min      1Q  Median      3Q     Max 
-2259.2  -390.3    12.5   488.1  1423.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4166.2320  1100.9490    3.78  0.00021 ***
age         -146.1344    62.3205   -2.34  0.02012 *  
I(age^2)       2.9233     1.2229    2.39  0.01786 *  
lwt            9.2296    10.7900    0.86  0.39348    
I(lwt^2)      -0.0181     0.0350   -0.52  0.60549    
smokeYes    -326.4692   112.2103   -2.91  0.00408 ** 
raceBlack   -496.7971   155.0977   -3.20  0.00161 ** 
raceOther   -341.7900   120.2940   -2.84  0.00501 ** 
preterm1+   -279.1634   138.8263   -2.01  0.04583 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 668 on 180 degrees of freedom
Multiple R-squared:  0.196, Adjusted R-squared:  0.16 
F-statistic: 5.47 on 8 and 180 DF,  p-value: 0.00000345

library(epicalc)
regress.display(lm.full)

                 Coeff lower095ci upper095ci    Pr>|t|
(Intercept) 4166.23199 1993.80557 6338.65841 0.0002098
age         -146.13440 -269.10716  -23.16164 0.0201231
I(age^2)       2.92325    0.51019    5.33631 0.0178605
lwt            9.22958  -12.06153   30.52068 0.3934754
I(lwt^2)      -0.01813   -0.08727    0.05101 0.6054888
smokeYes    -326.46917 -547.88604 -105.05231 0.0040775
raceBlack   -496.79705 -802.84069 -190.75341 0.0016079
raceOther   -341.78998 -579.15773 -104.42223 0.0050114
preterm1+   -279.16340 -553.09981   -5.22699 0.0458302

Various plotting functions from the car package

Most useful functions are:

## Load the car (Companion to Applied Regression) package
library(car)

Quantile-Comparison Plots

Plots empirical quantiles of a variable, or of studentized residuals from a linear model, against theoretical quantiles of a comparison distribution.

## Residual normality
qqPlot(lm.full)

plot of chunk unnamed-chunk-5

Added-Variable Plots

These functions construct added-variable (also called partial-regression) plots for linear and generalized linear models.

## Added-variable plots: bwt|others ~ age|others (Partial relationship between response and regressor)
avPlots(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-6

## avPlots(model = lm.full, terms =  ~ age + I(age^2), id.n = 5)

Ceres Plots

These functions draw Ceres plots for linear and generalized linear models.

## This does not work?
## ceresPlots(model = lm.full, terms =  ~ age, ylim = c(-4000,4000))

Component+Residual (Partial Residual) Plots

These functions construct component+residual plots (also called partial-residual plots) for linear and generalized linear models.

## Component + Residual plots: (Predicted + Residual(bwt)) ~ age
crPlots(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-8

## crPlots(model = lm.full, terms =  ~ age + I(age^2), id.n = 5)

dfbeta and dfbetas Index Plots

These functions display index plots of dfbeta (effect on coefficients of deleting each observation in turn) and dfbetas (effect on coefficients of deleting each observation in turn, standardized by a deleted estimate of the coefficient standard error). In the plot of dfbeta, horizontal lines are drawn at 0 and +/- one standard error; in the plot of dfbetas, horizontal lines are drawn and 0 and +/- 1.

## dfbeta and dfbetas Index Plots
dfbetasPlots(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-9

## Easier to see if the y-axis is fixed
dfbetasPlots(model = lm.full, id.n = 5, ylim = c(-1,1)*1.5)

plot of chunk unnamed-chunk-9

Influence Index Plot

Provides index plots of Cook's distances, leverages, Studentized residuals, and outlier significance levels for a regression object.

influenceIndexPlot(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-10

Regression Influence Plot

This function creates a “bubble” plot of Studentized residuals by hat values, with the areas of the circles representing the observations proportional to Cook's distances. Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale.

influencePlot(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-11

    StudRes     Hat   CookD
68   0.7712 0.30289 0.16965
76   0.4098 0.14951 0.05740
93   0.5646 0.17724 0.08751
106  1.3175 0.22056 0.23314
129  2.1800 0.02508 0.11535
130  1.3655 0.53713 0.48916
131 -2.4353 0.06020 0.20270
132 -3.5297 0.02404 0.17904
133 -2.7495 0.10151 0.30259
136 -2.1492 0.03276 0.13053

Inverse Response Plots to Transform the Response

For a ‘lm’ model, draws an inverse.response plot with the response Y on the vertical axis and the fitted values Yhat on the horizontal axis. Uses ‘nls’ to estimate lambda in the function Yhat = b0 + b1(Y)lambda. Adds the fitted curve to the plot. ‘invResPlot’ is an alias for ‘inverseResponsePlot’.

inverseResponsePlot(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-12

  lambda      RSS
1  4.723 14885493
2 -1.000 17295001
3  0.000 16366354
4  1.000 15724560

Regression Leverage Plots

These functions display a generalization, due to Sall (1990) and Cook and Weisberg (1991), of added-variable plots to multiple-df terms in a linear model. When a term has just 1 df, the leverage plot is a rescaled version of the usual added-variable (partial-regression) plot.

leveragePlots(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-13

Marginal Model Plotting

For a regression object, plots the response on the vertical axis versus a linear combination u of terms in the mean function on the horizontal axis. Added to the plot are a ‘loess’ smooth for the graph, along with a loess smooth from the plot of the fitted values on u. ‘mmps’ is an alias for ‘marginalModelPlots’, and ‘mmp’ is an alias for ‘marginalModelPlot’.

marginalModelPlots(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-14

Residual Plots and Curvature Tests for Linear Model Fits

Plots the residuals versus each term in a mean function and versus fitted values. Also computes a curvature test for each of the plots by adding a quadratic term and testing the quadratic to be zero. This is Tukey's test for nonadditivity when plotting against fitted values.

residualPlots(model = lm.full, id.n = 5)

plot of chunk unnamed-chunk-15

           Test stat Pr(>|t|)
age            0.473    0.637
I(age^2)       1.330    0.185
lwt           -0.484    0.629
I(lwt^2)       2.312    0.022
smoke             NA       NA
race              NA       NA
preterm           NA       NA
Tukey test     0.788    0.431

Spread-Level Plots

Creates plots for examining the possible dependence of spread on level, or an extension of these plots to the studentized residuals from linear models.

spreadLevelPlot(x = lm.full)

plot of chunk unnamed-chunk-16


Suggested power transformation:  0.5817 

effects package

## Load effects
library(effects)
## Plot effect of each term
allEffects.lm.full <- allEffects(lm.full)
plot(allEffects.lm.full)

plot of chunk unnamed-chunk-17