## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE,
echo = TRUE, fig.width = 7, fig.height = 7)
options(width = 116, scipen = 10)
setwd("~/Documents/HSPH/useR.at.HSPH/")
library(car)
For more fuller examples, see: http://rpubs.com/kaz_yos/residuals-bio213
Performed to check for validity of assumptions and presence of outliers/influential points.
Assumptions in linear regression
If normality and/or homoscedasticity assumptions are violated, consider transformation of the outcome variable. If linearity assumption is violated, consider transformation of the predictor variable.
Examples using R default functions and car package are demonstrated here.
Most useful functions are:
library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")
## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)
## Recoding
lbw <- within(lbw, {
## race relabeling
race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## ftv (frequency of visit) relabeling
ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
ftv.cat <- relevel(ftv.cat, ref = "Normal")
## ptl
preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("=0",">=1"))
})
## Full
linear.model.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm, data = lbw)
## summary
summary(linear.model.full)
Call:
lm(formula = bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat +
preterm, data = lbw)
Residuals:
Min 1Q Median 3Q Max
-1896.7 -443.3 53.2 466.1 1654.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2947.32 320.48 9.20 < 2e-16 ***
age -2.91 9.67 -0.30 0.76354
lwt 4.22 1.72 2.46 0.01488 *
smoke -307.34 109.13 -2.82 0.00541 **
ht -568.63 200.88 -2.83 0.00518 **
ui -494.11 137.23 -3.60 0.00041 ***
ftv.catNone -56.50 105.36 -0.54 0.59245
ftv.catMany -185.86 203.19 -0.91 0.36158
race.catBlack -467.30 149.78 -3.12 0.00211 **
race.catOther -322.81 117.40 -2.75 0.00658 **
preterm>=1 -207.91 136.35 -1.52 0.12907
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 647 on 178 degrees of freedom
Multiple R-squared: 0.255, Adjusted R-squared: 0.213
F-statistic: 6.08 on 10 and 178 DF, p-value: 0.0000000609
Not interpretable from graphs, think about the design.
Intermediate: Normality of the residual (\( r_i \))
car::qqPlot(residuals(linear.model.full), id.n = 5)
132 130 136 94 155
1 189 2 188 3
Plot the residual \( r_i \) versus the predicted outcome \( \hat{Y}_i \)
## Raw residual
car::residualPlots(linear.model.full, terms = ~ 1, fitted = T, id.n = 5, smoother = loessLine)
Test stat Pr(>|t|)
Tukey test -0.611 0.541
Hard: Partial regression plot (Added-variable plot)
What Dr Orav calls “partial regression residual plot” is this one. See if the red line fits the data points well and if there is any non-linear pattern.
car::avPlots(linear.model.full, terms = ~ age + lwt, id.n = 5)
Combination of jackknifed residuals (Kleinbaum's definition), hat-values (leverage), and Cook's distance
Jackknifed residuals (Kleinbaum definition) on Y-axis Hat-values (leverage) on X-axis Cook's distances as size
An observation that is both outlying and has high leverage exerts influence on the regression coefficients, thus high Cook's distance.
car::influencePlot(linear.model.full, id.n = 5)
StudRes Hat CookD
22 0.3970 0.16159 0.05267
68 0.3463 0.22651 0.05665
93 1.1371 0.15114 0.14455
94 2.3185 0.07208 0.19248
106 1.6674 0.14865 0.20903
130 2.7618 0.11042 0.28806
131 -2.0750 0.08114 0.18422
132 -3.1015 0.06288 0.23657
133 -1.8795 0.16491 0.25006
136 -2.4445 0.03153 0.13116
189 0.2073 0.18064 0.02943
title("Y: jackknifed residuals, X: leverage, Size: Cook D")