## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE,
echo = TRUE, fig.width = 7, fig.height = 7)
options(width = 116, scipen = 10)
setwd("~/Documents/HSPH/useR.at.HSPH/")
library(car)
Performed to check for validity of assumptions and presence of outliers/influential points.
Assumptions in linear regression
If normality and/or homoscedasticity assumptions are violated, consider transformation of the outcome variable. If linearity assumption is violated, consider transformation of the predictor variable.
Examples using R default functions and car package are demonstrated here.
Most useful functions are:
library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")
## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)
## Recoding
lbw <- within(lbw, {
## race relabeling
race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## ftv (frequency of visit) relabeling
ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
ftv.cat <- relevel(ftv.cat, ref = "Normal")
## ptl
preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("=0",">=1"))
## Categorize smoke ht ui
smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes"))
ht <- factor(ht, levels = 0:1, labels = c("No","Yes"))
ui <- factor(ui, levels = 0:1, labels = c("No","Yes"))
})
## Full
lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm, data = lbw)
## summary
summary(lm.full)
Call:
lm(formula = bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat +
preterm, data = lbw)
Residuals:
Min 1Q Median 3Q Max
-1896.7 -443.3 53.2 466.1 1654.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2947.32 320.48 9.20 < 2e-16 ***
age -2.91 9.67 -0.30 0.76354
lwt 4.22 1.72 2.46 0.01488 *
smokeYes -307.34 109.13 -2.82 0.00541 **
htYes -568.63 200.88 -2.83 0.00518 **
uiYes -494.11 137.23 -3.60 0.00041 ***
ftv.catNone -56.50 105.36 -0.54 0.59245
ftv.catMany -185.86 203.19 -0.91 0.36158
race.catBlack -467.30 149.78 -3.12 0.00211 **
race.catOther -322.81 117.40 -2.75 0.00658 **
preterm>=1 -207.91 136.35 -1.52 0.12907
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 647 on 178 degrees of freedom
Multiple R-squared: 0.255, Adjusted R-squared: 0.213
F-statistic: 6.08 on 10 and 178 DF, p-value: 0.0000000609
Not interpretable from graphs, think about the design.
Simple: Normality of the outcome (\( Y_i \))
car::qqPlot(lbw$bwt)
Intermediate: Normality of the residual (\( r_i \))
layout(matrix(1:2, ncol = 2))
plot(lm.full, which = 2)
car::qqPlot(residuals(lm.full), id.n = 5)
132 130 136 94 155
1 189 2 188 3
Plot the residual \( r_i \) versus the predicted outcome \( \hat{Y}_i \)
layout(matrix(1:2, ncol = 2))
plot(lm.full, which = 1)
## Raw residual
car::residualPlots(lm.full, terms = ~ 1, fitted = T, id.n = 5, smoother = loessLine)
Test stat Pr(>|t|)
Tukey test -0.611 0.541
## Kleinbaum's jackknifed residuals (R calls these studentized residuals)
car::residualPlots(lm.full, terms = ~ 1, id.n = 5, type = "rstudent", quadratic = F, smoother = loessLine)
Test stat Pr(>|t|)
Tukey test -0.611 0.541
Spread level plot basically does the same thing. Non-constant variance test tests for it.
car::spreadLevelPlot(lm.full, id.n = 5)
Suggested power transformation: 0.7456
car::ncvTest(lm.full)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.004127 Df = 1 p = 0.9488
Simple: Plot the outcome \( Y_i \) versus the predictor \( X_i \)
layout(matrix(1:2, ncol = 2))
plot(bwt ~ age, lbw)
plot(bwt ~ lwt, lbw)
Intermediate: Plot the residual \( r_i \) versus the predictor \( X_i \)
car::residualPlots(lm.full, terms = ~ age + lwt, fitted = F, id.n = 5, smoother = loessLine)
Test stat Pr(>|t|)
age 2.338 0.021
lwt 0.307 0.759
Hard: Partial regression plot (Added-variable plot)
What Dr Orav calls “partial regression residual plot” is this one. See if the red line fits the data points well and if there is any non-linear pattern.
car::avPlots(lm.full, terms = ~ age + lwt, id.n = 5)
Partial residual plot (Component+Residual plot)
This is similar, but different in that the X-axis is the raw predictor. The red line is the partial fit, assuming linearity in the partial relationship between \( y \) and \( x_j \). The green line is a lowess smooth, and if it is very different from the red line, it suggests need for predictor transformation.
car::crPlots(lm.full, terms = ~ age + lwt, id.n = 5)
Simple: Scatter plots for continuous covariates
layout(matrix(1:2, ncol = 2))
plot(bwt ~ age, lbw)
plot(bwt ~ lwt, lbw)
Intermediate: Standardized residuals (Kleinbaum definition)
\( z_{i} = \frac {\hat{E}_i} {S} \)
where \( S \) is the square root of the mean squared error (MSE)
Absolute values larger than 2 are considered large residuals.
There is probably no built-in functions calculate this in R.
Hard: Studentized residuals (Kleinbaum definition)
\( r_{i} = \frac {\hat{E}_i} {S \sqrt{1 - h_i}} \)
where \( h_i \) is the leverage of the $i$th observation.
These are standardized residuals that incorporate leverage (hat-values). Absolute values larger than 2 are considered large residuals.
R uses different terminology (See Fox & Weisberg), Kleinbaum's studentized residual is R's standardized residual.
plot(rstandard(lm.full))
car::residualPlots(lm.full, terms = ~ 1, id.n = 5, type = "rstandard", quadratic = F)
Test stat Pr(>|t|)
Tukey test -0.611 0.541
Leverage (hat-values)
Leverage or hat-values are considred high if greater than 2*(Number of predictors + 1) / (Number of data points). Here it is 10 predictors, and 189 data points, thus the cutoff is 0.1164021. These values indicate how far these data points are from the center of the regressor space.
influencePlot() function is more useful for hat-values (combined with Jackknifed residuals and Cook's distances).' The cut off value is automatically added.
car::influenceIndexPlot(lm.full, vars = "hat", id.n = 20)
abline(h = 2 * length(coef(lm.full)) / nrow(lm.full$model))
Hard: Jackknifed residuals (Kleinbaum definition)
\( r_{(-i)} = \frac {\hat{E}_i} {S_{(-i)} \sqrt{1 - h_i}} \)
where \( S_{(-i)} \) is the square root of the MSE calculated excluding observation \( i \).
R uses different terminology (See Fox & Weisberg), Kleinbaum's jackknifed residual is R's studentized residual.
car::residualPlots(lm.full, terms = ~ 1, id.n = 5, type = "rstudent", quadratic = F)
Test stat Pr(>|t|)
Tukey test -0.611 0.541
car::influenceIndexPlot(lm.full, vars = "Studentized", id.n = 5)
Cook's distance
\( \sum_{j=0}^{k}[\hat{\beta}_j - \hat{\beta}_{j(-i)}]^2 \)
Cook's distance mesures the extent to which the estimates of the regression coefficients change when an observation is deleted from the analysis.
car::influenceIndexPlot(lm.full, vars = "Cook", id.n = 5)
Combination of jackknifed residuals (Kleinbaum's definition), hat-values (leverage), and Cook's distance
Jackknifed residuals (Kleinbaum definition) on Y-axis Hat-values (leverage) on X-axis Cook's distances as size
An observation that is both outlying and has high leverage exerts influence on the regression coefficients, thus high Cook's distance.
car::influencePlot(lm.full, id.n = 5)
StudRes Hat CookD
22 0.3970 0.16159 0.05267
68 0.3463 0.22651 0.05665
93 1.1371 0.15114 0.14455
94 2.3185 0.07208 0.19248
106 1.6674 0.14865 0.20903
130 2.7618 0.11042 0.28806
131 -2.0750 0.08114 0.18422
132 -3.1015 0.06288 0.23657
133 -1.8795 0.16491 0.25006
136 -2.4445 0.03153 0.13116
189 0.2073 0.18064 0.02943
title("Y: jackknifed residuals, X: leverage, Size: Cook D")
dfbeta
This is a non-summarized version of Cook's distance, i.e., changes in coefficients when each observation is deleted.'
dfbeta\( _{ij} = b_{(-i)j} - b_{j} \) for \( j \) = 0, …, \( k \)
where \( b_j \) is the coefficient computed using all the data and \( b_{(-i)j} \) is the same coefficient computed with case \( i \) omitted. dfbetas\( _{ij} \) is the standardized version of dfbeta\( _{ij} \).
## dfbeta for first 10 observations
dfbeta(lm.full)[1:10,]
(Intercept) age lwt smokeYes htYes uiYes ftv.catNone ftv.catMany race.catBlack race.catOther
1 4.1285 0.06162 -0.040169 1.5818 1.77997 -5.449 -1.4157 -0.33133 -4.1889 0.6775
2 31.5126 -0.89067 -0.063748 1.5680 5.39399 0.378 -1.3863 -41.55109 -0.9061 -9.4662
3 -25.0841 0.41187 0.066758 -7.6297 -0.08118 3.551 8.3429 7.41421 3.1864 1.9774
4 1.6985 -0.02605 -0.004506 0.8550 0.33112 2.164 -1.0242 -0.77335 -0.3296 -0.2545
5 2.9867 -0.07558 -0.009088 1.0008 0.21583 4.265 0.8387 0.17631 -1.1392 -1.4148
6 -0.7135 0.06693 -0.009364 2.2187 3.42553 3.133 -3.7675 -0.06969 0.8473 -4.8949
7 -40.1601 0.52170 0.074151 9.1164 0.56609 4.105 7.9833 6.71370 8.9521 11.3928
8 -21.6675 0.53326 0.031435 0.3232 0.53376 3.087 7.4300 5.36974 0.3177 -5.9116
9 1.8222 -0.43739 0.034003 -7.6687 0.66096 2.044 5.5879 6.94256 0.8800 -0.0994
10 0.6666 -0.30998 0.047935 -3.7820 1.30704 2.401 -4.1182 0.29882 1.6926 2.7312
preterm>=1
1 0.8375
2 3.9622
3 4.5351
4 -0.9888
5 -1.6422
6 1.4513
7 1.4624
8 1.9072
9 5.3374
10 3.2578
## dfbetas (standardized) for first 10 observations
dfbetas(lm.full)[1:10,]
(Intercept) age lwt smokeYes htYes uiYes ftv.catNone ftv.catMany race.catBlack
1 0.012848 0.006354 -0.023329 0.014455 0.0088370 -0.039604 -0.013401 -0.0016262 -0.027892
2 0.098203 -0.091959 -0.037075 0.014349 0.0268171 0.002751 -0.013141 -0.2042290 -0.006042
3 -0.078171 0.042525 0.038826 -0.069822 -0.0004036 0.025841 0.079083 0.0364423 0.021247
4 0.005285 -0.002685 -0.002617 0.007813 0.0016438 0.015726 -0.009694 -0.0037954 -0.002195
5 0.009294 -0.007792 -0.005278 0.009145 0.0010715 0.030996 0.007939 0.0008654 -0.007586
6 -0.002223 0.006908 -0.005444 0.020296 0.0170243 0.022795 -0.035699 -0.0003424 0.005648
7 -0.125442 0.053990 0.043225 0.083619 0.0028210 0.029943 0.075849 0.0330754 0.059831
8 -0.067486 0.055028 0.018272 0.002956 0.0026522 0.022458 0.070391 0.0263787 0.002117
9 0.005677 -0.045144 0.019769 -0.070154 0.0032850 0.014869 0.052950 0.0341122 0.005866
10 0.002076 -0.031980 0.027857 -0.034583 0.0064932 0.017459 -0.039006 0.0014676 0.011277
race.catOther preterm>=1
1 0.0057553 0.006126
2 -0.0805302 0.029022
3 0.0168223 0.033219
4 -0.0021617 -0.007232
5 -0.0120184 -0.012011
6 -0.0416261 0.010627
7 0.0971458 0.010737
8 -0.0502633 0.013962
9 -0.0008454 0.039082
10 0.0232173 0.023845
dffits
Standardized changes in the predicted value when each observation is deleted.
dffits(lm.full)[1:10]
1 2 3 4 5 6 7 8 9 10
-0.06420 -0.26596 -0.14219 0.02232 0.04353 -0.09531 -0.19103 -0.11884 -0.12071 -0.08937