Chapter 10 – Building the Regression Model II: Diagnostics

Load the data sets

env <- new.env()
load("../data.rda", envir = env)

Input Life Insurance Data

df <- get("CH10TA01", envir = env)
names(df) <- c("X1", "X2", "Y")
fit <- lm(Y ~ ., df)

TABLE 10.1 (p 387)

Basic Data–Life Insurance Example

cbind(
  "Avg Income"     = df$X1,
  "Risk Aversion"  = df$X2,
  "Life Insurance" = df$Y)

##       Avg Income Risk Aversion Life Insurance
##  [1,]      45.01             6             91
##  [2,]      57.20             4            162
##  [3,]      26.85             5             11
##  [4,]      66.29             7            240
##  [5,]      40.96             5             73
##  [6,]      73.00            10            311
##  [7,]      79.38             1            316
##  [8,]      52.77             8            154
##  [9,]      55.92             6            164
## [10,]      38.12             4             54
## [11,]      35.84             6             53
## [12,]      75.80             9            326
## [13,]      37.41             5             55
## [14,]      54.38             2            130
## [15,]      46.19             7            112
## [16,]      46.13             4             91
## [17,]      30.37             3             14
## [18,]      39.06             5             63

FIGURE 10.3 (p 387)

Residual Plot and Added-Variable Plot–Life Insurance Example

The car package contains avPlots for this, among other useful diagnostic functions. See below. At this time, Quick-R contains some examples at

http://www.statmethods.net/stats/rdiagnostics.html

A <- resid(lm(Y ~ X2, df))
B <- resid(lm(X1 ~ X2, df))
par(mfrow = c(1, 2), pch = 19)

plot(resid(fit) ~ X1, df, xlab = "X1", ylab = "Residual")
title("(a) Residual Plot against X1")
abline(0, 0, lty = 2)

plot(A ~ B, xlab = "e(X1|X2)", ylab = "e(Y|X2)")
title("(b) Added-Variable Plot for X1")
abline(lm(A ~ B))
abline(0, 0, lty = 2)

plot of chunk avplot

Input the Body Fat Data

df <- get("CH07TA01", envir = env)
names(df) <- c("X1", "X2", "X3", "Y")
fit <- lm(Y ~ X1 + X2, df)

FIGURE 10.4 (p 389)

Residual Plots and Added-Variable Plots–Body Fat Example with Two Predictor Variables

The above manual approach can be extended to this example, but no more pedagogical benefit derives. The reader should be able to match the output to that in the book.

library(car)
avPlots(fit)

plot of chunk resid-plot

residualPlots(fit, fitted = FALSE, quadratic = FALSE, test = FALSE)

plot of chunk resid-plot

TABLE 10.2 (p 393)

Illustration of Hat Matrix

See Chapter 5 tutorial on how to do the matrix algebra. This will all be done using the R utilities.

Note, you can use model.matrix(fit) to obtain the X matrix for calculating the hat matrix and error variance accordingly. For example

X %*% solve(t(X) %*% X) %*% t(X)

fit <- lm(Y ~ X1 + X2, data =
  data.frame(
    'Y' = c(301, 327, 246, 187),
    'X1' = c(14, 19, 12, 11),
    'X2' = c(25, 32, 22, 15)))

cbind(
  'X1'   = model.frame(fit)$X1,
  'X2'   = model.frame(fit)$X2,
  'Y'    = model.frame(fit)$Y,
  'Yhat' = fitted(fit), 
  'e'    = resid(fit),
  'h'    = hatvalues(fit),
  's'    = deviance(fit) * (1 - hatvalues(fit)))

##   X1 X2   Y  Yhat        e      h        s
## 1 14 25 301 282.2  18.7621 0.3877 352.0155
## 2 19 32 327 332.3  -5.2919 0.9513  28.0039
## 3 12 22 246 260.0 -13.9513 0.6614 194.6384
## 4 11 15 187 186.5   0.4811 0.9996   0.2314

FIGURE 10.6 (p 398)

Illustration of Leverage Values as Distance Measures–Table 10.2 Example

plot(X2 ~ X1, model.frame(fit), pch = 19, xlim = c(10, 20), ylim = c(10, 37),
     xlab = "X1", ylab = "X2", main = "FIGURE 10.6")
text(X2 ~ X1, model.frame(fit), labels = round(hatvalues(fit), 4), pos = 3)

points(mean(X2) ~ mean(X1), model.frame(fit), pch = 22)
text(mean(X2) ~ mean(X1), model.frame(fit), 
     cex = 0.75, pos = 4, labels = "(X1bar, X2bar)")

plot of chunk unnamed-chunk-6

TABLE 10.3 (p 397)

Residuals, Diagonal Elements of the Hat Matrix, and Studentized Deleted Residuals–Body Fat Example with Two Predictor Variables

fit <- lm(Y ~ X1 + X2, df)
cbind(
  'e' = round(resid(fit), 3),
  'h' = round(hatvalues(fit), 3),
  't' = round(rstudent(fit), 3))

##         e     h      t
## 1  -1.683 0.201 -0.730
## 2   3.643 0.059  1.534
## 3  -3.176 0.372 -1.654
## 4  -3.158 0.111 -1.348
## 5   0.000 0.248  0.000
## 6  -0.361 0.129 -0.148
## 7   0.716 0.156  0.298
## 8   4.015 0.096  1.760
## 9   2.655 0.115  1.118
## 10 -2.475 0.110 -1.034
## 11  0.336 0.120  0.137
## 12  2.226 0.109  0.923
## 13 -3.947 0.178 -1.826
## 14  3.447 0.148  1.525
## 15  0.571 0.333  0.267
## 16  0.642 0.095  0.258
## 17 -0.851 0.106 -0.345
## 18 -0.783 0.197 -0.334
## 19 -2.857 0.067 -1.176
## 20  1.040 0.050  0.409

FIGURE 10.7 (p 399)

Scatter Plot of Thigh Circumference against Triceps Skinfold Thickness–Body Fat Example with Two Predictor Variables

plot(X2 ~ X1, df, type = 'n', xlim = c(13, 33), ylim = c(40, 60),
     xlab = "Triceps Skinfold Thickness", ylab = "Thigh Circumference")
text(X2 ~ X1, df, labels = seq(nrow(df)), cex=.75)

plot of chunk unnamed-chunk-8

TABLE 10.4 (p 402)

DFFITS, Cook's Distances, and DFBETAS–Body Fat Example with Two Predictor Variables

Compare with the R function influence.measures(fit)

cbind(
  "DFFITS"  = round(dffits(fit), 4),
  "D"       = round(cooks.distance(fit), 4),
  "DFBETA0" = round(dfbetas(fit)[,1], 4),
  "DFBETA1" = round(dfbetas(fit)[,2], 4),
  "DFBETA2" = round(dfbetas(fit)[,3], 4))

##     DFFITS      D DFBETA0 DFBETA1 DFBETA2
## 1  -0.3661 0.0460 -0.3052 -0.1315  0.2320
## 2   0.3838 0.0455  0.1726  0.1150 -0.1426
## 3  -1.2731 0.4902 -0.8471 -1.1825  1.0669
## 4  -0.4763 0.0722 -0.1016 -0.2935  0.1961
## 5  -0.0001 0.0000 -0.0001  0.0000  0.0001
## 6  -0.0567 0.0011  0.0397  0.0401 -0.0443
## 7   0.1279 0.0058 -0.0775 -0.0156  0.0543
## 8   0.5745 0.0979  0.2614  0.3911 -0.3325
## 9   0.4022 0.0531 -0.1514 -0.2947  0.2469
## 10 -0.3639 0.0440  0.2377  0.2446 -0.2688
## 11  0.0505 0.0009 -0.0090  0.0171 -0.0025
## 12  0.3233 0.0352 -0.1305  0.0225  0.0700
## 13 -0.8508 0.2122  0.1194  0.5924 -0.3895
## 14  0.6355 0.1249  0.4517  0.1132 -0.2977
## 15  0.1889 0.0126 -0.0030 -0.1248  0.0688
## 16  0.0838 0.0025  0.0093  0.0431 -0.0251
## 17 -0.1184 0.0049  0.0795  0.0550 -0.0761
## 18 -0.1655 0.0096  0.1321  0.0753 -0.1161
## 19 -0.3151 0.0324 -0.1296 -0.0041  0.0644
## 20  0.0940 0.0031  0.0102  0.0023 -0.0033

FIGURE 10.8 (p 404)

Proportional Influence Plot (Points Proportional in Size to Cook's Distance Measure) and Index Influence Plot–Body Fat Example with Two Predictor Variables

par(mfrow = c(1, 2), pch = 19)
plot(resid(fit) ~ fitted(fit), cex = cooks.distance(fit)*10,
     xlim = c(10, 30), ylim = c(-4.5, 4.5),
     xlab = "YHAT", ylab = "Residual")
title("(a) Proportional Influence Plot")
plot(seq(nrow(df)), cooks.distance(fit), type = "o", lwd = 2,
     xlab = "Case Index Number", ylab = "Cook's Distance D")
title("(b) Index Influence Plot")

plot of chunk influence

TABLE 10.5 (p 409)

Variance Inflation Factors–Body Fat Example with Three Predictor Variables

R has vif (car) for this. It takes in an lm object. To get the standardized coefficients we update the model by generically scaling the values. The maximum VIF is clear, and an average can easily be computed. I leave it to the interested student to perform that simple task.

fit <- lm(scale(Y) ~ scale(X1) + scale(X2) + scale(X3), df)
cbind(
  "Beta*" = round(as.vector(coef(fit)[-1]), 4),
  "VIF"   = round(vif(lm(Y ~ ., df)), 2))

##     Beta*   VIF
## X1  4.264 708.8
## X2 -2.929 564.3
## X3 -1.561 104.6

Input the Surgical Unit Data

Note: R > 2.15 has paste0. No need for sep = "" anymore.

df <- get("CH09TA01", envir = env)
names(df) <- c(paste("X", seq(8), sep = ""), "Y" , "lnY")
fit <- lm(lnY ~ X1 + X2 + X3 + X8, df)
vif(fit)  # For page 412

##    X1    X2    X3    X8 
## 1.103 1.020 1.049 1.092

FIGURE 10.9 (p 411)

Residual and Added-Variable Plots for Surgical Unit Example–Regression Model (10.45)

The car functions used earlier can be used to explore the unshown information discussed on page 410. This example, however, requires relating added-values from variables not within the model. Thus, it cannot be used with residualPlots or avPlots. It may be possible to fit a full model and use these functions with appropriately selected 'terms' parameters, though.

par(mfrow = c(2, 2), pch = 19)

plot(resid(fit) ~ fitted(fit), ylim = c(-0.7, 0.7),
     xlab = "Predicted Value", ylab = "Residual")
title("(a) Residual Plot against Predicted")

plot(resid(fit) ~ df$X5, ylim = c(-0.7, 0.7),
     xlab = "X5", ylab = "Residual")
title("(b) Residual Plot against X5")

plot(resid(fit) ~ resid(update(fit, X5 ~ .)), ylim = c(-0.7, 0.7),
     xlab = "e(X5|X1238)", ylab = "e(Y'|X1238)")
title("(c) Added-Variable Plot for X5")
abline(lm(resid(fit) ~ resid(update(fit, X5 ~ .))) )

qqnorm(resid(fit), xlab = "Expected Value", ylab = "Residual", main = "")
qqline(resid(fit))
title("(d) Normal Probability Plot")

plot of chunk unnamed-chunk-12

FIGURE 10.10 (p 413)

Diagnostic Plots for Surgical Unit Example

par(mfrow = c(2, 2), pch = 19)

plot(seq(nrow(df)), rstudent(fit), type = "o",
     main = "(a) Studentized Deleted Residuals",
     xlab = "Case Index", ylab = "t")
plot(seq(nrow(df)), hatvalues(fit), type = "o",
     main = "(b) Leverage Values",
     xlab = "Case Index", ylab = "h")
plot(seq(nrow(df)), cooks.distance(fit), type = "o",
     main = "(c) Cook's Distance",
     xlab = "Case Index", ylab = "D")
plot(seq(nrow(df)), dffits(fit), type = "o",
     main = "(d) DFFITS values",
     xlab = "Case Index", ylab = "DFFITS")

plot of chunk unnamed-chunk-13

TABLE 10.6 (p 413)

Various Diagnostics for Outlying Cases–Surgical Unit Example

cases <- c(17, 23, 28, 32, 38, 42, 52)
cbind(
  'e'      = round(resid(fit), 4),
  't'      = round(rstudent(fit), 4),
  'h'      = round(hatvalues(fit), 4),
  'D'      = round(cooks.distance(fit), 4),
  'DFFITS' = round(dffits(fit), 4))[cases, ]

##          e       t      h      D  DFFITS
## 17  0.5952  3.3696 0.1499 0.3306  1.4151
## 23  0.2788  1.4854 0.1885 0.1001  0.7160
## 28  0.0876  0.4896 0.2914 0.0200  0.3140
## 32 -0.2861 -1.5585 0.2202 0.1333 -0.8283
## 38 -0.2271 -1.3016 0.3059 0.1472 -0.8641
## 42 -0.0303 -0.1620 0.2262 0.0016 -0.0876
## 52 -0.1375 -0.7358 0.2221 0.0312 -0.3931