LMR, 11.1, 11.2, 11.3
7.1 The generated data set in this question is taken from Mantel (1970).
Case <- c(1:5)
Y <- c(5,6,8,9,11)
X1 <- c(1,200,-50,909,506)
X2 <- c(1004,806,1058,100,505)
X3 <- c(6,7.3,11,13,13.1)
mantel <- cbind(Case, Y, X1, X2, X3)
mantel
## Case Y X1 X2 X3
## [1,] 1 5 1 1004 6.0
## [2,] 2 6 200 806 7.3
## [3,] 3 8 -50 1058 11.0
## [4,] 4 9 909 100 13.0
## [5,] 5 11 506 505 13.1
Interest centers on using variable selection to choose a subset of the predictors to model Y. The data were generated such that the full model:
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + e\]
is a valid model for the data.
Output from R associated with different variable selection procedures based on model 7.8 appears below.
cor(mantel[,3:5])
## X1 X2 X3
## X1 1.0000000 -0.9999887 0.6858141
## X2 -0.9999887 1.0000000 -0.6826107
## X3 0.6858141 -0.6826107 1.0000000
mantel <- as.data.frame(mantel)
lm1 <- lm(Y ~ X3, mantel)
lm2 <- lm(Y ~ X1 + X2, mantel)
lm3 <- lm(Y ~ X1 + X2 + X3, mantel)
sum_lm1 <- summary(lm1)
sum_lm2 <- summary(lm2)
sum_lm3 <- summary(lm3)
adjr2 <- c(sum_lm1$adj.r.squared, sum_lm2$adj.r.squared, sum_lm3$adj.r.squared)
x <- seq(1,3,by=1)
plot(adjr2 ~ x, xlab = "Subset Size", ylab = "Statistic: adjr2")
Table 7.4 gives the values of \(R^2_{adj}\), AIC and BIC for the best subset of each size.
As of note, the table below is for some reason producing different results than the table that’s in the book.
Predictors <- c("X3", "X1, X2", "X1, X2, X3")
AIC_col <- c(AIC(lm1, k=2), AIC(lm2, k=2), AIC(lm3, k=2))
BIC_col <- c(AIC(lm1, k=log(nrow(mantel))), AIC(lm2, k=log(nrow(mantel))), AIC(lm3, k=log(nrow(mantel))))
allsubsets <- cbind(x, Predictors, adjr2, AIC_col, BIC_col)
allsubsets
## x Predictors adjr2 AIC_col
## [1,] "1" "X3" "0.876484701848241" "15.8806367928623"
## [2,] "2" "X1, X2" "1" "-271.559992520695"
## [3,] "3" "X1, X2, X3" "1" "-269.578999741786"
## BIC_col
## [1,] "14.7089505301646"
## [2,] "-273.122240870959"
## [3,] "-271.531810179616"
In theory, the 2 subset size model should be best since it has the highest adjusted R-squared, and the lowest AIC and BIC scores.
Reference: http://pages.pomona.edu/~jsh04747/courses/math158/multlin4.pdf
attach(mantel)
## The following objects are masked _by_ .GlobalEnv:
##
## Case, X1, X2, X3, Y
slm3 <- step(lm(Y ~ 1), Y ~ X1 + X2 + X3, direction="forward")
## Start: AIC=9.59
## Y ~ 1
##
## Df Sum of Sq RSS AIC
## + X3 1 20.6879 2.1121 -0.3087
## + X1 1 8.6112 14.1888 9.2151
## + X2 1 8.5064 14.2936 9.2519
## <none> 22.8000 9.5866
##
## Step: AIC=-0.31
## Y ~ X3
##
## Df Sum of Sq RSS AIC
## <none> 2.1121 -0.30875
## + X2 1 0.066328 2.0458 1.53172
## + X1 1 0.064522 2.0476 1.53613
slm3
##
## Call:
## lm(formula = Y ~ X3)
##
## Coefficients:
## (Intercept) X3
## 0.7975 0.6947
BIC
slm4 <- step(lm(Y ~ 1), Y ~ X1 + X2 + X3, direction="forward", k = log(nrow(mantel)))
## Start: AIC=9.2
## Y ~ 1
##
## Df Sum of Sq RSS AIC
## + X3 1 20.6879 2.1121 -1.0899
## + X1 1 8.6112 14.1888 8.4339
## + X2 1 8.5064 14.2936 8.4707
## <none> 22.8000 9.1961
##
## Step: AIC=-1.09
## Y ~ X3
##
## Df Sum of Sq RSS AIC
## <none> 2.1121 -1.08987
## + X2 1 0.066328 2.0458 0.36003
## + X1 1 0.064522 2.0476 0.36444
slm4
##
## Call:
## lm(formula = Y ~ X3)
##
## Coefficients:
## (Intercept) X3
## 0.7975 0.6947
Unsure why the two models are chosen differently between the two. However, I suspect that “because of the”one at a time" nature of adding/dropping variables, it’s possible to miss the “optimal model.” Reference: http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf
Stepwise variable selection tends to pick models that are smaller than desirable for prediction purposes. To give a simple example, consider the simple regression with just one predictor variable. Suppose that the slope for this predictor is not quite statistically significant. We might not have enough evidence to say that it is related to y but it still might be better to use it for predictive purposes.
Unclear as to why I would support one over the other.
The real data in this question first appeared in Hald (1952). The data are given in Table 7.5
Y <- c(78.5,74.3,104.3,87.6,95.9,109.2,102.7,72.5,93.1,115.9,83.8,113.3,109.4)
x1 <- c(7,1,11,11,7,11,3,1,2,21,1,11,10)
x2 <- c(26,29,56,31,52,55,71,31,54,47,40,66,68)
x3 <- c(6,15,8,8,6,9,17,22,18,4,23,9,8)
x4 <- c(60,52,20,47,33,22,6,44,22,26,34,12,12)
hald <- cbind(Y, x1, x2, x3, x4)
hald
## Y x1 x2 x3 x4
## [1,] 78.5 7 26 6 60
## [2,] 74.3 1 29 15 52
## [3,] 104.3 11 56 8 20
## [4,] 87.6 11 31 8 47
## [5,] 95.9 7 52 6 33
## [6,] 109.2 11 55 9 22
## [7,] 102.7 3 71 17 6
## [8,] 72.5 1 31 22 44
## [9,] 93.1 2 54 18 22
## [10,] 115.9 21 47 4 26
## [11,] 83.8 1 40 23 34
## [12,] 113.3 11 66 9 12
## [13,] 109.4 10 68 8 12
Interest centers on using variable selection to choose a subset of the predictors to model Y. Throughout this question we shall assume that the full model below is a valid model for the data.
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + e\]
Output from R associated with different variable selection procedures based on model (7.9) appears on the following pages:
cor(hald[,c(2:5)])
## x1 x2 x3 x4
## x1 1.0000000 0.2285795 -0.8241338 -0.2454451
## x2 0.2285795 1.0000000 -0.1392424 -0.9729550
## x3 -0.8241338 -0.1392424 1.0000000 0.0295370
## x4 -0.2454451 -0.9729550 0.0295370 1.0000000
hald <- as.data.frame(hald)
lm1 <- lm(Y ~ x4, hald)
lm2 <- lm(Y ~ x1 + x2, hald)
lm3 <- lm(Y ~ x1 + x2 + x4, hald)
lm4 <- lm(Y ~ x1 + x2 + x3 + x4, hald)
sum_lm1 <- summary(lm1)
sum_lm2 <- summary(lm2)
sum_lm3 <- summary(lm3)
sum_lm4 <- summary(lm4)
adjr2 <- c(sum_lm1$adj.r.squared, sum_lm2$adj.r.squared, sum_lm3$adj.r.squared, sum_lm4$adj.r.squared)
x <- seq(1,4,by=1)
plot(adjr2 ~ x, xlab = "Subset Size", ylab = "Statistic: adjr2")
Predictors <- c("X4", "X1, X2", "X1, X2, X4", "X1, X2, X3, X4")
AIC_col <- c(AIC(lm1, k=2), AIC(lm2, k=2), AIC(lm3, k=2), AIC(lm4, k=2))
BIC_col <- c(AIC(lm1, k=log(nrow(mantel))), AIC(lm2, k=log(nrow(mantel))), AIC(lm3, k=log(nrow(mantel))), AIC(lm4, k=log(nrow(mantel))))
allsubsets <- cbind(x, Predictors, adjr2, AIC_col, BIC_col)
allsubsets
## x Predictors adjr2 AIC_col
## [1,] "1" "X4" "0.644954869961756" "97.7440447788562"
## [2,] "2" "X1, X2" "0.974414049442758" "64.3123927621906"
## [3,] "3" "X1, X2, X4" "0.976447268267236" "63.8662854718626"
## [4,] "4" "X1, X2, X3, X4" "0.97356343061152" "65.8366897916517"
## BIC_col
## [1,] "96.5723585161585"
## [2,] "62.750144411927"
## [3,] "61.9134750340331"
## [4,] "63.4933172662563"
2 or 3 subset here would be quite reasonable.
attach(hald)
## The following objects are masked _by_ .GlobalEnv:
##
## x1, x2, x3, x4, Y
## The following object is masked from mantel:
##
## Y
slm3 <- step(lm(Y ~ 1), Y ~ x1 + x2 + x3 + x4, direction="forward")
## Start: AIC=71.44
## Y ~ 1
##
## Df Sum of Sq RSS AIC
## + x4 1 1831.90 883.87 58.852
## + x2 1 1809.43 906.34 59.178
## + x1 1 1450.08 1265.69 63.519
## + x3 1 776.36 1939.40 69.067
## <none> 2715.76 71.444
##
## Step: AIC=58.85
## Y ~ x4
##
## Df Sum of Sq RSS AIC
## + x1 1 809.10 74.76 28.742
## + x3 1 708.13 175.74 39.853
## <none> 883.87 58.852
## + x2 1 14.99 868.88 60.629
##
## Step: AIC=28.74
## Y ~ x4 + x1
##
## Df Sum of Sq RSS AIC
## + x2 1 26.789 47.973 24.974
## + x3 1 23.926 50.836 25.728
## <none> 74.762 28.742
##
## Step: AIC=24.97
## Y ~ x4 + x1 + x2
##
## Df Sum of Sq RSS AIC
## <none> 47.973 24.974
## + x3 1 0.10909 47.864 26.944
slm3
##
## Call:
## lm(formula = Y ~ x4 + x1 + x2)
##
## Coefficients:
## (Intercept) x4 x1 x2
## 71.6483 -0.2365 1.4519 0.4161
BIC
slm4 <- step(lm(Y ~ 1), Y ~ x1 + x2 + x3 + x4, direction="forward", k = log(nrow(hald)))
## Start: AIC=72.01
## Y ~ 1
##
## Df Sum of Sq RSS AIC
## + x4 1 1831.90 883.87 59.982
## + x2 1 1809.43 906.34 60.308
## + x1 1 1450.08 1265.69 64.649
## + x3 1 776.36 1939.40 70.197
## <none> 2715.76 72.009
##
## Step: AIC=59.98
## Y ~ x4
##
## Df Sum of Sq RSS AIC
## + x1 1 809.10 74.76 30.437
## + x3 1 708.13 175.74 41.547
## <none> 883.87 59.982
## + x2 1 14.99 868.88 62.324
##
## Step: AIC=30.44
## Y ~ x4 + x1
##
## Df Sum of Sq RSS AIC
## + x2 1 26.789 47.973 27.234
## + x3 1 23.926 50.836 27.987
## <none> 74.762 30.437
##
## Step: AIC=27.23
## Y ~ x4 + x1 + x2
##
## Df Sum of Sq RSS AIC
## <none> 47.973 27.234
## + x3 1 0.10909 47.864 29.769
slm4
##
## Call:
## lm(formula = Y ~ x4 + x1 + x2)
##
## Coefficients:
## (Intercept) x4 x1 x2
## 71.6483 -0.2365 1.4519 0.4161
Looks like the 3 predictors model works best here.
AIC
slm5 <- step(lm(Y ~ x1 + x2 + x3 + x4), Y ~ x1 + x2 + x3 + x4, direction="backward")
## Start: AIC=26.94
## Y ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x3 1 0.1091 47.973 24.974
## - x4 1 0.2470 48.111 25.011
## - x2 1 2.9725 50.836 25.728
## <none> 47.864 26.944
## - x1 1 25.9509 73.815 30.576
##
## Step: AIC=24.97
## Y ~ x1 + x2 + x4
##
## Df Sum of Sq RSS AIC
## <none> 47.97 24.974
## - x4 1 9.93 57.90 25.420
## - x2 1 26.79 74.76 28.742
## - x1 1 820.91 868.88 60.629
slm5
##
## Call:
## lm(formula = Y ~ x1 + x2 + x4)
##
## Coefficients:
## (Intercept) x1 x2 x4
## 71.6483 1.4519 0.4161 -0.2365
slm6 <- step(lm(Y ~ x1 + x2 + x3 + x4), Y ~ x1 + x2 + x3 + x4, direction="backward", k = log(nrow(hald)))
## Start: AIC=29.77
## Y ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x3 1 0.1091 47.973 27.234
## - x4 1 0.2470 48.111 27.271
## - x2 1 2.9725 50.836 27.987
## <none> 47.864 29.769
## - x1 1 25.9509 73.815 32.836
##
## Step: AIC=27.23
## Y ~ x1 + x2 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 9.93 57.90 27.115
## <none> 47.97 27.234
## - x2 1 26.79 74.76 30.437
## - x1 1 820.91 868.88 62.324
##
## Step: AIC=27.11
## Y ~ x1 + x2
##
## Df Sum of Sq RSS AIC
## <none> 57.90 27.115
## - x1 1 848.43 906.34 60.308
## - x2 1 1207.78 1265.69 64.649
slm6
##
## Call:
## lm(formula = Y ~ x1 + x2)
##
## Coefficients:
## (Intercept) x1 x2
## 52.5773 1.4683 0.6623
XXXXXXX
2 or the 3 predictor model would do well here.
Use the prostate data with lpsa as the resopnse and the other variable as predictors. Implement the following variable selection methods to determine the “best” model:
require(faraway)
## Loading required package: faraway
data(prostate)
head(prostate)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.7695 50 -1.386294 0 -1.38629 6 0 -0.43078
## 2 -0.9942523 3.3196 58 -1.386294 0 -1.38629 6 0 -0.16252
## 3 -0.5108256 2.6912 74 -1.386294 0 -1.38629 7 20 -0.16252
## 4 -1.2039728 3.2828 58 -1.386294 0 -1.38629 6 0 -0.16252
## 5 0.7514161 3.4324 62 -1.386294 0 -1.38629 6 0 0.37156
## 6 -1.0498221 3.2288 50 -1.386294 0 -1.38629 6 0 0.76547
b <- colnames(prostate[,-(9)])
a <- lm(lpsa ~ ., data=prostate)
attach(prostate)
backward_lm <- step(a, lpsa ~ ., direction="backward")
## Start: AIC=-58.32
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
##
## Df Sum of Sq RSS AIC
## - gleason 1 0.0412 44.204 -60.231
## - pgg45 1 0.5258 44.689 -59.174
## - lcp 1 0.6740 44.837 -58.853
## <none> 44.163 -58.322
## - age 1 1.5503 45.713 -56.975
## - lbph 1 1.6835 45.847 -56.693
## - lweight 1 3.5861 47.749 -52.749
## - svi 1 4.9355 49.099 -50.046
## - lcavol 1 22.3721 66.535 -20.567
##
## Step: AIC=-60.23
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
##
## Df Sum of Sq RSS AIC
## - lcp 1 0.6623 44.867 -60.789
## <none> 44.204 -60.231
## - pgg45 1 1.1920 45.396 -59.650
## - age 1 1.5166 45.721 -58.959
## - lbph 1 1.7053 45.910 -58.560
## - lweight 1 3.5462 47.750 -54.746
## - svi 1 4.8984 49.103 -52.037
## - lcavol 1 23.5039 67.708 -20.872
##
## Step: AIC=-60.79
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
##
## Df Sum of Sq RSS AIC
## - pgg45 1 0.6590 45.526 -61.374
## <none> 44.867 -60.789
## - age 1 1.2649 46.131 -60.092
## - lbph 1 1.6465 46.513 -59.293
## - lweight 1 3.5647 48.431 -55.373
## - svi 1 4.2503 49.117 -54.009
## - lcavol 1 25.4189 70.285 -19.248
##
## Step: AIC=-61.37
## lpsa ~ lcavol + lweight + age + lbph + svi
##
## Df Sum of Sq RSS AIC
## <none> 45.526 -61.374
## - age 1 0.9592 46.485 -61.352
## - lbph 1 1.8568 47.382 -59.497
## - lweight 1 3.2251 48.751 -56.735
## - svi 1 5.9517 51.477 -51.456
## - lcavol 1 28.7665 74.292 -15.871
backward_lm
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph
## 0.95100 0.56561 0.42369 -0.01489 0.11184
## svi
## 0.72095
Final model from backward step elimination via AIC: lpsa ~ lcavol + lweight + age + lbph + svi
require(leaps)
## Loading required package: leaps
b <- regsubsets(lpsa ~ ., data=prostate)
rs <- summary(b)
rs$which
## (Intercept) lcavol lweight age lbph svi lcp gleason pgg45
## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
\[AIC = -2L(\theta) + 2p\] \[-2L(\theta) = n log(\frac{RSS}{n})\]
AIC <- nrow(prostate)*log(rs$rss/nrow(prostate)) + (2:9)*2
plot(AIC ~ I(1:8), ylab="AIC", xlab="Number of Predictors")
AIC
## [1] -44.36608 -52.69043 -60.67621 -61.35179 -61.37439 -60.78867 -60.23130
## [8] -58.32184
5 predictor variables produces the lowest AIC which corresponds to: (Intercept) lcavol lweight age lbph
plot(2:9, rs$adjr2, xlab="No. of Parameters", ylab="Adjusted R-square")
which.max(rs$adjr)
## [1] 7
Mallow’s Cp statistic. A good model should predict well, so the average mean square error of prediction might be a good criterion:
\[\frac{1}{\sigma^2} = \Sigma E(\hat y - Ey_i)^2\]
which can be estimated by the Cp statistic:
\[C_p = \frac{RSS_p}{\hat \sigma^2} + 2p - n\]
where \(\hat \sigma^2\) is from the model with all predictors and \(RSS_p\) indicates the RSS from a model with p parameters. For the full model, Cp = p exactly. A model with a bad fit will have a \(C_p\) much bigger than p. It is usual to plot \(C_p\) against p. We desire models with small p and \(C_p\) around or less than p. \(C_p\), \(R^2_a\), and AIC all trade-off fit in terms of RSS against complexity (p).
plot(2:9, rs$cp, xlab="No. of Parameters", ylab="Cp Statistic")
abline(0,1)
The 6 predictor variable model appears to fit best here.
Use the seatpos data with hipcenter as the response.
data(seatpos,package="faraway")
head(seatpos)
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## 1 46 180 187.2 184.9 95.2 36.1 45.3 41.3 -206.300
## 2 31 175 167.5 165.5 83.8 32.9 36.5 35.9 -178.210
## 3 23 100 153.6 152.2 82.9 26.0 36.6 31.0 -71.673
## 4 19 185 190.3 187.4 97.3 37.4 44.1 41.0 -257.720
## 5 23 159 178.0 174.1 93.9 29.5 40.1 36.9 -173.230
## 6 47 170 178.7 177.0 92.4 36.0 43.2 37.4 -185.150
seatpos_lm <- lm(hipcenter ~ ., data=seatpos)
summary(seatpos_lm)
##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
It seems that none of these values are stastically significant. It appears that for every unit of leg length that increases, -6.439 units are decreased in hipcenter.
predict(seatpos_lm, interval = "confidence")
## fit lwr upr
## 1 -230.82470 -263.2205 -198.42890
## 2 -158.22231 -199.0855 -117.35914
## 3 -96.85463 -131.7243 -61.98494
## 4 -255.78273 -294.7597 -216.80574
## 5 -188.59572 -233.3630 -143.82845
## 6 -186.02614 -214.1290 -157.92325
## 7 -153.98285 -189.6664 -118.29935
## 8 -244.79086 -286.3104 -203.27128
## 9 -139.71030 -173.4665 -105.95404
## 10 -112.98566 -148.1345 -77.83680
## 11 -163.72509 -186.4322 -141.01801
## 12 -89.14799 -120.0194 -58.27658
## 13 -194.10261 -249.8912 -138.31401
## 14 -128.43355 -157.7773 -99.08975
## 15 -186.44972 -226.2569 -146.64258
## 16 -177.90902 -217.5495 -138.26853
## 17 -201.58090 -250.4299 -152.73188
## 18 -98.43069 -141.8463 -55.01511
## 19 -145.80244 -174.2207 -117.38415
## 20 -167.75364 -199.5743 -135.93300
## 21 -178.41491 -214.6476 -142.18225
## 22 -279.07627 -336.5054 -221.64716
## 23 -245.56763 -285.6346 -205.50071
## 24 -81.55529 -114.4871 -48.62343
## 25 -141.13605 -167.5849 -114.68722
## 26 -222.49965 -247.7190 -197.28026
## 27 -156.83929 -184.1675 -129.51112
## 28 -128.68145 -170.6894 -86.67351
## 29 -193.00256 -225.2335 -160.77163
## 30 -93.20235 -125.8015 -60.60319
## 31 -102.96051 -160.7042 -45.21677
## 32 -182.39983 -222.0483 -142.75134
## 33 -166.93549 -205.3431 -128.52790
## 34 -102.63962 -131.3436 -73.93562
## 35 -194.49288 -227.4769 -161.50888
## 36 -142.50545 -185.0056 -100.00534
## 37 -178.52201 -207.8089 -149.23515
## 38 -154.08219 -186.4553 -121.70905
c <- regsubsets(hipcenter ~ ., data=seatpos)
rsc <- summary(c)
rsc$which
## (Intercept) Age Weight HtShoes Ht Seated Arm Thigh Leg
## 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 3 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 4 TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## 5 TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
## 6 TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
AIC_hip <- nrow(seatpos)*log(rsc$rss/nrow(seatpos)) + (2:9)*2
AIC_hip
## [1] 275.0667 274.7798 274.2418 275.8291 277.6712 279.6389 281.6286 283.6240
The value is smallest with 3 predictor variables. Intercept + Age + Ht
new_lm <- lm(hipcenter ~ Age + Ht, data=seatpos)
summary(new_lm)
##
## Call:
## lm(formula = hipcenter ~ Age + Ht, data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.534 -23.028 2.131 24.994 53.939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 526.9589 92.2479 5.712 1.85e-06 ***
## Age 0.5211 0.3862 1.349 0.186
## Ht -4.2004 0.5313 -7.906 2.69e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.96 on 35 degrees of freedom
## Multiple R-squared: 0.6562, Adjusted R-squared: 0.6365
## F-statistic: 33.4 on 2 and 35 DF, p-value: 7.694e-09
THe adjusted R-square value is improved with this smaller model.
Using the seatpos data, perform a PCR analysis with hipcenter as the response and HtShoes, Ht, Seated, Arm, Thigh, and Leg as predictors. Select an appropriate number of components and give an interpretation to those you choose. Add Age and Weight as predictors and repeat the analysis. Use both models to predict the response for predictors taking these values:
Principal component analysis (PCA) is a popular method for finding low-dimensional linear structure in higher dimensional data.
For observational data, many predictors are often substantially correlated. It would be nice if we could transform these predoctors to orthogonality as it could make interpretation easier.
par(mfrow=c(3,2))
plot(hipcenter ~ HtShoes,seatpos)
plot(hipcenter ~ Ht,seatpos)
plot(hipcenter ~ Seated,seatpos)
plot(hipcenter ~ Arm,seatpos)
plot(hipcenter ~ Thigh,seatpos)
plot(hipcenter ~ Leg,seatpos)
cor(seatpos[,c(3,4,5,6,7,8)])
## HtShoes Ht Seated Arm Thigh Leg
## HtShoes 1.0000000 0.9981475 0.9296751 0.7519530 0.7248622 0.9084334
## Ht 0.9981475 1.0000000 0.9282281 0.7521416 0.7349604 0.9097524
## Seated 0.9296751 0.9282281 1.0000000 0.6251964 0.6070907 0.8119143
## Arm 0.7519530 0.7521416 0.6251964 1.0000000 0.6710985 0.7538140
## Thigh 0.7248622 0.7349604 0.6070907 0.6710985 1.0000000 0.6495412
## Leg 0.9084334 0.9097524 0.8119143 0.7538140 0.6495412 1.0000000
There are some strong correlations (not surprisingly) in this correlation matrix.
cseat <- seatpos[,c(3,4,5,6,7,8)]
pr_seat <- prcomp(cseat)
summary(pr_seat)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 17.1573 2.89689 2.11907 1.56412 1.22502 0.46218
## Proportion of Variance 0.9453 0.02695 0.01442 0.00786 0.00482 0.00069
## Cumulative Proportion 0.9453 0.97222 0.98664 0.99450 0.99931 1.00000
Most of the standard deviation is in PC1. The proportion of variance is .9453, which even one dimension alone may be a sufficient model.
round(pr_seat$rotation[,1],2)
## HtShoes Ht Seated Arm Thigh Leg
## -0.65 -0.65 -0.27 -0.15 -0.17 -0.18
This explanation may be smaller shoes, height, arms, seated height, thighs and legs create a smaller horizontal distance of the midpoint of the hips from a fixed location in the car in mm.
Like variances, principal components analysis can be very sensitive to outliers so it is essential to check for these. In addition to the usual graphical checks of the data, it is worth checking for outliers in higher dimensions.
Mahalanobis distance is a measure of the distance of a point from the mean that adjusts for the correlation in the data.
require(MASS)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.4.3
robseat <- cov.rob(cseat)
md <- mahalanobis(cseat, center=robseat$center, cov=robseat$cov)
n <- nrow(cseat);p <- ncol(cseat)
plot(qchisq(1:n/(n+1),p), sort(md), xlab=expression(paste(chi^2," quantiles")), ylab="Sorted Mahalanobis distances")
abline(0,1)
Let’s see how we can use PCA for regression. We might have a model y ~ X. We replace this by y ~ Z where typically we use only the first few columns of Z. This is known as principal components regression or PCR. The technique is used in two distinct ways for explanation and prediction purposes.
When the goal of the regression is to find simple, well-fitting and understandable models for the response. PCR may help. The PCs are linear combinations of the predictors.
lmodpcr <- lm(seatpos$hipcenter ~ pr_seat$x[,1])
summary(lmodpcr)
##
## Call:
## lm(formula = seatpos$hipcenter ~ pr_seat$x[, 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.009 -29.349 3.694 19.930 73.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -164.8849 5.9017 -27.939 < 2e-16 ***
## pr_seat$x[, 1] 2.7770 0.3486 7.966 1.85e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.38 on 36 degrees of freedom
## Multiple R-squared: 0.638, Adjusted R-squared: 0.628
## F-statistic: 63.46 on 1 and 36 DF, p-value: 1.853e-09
The PCR here has allowed us a meaningful explanation whereas the full predictor regression was opaque.
One objection to the previous analysis is that the PC still uses all predictors so there has been no saving in the number of variables needed to model the response. Furthermore, we must rely on our subjective interpretation of the meaning of the PCs. To answer this, one idea is to take a few representive predictors based on the largest coefficients seen in the PCs.
The linear rotations (or loadings) can be found in the rotation matrix. We plot these vectors against the predictor number.
matplot(1:6, pr_seat$rotation, type="l")
(Eigenvectors for the PCA of the data). The solid line corresponds to the first PC (?)
These vectors represent the linear combinations of the predictors that generate the PCs.
Let’s try using the pls package, which contains several features which makes the computation more straightforward.
require(pls)
## Loading required package: pls
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
trainseat <- seatpos[1:30,]
testseat <- seatpos[31:38,]
pcrmod <- pcr(hipcenter ~ ., data=trainseat, ncomp=8)
rmse <- function(x,y) sqrt(mean((x-y)^2))
rmse(predict(pcrmod), trainseat$hipcenter)
## [1] 29.07376
Not a good fit. What is the optimal principal component amount?
plot(pr_seat$sdev, type="l", ylab="SD of PC", xlab="PC number")
We will figure how how many PCs would give the best results?
pcrmse <- RMSEP(pcrmod, newdata=testseat)
plot(pcrmse,main="")
which.min(pcrmse$val)
## [1] 6
pcrmse$val[6]
## [1] 58.22981
Fit a PLS model to the seatpos data with hipcenter as the response and all other variables as predictors. Take care to select an appropriate number of components. Use the model to predict the response at the values of the predictors specified in the first question.
Partial least squares (PLS) is a method for relating a set of input variables \(X_1, ..., X_m\) and outputs \(Y_1, ..., Y_l\). PLS regression is comparable to PCR i nthat both predict the response using some number of linear combinations of the predictors. The difference is that while PCR ignores Y in determining the linear combinations, PLS regression explicitly chooes them to predict Y as well as possible.
One criticism of PLS is that it solves no well-defined modeling problem, which makes it difficult to distinguish between the competing algorithms on theoretical rather than empirical grounds.
As with PCR, we must choose the number of components carefully. CV can be helpful in doing this. We compute the PLS on all models. We have set the random generator seed to ensure reproducibility in light of the random division of the validation samples.
set.seed(123)
plsmod <- plsr(hipcenter ~ ., data=trainseat, ncomp=8, validation="CV")
coefplot(plsmod, ncomp=4, xlab="Frequency")
plsCV <- RMSEP(plsmod, estimate="CV")
plot(plsCV, main="")
We see that the effective linear combination of the predictors is similar to PCR(?). Now we determine the performance on the training set for the 3 component model.
ypred <- predict(plsmod, ncomp=3)
rmse(ypred,trainseat$hipcenter)
## [1] 27.9832
ytpred <- predict(plsmod, testseat, ncomp=3)
rmse(ytpred,testseat)
## [1] 245.6917
PLS tends to have an advantage over PCR for prediction problems because PLS constructs its linear combination explicitly to predict the response. On the other hand, PCR is better suited for developoing insights by forming linear combinations that have interesting interpretations. Although both methods use the idea of dimension reduciton, there is usually no reduction in the number of variables used since every predictor contributes to the linear combinations.
Fit a ridge regression model to the seatpos data with hipcenter as the response and all other variables as predictors. Take care to select an appropriate amount of shrinkage. Use the model to predict the response at the values of the predictors specified in the first question.
Ridge regression makes the assumption that the regression coefficients (after normalization) should not be very large. This is a reasonable assumption in applications where you have a large number of predictors and you believe that many of them have some effect on the response. Hence shrinkage is embedded in the method. Ridge regression is particularly effective when the model matrix is collinear.
Ridge regression is an example of a penalized regression because of the presence of \(\lambda\).
require(MASS)
rgmod <- lm.ridge(hipcenter ~., trainseat, lambda = seq(0, 1, len=21))
matplot(rgmod$lambda, coef(rgmod), type="l", xlab=expression(lambda), ylab=expression(hat(beta)),col=1)
????