LMR, 11.1, 11.2, 11.3

MARR 7.1

7.1 The generated data set in this question is taken from Mantel (1970).

Case <- c(1:5)
Y <- c(5,6,8,9,11)
X1 <- c(1,200,-50,909,506)
X2 <- c(1004,806,1058,100,505)
X3 <- c(6,7.3,11,13,13.1)

mantel <- cbind(Case, Y, X1, X2, X3)
mantel

##      Case  Y  X1   X2   X3
## [1,]    1  5   1 1004  6.0
## [2,]    2  6 200  806  7.3
## [3,]    3  8 -50 1058 11.0
## [4,]    4  9 909  100 13.0
## [5,]    5 11 506  505 13.1

Interest centers on using variable selection to choose a subset of the predictors to model Y. The data were generated such that the full model:

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + e\]

is a valid model for the data.

Output from R associated with different variable selection procedures based on model 7.8 appears below.

Identify the optimal model or models based on \(R^2_{\text{adj}}\), AIC, and BIC from the approach based on all possible subsets.

cor(mantel[,3:5])

##            X1         X2         X3
## X1  1.0000000 -0.9999887  0.6858141
## X2 -0.9999887  1.0000000 -0.6826107
## X3  0.6858141 -0.6826107  1.0000000

mantel <- as.data.frame(mantel)
lm1 <- lm(Y ~ X3, mantel)
lm2 <- lm(Y ~ X1 + X2, mantel)
lm3 <- lm(Y ~ X1 + X2 + X3, mantel)

sum_lm1 <- summary(lm1)
sum_lm2 <- summary(lm2)
sum_lm3 <- summary(lm3)

adjr2 <- c(sum_lm1$adj.r.squared, sum_lm2$adj.r.squared, sum_lm3$adj.r.squared)
x <- seq(1,3,by=1)

plot(adjr2 ~ x, xlab = "Subset Size", ylab = "Statistic: adjr2")

Table 7.4 gives the values of \(R^2_{adj}\), AIC and BIC for the best subset of each size.

As of note, the table below is for some reason producing different results than the table that’s in the book.

Predictors <- c("X3", "X1, X2", "X1, X2, X3")
AIC_col <- c(AIC(lm1, k=2), AIC(lm2, k=2), AIC(lm3, k=2))
BIC_col <- c(AIC(lm1, k=log(nrow(mantel))), AIC(lm2, k=log(nrow(mantel))), AIC(lm3, k=log(nrow(mantel))))

allsubsets <- cbind(x, Predictors, adjr2, AIC_col, BIC_col)
allsubsets

##      x   Predictors   adjr2               AIC_col            
## [1,] "1" "X3"         "0.876484701848241" "15.8806367928623" 
## [2,] "2" "X1, X2"     "1"                 "-271.559992520695"
## [3,] "3" "X1, X2, X3" "1"                 "-269.578999741786"
##      BIC_col            
## [1,] "14.7089505301646" 
## [2,] "-273.122240870959"
## [3,] "-271.531810179616"

In theory, the 2 subset size model should be best since it has the highest adjusted R-squared, and the lowest AIC and BIC scores.

Identify the optimal model or models based on AIC and BIC from the approach based on forward selection.

Reference: http://pages.pomona.edu/~jsh04747/courses/math158/multlin4.pdf

attach(mantel)

## The following objects are masked _by_ .GlobalEnv:
## 
##     Case, X1, X2, X3, Y

slm3 <- step(lm(Y ~ 1), Y ~ X1 + X2 + X3, direction="forward")

## Start:  AIC=9.59
## Y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + X3    1   20.6879  2.1121 -0.3087
## + X1    1    8.6112 14.1888  9.2151
## + X2    1    8.5064 14.2936  9.2519
## <none>              22.8000  9.5866
## 
## Step:  AIC=-0.31
## Y ~ X3
## 
##        Df Sum of Sq    RSS      AIC
## <none>              2.1121 -0.30875
## + X2    1  0.066328 2.0458  1.53172
## + X1    1  0.064522 2.0476  1.53613

slm3

## 
## Call:
## lm(formula = Y ~ X3)
## 
## Coefficients:
## (Intercept)           X3  
##      0.7975       0.6947

BIC

slm4 <- step(lm(Y ~ 1), Y ~ X1 + X2 + X3, direction="forward", k = log(nrow(mantel)))

## Start:  AIC=9.2
## Y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + X3    1   20.6879  2.1121 -1.0899
## + X1    1    8.6112 14.1888  8.4339
## + X2    1    8.5064 14.2936  8.4707
## <none>              22.8000  9.1961
## 
## Step:  AIC=-1.09
## Y ~ X3
## 
##        Df Sum of Sq    RSS      AIC
## <none>              2.1121 -1.08987
## + X2    1  0.066328 2.0458  0.36003
## + X1    1  0.064522 2.0476  0.36444

slm4

## 
## Call:
## lm(formula = Y ~ X3)
## 
## Coefficients:
## (Intercept)           X3  
##      0.7975       0.6947

Carefully explain why different models are chosen in (a) and (b).

Unsure why the two models are chosen differently between the two. However, I suspect that “because of the”one at a time" nature of adding/dropping variables, it’s possible to miss the “optimal model.” Reference: http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf

Stepwise variable selection tends to pick models that are smaller than desirable for prediction purposes. To give a simple example, consider the simple regression with just one predictor variable. Suppose that the slope for this predictor is not quite statistically significant. We might not have enough evidence to say that it is related to y but it still might be better to use it for predictive purposes.

Decide which model you would recommend. Give detailed reasons to support your choice.

Unclear as to why I would support one over the other.

MARR 7.2

The real data in this question first appeared in Hald (1952). The data are given in Table 7.5

Y <- c(78.5,74.3,104.3,87.6,95.9,109.2,102.7,72.5,93.1,115.9,83.8,113.3,109.4)
x1 <- c(7,1,11,11,7,11,3,1,2,21,1,11,10)
x2 <- c(26,29,56,31,52,55,71,31,54,47,40,66,68)
x3 <- c(6,15,8,8,6,9,17,22,18,4,23,9,8)
x4 <- c(60,52,20,47,33,22,6,44,22,26,34,12,12)

hald <- cbind(Y, x1, x2, x3, x4)
hald

##           Y x1 x2 x3 x4
##  [1,]  78.5  7 26  6 60
##  [2,]  74.3  1 29 15 52
##  [3,] 104.3 11 56  8 20
##  [4,]  87.6 11 31  8 47
##  [5,]  95.9  7 52  6 33
##  [6,] 109.2 11 55  9 22
##  [7,] 102.7  3 71 17  6
##  [8,]  72.5  1 31 22 44
##  [9,]  93.1  2 54 18 22
## [10,] 115.9 21 47  4 26
## [11,]  83.8  1 40 23 34
## [12,] 113.3 11 66  9 12
## [13,] 109.4 10 68  8 12

Interest centers on using variable selection to choose a subset of the predictors to model Y. Throughout this question we shall assume that the full model below is a valid model for the data.

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + e\]

Output from R associated with different variable selection procedures based on model (7.9) appears on the following pages:

cor(hald[,c(2:5)])

##            x1         x2         x3         x4
## x1  1.0000000  0.2285795 -0.8241338 -0.2454451
## x2  0.2285795  1.0000000 -0.1392424 -0.9729550
## x3 -0.8241338 -0.1392424  1.0000000  0.0295370
## x4 -0.2454451 -0.9729550  0.0295370  1.0000000

Identify the optimal model or models based on \(R^2_{adj}\), AIC, AICC, BIC from the approach based on all possible subsets.

hald <- as.data.frame(hald)
lm1 <- lm(Y ~ x4, hald)
lm2 <- lm(Y ~ x1 + x2, hald)
lm3 <- lm(Y ~ x1 + x2 + x4, hald)
lm4 <- lm(Y ~ x1 + x2 + x3 + x4, hald)

sum_lm1 <- summary(lm1)
sum_lm2 <- summary(lm2)
sum_lm3 <- summary(lm3)
sum_lm4 <- summary(lm4)

adjr2 <- c(sum_lm1$adj.r.squared, sum_lm2$adj.r.squared, sum_lm3$adj.r.squared, sum_lm4$adj.r.squared)
x <- seq(1,4,by=1)

plot(adjr2 ~ x, xlab = "Subset Size", ylab = "Statistic: adjr2")

Predictors <- c("X4", "X1, X2", "X1, X2, X4", "X1, X2, X3, X4")
AIC_col <- c(AIC(lm1, k=2), AIC(lm2, k=2), AIC(lm3, k=2), AIC(lm4, k=2))
BIC_col <- c(AIC(lm1, k=log(nrow(mantel))), AIC(lm2, k=log(nrow(mantel))), AIC(lm3, k=log(nrow(mantel))), AIC(lm4, k=log(nrow(mantel))))

allsubsets <- cbind(x, Predictors, adjr2, AIC_col, BIC_col)
allsubsets

##      x   Predictors       adjr2               AIC_col           
## [1,] "1" "X4"             "0.644954869961756" "97.7440447788562"
## [2,] "2" "X1, X2"         "0.974414049442758" "64.3123927621906"
## [3,] "3" "X1, X2, X4"     "0.976447268267236" "63.8662854718626"
## [4,] "4" "X1, X2, X3, X4" "0.97356343061152"  "65.8366897916517"
##      BIC_col           
## [1,] "96.5723585161585"
## [2,] "62.750144411927" 
## [3,] "61.9134750340331"
## [4,] "63.4933172662563"

2 or 3 subset here would be quite reasonable.

Identify the optimal model or models based on AIC and BIC from the approach based on forward selection.

attach(hald)

## The following objects are masked _by_ .GlobalEnv:
## 
##     x1, x2, x3, x4, Y

## The following object is masked from mantel:
## 
##     Y

slm3 <- step(lm(Y ~ 1), Y ~ x1 + x2 + x3 + x4, direction="forward")

## Start:  AIC=71.44
## Y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 58.852
## + x2    1   1809.43  906.34 59.178
## + x1    1   1450.08 1265.69 63.519
## + x3    1    776.36 1939.40 69.067
## <none>              2715.76 71.444
## 
## Step:  AIC=58.85
## Y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x1    1    809.10  74.76 28.742
## + x3    1    708.13 175.74 39.853
## <none>              883.87 58.852
## + x2    1     14.99 868.88 60.629
## 
## Step:  AIC=28.74
## Y ~ x4 + x1
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1    26.789 47.973 24.974
## + x3    1    23.926 50.836 25.728
## <none>              74.762 28.742
## 
## Step:  AIC=24.97
## Y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              47.973 24.974
## + x3    1   0.10909 47.864 26.944

slm3

## 
## Call:
## lm(formula = Y ~ x4 + x1 + x2)
## 
## Coefficients:
## (Intercept)           x4           x1           x2  
##     71.6483      -0.2365       1.4519       0.4161

BIC

slm4 <- step(lm(Y ~ 1), Y ~ x1 + x2 + x3 + x4, direction="forward", k = log(nrow(hald)))

## Start:  AIC=72.01
## Y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 59.982
## + x2    1   1809.43  906.34 60.308
## + x1    1   1450.08 1265.69 64.649
## + x3    1    776.36 1939.40 70.197
## <none>              2715.76 72.009
## 
## Step:  AIC=59.98
## Y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x1    1    809.10  74.76 30.437
## + x3    1    708.13 175.74 41.547
## <none>              883.87 59.982
## + x2    1     14.99 868.88 62.324
## 
## Step:  AIC=30.44
## Y ~ x4 + x1
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1    26.789 47.973 27.234
## + x3    1    23.926 50.836 27.987
## <none>              74.762 30.437
## 
## Step:  AIC=27.23
## Y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              47.973 27.234
## + x3    1   0.10909 47.864 29.769

slm4

## 
## Call:
## lm(formula = Y ~ x4 + x1 + x2)
## 
## Coefficients:
## (Intercept)           x4           x1           x2  
##     71.6483      -0.2365       1.4519       0.4161

Looks like the 3 predictors model works best here.

Identify the optimal model or models based on AIC and BIC from the approach based on backward elimination.

AIC

slm5 <- step(lm(Y ~ x1 + x2 + x3 + x4), Y ~ x1 + x2 + x3 + x4, direction="backward")

## Start:  AIC=26.94
## Y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 24.974
## - x4    1    0.2470 48.111 25.011
## - x2    1    2.9725 50.836 25.728
## <none>              47.864 26.944
## - x1    1   25.9509 73.815 30.576
## 
## Step:  AIC=24.97
## Y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>               47.97 24.974
## - x4    1      9.93  57.90 25.420
## - x2    1     26.79  74.76 28.742
## - x1    1    820.91 868.88 60.629

slm5

## 
## Call:
## lm(formula = Y ~ x1 + x2 + x4)
## 
## Coefficients:
## (Intercept)           x1           x2           x4  
##     71.6483       1.4519       0.4161      -0.2365

slm6 <- step(lm(Y ~ x1 + x2 + x3 + x4), Y ~ x1 + x2 + x3 + x4, direction="backward", k = log(nrow(hald)))

## Start:  AIC=29.77
## Y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 27.234
## - x4    1    0.2470 48.111 27.271
## - x2    1    2.9725 50.836 27.987
## <none>              47.864 29.769
## - x1    1   25.9509 73.815 32.836
## 
## Step:  AIC=27.23
## Y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x4    1      9.93  57.90 27.115
## <none>               47.97 27.234
## - x2    1     26.79  74.76 30.437
## - x1    1    820.91 868.88 62.324
## 
## Step:  AIC=27.11
## Y ~ x1 + x2
## 
##        Df Sum of Sq     RSS    AIC
## <none>                57.90 27.115
## - x1    1    848.43  906.34 60.308
## - x2    1   1207.78 1265.69 64.649

slm6

## 
## Call:
## lm(formula = Y ~ x1 + x2)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     52.5773       1.4683       0.6623

Carefully explain why the models chosen in (a),(b) & (c) are not all the same.

XXXXXXX

Recommend a final model. Give detailed reasons to support your choice.

2 or the 3 predictor model would do well here.

LMR 10.1

Use the prostate data with lpsa as the resopnse and the other variable as predictors. Implement the following variable selection methods to determine the “best” model:

require(faraway)

## Loading required package: faraway

data(prostate)
head(prostate)

##       lcavol lweight age      lbph svi      lcp gleason pgg45     lpsa
## 1 -0.5798185  2.7695  50 -1.386294   0 -1.38629       6     0 -0.43078
## 2 -0.9942523  3.3196  58 -1.386294   0 -1.38629       6     0 -0.16252
## 3 -0.5108256  2.6912  74 -1.386294   0 -1.38629       7    20 -0.16252
## 4 -1.2039728  3.2828  58 -1.386294   0 -1.38629       6     0 -0.16252
## 5  0.7514161  3.4324  62 -1.386294   0 -1.38629       6     0  0.37156
## 6 -1.0498221  3.2288  50 -1.386294   0 -1.38629       6     0  0.76547

Backward elimination

b <- colnames(prostate[,-(9)])
a <- lm(lpsa ~ ., data=prostate)
attach(prostate)
backward_lm <- step(a, lpsa ~ ., direction="backward")

## Start:  AIC=-58.32
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - gleason  1    0.0412 44.204 -60.231
## - pgg45    1    0.5258 44.689 -59.174
## - lcp      1    0.6740 44.837 -58.853
## <none>                 44.163 -58.322
## - age      1    1.5503 45.713 -56.975
## - lbph     1    1.6835 45.847 -56.693
## - lweight  1    3.5861 47.749 -52.749
## - svi      1    4.9355 49.099 -50.046
## - lcavol   1   22.3721 66.535 -20.567
## 
## Step:  AIC=-60.23
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - lcp      1    0.6623 44.867 -60.789
## <none>                 44.204 -60.231
## - pgg45    1    1.1920 45.396 -59.650
## - age      1    1.5166 45.721 -58.959
## - lbph     1    1.7053 45.910 -58.560
## - lweight  1    3.5462 47.750 -54.746
## - svi      1    4.8984 49.103 -52.037
## - lcavol   1   23.5039 67.708 -20.872
## 
## Step:  AIC=-60.79
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - pgg45    1    0.6590 45.526 -61.374
## <none>                 44.867 -60.789
## - age      1    1.2649 46.131 -60.092
## - lbph     1    1.6465 46.513 -59.293
## - lweight  1    3.5647 48.431 -55.373
## - svi      1    4.2503 49.117 -54.009
## - lcavol   1   25.4189 70.285 -19.248
## 
## Step:  AIC=-61.37
## lpsa ~ lcavol + lweight + age + lbph + svi
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 45.526 -61.374
## - age      1    0.9592 46.485 -61.352
## - lbph     1    1.8568 47.382 -59.497
## - lweight  1    3.2251 48.751 -56.735
## - svi      1    5.9517 51.477 -51.456
## - lcavol   1   28.7665 74.292 -15.871

backward_lm

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph  
##     0.95100      0.56561      0.42369     -0.01489      0.11184  
##         svi  
##     0.72095

Final model from backward step elimination via AIC: lpsa ~ lcavol + lweight + age + lbph + svi

require(leaps)

## Loading required package: leaps

b <- regsubsets(lpsa ~ ., data=prostate)
rs <- summary(b)
rs$which

##   (Intercept) lcavol lweight   age  lbph   svi   lcp gleason pgg45
## 1        TRUE   TRUE   FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
## 2        TRUE   TRUE    TRUE FALSE FALSE FALSE FALSE   FALSE FALSE
## 3        TRUE   TRUE    TRUE FALSE FALSE  TRUE FALSE   FALSE FALSE
## 4        TRUE   TRUE    TRUE FALSE  TRUE  TRUE FALSE   FALSE FALSE
## 5        TRUE   TRUE    TRUE  TRUE  TRUE  TRUE FALSE   FALSE FALSE
## 6        TRUE   TRUE    TRUE  TRUE  TRUE  TRUE FALSE   FALSE  TRUE
## 7        TRUE   TRUE    TRUE  TRUE  TRUE  TRUE  TRUE   FALSE  TRUE
## 8        TRUE   TRUE    TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE

\[AIC = -2L(\theta) + 2p\] \[-2L(\theta) = n log(\frac{RSS}{n})\]

AIC <- nrow(prostate)*log(rs$rss/nrow(prostate)) + (2:9)*2
plot(AIC ~ I(1:8), ylab="AIC", xlab="Number of Predictors")

AIC

## [1] -44.36608 -52.69043 -60.67621 -61.35179 -61.37439 -60.78867 -60.23130
## [8] -58.32184

5 predictor variables produces the lowest AIC which corresponds to: (Intercept) lcavol lweight age lbph

Adjust \(R^2\)

plot(2:9, rs$adjr2, xlab="No. of Parameters", ylab="Adjusted R-square")

which.max(rs$adjr)

## [1] 7

Mallows \(C_p\)

Mallow’s Cp statistic. A good model should predict well, so the average mean square error of prediction might be a good criterion:

\[\frac{1}{\sigma^2} = \Sigma E(\hat y - Ey_i)^2\]

which can be estimated by the Cp statistic:

\[C_p = \frac{RSS_p}{\hat \sigma^2} + 2p - n\]

where \(\hat \sigma^2\) is from the model with all predictors and \(RSS_p\) indicates the RSS from a model with p parameters. For the full model, Cp = p exactly. A model with a bad fit will have a \(C_p\) much bigger than p. It is usual to plot \(C_p\) against p. We desire models with small p and \(C_p\) around or less than p. \(C_p\), \(R^2_a\), and AIC all trade-off fit in terms of RSS against complexity (p).

plot(2:9, rs$cp, xlab="No. of Parameters", ylab="Cp Statistic")
abline(0,1)

The 6 predictor variable model appears to fit best here.

LMR 10.6

Use the seatpos data with hipcenter as the response.

data(seatpos,package="faraway")
head(seatpos)

##   Age Weight HtShoes    Ht Seated  Arm Thigh  Leg hipcenter
## 1  46    180   187.2 184.9   95.2 36.1  45.3 41.3  -206.300
## 2  31    175   167.5 165.5   83.8 32.9  36.5 35.9  -178.210
## 3  23    100   153.6 152.2   82.9 26.0  36.6 31.0   -71.673
## 4  19    185   190.3 187.4   97.3 37.4  44.1 41.0  -257.720
## 5  23    159   178.0 174.1   93.9 29.5  40.1 36.9  -173.230
## 6  47    170   178.7 177.0   92.4 36.0  43.2 37.4  -185.150

Fit a model with all eight predictors. Comment on the effect of leg length on the response.

seatpos_lm <- lm(hipcenter ~ ., data=seatpos)
summary(seatpos_lm)

## 
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.827 -22.833  -3.678  25.017  62.337 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 436.43213  166.57162   2.620   0.0138 *
## Age           0.77572    0.57033   1.360   0.1843  
## Weight        0.02631    0.33097   0.080   0.9372  
## HtShoes      -2.69241    9.75304  -0.276   0.7845  
## Ht            0.60134   10.12987   0.059   0.9531  
## Seated        0.53375    3.76189   0.142   0.8882  
## Arm          -1.32807    3.90020  -0.341   0.7359  
## Thigh        -1.14312    2.66002  -0.430   0.6706  
## Leg          -6.43905    4.71386  -1.366   0.1824  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared:  0.6866, Adjusted R-squared:  0.6001 
## F-statistic:  7.94 on 8 and 29 DF,  p-value: 1.306e-05

It seems that none of these values are stastically significant. It appears that for every unit of leg length that increases, -6.439 units are decreased in hipcenter.

Compute a 95% prediction interval for the mean value of the predictors.

predict(seatpos_lm, interval = "confidence")

##           fit       lwr        upr
## 1  -230.82470 -263.2205 -198.42890
## 2  -158.22231 -199.0855 -117.35914
## 3   -96.85463 -131.7243  -61.98494
## 4  -255.78273 -294.7597 -216.80574
## 5  -188.59572 -233.3630 -143.82845
## 6  -186.02614 -214.1290 -157.92325
## 7  -153.98285 -189.6664 -118.29935
## 8  -244.79086 -286.3104 -203.27128
## 9  -139.71030 -173.4665 -105.95404
## 10 -112.98566 -148.1345  -77.83680
## 11 -163.72509 -186.4322 -141.01801
## 12  -89.14799 -120.0194  -58.27658
## 13 -194.10261 -249.8912 -138.31401
## 14 -128.43355 -157.7773  -99.08975
## 15 -186.44972 -226.2569 -146.64258
## 16 -177.90902 -217.5495 -138.26853
## 17 -201.58090 -250.4299 -152.73188
## 18  -98.43069 -141.8463  -55.01511
## 19 -145.80244 -174.2207 -117.38415
## 20 -167.75364 -199.5743 -135.93300
## 21 -178.41491 -214.6476 -142.18225
## 22 -279.07627 -336.5054 -221.64716
## 23 -245.56763 -285.6346 -205.50071
## 24  -81.55529 -114.4871  -48.62343
## 25 -141.13605 -167.5849 -114.68722
## 26 -222.49965 -247.7190 -197.28026
## 27 -156.83929 -184.1675 -129.51112
## 28 -128.68145 -170.6894  -86.67351
## 29 -193.00256 -225.2335 -160.77163
## 30  -93.20235 -125.8015  -60.60319
## 31 -102.96051 -160.7042  -45.21677
## 32 -182.39983 -222.0483 -142.75134
## 33 -166.93549 -205.3431 -128.52790
## 34 -102.63962 -131.3436  -73.93562
## 35 -194.49288 -227.4769 -161.50888
## 36 -142.50545 -185.0056 -100.00534
## 37 -178.52201 -207.8089 -149.23515
## 38 -154.08219 -186.4553 -121.70905

Use AIC to select a model. Now interpret the effect of leg length and compute the prediction interval. Compare the conclusions from the two models.

c <- regsubsets(hipcenter ~ ., data=seatpos)
rsc <- summary(c)
rsc$which

##   (Intercept)   Age Weight HtShoes    Ht Seated   Arm Thigh   Leg
## 1        TRUE FALSE  FALSE   FALSE  TRUE  FALSE FALSE FALSE FALSE
## 2        TRUE FALSE  FALSE   FALSE  TRUE  FALSE FALSE FALSE  TRUE
## 3        TRUE  TRUE  FALSE   FALSE  TRUE  FALSE FALSE FALSE  TRUE
## 4        TRUE  TRUE  FALSE    TRUE FALSE  FALSE FALSE  TRUE  TRUE
## 5        TRUE  TRUE  FALSE    TRUE FALSE  FALSE  TRUE  TRUE  TRUE
## 6        TRUE  TRUE  FALSE    TRUE FALSE   TRUE  TRUE  TRUE  TRUE
## 7        TRUE  TRUE   TRUE    TRUE FALSE   TRUE  TRUE  TRUE  TRUE
## 8        TRUE  TRUE   TRUE    TRUE  TRUE   TRUE  TRUE  TRUE  TRUE

AIC_hip <- nrow(seatpos)*log(rsc$rss/nrow(seatpos)) + (2:9)*2
AIC_hip

## [1] 275.0667 274.7798 274.2418 275.8291 277.6712 279.6389 281.6286 283.6240

The value is smallest with 3 predictor variables. Intercept + Age + Ht

new_lm <- lm(hipcenter ~ Age + Ht, data=seatpos)
summary(new_lm)

## 
## Call:
## lm(formula = hipcenter ~ Age + Ht, data = seatpos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.534 -23.028   2.131  24.994  53.939 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 526.9589    92.2479   5.712 1.85e-06 ***
## Age           0.5211     0.3862   1.349    0.186    
## Ht           -4.2004     0.5313  -7.906 2.69e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.96 on 35 degrees of freedom
## Multiple R-squared:  0.6562, Adjusted R-squared:  0.6365 
## F-statistic:  33.4 on 2 and 35 DF,  p-value: 7.694e-09

THe adjusted R-square value is improved with this smaller model.

LMR 11.1

Using the seatpos data, perform a PCR analysis with hipcenter as the response and HtShoes, Ht, Seated, Arm, Thigh, and Leg as predictors. Select an appropriate number of components and give an interpretation to those you choose. Add Age and Weight as predictors and repeat the analysis. Use both models to predict the response for predictors taking these values:

Principal component analysis (PCA) is a popular method for finding low-dimensional linear structure in higher dimensional data.

For observational data, many predictors are often substantially correlated. It would be nice if we could transform these predoctors to orthogonality as it could make interpretation easier.

par(mfrow=c(3,2))
plot(hipcenter ~ HtShoes,seatpos)
plot(hipcenter ~ Ht,seatpos)
plot(hipcenter ~ Seated,seatpos)
plot(hipcenter ~ Arm,seatpos)
plot(hipcenter ~ Thigh,seatpos)
plot(hipcenter ~ Leg,seatpos)

cor(seatpos[,c(3,4,5,6,7,8)])

##           HtShoes        Ht    Seated       Arm     Thigh       Leg
## HtShoes 1.0000000 0.9981475 0.9296751 0.7519530 0.7248622 0.9084334
## Ht      0.9981475 1.0000000 0.9282281 0.7521416 0.7349604 0.9097524
## Seated  0.9296751 0.9282281 1.0000000 0.6251964 0.6070907 0.8119143
## Arm     0.7519530 0.7521416 0.6251964 1.0000000 0.6710985 0.7538140
## Thigh   0.7248622 0.7349604 0.6070907 0.6710985 1.0000000 0.6495412
## Leg     0.9084334 0.9097524 0.8119143 0.7538140 0.6495412 1.0000000

There are some strong correlations (not surprisingly) in this correlation matrix.

cseat <- seatpos[,c(3,4,5,6,7,8)]
pr_seat <- prcomp(cseat)
summary(pr_seat)

## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     17.1573 2.89689 2.11907 1.56412 1.22502 0.46218
## Proportion of Variance  0.9453 0.02695 0.01442 0.00786 0.00482 0.00069
## Cumulative Proportion   0.9453 0.97222 0.98664 0.99450 0.99931 1.00000

Most of the standard deviation is in PC1. The proportion of variance is .9453, which even one dimension alone may be a sufficient model.

round(pr_seat$rotation[,1],2)

## HtShoes      Ht  Seated     Arm   Thigh     Leg 
##   -0.65   -0.65   -0.27   -0.15   -0.17   -0.18

This explanation may be smaller shoes, height, arms, seated height, thighs and legs create a smaller horizontal distance of the midpoint of the hips from a fixed location in the car in mm.

Like variances, principal components analysis can be very sensitive to outliers so it is essential to check for these. In addition to the usual graphical checks of the data, it is worth checking for outliers in higher dimensions.

Mahalanobis distance is a measure of the distance of a point from the mean that adjusts for the correlation in the data.

require(MASS)

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.4.3

robseat <- cov.rob(cseat)
md <- mahalanobis(cseat, center=robseat$center, cov=robseat$cov)
n <- nrow(cseat);p <- ncol(cseat)
plot(qchisq(1:n/(n+1),p), sort(md), xlab=expression(paste(chi^2," quantiles")), ylab="Sorted Mahalanobis distances")
abline(0,1)

Let’s see how we can use PCA for regression. We might have a model y ~ X. We replace this by y ~ Z where typically we use only the first few columns of Z. This is known as principal components regression or PCR. The technique is used in two distinct ways for explanation and prediction purposes.

When the goal of the regression is to find simple, well-fitting and understandable models for the response. PCR may help. The PCs are linear combinations of the predictors.

lmodpcr <- lm(seatpos$hipcenter ~ pr_seat$x[,1])
summary(lmodpcr)

## 
## Call:
## lm(formula = seatpos$hipcenter ~ pr_seat$x[, 1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.009 -29.349   3.694  19.930  73.502 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -164.8849     5.9017 -27.939  < 2e-16 ***
## pr_seat$x[, 1]    2.7770     0.3486   7.966 1.85e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.38 on 36 degrees of freedom
## Multiple R-squared:  0.638,  Adjusted R-squared:  0.628 
## F-statistic: 63.46 on 1 and 36 DF,  p-value: 1.853e-09

The PCR here has allowed us a meaningful explanation whereas the full predictor regression was opaque.

One objection to the previous analysis is that the PC still uses all predictors so there has been no saving in the number of variables needed to model the response. Furthermore, we must rely on our subjective interpretation of the meaning of the PCs. To answer this, one idea is to take a few representive predictors based on the largest coefficients seen in the PCs.

The linear rotations (or loadings) can be found in the rotation matrix. We plot these vectors against the predictor number.

matplot(1:6, pr_seat$rotation, type="l")

(Eigenvectors for the PCA of the data). The solid line corresponds to the first PC (?)

These vectors represent the linear combinations of the predictors that generate the PCs.

Let’s try using the pls package, which contains several features which makes the computation more straightforward.

require(pls)

## Loading required package: pls

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

trainseat <- seatpos[1:30,]
testseat <- seatpos[31:38,]
pcrmod <- pcr(hipcenter ~ ., data=trainseat, ncomp=8)

rmse <- function(x,y) sqrt(mean((x-y)^2))

rmse(predict(pcrmod), trainseat$hipcenter)

## [1] 29.07376

Not a good fit. What is the optimal principal component amount?

plot(pr_seat$sdev, type="l", ylab="SD of PC", xlab="PC number")

We will figure how how many PCs would give the best results?

pcrmse <- RMSEP(pcrmod, newdata=testseat)
plot(pcrmse,main="")

which.min(pcrmse$val)

## [1] 6

pcrmse$val[6]

## [1] 58.22981

LMR 11.2

Fit a PLS model to the seatpos data with hipcenter as the response and all other variables as predictors. Take care to select an appropriate number of components. Use the model to predict the response at the values of the predictors specified in the first question.

Partial least squares (PLS) is a method for relating a set of input variables \(X_1, ..., X_m\) and outputs \(Y_1, ..., Y_l\). PLS regression is comparable to PCR i nthat both predict the response using some number of linear combinations of the predictors. The difference is that while PCR ignores Y in determining the linear combinations, PLS regression explicitly chooes them to predict Y as well as possible.

One criticism of PLS is that it solves no well-defined modeling problem, which makes it difficult to distinguish between the competing algorithms on theoretical rather than empirical grounds.

As with PCR, we must choose the number of components carefully. CV can be helpful in doing this. We compute the PLS on all models. We have set the random generator seed to ensure reproducibility in light of the random division of the validation samples.

set.seed(123)
plsmod <- plsr(hipcenter ~ ., data=trainseat, ncomp=8, validation="CV")
coefplot(plsmod, ncomp=4, xlab="Frequency")

plsCV <- RMSEP(plsmod, estimate="CV")
plot(plsCV, main="")

We see that the effective linear combination of the predictors is similar to PCR(?). Now we determine the performance on the training set for the 3 component model.

ypred <- predict(plsmod, ncomp=3)
rmse(ypred,trainseat$hipcenter)

## [1] 27.9832

ytpred <- predict(plsmod, testseat, ncomp=3)
rmse(ytpred,testseat)

## [1] 245.6917

PLS tends to have an advantage over PCR for prediction problems because PLS constructs its linear combination explicitly to predict the response. On the other hand, PCR is better suited for developoing insights by forming linear combinations that have interesting interpretations. Although both methods use the idea of dimension reduciton, there is usually no reduction in the number of variables used since every predictor contributes to the linear combinations.

LMR 11.3

Fit a ridge regression model to the seatpos data with hipcenter as the response and all other variables as predictors. Take care to select an appropriate amount of shrinkage. Use the model to predict the response at the values of the predictors specified in the first question.

Ridge regression makes the assumption that the regression coefficients (after normalization) should not be very large. This is a reasonable assumption in applications where you have a large number of predictors and you believe that many of them have some effect on the response. Hence shrinkage is embedded in the method. Ridge regression is particularly effective when the model matrix is collinear.

Ridge regression is an example of a penalized regression because of the presence of \(\lambda\).

require(MASS)
rgmod <- lm.ridge(hipcenter ~., trainseat, lambda = seq(0, 1, len=21))
matplot(rgmod$lambda, coef(rgmod), type="l", xlab=expression(lambda), ylab=expression(hat(beta)),col=1)

????

CUNY 621 Suggested Exercises Week 7

Joel Park

3/26/2018