6.5.1 Best Subset Selection

Data preprocessing:

library(ISLR)
str(Hitters)
'data.frame':   322 obs. of  20 variables:
 $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
 $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
 $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
 $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
 $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
 $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
 $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
 $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
 $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
 $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
 $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
 $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
 $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
 $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
 $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
 $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
 $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
 $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
 $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
 $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
dim(Hitters)
[1] 322  20
sum(is.na(Hitters$Salary))
[1] 59
Hitters <- na.omit(Hitters)
dim(Hitters)
[1] 263  20

Step (2) of Algorithm 6.1:

library(leaps)
regfit.full <- regsubsets(Salary ~ ., data = Hitters)
summary(regfit.full)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters)
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN
1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*"  " "    " "    
7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " "  " "    " "    
8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " "  "*"    " "    
         DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 ) " "       " "     " "     " "    " "       
2  ( 1 ) " "       " "     " "     " "    " "       
3  ( 1 ) " "       "*"     " "     " "    " "       
4  ( 1 ) "*"       "*"     " "     " "    " "       
5  ( 1 ) "*"       "*"     " "     " "    " "       
6  ( 1 ) "*"       "*"     " "     " "    " "       
7  ( 1 ) "*"       "*"     " "     " "    " "       
8  ( 1 ) "*"       "*"     " "     " "    " "       

Step (3) of Algorithm 6.1:

regfit.full <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters))
reg.summary <- summary(regfit.full)
names(reg.summary)
[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"   
reg.summary$rsq
 [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227 0.5285569 0.5346124
[10] 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164 0.5454692 0.5457656 0.5459518 0.5460945
[19] 0.5461159

Here we chose \(R^2\) as the test standard of which of \(M_0, \dots, M_p\) is the best model.

Plot the result:

par(mfrow=c(2,2))
plot(reg.summary$rss, xlab = 'Number of variables', ylab = 'RSS', type = 'l')
plot(reg.summary$adjr2, xlab = 'Number of variables', ylab = 'Adjusted Rsq', type = 'l')
rsq.max.idx <- which.max(reg.summary$adjr2)
points(rsq.max.idx, reg.summary$adjr2[rsq.max.idx], col = 'red', cex = 2, pch = 20)
plot(reg.summary$cp, xlab = 'Number of variables', ylab = 'Cp', type = 'l')
cp.min.idx <- which.min(reg.summary$cp)
points(cp.min.idx, reg.summary$cp[cp.min.idx], col = 'red', cex = 2, pch = 20)
bic.min.idx <- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab = 'Number of variables', ylab = 'BIC', type = 'l')
points(bic.min.idx, reg.summary$bic[bic.min.idx], col = 'red', cex = 2, pch = 20)

Plot with regsubsets()’s built-in function:

plot(regfit.full, scale = 'r2')

plot(regfit.full, scale = 'adjr2')

plot(regfit.full, scale = 'Cp')

plot(regfit.full, scale = 'bic')

With these plots we can answer which predictors are selected in the best model with a specific standard.

For example, in the top row of the second plot (adjusted \(R^2\)) above, when the adjr2 get the maximum value, 0.52, some of the marked (as black) variables are AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts, which are marked in the 6th row of summary(regfit.full) outputs.

Get the coefficients (\(\beta_i, i \in [1..p]\)) of equation (6.1):

coef(regfit.full, 6)
 (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW      PutOuts 
  91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338    0.2643076 

6.5.2 Forward and Backward Stepwise Selection

regfit.fwd <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters), method = 'forward')
summary(regfit.fwd)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = ncol(Hitters), 
    method = "forward")
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 19
Selection Algorithm: forward
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*"  " "    " "    
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*"  " "    " "    
7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*"  "*"    " "    
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*"  "*"    " "    
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    " "    
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    " "    
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    "*"    
12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    "*"    
13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    "*"    
14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    "*"    
15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*"  "*"    "*"    
16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*"  "*"    "*"    
17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*"  "*"    "*"    
18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*"  "*"    "*"    
19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*"  "*"    "*"    
          DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "       " "     " "     " "    " "       
2  ( 1 )  " "       " "     " "     " "    " "       
3  ( 1 )  " "       "*"     " "     " "    " "       
4  ( 1 )  "*"       "*"     " "     " "    " "       
5  ( 1 )  "*"       "*"     " "     " "    " "       
6  ( 1 )  "*"       "*"     " "     " "    " "       
7  ( 1 )  "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"       "*"     "*"     " "    " "       
12  ( 1 ) "*"       "*"     "*"     " "    " "       
13  ( 1 ) "*"       "*"     "*"     "*"    " "       
14  ( 1 ) "*"       "*"     "*"     "*"    " "       
15  ( 1 ) "*"       "*"     "*"     "*"    " "       
16  ( 1 ) "*"       "*"     "*"     "*"    " "       
17  ( 1 ) "*"       "*"     "*"     "*"    "*"       
18  ( 1 ) "*"       "*"     "*"     "*"    "*"       
19  ( 1 ) "*"       "*"     "*"     "*"    "*"       
regfit.bwd <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters), method = 'backward')
summary(regfit.bwd)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = ncol(Hitters), 
    method = "backward")
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 19
Selection Algorithm: backward
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"   " "  " "    " "    
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " "  " "    " "    
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " "  " "    " "    
4  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " "  " "    " "    
5  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " "  " "    " "    
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " "  " "    " "    
7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " "  "*"    " "    
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*"  "*"    " "    
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    " "    
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    " "    
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    "*"    
12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    "*"    
13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    "*"    
14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*"  "*"    "*"    
15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*"  "*"    "*"    
16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*"  "*"    "*"    
17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*"  "*"    "*"    
18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*"  "*"    "*"    
19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*"  "*"    "*"    
          DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "       " "     " "     " "    " "       
2  ( 1 )  " "       " "     " "     " "    " "       
3  ( 1 )  " "       "*"     " "     " "    " "       
4  ( 1 )  " "       "*"     " "     " "    " "       
5  ( 1 )  " "       "*"     " "     " "    " "       
6  ( 1 )  "*"       "*"     " "     " "    " "       
7  ( 1 )  "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"       "*"     "*"     " "    " "       
12  ( 1 ) "*"       "*"     "*"     " "    " "       
13  ( 1 ) "*"       "*"     "*"     "*"    " "       
14  ( 1 ) "*"       "*"     "*"     "*"    " "       
15  ( 1 ) "*"       "*"     "*"     "*"    " "       
16  ( 1 ) "*"       "*"     "*"     "*"    " "       
17  ( 1 ) "*"       "*"     "*"     "*"    "*"       
18  ( 1 ) "*"       "*"     "*"     "*"    "*"       
19  ( 1 ) "*"       "*"     "*"     "*"    "*"       

Which predictors in \(M_k\) (\(k=7\) here), and what are their coefficients \(\beta_i\):

coef(regfit.full, 7)
 (Intercept)         Hits        Walks       CAtBat        CHits       CHmRun    DivisionW 
  79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073    1.4420538 -129.9866432 
     PutOuts 
   0.2366813 
coef(regfit.fwd, 7)
 (Intercept)        AtBat         Hits        Walks         CRBI       CWalks    DivisionW 
 109.7873062   -1.9588851    7.4498772    4.9131401    0.8537622   -0.3053070 -127.1223928 
     PutOuts 
   0.2533404 
coef(regfit.bwd, 7)
 (Intercept)        AtBat         Hits        Walks        CRuns       CWalks    DivisionW 
 105.6487488   -1.9762838    6.7574914    6.0558691    1.1293095   -0.7163346 -116.1692169 
     PutOuts 
   0.3028847 

6.5.3 Choosing Among Models Using the Validation Set Approach and Cross-Validation

set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), replace = TRUE)
test <- !train
regfit.best <- regsubsets(Salary ~ ., data = Hitters[train,], nvmax = 19)
test.mat <- model.matrix(Salary ~ ., data = Hitters[test,])
val.errors <- rep(NA, 19)
for (i in 1:19) {
  coefi <- coef(regfit.best, id = i)
  pred <- test.mat[,names(coefi)] %*% coefi
  val.errors[i] <- mean((Hitters$Salary[test] - pred) ^ 2)
}
val.errors
 [1] 220968.0 169157.1 178518.2 163426.1 168418.1 171270.6 162377.1 157909.3 154055.7 148162.1
[11] 151156.4 151742.5 152214.5 157358.7 158541.4 158743.3 159972.7 159859.8 160105.6
which.min(val.errors)
[1] 10
coef(regfit.best, 10)
(Intercept)       AtBat        Hits       Walks      CAtBat       CHits      CHmRun 
-80.2751499  -1.4683816   7.1625314   3.6430345  -0.1855698   1.1053238   1.3844863 
     CWalks     LeagueN   DivisionW     PutOuts 
 -0.7483170  84.5576103 -53.0289658   0.2381662 

Write predict function for regsubsets() function:

predict.regsubsets <- function(object, newdata, id, ...) {
  print(paste('dimenson of test set:', paste(dim(newdata), collapse = " ")))
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}
regfit.best <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
coef(regfit.best, 10)
 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns         CRBI 
 162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490    0.7743122 
      CWalks    DivisionW      PutOuts      Assists 
  -0.8308264 -112.3800575    0.2973726    0.2831680 
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(Hitters), replace = TRUE)
for (i in 1:k) {
  print(paste('There are', sum(folds == i), 'test observations in fold', i))
}
[1] "There are 13 test observations in fold 1"
[1] "There are 25 test observations in fold 2"
[1] "There are 31 test observations in fold 3"
[1] "There are 32 test observations in fold 4"
[1] "There are 33 test observations in fold 5"
[1] "There are 27 test observations in fold 6"
[1] "There are 26 test observations in fold 7"
[1] "There are 30 test observations in fold 8"
[1] "There are 22 test observations in fold 9"
[1] "There are 24 test observations in fold 10"
cv.errors <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for (j in 1:k) {
  best.fit <- regsubsets(Salary ~ ., data = Hitters[folds != j,], nvmax = 19)
  print(paste('====== The no.', j, 'round: ======'))
  for (i in 1:19) {
    pred <- predict(best.fit, Hitters[folds == j, ], id = i)
    # print(paste('i =', i, ', j =', j, 'dim(pred):', paste(dim(pred), collapse = " ")))
    cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred) ^ 2)
  }
  print(paste('dim(pred):', paste(dim(pred), collapse = " ")))
}
[1] "====== The no. 1 round: ======"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dim(pred): 13 1"
[1] "====== The no. 2 round: ======"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dim(pred): 25 1"
[1] "====== The no. 3 round: ======"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dim(pred): 31 1"
[1] "====== The no. 4 round: ======"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dim(pred): 32 1"
[1] "====== The no. 5 round: ======"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dim(pred): 33 1"
[1] "====== The no. 6 round: ======"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dim(pred): 27 1"
[1] "====== The no. 7 round: ======"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dim(pred): 26 1"
[1] "====== The no. 8 round: ======"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dim(pred): 30 1"
[1] "====== The no. 9 round: ======"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dim(pred): 22 1"
[1] "====== The no. 10 round: ======"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dim(pred): 24 1"
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
       1        2        3        4        5        6        7        8        9       10 
160093.5 140196.8 153117.0 151159.3 146841.3 138302.6 144346.2 130207.7 129459.6 125334.7 
      11       12       13       14       15       16       17       18       19 
125153.8 128273.5 133461.0 133974.6 131825.7 131882.8 132750.9 133096.2 132804.7 
plot(mean.cv.errors, type = 'b')

regfit.best <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
coef(regfit.best, 11)
 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns         CRBI 
 135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914    1.4553310    0.7852528 
      CWalks      LeagueN    DivisionW      PutOuts      Assists 
  -0.8228559   43.1116152 -111.1460252    0.2894087    0.2688277 

How Above Codes Work

Backgrounds

Design Matrix:一般用\(X\)表示,行数为\(n\) (试验次数,number of observations),列数为\(p\)(特征数量,number of features),design matrix满足公式(6.1),或者写成矩阵形式:\(y = X\beta\),其中 \(y\) 是长度为 \(n\) 的结果向量(response variable),\(\beta\) 是长度为 \(p\) 的回归系数向量。

在代码test.mat <- model.matrix(Salary ~ ., data = Hitters[test,])中: Salary\(y\)Hitters[test,]中除去 Salary 的其他19个特征组成了 \(X\) 矩阵, 所以data.frame(test.mat[, -1])League, DivisionNewLeague 3 个factor型特征名字改变后(例如 League 变成了 LeagueN)变成了 within(Hitters[test,], rm(Salary))中对应的数字,其他16个数值型特征的值完全相同。

%*% 表示矩阵乘法:

x <- 1:4
x %*% x
     [,1]
[1,]   30

参考 Matrix Multiplication

在 R 语言中,Salary ~ .是一个 formula 对象:

class(Salary ~ .)
[1] "formula"

Function Explanations

From line 124, pred <- predict(best.fit, Hitters[folds == j, ], id = i), we can see the parameters of function predict.regsubsets are:

  • object: a fit result of function regsubsets. For example, when regfit.best used as object in function predict.regsubsets, we can see:
regfit.best$call
regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
class(regfit.best$call)
[1] "call"
regfit.best$call[[2]]
Salary ~ .
class(regfit.best$call[[2]])
[1] "call"
as.formula(regfit.best$call[[2]])
Salary ~ .
class(as.formula(regfit.best$call[[2]]))
[1] "formula"
  • newdata: here it used to pass the test dataset to predict function;

  • id: 在第 j 轮交叉验证中(被标记为 j 的数据做测试数据,其他数据做训练数据),id 用来确定 predictors 的数量,例如 \(id = 6\) 时,coef(regfit.best, id = 6)表示有6个predictors的系数向量 \(\beta\)

best.fit 是在 training set 上得到的 best subsets 模型,coefi 是基于这个模型得到的系数,所以在predict.regsubsets的返回值 \(\hat y\) 中,系数 \(\beta\) 是从训练集上得到的,\(X\) 则来自于测试集,\(\hat y\)的长度就是训练集的长度,由于整个训练集被 sample 函数随机分为10组,每组的数量不完全一样。从上面的输出可以看到,第一轮测试集长度为13,第二轮长度为25, ……,第10轮长度为24,记为\(v = c(13, 25, 31, 32, 33, 27, 26, 30, 22, 24)\)

cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred) ^ 2)中,pred 中保存了第 j 轮交叉验证中的 i 个 predictors 的 best subsets 的计算结果 \(\hat y_{ji}\),实际值 \(y_{ji}\)Hitters$Salary[folds == j],二者的长度都是此轮测试集的长度 \(v_j\),所以误差项的表达式是: \[CVerror_{ji} = \frac{\sum^{v_j}_{u=1}(y_{ji} - \hat y_{ji}) ^ 2}{v_j}\]

6.6.1 Ridge Regression

Define the input data:

x <- model.matrix(Salary ~ ., data = Hitters)[, -1]
y <- Hitters$Salary

Run ridge regrssion:

library(glmnet)
grid <- 10 ^ seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)
dim(coef(ridge.mod))
[1]  20 100

Study the calculated cofficients:

ridge.mod$lambda[50]
[1] 11497.57
coef(ridge.mod)[, 50]
  (Intercept)         AtBat          Hits         HmRun          Runs           RBI 
407.356050200   0.036957182   0.138180344   0.524629976   0.230701523   0.239841459 
        Walks         Years        CAtBat         CHits        CHmRun         CRuns 
  0.289618741   1.107702929   0.003131815   0.011653637   0.087545670   0.023379882 
         CRBI        CWalks       LeagueN     DivisionW       PutOuts       Assists 
  0.024138320   0.025015421   0.085028114  -6.215440973   0.016482577   0.002612988 
       Errors    NewLeagueN 
 -0.020502690   0.301433531 
sqrt(sum(coef(ridge.mod)[-1, 50] ^ 2))
[1] 6.360612
ridge.mod$lambda[60]
[1] 705.4802
coef(ridge.mod)[, 60]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI        Walks 
 54.32519950   0.11211115   0.65622409   1.17980910   0.93769713   0.84718546   1.31987948 
       Years       CAtBat        CHits       CHmRun        CRuns         CRBI       CWalks 
  2.59640425   0.01083413   0.04674557   0.33777318   0.09355528   0.09780402   0.07189612 
     LeagueN    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
 13.68370191 -54.65877750   0.11852289   0.01606037  -0.70358655   8.61181213 
sqrt(sum(coef(ridge.mod)[-1, 60] ^ 2))
[1] 57.11001

Get the coefficients when \(\lambda = 50\):

predict(ridge.mod, s = 50, type = 'coefficients')
20 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  4.876610e+01
AtBat       -3.580999e-01
Hits         1.969359e+00
HmRun       -1.278248e+00
Runs         1.145892e+00
RBI          8.038292e-01
Walks        2.716186e+00
Years       -6.218319e+00
CAtBat       5.447837e-03
CHits        1.064895e-01
CHmRun       6.244860e-01
CRuns        2.214985e-01
CRBI         2.186914e-01
CWalks      -1.500245e-01
LeagueN      4.592589e+01
DivisionW   -1.182011e+02
PutOuts      2.502322e-01
Assists      1.215665e-01
Errors      -3.278600e+00
NewLeagueN  -9.496680e+00

Fit the ridge regression model on training set:

set.seed(1)
train <- sample(1: nrow(x), nrow(x)/2)
test <- -train
y.test <- y[test]
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)

Calculate the MSE when \(\lambda = 4\):

ridge.pred <- predict(ridge.mod, s = 4, newx = x[test,])
mean((ridge.pred - y.test) ^ 2)
[1] 101036.8

Calculate the MSE when \(\lambda \to \infty\) (\(\beta \to 0\)) on test set:

mean((mean(y[train]) - y.test) ^ 2)
[1] 193253.1
ridge.pred <- predict(ridge.mod, s = 1e10, newx = x[test,])
mean((ridge.pred - y.test) ^ 2)
[1] 193253.1

Calculate the MSE with least square fit (\(\lambda = 0\)):

ridge.pred <- predict(ridge.mod, s = 0.01, newx = x[test,], exact = TRUE)
mean((ridge.pred - y.test) ^ 2)
[1] 114723.6

When s < 0.01, the following error arises:

Error: used coef.glmnet() or predict.glmnet() with exact=TRUE so must in addition supply original argument(s) x and y in order to safely rerun glmnet

So here I use s = 0.01 instead of s = 0 as a workaround.

According to trying to use exact=TRUE feature in R glmnet, the parameter penalty.factor must be provided when both s = 0 and exact = TRUE in function predict.glmnet(). But I don’t konw what does this parameter mean so I can’t set its value.

Compare the coefficients created by lm() and glmnet(..., s = 0):

lm(y ~ x, subset = train)

Call:
lm(formula = y ~ x, subset = train)

Coefficients:
(Intercept)       xAtBat        xHits       xHmRun        xRuns         xRBI       xWalks  
  299.42849     -2.54027      8.36682     11.64512     -9.09923      2.44105      9.23440  
     xYears      xCAtBat       xCHits      xCHmRun       xCRuns        xCRBI      xCWalks  
  -22.93673     -0.18154     -0.11598     -1.33888      3.32838      0.07536     -1.07841  
   xLeagueN   xDivisionW     xPutOuts     xAssists      xErrors  xNewLeagueN  
   59.76065    -98.86233      0.34087      0.34165     -0.64207     -0.67442  
predict(ridge.mod, s = 0.01, exact = TRUE, type = 'coefficients')[1:20,]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI        Walks 
299.44467220  -2.53538355   8.33585019  11.59830815  -9.05971371   2.45326546   9.21776006 
       Years       CAtBat        CHits       CHmRun        CRuns         CRBI       CWalks 
-22.98239583  -0.18191651  -0.10565688  -1.31721358   3.31152519   0.06590689  -1.07244477 
     LeagueN    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
 59.75587273 -98.94393005   0.34083276   0.34155445  -0.65312471  -0.65882930 

They are almost the same.

Choose the \(\lambda\) with CV:

set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.out)
best.lambda <- cv.out$lambda.min
abline(v = log(best.lambda), lty = 2, lwd = 2, col = 'blue')

Test MSE associated with the best \(\lambda\):

ridge.pred <- predict(ridge.mod, s = best.lambda, newx = x[test, ])
mean((ridge.pred - y.test) ^ 2)
[1] 96015.51

Fit ridge regression model on the full data set, with the \(\lambda\) chosen by cross-validation:

out <- glmnet(x, y, alpha = 0)
predict(out, s = best.lambda, type = 'coefficients')[1:20,]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI        Walks 
  9.88487157   0.03143991   1.00882875   0.13927624   1.11320781   0.87318990   1.80410229 
       Years       CAtBat        CHits       CHmRun        CRuns         CRBI       CWalks 
  0.13074383   0.01113978   0.06489843   0.45158546   0.12900049   0.13737712   0.02908572 
     LeagueN    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
 27.18227527 -91.63411282   0.19149252   0.04254536  -1.81244470   7.21208394 

All 19 coefficients are non-zero. So no predictors are excluded by ridge regression.

6.6.2 Lasso

Fit training data with Lasso model:

lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid)
plot(lasso.mod)

Choose the \(\lambda\) of Lasso with CV:

set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)

lbl <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s = lbl, newx = x[test, ])
mean((lasso.pred - y.test) ^ 2)
[1] 100743.4

The coefficients of the Lasso model:

out <- glmnet(x, y, alpha = 1, lambda = lbl)
predict(out, type = 'coefficients', s = lbl)[1:20,]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI        Walks 
  19.5052238    0.0000000    1.8702513    0.0000000    0.0000000    0.0000000    2.2185101 
       Years       CAtBat        CHits       CHmRun        CRuns         CRBI       CWalks 
   0.0000000    0.0000000    0.0000000    0.0000000    0.2075887    0.4125063    0.0000000 
     LeagueN    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
   1.7600992 -103.4996975    0.2207019    0.0000000    0.0000000    0.0000000 

So 12 of 19 predictors are excluded by Lasso.

6.7.1 Principal Components Regression

Predict Salary of Hitters with PCR:

library(pls)
Hitters <- na.omit(Hitters)
set.seed(2)
pcr.fit <- pcr(Salary ~ ., data = Hitters, scale = TRUE, validation = 'CV')
summary(pcr.fit)
Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
CV             452    348.9    352.2    353.5    352.8    350.1    349.1    349.6
adjCV          452    348.7    351.8    352.9    352.1    349.3    348.0    348.5
       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
CV       350.9    352.9     353.8     355.0     356.2     363.5     355.2     357.4
adjCV    349.8    351.6     352.3     353.4     354.5     361.6     352.8     355.2
       16 comps  17 comps  18 comps  19 comps
CV        347.6     350.1     349.2     352.6
adjCV     345.5     347.6     346.7     349.8

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps
X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96    96.28
Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75    46.86
        10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps
X          97.26     97.98     98.65     99.15     99.47     99.75     99.89     99.97
Salary     47.76     47.82     47.85     48.10     50.40     50.55     53.01     53.85
        18 comps  19 comps
X          99.99    100.00
Salary     54.61     54.61

Plot the result:

validationplot(pcr.fit, val.type = 'MSEP')

Perform PCR on training set and evaluate the test performance:

set.seed(1)
x <- model.matrix(Salary ~ ., data = Hitters)[, -1]
y <- Hitters$Salary
Hitters <- na.omit(Hitters)
train <- sample(1: nrow(x), nrow(x)/2)
test <- -train
y.test <- y[test]
pcr.fit <- pcr(Salary ~ ., data = Hitters, subset = train, scale = TRUE, validation = 'CV')
summary(pcr.fit)
Data:   X dimension: 131 19 
    Y dimension: 131 1
Fit method: svdpc
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
CV           464.6    396.2    395.5    394.0    393.8    393.0    384.4    381.3
adjCV        464.6    395.8    394.8    393.3    392.9    392.5    381.5    380.0
       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
CV       385.5    387.4     401.2     403.5     409.6     405.6     406.7     409.3
adjCV    383.9    385.6     398.7     400.8     406.6     402.4     403.4     405.8
       16 comps  17 comps  18 comps  19 comps
CV        407.8     402.5     398.6     403.8
adjCV     404.2     398.4     394.5     399.4

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps
X         38.89    60.25    70.85    79.06    84.01    88.51    92.61    95.20    96.78
Salary    28.44    31.33    32.53    33.69    36.64    40.28    40.41    41.07    41.25
        10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps
X          97.63     98.27     98.89     99.27     99.56     99.78     99.91     99.97
Salary     41.27     41.41     41.44     43.20     44.24     44.30     45.50     49.66
        18 comps  19 comps
X         100.00    100.00
Salary     51.13     51.18
validationplot(pcr.fit, val.type = 'MSEP')

pcr.pred <- predict(pcr.fit, x[test, ], ncomp = 7)
mean((pcr.pred - y.test) ^ 2)
[1] 96556.22

Fit PCR on the full data set, using M = 7, the number of components identified by cross-validation:

pcr.fit <- pcr(y ~ x, scale = TRUE, ncomp = 7)
summary(pcr.fit)
Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: svdpc
Number of components considered: 7
TRAINING: % variance explained
   1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X    38.31    60.16    70.84    79.03    84.29    88.63    92.26
y    40.63    41.58    42.17    43.22    44.90    46.48    46.69

6.7.2 Partial Least Squares

Perform PLS on training set of Hitters:

pls.fit <- plsr(Salary ~ ., data = Hitters, subset = train, scale = TRUE, validation = 'CV')
summary(pls.fit)
Data:   X dimension: 131 19 
    Y dimension: 131 1
Fit method: kernelpls
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
CV           464.6    393.3    392.8    395.0    396.2    408.2    416.1    414.8
adjCV        464.6    392.6    391.4    392.8    393.9    405.0    411.5    409.7
       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
CV       412.5    403.2     396.9     400.9     399.1     398.8     395.5     397.8
adjCV    407.8    399.3     392.4     396.4     394.8     394.6     391.6     393.7
       16 comps  17 comps  18 comps  19 comps
CV        399.0     400.5     398.2     407.6
adjCV     394.7     396.1     394.2     402.9

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps
X         38.12    53.46    66.05    74.49    79.33    84.56    87.09    90.74    92.55
Salary    33.58    38.96    41.57    42.43    44.04    45.59    47.05    47.53    48.42
        10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps
X          93.94     97.23     97.88     98.35     98.85     99.11     99.43     99.78
Salary     49.68     50.04     50.54     50.78     50.92     51.04     51.11     51.15
        18 comps  19 comps
X          99.99    100.00
Salary     51.16     51.18
validationplot(pls.fit, val.type = 'MSEP')

Evaluate test set MSE:

pls.pred <- predict(pls.fit, x[test, ], ncomp = 2)
mean((pls.pred - y.test) ^ 2)
[1] 101417.5

Perform PLS on full data set using \(M = 2\):

pls.fit <- plsr(Salary ~ ., data = Hitters, scale = TRUE, ncomp = 2)
summary(pls.fit)
Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: kernelpls
Number of components considered: 2
TRAINING: % variance explained
        1 comps  2 comps
X         38.08    51.03
Salary    43.05    46.40

Comparing the result with PCR (\(M=7\)), PLS explained 46.40% using only 2 components (\(M=2\)). While PCR used 7 components to explain 46.69%. Although PCR with \(M=7\) explained 92.26% of the predictors (\(X\)), compared with 51.03% of PLS, the predictor explanation ability is not our interests.

---
title: "Lab of Chapter 6"
output: html_notebook
---

# 6.5.1 Best Subset Selection

Data preprocessing:
```{r}
library(ISLR)
str(Hitters)
dim(Hitters)
sum(is.na(Hitters$Salary))
Hitters <- na.omit(Hitters)
dim(Hitters)
```

Step (2) of *Algorithm 6.1*:
```{r}
library(leaps)
regfit.full <- regsubsets(Salary ~ ., data = Hitters)
summary(regfit.full)
```

Step (3) of *Algorithm 6.1*:
```{r}
regfit.full <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters))
reg.summary <- summary(regfit.full)
names(reg.summary)
reg.summary$rsq
```
Here we chose $R^2$ as the test standard of which of $M_0, \dots, M_p$ is the best model.


Plot the result:
```{r}
par(mfrow=c(2,2))
plot(reg.summary$rss, xlab = 'Number of variables', ylab = 'RSS', type = 'l')
plot(reg.summary$adjr2, xlab = 'Number of variables', ylab = 'Adjusted Rsq', type = 'l')
rsq.max.idx <- which.max(reg.summary$adjr2)
points(rsq.max.idx, reg.summary$adjr2[rsq.max.idx], col = 'red', cex = 2, pch = 20)
plot(reg.summary$cp, xlab = 'Number of variables', ylab = 'Cp', type = 'l')
cp.min.idx <- which.min(reg.summary$cp)
points(cp.min.idx, reg.summary$cp[cp.min.idx], col = 'red', cex = 2, pch = 20)
bic.min.idx <- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab = 'Number of variables', ylab = 'BIC', type = 'l')
points(bic.min.idx, reg.summary$bic[bic.min.idx], col = 'red', cex = 2, pch = 20)
```

Plot with `regsubsets()`'s built-in function:
```{r}
plot(regfit.full, scale = 'r2')
plot(regfit.full, scale = 'adjr2')
plot(regfit.full, scale = 'Cp')
plot(regfit.full, scale = 'bic')
```
With these plots we can answer which predictors are selected in the *best* model with a specific standard.

For example, in the top row of the second plot (adjusted $R^2$) above, when the *adjr2* get the maximum value, 0.52, some of the marked (as black) variables are *AtBat*, *Hits*, *Walks*, *CRBI*, *DivisionW*, and *PutOuts*, which are marked in the 6th row of `summary(regfit.full)` outputs.

Get the coefficients ($\beta_i, i \in [1..p]$) of equation (6.1):
```{r}
coef(regfit.full, 6)
```

# 6.5.2 Forward and Backward Stepwise Selection
```{r}
regfit.fwd <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters), method = 'forward')
summary(regfit.fwd)
regfit.bwd <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters), method = 'backward')
summary(regfit.bwd)
```

Which predictors in $M_k$ ($k=7$ here), and what are their coefficients $\beta_i$:
```{r}
coef(regfit.full, 7)
coef(regfit.fwd, 7)
coef(regfit.bwd, 7)
```

# 6.5.3 Choosing Among Models Using the Validation Set Approach and Cross-Validation

```{r}
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), replace = TRUE)
test <- !train
regfit.best <- regsubsets(Salary ~ ., data = Hitters[train,], nvmax = 19)
test.mat <- model.matrix(Salary ~ ., data = Hitters[test,])
val.errors <- rep(NA, 19)
for (i in 1:19) {
  coefi <- coef(regfit.best, id = i)
  pred <- test.mat[,names(coefi)] %*% coefi
  val.errors[i] <- mean((Hitters$Salary[test] - pred) ^ 2)
}
val.errors
which.min(val.errors)
coef(regfit.best, 10)
```

Write predict function for `regsubsets()` function:
```{r}
predict.regsubsets <- function(object, newdata, id, ...) {
  print(paste('dimenson of test set:', paste(dim(newdata), collapse = " ")))
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}
```

```{r}
regfit.best <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
coef(regfit.best, 10)
```

```{r}
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(Hitters), replace = TRUE)
for (i in 1:k) {
  print(paste('There are', sum(folds == i), 'test observations in fold', i))
}
cv.errors <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for (j in 1:k) {
  best.fit <- regsubsets(Salary ~ ., data = Hitters[folds != j,], nvmax = 19)
  print(paste('====== The no.', j, 'round: ======'))
  for (i in 1:19) {
    pred <- predict(best.fit, Hitters[folds == j, ], id = i)
    # print(paste('i =', i, ', j =', j, 'dim(pred):', paste(dim(pred), collapse = " ")))
    cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred) ^ 2)
  }
  print(paste('dim(pred):', paste(dim(pred), collapse = " ")))
}
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
plot(mean.cv.errors, type = 'b')
regfit.best <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
coef(regfit.best, 11)
```

## How Above Codes Work

### Backgrounds

[Design Matrix](https://en.wikipedia.org/wiki/Design_matrix)：一般用$X$表示，行数为$n$ （试验次数，number of observations），列数为$p$（特征数量，number of features），design matrix满足公式(6.1)，或者写成矩阵形式：$y = X\beta$，其中 $y$ 是长度为 $n$ 的结果向量(response variable)，$\beta$ 是长度为 $p$ 的回归系数向量。

在代码`test.mat <- model.matrix(Salary ~ ., data = Hitters[test,])`中：
`Salary`是 $y$，`Hitters[test,]`中除去 `Salary` 的其他19个特征组成了 $X$ 矩阵，
所以`data.frame(test.mat[, -1])` 中*League*, *Division* 和 *NewLeague* 3 个factor型特征名字改变后（例如 *League* 变成了 *LeagueN*）变成了 `within(Hitters[test,], rm(Salary))`中对应的数字，其他16个数值型特征的值完全相同。


`%*%` 表示矩阵乘法：
```{r}
x <- 1:4
x %*% x
```

参考 [Matrix Multiplication](http://astrostatistics.psu.edu/su07/R/html/base/html/matmult.html)

在 R 语言中，`Salary ~ .`是一个 *formula* 对象：
```{r}
class(Salary ~ .)
```

### Function Explanations

From line 124, `pred <- predict(best.fit, Hitters[folds == j, ], id = i)`, we can see the parameters of function `predict.regsubsets` are:

* object: a fit result of function `regsubsets`. For example, when `regfit.best` used as `object` in function `predict.regsubsets`, we can see:
```{r}
regfit.best$call
class(regfit.best$call)
regfit.best$call[[2]]
class(regfit.best$call[[2]])
as.formula(regfit.best$call[[2]])
class(as.formula(regfit.best$call[[2]]))
```


* newdata: here it used to pass the test dataset to predict function;

* id: 在第 *j* 轮交叉验证中（被标记为 *j* 的数据做测试数据，其他数据做训练数据），*id* 用来确定 predictors 的数量，例如 $id = 6$ 时，`coef(regfit.best, id = 6)`表示有6个predictors的系数向量 $\beta$。

`best.fit` 是在 training set 上得到的 best subsets 模型，`coefi` 是基于这个模型得到的系数，所以在`predict.regsubsets`的返回值 $\hat y$ 中，系数 $\beta$ 是从训练集上得到的，$X$ 则来自于测试集，$\hat y$的长度就是训练集的长度，由于整个训练集被 `sample` 函数随机分为10组，每组的数量不完全一样。从上面的输出可以看到，第一轮测试集长度为13，第二轮长度为25，
……，第10轮长度为24，记为$v = c(13, 25, 31, 32, 33, 27, 26, 30, 22, 24)$。

`cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred) ^ 2)`中，`pred` 中保存了第 *j* 轮交叉验证中的 *i* 个 predictors 的 best subsets 的计算结果 $\hat y_{ji}$，实际值 $y_{ji}$ 是 `Hitters$Salary[folds == j]`，二者的长度都是此轮测试集的长度 $v_j$，所以误差项的表达式是：
$$CVerror_{ji} = \frac{\sum^{v_j}_{u=1}(y_{ji} - \hat y_{ji}) ^ 2}{v_j}$$

# 6.6.1 Ridge Regression
Define the input data:
```{r}
x <- model.matrix(Salary ~ ., data = Hitters)[, -1]
y <- Hitters$Salary
```

Run ridge regrssion:
```{r}
library(glmnet)
grid <- 10 ^ seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)
dim(coef(ridge.mod))
```

Study the calculated cofficients:
```{r}
ridge.mod$lambda[50]
coef(ridge.mod)[, 50]
sqrt(sum(coef(ridge.mod)[-1, 50] ^ 2))

ridge.mod$lambda[60]
coef(ridge.mod)[, 60]
sqrt(sum(coef(ridge.mod)[-1, 60] ^ 2))
```

Get the coefficients when $\lambda = 50$:
```{r}
predict(ridge.mod, s = 50, type = 'coefficients')
```

Fit the ridge regression model on training set:
```{r}
set.seed(1)
train <- sample(1: nrow(x), nrow(x)/2)
test <- -train
y.test <- y[test]
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
```

Calculate the MSE when $\lambda = 4$:
```{r}
ridge.pred <- predict(ridge.mod, s = 4, newx = x[test,])
mean((ridge.pred - y.test) ^ 2)
```

Calculate the MSE when $\lambda \to \infty$ ($\beta \to 0$) on test set:
```{r}
mean((mean(y[train]) - y.test) ^ 2)
ridge.pred <- predict(ridge.mod, s = 1e10, newx = x[test,])
mean((ridge.pred - y.test) ^ 2)
```

Calculate the MSE with least square fit ($\lambda = 0$):
```{r}
ridge.pred <- predict(ridge.mod, s = 0.01, newx = x[test,], exact = TRUE)
mean((ridge.pred - y.test) ^ 2)
```

When `s < 0.01`, the following error arises:

> Error: used coef.glmnet() or predict.glmnet() with `exact=TRUE` so must in addition supply original argument(s) x and y in order to safely rerun glmnet 

So here I use `s = 0.01` instead of `s = 0` as a workaround.

According to [trying to use exact=TRUE feature in R glmnet](https://stackoverflow.com/questions/49804793/trying-to-use-exact-true-feature-in-r-glmnet), the parameter *penalty.factor* must be provided when both `s = 0` and `exact = TRUE` in function `predict.glmnet()`.
But I don't konw what does this parameter mean so I can't set its value.

Compare the coefficients created by `lm()` and `glmnet(..., s = 0)`:
```{r}
lm(y ~ x, subset = train)
predict(ridge.mod, s = 0.01, exact = TRUE, type = 'coefficients')[1:20,]
```

They are almost the same.

Choose the $\lambda$ with CV:
```{r}
set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.out)
best.lambda <- cv.out$lambda.min
abline(v = log(best.lambda), lty = 2, lwd = 2, col = 'blue')
```

Test MSE associated with the best $\lambda$:
```{r}
ridge.pred <- predict(ridge.mod, s = best.lambda, newx = x[test, ])
mean((ridge.pred - y.test) ^ 2)
```

Fit ridge regression model on the full data set, with the $\lambda$ chosen by cross-validation:
```{r}
out <- glmnet(x, y, alpha = 0)
predict(out, s = best.lambda, type = 'coefficients')[1:20,]
```

All 19 coefficients are non-zero. So no predictors are excluded by ridge regression.

# 6.6.2 Lasso

Fit training data with Lasso model:
```{r}
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid)
plot(lasso.mod)
```

Choose the $\lambda$ of Lasso with CV:
```{r}
set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)
lbl <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s = lbl, newx = x[test, ])
mean((lasso.pred - y.test) ^ 2)
```

The coefficients of the Lasso model:
```{r}
out <- glmnet(x, y, alpha = 1, lambda = lbl)
predict(out, type = 'coefficients', s = lbl)[1:20,]
```

So 12 of 19 predictors are excluded by Lasso.

# 6.7.1 Principal Components Regression

Predict *Salary* of *Hitters* with PCR:
```{r}
library(pls)
Hitters <- na.omit(Hitters)
set.seed(2)
pcr.fit <- pcr(Salary ~ ., data = Hitters, scale = TRUE, validation = 'CV')
summary(pcr.fit)
```

Plot the result:
```{r}
validationplot(pcr.fit, val.type = 'MSEP')
```

Perform PCR on training set and evaluate the test performance:
```{r}
set.seed(1)
x <- model.matrix(Salary ~ ., data = Hitters)[, -1]
y <- Hitters$Salary
Hitters <- na.omit(Hitters)
train <- sample(1: nrow(x), nrow(x)/2)
test <- -train
y.test <- y[test]
pcr.fit <- pcr(Salary ~ ., data = Hitters, subset = train, scale = TRUE, validation = 'CV')
summary(pcr.fit)
validationplot(pcr.fit, val.type = 'MSEP')
pcr.pred <- predict(pcr.fit, x[test, ], ncomp = 7)
mean((pcr.pred - y.test) ^ 2)
```

Fit PCR on the full data set, using M = 7, the number of components identified by cross-validation:
```{r}
pcr.fit <- pcr(y ~ x, scale = TRUE, ncomp = 7)
summary(pcr.fit)
```

# 6.7.2 Partial Least Squares

Perform PLS on training set of *Hitters*:
```{r}
pls.fit <- plsr(Salary ~ ., data = Hitters, subset = train, scale = TRUE, validation = 'CV')
summary(pls.fit)
validationplot(pls.fit, val.type = 'MSEP')
```

Evaluate test set MSE:
```{r}
pls.pred <- predict(pls.fit, x[test, ], ncomp = 2)
mean((pls.pred - y.test) ^ 2)
```

Perform PLS on full data set using $M = 2$:
```{r}
pls.fit <- plsr(Salary ~ ., data = Hitters, scale = TRUE, ncomp = 2)
summary(pls.fit)
```

Comparing the result with PCR ($M=7$), PLS explained 46.40% using only 2 components ($M=2$).
While PCR used 7 components to explain 46.69%.
Although PCR with $M=7$ explained 92.26% of the predictors ($X$), compared with 51.03% of PLS,
the predictor explanation ability is not our interests.