Data preprocessing:
library(ISLR)
str(Hitters)
'data.frame': 322 obs. of 20 variables:
$ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
$ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
$ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
$ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
$ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
$ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
$ Years : int 1 14 3 11 2 11 2 3 2 13 ...
$ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
$ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
$ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
$ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
$ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
$ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
$ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
$ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
$ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
$ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
$ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
$ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
$ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
dim(Hitters)
[1] 322 20
sum(is.na(Hitters$Salary))
[1] 59
Hitters <- na.omit(Hitters)
dim(Hitters)
[1] 263 20
Step (2) of Algorithm 6.1:
library(leaps)
regfit.full <- regsubsets(Salary ~ ., data = Hitters)
summary(regfit.full)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters)
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" " " " "
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " "
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " "
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*" " " " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" " " " "
7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " " " " " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " " "*" " "
DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " "
4 ( 1 ) "*" "*" " " " " " "
5 ( 1 ) "*" "*" " " " " " "
6 ( 1 ) "*" "*" " " " " " "
7 ( 1 ) "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " " " " "
Step (3) of Algorithm 6.1:
regfit.full <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters))
reg.summary <- summary(regfit.full)
names(reg.summary)
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary$rsq
[1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227 0.5285569 0.5346124
[10] 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164 0.5454692 0.5457656 0.5459518 0.5460945
[19] 0.5461159
Here we chose \(R^2\) as the test standard of which of \(M_0, \dots, M_p\) is the best model.
Plot the result:
par(mfrow=c(2,2))
plot(reg.summary$rss, xlab = 'Number of variables', ylab = 'RSS', type = 'l')
plot(reg.summary$adjr2, xlab = 'Number of variables', ylab = 'Adjusted Rsq', type = 'l')
rsq.max.idx <- which.max(reg.summary$adjr2)
points(rsq.max.idx, reg.summary$adjr2[rsq.max.idx], col = 'red', cex = 2, pch = 20)
plot(reg.summary$cp, xlab = 'Number of variables', ylab = 'Cp', type = 'l')
cp.min.idx <- which.min(reg.summary$cp)
points(cp.min.idx, reg.summary$cp[cp.min.idx], col = 'red', cex = 2, pch = 20)
bic.min.idx <- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab = 'Number of variables', ylab = 'BIC', type = 'l')
points(bic.min.idx, reg.summary$bic[bic.min.idx], col = 'red', cex = 2, pch = 20)
Plot with regsubsets()
’s built-in function:
plot(regfit.full, scale = 'r2')
plot(regfit.full, scale = 'adjr2')
plot(regfit.full, scale = 'Cp')
plot(regfit.full, scale = 'bic')
With these plots we can answer which predictors are selected in the best model with a specific standard.
For example, in the top row of the second plot (adjusted \(R^2\)) above, when the adjr2 get the maximum value, 0.52, some of the marked (as black) variables are AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts, which are marked in the 6th row of summary(regfit.full)
outputs.
Get the coefficients (\(\beta_i, i \in [1..p]\)) of equation (6.1):
coef(regfit.full, 6)
(Intercept) AtBat Hits Walks CRBI DivisionW PutOuts
91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338 0.2643076
regfit.fwd <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters), method = 'forward')
summary(regfit.fwd)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = ncol(Hitters),
method = "forward")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: forward
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" " " " "
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " "
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " " " "
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*" " " " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" " " " "
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" "*" " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*" "*" " "
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" " "
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" " "
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" "*" "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*" "*" "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*" "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " "
4 ( 1 ) "*" "*" " " " " " "
5 ( 1 ) "*" "*" " " " " " "
6 ( 1 ) "*" "*" " " " " " "
7 ( 1 ) "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " " " " "
9 ( 1 ) "*" "*" " " " " " "
10 ( 1 ) "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*"
regfit.bwd <- regsubsets(Salary ~ ., data = Hitters, nvmax = ncol(Hitters), method = 'backward')
summary(regfit.bwd)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = ncol(Hitters),
method = "backward")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: backward
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " " " " " "
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " " " " " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " " " " " "
4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*" " " " " " "
5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " " " " " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " " " " " "
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " " "*" " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*" "*" " "
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" " "
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" " "
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" "*" "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*" "*" "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*" "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " "
4 ( 1 ) " " "*" " " " " " "
5 ( 1 ) " " "*" " " " " " "
6 ( 1 ) "*" "*" " " " " " "
7 ( 1 ) "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " " " " "
9 ( 1 ) "*" "*" " " " " " "
10 ( 1 ) "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*"
Which predictors in \(M_k\) (\(k=7\) here), and what are their coefficients \(\beta_i\):
coef(regfit.full, 7)
(Intercept) Hits Walks CAtBat CHits CHmRun DivisionW
79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073 1.4420538 -129.9866432
PutOuts
0.2366813
coef(regfit.fwd, 7)
(Intercept) AtBat Hits Walks CRBI CWalks DivisionW
109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622 -0.3053070 -127.1223928
PutOuts
0.2533404
coef(regfit.bwd, 7)
(Intercept) AtBat Hits Walks CRuns CWalks DivisionW
105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095 -0.7163346 -116.1692169
PutOuts
0.3028847
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), replace = TRUE)
test <- !train
regfit.best <- regsubsets(Salary ~ ., data = Hitters[train,], nvmax = 19)
test.mat <- model.matrix(Salary ~ ., data = Hitters[test,])
val.errors <- rep(NA, 19)
for (i in 1:19) {
coefi <- coef(regfit.best, id = i)
pred <- test.mat[,names(coefi)] %*% coefi
val.errors[i] <- mean((Hitters$Salary[test] - pred) ^ 2)
}
val.errors
[1] 220968.0 169157.1 178518.2 163426.1 168418.1 171270.6 162377.1 157909.3 154055.7 148162.1
[11] 151156.4 151742.5 152214.5 157358.7 158541.4 158743.3 159972.7 159859.8 160105.6
which.min(val.errors)
[1] 10
coef(regfit.best, 10)
(Intercept) AtBat Hits Walks CAtBat CHits CHmRun
-80.2751499 -1.4683816 7.1625314 3.6430345 -0.1855698 1.1053238 1.3844863
CWalks LeagueN DivisionW PutOuts
-0.7483170 84.5576103 -53.0289658 0.2381662
Write predict function for regsubsets()
function:
predict.regsubsets <- function(object, newdata, id, ...) {
print(paste('dimenson of test set:', paste(dim(newdata), collapse = " ")))
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
regfit.best <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
coef(regfit.best, 10)
(Intercept) AtBat Hits Walks CAtBat CRuns CRBI
162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490 0.7743122
CWalks DivisionW PutOuts Assists
-0.8308264 -112.3800575 0.2973726 0.2831680
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(Hitters), replace = TRUE)
for (i in 1:k) {
print(paste('There are', sum(folds == i), 'test observations in fold', i))
}
[1] "There are 13 test observations in fold 1"
[1] "There are 25 test observations in fold 2"
[1] "There are 31 test observations in fold 3"
[1] "There are 32 test observations in fold 4"
[1] "There are 33 test observations in fold 5"
[1] "There are 27 test observations in fold 6"
[1] "There are 26 test observations in fold 7"
[1] "There are 30 test observations in fold 8"
[1] "There are 22 test observations in fold 9"
[1] "There are 24 test observations in fold 10"
cv.errors <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for (j in 1:k) {
best.fit <- regsubsets(Salary ~ ., data = Hitters[folds != j,], nvmax = 19)
print(paste('====== The no.', j, 'round: ======'))
for (i in 1:19) {
pred <- predict(best.fit, Hitters[folds == j, ], id = i)
# print(paste('i =', i, ', j =', j, 'dim(pred):', paste(dim(pred), collapse = " ")))
cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred) ^ 2)
}
print(paste('dim(pred):', paste(dim(pred), collapse = " ")))
}
[1] "====== The no. 1 round: ======"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dimenson of test set: 13 20"
[1] "dim(pred): 13 1"
[1] "====== The no. 2 round: ======"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dimenson of test set: 25 20"
[1] "dim(pred): 25 1"
[1] "====== The no. 3 round: ======"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dimenson of test set: 31 20"
[1] "dim(pred): 31 1"
[1] "====== The no. 4 round: ======"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dimenson of test set: 32 20"
[1] "dim(pred): 32 1"
[1] "====== The no. 5 round: ======"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dimenson of test set: 33 20"
[1] "dim(pred): 33 1"
[1] "====== The no. 6 round: ======"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dimenson of test set: 27 20"
[1] "dim(pred): 27 1"
[1] "====== The no. 7 round: ======"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dimenson of test set: 26 20"
[1] "dim(pred): 26 1"
[1] "====== The no. 8 round: ======"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dimenson of test set: 30 20"
[1] "dim(pred): 30 1"
[1] "====== The no. 9 round: ======"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dimenson of test set: 22 20"
[1] "dim(pred): 22 1"
[1] "====== The no. 10 round: ======"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dimenson of test set: 24 20"
[1] "dim(pred): 24 1"
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
1 2 3 4 5 6 7 8 9 10
160093.5 140196.8 153117.0 151159.3 146841.3 138302.6 144346.2 130207.7 129459.6 125334.7
11 12 13 14 15 16 17 18 19
125153.8 128273.5 133461.0 133974.6 131825.7 131882.8 132750.9 133096.2 132804.7
plot(mean.cv.errors, type = 'b')
regfit.best <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
coef(regfit.best, 11)
(Intercept) AtBat Hits Walks CAtBat CRuns CRBI
135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914 1.4553310 0.7852528
CWalks LeagueN DivisionW PutOuts Assists
-0.8228559 43.1116152 -111.1460252 0.2894087 0.2688277
Design Matrix:一般用\(X\)表示,行数为\(n\) (试验次数,number of observations),列数为\(p\)(特征数量,number of features),design matrix满足公式(6.1),或者写成矩阵形式:\(y = X\beta\),其中 \(y\) 是长度为 \(n\) 的结果向量(response variable),\(\beta\) 是长度为 \(p\) 的回归系数向量。
在代码test.mat <- model.matrix(Salary ~ ., data = Hitters[test,])
中: Salary
是 \(y\),Hitters[test,]
中除去 Salary
的其他19个特征组成了 \(X\) 矩阵, 所以data.frame(test.mat[, -1])
中League, Division 和 NewLeague 3 个factor型特征名字改变后(例如 League 变成了 LeagueN)变成了 within(Hitters[test,], rm(Salary))
中对应的数字,其他16个数值型特征的值完全相同。
%*%
表示矩阵乘法:
x <- 1:4
x %*% x
[,1]
[1,] 30
在 R 语言中,Salary ~ .
是一个 formula 对象:
class(Salary ~ .)
[1] "formula"
From line 124, pred <- predict(best.fit, Hitters[folds == j, ], id = i)
, we can see the parameters of function predict.regsubsets
are:
regsubsets
. For example, when regfit.best
used as object
in function predict.regsubsets
, we can see:regfit.best$call
regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
class(regfit.best$call)
[1] "call"
regfit.best$call[[2]]
Salary ~ .
class(regfit.best$call[[2]])
[1] "call"
as.formula(regfit.best$call[[2]])
Salary ~ .
class(as.formula(regfit.best$call[[2]]))
[1] "formula"
newdata: here it used to pass the test dataset to predict function;
id: 在第 j 轮交叉验证中(被标记为 j 的数据做测试数据,其他数据做训练数据),id 用来确定 predictors 的数量,例如 \(id = 6\) 时,coef(regfit.best, id = 6)
表示有6个predictors的系数向量 \(\beta\)。
best.fit
是在 training set 上得到的 best subsets 模型,coefi
是基于这个模型得到的系数,所以在predict.regsubsets
的返回值 \(\hat y\) 中,系数 \(\beta\) 是从训练集上得到的,\(X\) 则来自于测试集,\(\hat y\)的长度就是训练集的长度,由于整个训练集被 sample
函数随机分为10组,每组的数量不完全一样。从上面的输出可以看到,第一轮测试集长度为13,第二轮长度为25, ……,第10轮长度为24,记为\(v = c(13, 25, 31, 32, 33, 27, 26, 30, 22, 24)\)。
cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred) ^ 2)
中,pred
中保存了第 j 轮交叉验证中的 i 个 predictors 的 best subsets 的计算结果 \(\hat y_{ji}\),实际值 \(y_{ji}\) 是 Hitters$Salary[folds == j]
,二者的长度都是此轮测试集的长度 \(v_j\),所以误差项的表达式是: \[CVerror_{ji} = \frac{\sum^{v_j}_{u=1}(y_{ji} - \hat y_{ji}) ^ 2}{v_j}\]
Define the input data:
x <- model.matrix(Salary ~ ., data = Hitters)[, -1]
y <- Hitters$Salary
Run ridge regrssion:
library(glmnet)
grid <- 10 ^ seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)
dim(coef(ridge.mod))
[1] 20 100
Study the calculated cofficients:
ridge.mod$lambda[50]
[1] 11497.57
coef(ridge.mod)[, 50]
(Intercept) AtBat Hits HmRun Runs RBI
407.356050200 0.036957182 0.138180344 0.524629976 0.230701523 0.239841459
Walks Years CAtBat CHits CHmRun CRuns
0.289618741 1.107702929 0.003131815 0.011653637 0.087545670 0.023379882
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.024138320 0.025015421 0.085028114 -6.215440973 0.016482577 0.002612988
Errors NewLeagueN
-0.020502690 0.301433531
sqrt(sum(coef(ridge.mod)[-1, 50] ^ 2))
[1] 6.360612
ridge.mod$lambda[60]
[1] 705.4802
coef(ridge.mod)[, 60]
(Intercept) AtBat Hits HmRun Runs RBI Walks
54.32519950 0.11211115 0.65622409 1.17980910 0.93769713 0.84718546 1.31987948
Years CAtBat CHits CHmRun CRuns CRBI CWalks
2.59640425 0.01083413 0.04674557 0.33777318 0.09355528 0.09780402 0.07189612
LeagueN DivisionW PutOuts Assists Errors NewLeagueN
13.68370191 -54.65877750 0.11852289 0.01606037 -0.70358655 8.61181213
sqrt(sum(coef(ridge.mod)[-1, 60] ^ 2))
[1] 57.11001
Get the coefficients when \(\lambda = 50\):
predict(ridge.mod, s = 50, type = 'coefficients')
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.876610e+01
AtBat -3.580999e-01
Hits 1.969359e+00
HmRun -1.278248e+00
Runs 1.145892e+00
RBI 8.038292e-01
Walks 2.716186e+00
Years -6.218319e+00
CAtBat 5.447837e-03
CHits 1.064895e-01
CHmRun 6.244860e-01
CRuns 2.214985e-01
CRBI 2.186914e-01
CWalks -1.500245e-01
LeagueN 4.592589e+01
DivisionW -1.182011e+02
PutOuts 2.502322e-01
Assists 1.215665e-01
Errors -3.278600e+00
NewLeagueN -9.496680e+00
Fit the ridge regression model on training set:
set.seed(1)
train <- sample(1: nrow(x), nrow(x)/2)
test <- -train
y.test <- y[test]
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
Calculate the MSE when \(\lambda = 4\):
ridge.pred <- predict(ridge.mod, s = 4, newx = x[test,])
mean((ridge.pred - y.test) ^ 2)
[1] 101036.8
Calculate the MSE when \(\lambda \to \infty\) (\(\beta \to 0\)) on test set:
mean((mean(y[train]) - y.test) ^ 2)
[1] 193253.1
ridge.pred <- predict(ridge.mod, s = 1e10, newx = x[test,])
mean((ridge.pred - y.test) ^ 2)
[1] 193253.1
Calculate the MSE with least square fit (\(\lambda = 0\)):
ridge.pred <- predict(ridge.mod, s = 0.01, newx = x[test,], exact = TRUE)
mean((ridge.pred - y.test) ^ 2)
[1] 114723.6
When s < 0.01
, the following error arises:
Error: used coef.glmnet() or predict.glmnet() with
exact=TRUE
so must in addition supply original argument(s) x and y in order to safely rerun glmnet
So here I use s = 0.01
instead of s = 0
as a workaround.
According to trying to use exact=TRUE feature in R glmnet, the parameter penalty.factor must be provided when both s = 0
and exact = TRUE
in function predict.glmnet()
. But I don’t konw what does this parameter mean so I can’t set its value.
Compare the coefficients created by lm()
and glmnet(..., s = 0)
:
lm(y ~ x, subset = train)
Call:
lm(formula = y ~ x, subset = train)
Coefficients:
(Intercept) xAtBat xHits xHmRun xRuns xRBI xWalks
299.42849 -2.54027 8.36682 11.64512 -9.09923 2.44105 9.23440
xYears xCAtBat xCHits xCHmRun xCRuns xCRBI xCWalks
-22.93673 -0.18154 -0.11598 -1.33888 3.32838 0.07536 -1.07841
xLeagueN xDivisionW xPutOuts xAssists xErrors xNewLeagueN
59.76065 -98.86233 0.34087 0.34165 -0.64207 -0.67442
predict(ridge.mod, s = 0.01, exact = TRUE, type = 'coefficients')[1:20,]
(Intercept) AtBat Hits HmRun Runs RBI Walks
299.44467220 -2.53538355 8.33585019 11.59830815 -9.05971371 2.45326546 9.21776006
Years CAtBat CHits CHmRun CRuns CRBI CWalks
-22.98239583 -0.18191651 -0.10565688 -1.31721358 3.31152519 0.06590689 -1.07244477
LeagueN DivisionW PutOuts Assists Errors NewLeagueN
59.75587273 -98.94393005 0.34083276 0.34155445 -0.65312471 -0.65882930
They are almost the same.
Choose the \(\lambda\) with CV:
set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.out)
best.lambda <- cv.out$lambda.min
abline(v = log(best.lambda), lty = 2, lwd = 2, col = 'blue')
Test MSE associated with the best \(\lambda\):
ridge.pred <- predict(ridge.mod, s = best.lambda, newx = x[test, ])
mean((ridge.pred - y.test) ^ 2)
[1] 96015.51
Fit ridge regression model on the full data set, with the \(\lambda\) chosen by cross-validation:
out <- glmnet(x, y, alpha = 0)
predict(out, s = best.lambda, type = 'coefficients')[1:20,]
(Intercept) AtBat Hits HmRun Runs RBI Walks
9.88487157 0.03143991 1.00882875 0.13927624 1.11320781 0.87318990 1.80410229
Years CAtBat CHits CHmRun CRuns CRBI CWalks
0.13074383 0.01113978 0.06489843 0.45158546 0.12900049 0.13737712 0.02908572
LeagueN DivisionW PutOuts Assists Errors NewLeagueN
27.18227527 -91.63411282 0.19149252 0.04254536 -1.81244470 7.21208394
All 19 coefficients are non-zero. So no predictors are excluded by ridge regression.
Fit training data with Lasso model:
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid)
plot(lasso.mod)
Choose the \(\lambda\) of Lasso with CV:
set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)
lbl <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s = lbl, newx = x[test, ])
mean((lasso.pred - y.test) ^ 2)
[1] 100743.4
The coefficients of the Lasso model:
out <- glmnet(x, y, alpha = 1, lambda = lbl)
predict(out, type = 'coefficients', s = lbl)[1:20,]
(Intercept) AtBat Hits HmRun Runs RBI Walks
19.5052238 0.0000000 1.8702513 0.0000000 0.0000000 0.0000000 2.2185101
Years CAtBat CHits CHmRun CRuns CRBI CWalks
0.0000000 0.0000000 0.0000000 0.0000000 0.2075887 0.4125063 0.0000000
LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1.7600992 -103.4996975 0.2207019 0.0000000 0.0000000 0.0000000
So 12 of 19 predictors are excluded by Lasso.
Predict Salary of Hitters with PCR:
library(pls)
Hitters <- na.omit(Hitters)
set.seed(2)
pcr.fit <- pcr(Salary ~ ., data = Hitters, scale = TRUE, validation = 'CV')
summary(pcr.fit)
Data: X dimension: 263 19
Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
CV 452 348.9 352.2 353.5 352.8 350.1 349.1 349.6
adjCV 452 348.7 351.8 352.9 352.1 349.3 348.0 348.5
8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
CV 350.9 352.9 353.8 355.0 356.2 363.5 355.2 357.4
adjCV 349.8 351.6 352.3 353.4 354.5 361.6 352.8 355.2
16 comps 17 comps 18 comps 19 comps
CV 347.6 350.1 349.2 352.6
adjCV 345.5 347.6 346.7 349.8
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps
X 38.31 60.16 70.84 79.03 84.29 88.63 92.26 94.96 96.28
Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69 46.75 46.86
10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps 17 comps
X 97.26 97.98 98.65 99.15 99.47 99.75 99.89 99.97
Salary 47.76 47.82 47.85 48.10 50.40 50.55 53.01 53.85
18 comps 19 comps
X 99.99 100.00
Salary 54.61 54.61
Plot the result:
validationplot(pcr.fit, val.type = 'MSEP')
Perform PCR on training set and evaluate the test performance:
set.seed(1)
x <- model.matrix(Salary ~ ., data = Hitters)[, -1]
y <- Hitters$Salary
Hitters <- na.omit(Hitters)
train <- sample(1: nrow(x), nrow(x)/2)
test <- -train
y.test <- y[test]
pcr.fit <- pcr(Salary ~ ., data = Hitters, subset = train, scale = TRUE, validation = 'CV')
summary(pcr.fit)
Data: X dimension: 131 19
Y dimension: 131 1
Fit method: svdpc
Number of components considered: 19
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
CV 464.6 396.2 395.5 394.0 393.8 393.0 384.4 381.3
adjCV 464.6 395.8 394.8 393.3 392.9 392.5 381.5 380.0
8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
CV 385.5 387.4 401.2 403.5 409.6 405.6 406.7 409.3
adjCV 383.9 385.6 398.7 400.8 406.6 402.4 403.4 405.8
16 comps 17 comps 18 comps 19 comps
CV 407.8 402.5 398.6 403.8
adjCV 404.2 398.4 394.5 399.4
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps
X 38.89 60.25 70.85 79.06 84.01 88.51 92.61 95.20 96.78
Salary 28.44 31.33 32.53 33.69 36.64 40.28 40.41 41.07 41.25
10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps 17 comps
X 97.63 98.27 98.89 99.27 99.56 99.78 99.91 99.97
Salary 41.27 41.41 41.44 43.20 44.24 44.30 45.50 49.66
18 comps 19 comps
X 100.00 100.00
Salary 51.13 51.18
validationplot(pcr.fit, val.type = 'MSEP')
pcr.pred <- predict(pcr.fit, x[test, ], ncomp = 7)
mean((pcr.pred - y.test) ^ 2)
[1] 96556.22
Fit PCR on the full data set, using M = 7, the number of components identified by cross-validation:
pcr.fit <- pcr(y ~ x, scale = TRUE, ncomp = 7)
summary(pcr.fit)
Data: X dimension: 263 19
Y dimension: 263 1
Fit method: svdpc
Number of components considered: 7
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
y 40.63 41.58 42.17 43.22 44.90 46.48 46.69
Perform PLS on training set of Hitters:
pls.fit <- plsr(Salary ~ ., data = Hitters, subset = train, scale = TRUE, validation = 'CV')
summary(pls.fit)
Data: X dimension: 131 19
Y dimension: 131 1
Fit method: kernelpls
Number of components considered: 19
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
CV 464.6 393.3 392.8 395.0 396.2 408.2 416.1 414.8
adjCV 464.6 392.6 391.4 392.8 393.9 405.0 411.5 409.7
8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
CV 412.5 403.2 396.9 400.9 399.1 398.8 395.5 397.8
adjCV 407.8 399.3 392.4 396.4 394.8 394.6 391.6 393.7
16 comps 17 comps 18 comps 19 comps
CV 399.0 400.5 398.2 407.6
adjCV 394.7 396.1 394.2 402.9
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps
X 38.12 53.46 66.05 74.49 79.33 84.56 87.09 90.74 92.55
Salary 33.58 38.96 41.57 42.43 44.04 45.59 47.05 47.53 48.42
10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps 17 comps
X 93.94 97.23 97.88 98.35 98.85 99.11 99.43 99.78
Salary 49.68 50.04 50.54 50.78 50.92 51.04 51.11 51.15
18 comps 19 comps
X 99.99 100.00
Salary 51.16 51.18
validationplot(pls.fit, val.type = 'MSEP')
Evaluate test set MSE:
pls.pred <- predict(pls.fit, x[test, ], ncomp = 2)
mean((pls.pred - y.test) ^ 2)
[1] 101417.5
Perform PLS on full data set using \(M = 2\):
pls.fit <- plsr(Salary ~ ., data = Hitters, scale = TRUE, ncomp = 2)
summary(pls.fit)
Data: X dimension: 263 19
Y dimension: 263 1
Fit method: kernelpls
Number of components considered: 2
TRAINING: % variance explained
1 comps 2 comps
X 38.08 51.03
Salary 43.05 46.40
Comparing the result with PCR (\(M=7\)), PLS explained 46.40% using only 2 components (\(M=2\)). While PCR used 7 components to explain 46.69%. Although PCR with \(M=7\) explained 92.26% of the predictors (\(X\)), compared with 51.03% of PLS, the predictor explanation ability is not our interests.