Q1. We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain \(p + 1\) models containing \(0,1,2,\cdots,p\) predictors. Explain your answers :
When performing best subset selection, the model with \(k\) predictors is the model with the smallest RSS among all the \(C_p^k\) models with \(k\) predictors. When performing forward stepwise selection, the model with \(k\) predictors is the model with the smallest RSS among the \(p - k\) models which augment the predictors in \(\mathcal{M}_{k - 1}\) with one additional predictor. When performing backward stepwise selection, the model with \(k\) predictors is the model with the smallest RSS among the \(k\) models which contains all but one of the predictors in \(\mathcal{M}_{k + 1}\). So, the model with \(k\) predictors which has the smallest training RSS is the one obtained from best subset selection as it is the one selected among all \(k\) predictors models.
Difficult to answer : best subset selection may have the smallest test RSS because it takes into account more models than the other methods. However, the other methods might also pick a model with smaller test RSS by sheer luck.
True. The model with \((k + 1)\) predictors is obtained by augmenting the predictors in the model with \(k\) predictors with one additional predictor.
True. The model with \(k\) predictors is obtained by removing one predictor from the model with \((k + 1)\) predictors.
False. There is no direct link between the models obtained from forward and backward selection.
False. There is no direct link between the models obtained from forward and backward selection.
False. The model with \((k + 1)\) predictors is obtained by selecting among all possible models with \((k + 1)\) predictors, and so does not necessarily contain all the predictors selected for the \(k\)-variable model.
Q2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
The lasso is less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Same as lasso, ridge regression is less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Non-linear methods are more flexible and will give improved prediction accuracy when their increase in variance are less than their decrease in bias.
Q3. Suppose we estimate the regression coefficients in a linear regression model by minimizing \[\sum_{i=1}^n\Biggl(y_i - \beta_0 - \sum_{j=1}^p\beta_jx_{ij}\Biggr)\text{ subject to }\sum_{j=1}^p|\beta_j|\le s\] for a particular value of \(s\). For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
Steadily decrease. As we increase \(s\) from 0, we are restricting the \(\beta_j\) coefficients less and less (the coefficients will increase to their least squares estimates), and so the model is becoming more and more flexible which provokes a steady decrease in the training RSS.
Decrease initially, and then eventually start increasing in a U shape. As we increase \(s\) from 0, we are restricting the \(\beta_j\) coefficients less and less (the coefficients will increase to their least squares estimates), and so the model is becoming more and more flexible which provokes at first a decrease in the test RSS before increasing again after that in a typical U shape.
Steadily increase. As we increase \(s\) from 0, we are restricting the \(\beta_j\) coefficients less and less (the coefficients will increase to their least squares estimates), and so the model is becoming more and more flexible which provokes a steady increase in variance.
Steadily decrease. As we increase \(s\) from 0, we are restricting the \(\beta_j\) coefficients less and less (the coefficients will increase to their least squares estimates), and so the model is becoming more and more flexible which provokes a steady decrease in bias.
Remain constant. By definition, the irreducible error is independant of the model, and consequently independant of the value of \(s\).
Q4. Suppose we estimate the regression coefficients in a linear regression model by minimizing \[\sum_{i=1}^n\Biggl(y_i - \beta_0 - \sum_{j=1}^p\beta_jx_{ij}\Biggr) - \lambda\sum_{j=1}^p\beta_j^2\] for a particular value of \(\lambda\). For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
Steadily increase. As we increase \(\lambda\) from 0, we are restricting the \(\beta_j\) coefficients more and more (the coefficients will deviate from their least squares estimates), and so the model is becoming less and less flexible which provokes a steady increase in training RSS.
Decrease initially, and then eventually start increasing in a U shape. As we increase \(\lambda\) from 0, we are restricting the \(\beta_j\) coefficients more and more (the coefficients will deviate from their least squares estimates), and so the model is becoming less and less flexible which provokes at first a decrease in the test RSS before increasing again after that in a typical U shape.
Steadily decrease. As we increase \(\lambda\) from 0, we are restricting the \(\beta_j\) coefficients more and more (the coefficients will deviate from their least squares estimates), and so the model is becoming less and less flexible which provokes a steady decrease in variance.
Steadily increase. As we increase \(\lambda\) from 0, we are restricting the \(\beta_j\) coefficients more and more (the coefficients will deviate from their least squares estimates), and so the model is becoming less and less flexible which provokes a steady increase in bias.
Remain constant. By definition, the irreducible error is independant of the model, and consequently independant of the value of \(\lambda\).
Q5. It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give different coefficient values to correlated variables. We will now explore this property in a very simple setting.
Suppose that \(n = 2\), \(p = 2\), \(x_{11} = x_{12}\), \(x_{21} = x_{22}\). Furthermore, suppose that \(y_1 + y_2 = 0\) and \(x_{11} + x_{21} = 0\) and \(x_{12} + x_{22} = 0\), so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero : \(\hat{\beta}_0 = 0\).
According to this setting (\(x_{11} = x_{12} = x_1\) and \(x_{21} = x_{22} = x_2\)), the ridge regression problem seeks to minimize \[(y_1 - \hat{\beta}_1x_1 - \hat{\beta}_2x_1)^2 + (y_2 - \hat{\beta}_1x_2 - \hat{\beta}_2x_2)^2 + \lambda(\hat{\beta}_1^2 + \hat{\beta}_2^2).\]
By taking the derivatives of the above expression with respect to \(\hat{\beta}_1\) and \(\hat{\beta}_2\) and setting them equal to \(0\), we obtain respectively \[\hat{\beta}_1(x_1^2 + x_2^2 + \lambda) + \hat{\beta}_2(x_1^2 + x_2^2) = y_1x_1 + y_2x_2\] and \[\hat{\beta}_1(x_1^2 + x_2^2) + \hat{\beta}_2(x_1^2 + x_2^2 + \lambda) = y_1x_1 + y_2x_2.\] By substracting the two expressions above we get \(\hat{\beta}_1 = \hat{\beta}_2\).
According to this setting (\(x_{11} = x_{12} = x_1\) and \(x_{21} = x_{22} = x_2\)), the lasso optimization problem seeks to minimize \[(y_1 - \hat{\beta}_1x_1 - \hat{\beta}_2x_1)^2 + (y_2 - \hat{\beta}_1x_2 - \hat{\beta}_2x_2)^2 + \lambda(|\hat{\beta}_1| + |\hat{\beta}_2|).\]
We will use the alternate form of the lasso optimization problem \[(y_1 - \hat{\beta}_1x_1 - \hat{\beta}_2x_1)^2 + (y_2 - \hat{\beta}_1x_2 - \hat{\beta}_2x_2)^2\text{ subject to }|\hat{\beta}_1| + |\hat{\beta}_2|\le s.\] Geometrically the lasso constraint take the form of a diamond centered at the origin of the plane \((\hat{\beta}_1,\hat{\beta}_2)\) which intersects the axes at a distance \(s\) from the origin. By using the setting of this problem (\(x_{11} = x_{12} = x_1\), \(x_{21} = x_{22} = x_2\), \(x_1 + x_2 = 0\) and \(y_1 + y_2 = 0\)), we have to minimize the expression \[2[y_1 - (\hat{\beta}_1 + \hat{\beta}_2)x_1]^2\ge 0.\] This optimization problem has a simple solution : \(\hat{\beta}_1 + \hat{\beta}_2 = y_1/x_1\). Geometrically, this is a line parallel to the edge of the diamond of the constraints. Now, solutions to the lasso optimization problem are contours of the function \([y_1 - (\hat{\beta}_1 + \hat{\beta}_2)x_1]^2\) that intersects the diamond of the constraints. So, the entire edge \(\hat{\beta}_1 + \hat{\beta}_2 = s\) (as is the edge \(\hat{\beta}_1 + \hat{\beta}_2 = -s\)) is a potential solution to the lasso optimization problem. Thus, the lasso optimization problem has a whole set of solutions instead of a unique one : \[\{(\hat{\beta}_1,\hat{\beta}_2) : \hat{\beta}_1 + \hat{\beta}_2 = s\text{ with }\hat{\beta}_1,\hat{\beta}_2\ge 0\text{ and }\hat{\beta}_1 + \hat{\beta}_2 = -s\text{ with }\hat{\beta}_1,\hat{\beta}_2\le 0\}.\]
Q6. We will now explore (6.12) and (6.13) further.
y <- 3
lambda <- 2
beta <- seq(-10, 10, 0.1)
plot(beta, (y - beta)^2 + lambda * beta^2, pch = 20, xlab = "beta", ylab = "Ridge optimization")
beta.est <- y / (1 + lambda)
points(beta.est, (y - beta.est)^2 + lambda * beta.est^2, col = "red", pch = 4, lwd = 5)
We may see that the function is minimized at \(\beta = y / (1 + \lambda)\).
y <- 3
lambda <- 2
beta <- seq(-10, 10, 0.1)
plot(beta, (y - beta)^2 + lambda * abs(beta), pch = 20, xlab = "beta", ylab = "Lasso optimization")
beta.est <- y - lambda / 2
points(beta.est, (y - beta.est)^2 + lambda * abs(beta.est), col = "red", pch = 4, lwd = 5)
We may see that the function is minimized at \(\beta = y - \lambda/2\) as \(y > \lambda / 2\).
Q8. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
set.seed(1)
x <- rnorm(100)
eps <- rnorm(100)
b0 <- 2
b1 <- 3
b2 <- -1
b3 <- 0.5
y <- b0 + b1 * x + b2 * x^2 + b3 * x^3 + eps
library(leaps)
data.full <- data.frame(y = y, x = x)
regfit.full <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10)
reg.summary <- summary(regfit.full)
par(mfrow = c(2, 2))
plot(reg.summary$cp, xlab = "Number of variables", ylab = "C_p", type = "l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of variables", ylab = "BIC", type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)
plot(reg.summary$adjr2, xlab = "Number of variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red", cex = 2, pch = 20)
We find that, with \(C_p\) we pick the 3-variables model, with BIC we pick the 3-variables model, and with adjusted \(R^2\) we pick the 3-variables model.
coef(regfit.full, which.max(reg.summary$adjr2))
## (Intercept) x I(x^2) I(x^5)
## 2.07219472 3.44514720 -1.15676236 0.09022577
We begin with forward stepwise selection.
regfit.fwd <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10, method = "forward")
reg.summary.fwd <- summary(regfit.fwd)
par(mfrow = c(2, 2))
plot(reg.summary.fwd$cp, xlab = "Number of variables", ylab = "C_p", type = "l")
points(which.min(reg.summary.fwd$cp), reg.summary.fwd$cp[which.min(reg.summary.fwd$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary.fwd$bic, xlab = "Number of variables", ylab = "BIC", type = "l")
points(which.min(reg.summary.fwd$bic), reg.summary.fwd$bic[which.min(reg.summary.fwd$bic)], col = "red", cex = 2, pch = 20)
plot(reg.summary.fwd$adjr2, xlab = "Number of variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary.fwd$adjr2), reg.summary.fwd$adjr2[which.max(reg.summary.fwd$adjr2)], col = "red", cex = 2, pch = 20)
mtext("Plots of C_p, BIC and adjusted R^2 for forward stepwise selection", side = 3, line = -2, outer = TRUE)
We find that, for forward stepwise selection, with \(C_p\) we pick the 3-variables model, with BIC we pick the 3-variables model, and with adjusted \(R^2\) we pick the 3-variables model.
coef(regfit.fwd, which.max(reg.summary.fwd$adjr2))
## (Intercept) x I(x^2) I(x^5)
## 2.07219472 3.44514720 -1.15676236 0.09022577
Next we proceed with backward stepwise selection.
regfit.bwd <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10, method = "backward")
reg.summary.bwd <- summary(regfit.bwd)
par(mfrow = c(2, 2))
plot(reg.summary.bwd$cp, xlab = "Number of variables", ylab = "C_p", type = "l")
points(which.min(reg.summary.bwd$cp), reg.summary.bwd$cp[which.min(reg.summary.bwd$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary.bwd$bic, xlab = "Number of variables", ylab = "BIC", type = "l")
points(which.min(reg.summary.bwd$bic), reg.summary.bwd$bic[which.min(reg.summary.bwd$bic)], col = "red", cex = 2, pch = 20)
plot(reg.summary.bwd$adjr2, xlab = "Number of variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary.bwd$adjr2), reg.summary.bwd$adjr2[which.max(reg.summary.bwd$adjr2)], col = "red", cex = 2, pch = 20)
mtext("Plots of C_p, BIC and adjusted R^2 for backward stepwise selection", side = 3, line = -2, outer = TRUE)
We find that, for backward stepwise selection, with \(C_p\) we pick the 3-variables model, with BIC we pick the 3-variables model, and with adjusted \(R^2\) we pick the 3-variables model.
coef(regfit.bwd, which.max(reg.summary.bwd$adjr2))
## (Intercept) x I(x^2) I(x^5)
## 2.07219472 3.44514720 -1.15676236 0.09022577
Here forward stepwise, backward stepwise and best subset all select the three variables model with \(X\), \(X^2\) and \(X^5\).
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 1.9-8
xmat <- model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full)[, -1]
cv.lasso <- cv.glmnet(xmat, y, alpha = 1)
plot(cv.lasso)
bestlam <- cv.lasso$lambda.min
bestlam
## [1] 0.03795616
Now we refit our lasso model using the value \(\lambda =\) 0.0379562 chosen by cross-validation.
fit.lasso <- glmnet(xmat, y, alpha = 1)
predict(fit.lasso, s = bestlam, type = "coefficients")[1:11, ]
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5)
## 2.04091434 3.28373245 -1.10646862 0.14042235 0.00000000 0.06399454
## I(x^6) I(x^7) I(x^8) I(x^9) I(x^10)
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
The lasso method picks \(X\), \(X^2\), \(X^3\) and \(X^5\) as variables for the model.
We begin with best subset selection.
b7 <- 7
y <- b0 + b7 * x^7 + eps
data.full <- data.frame(y = y, x = x)
regfit.full <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10)
reg.summary <- summary(regfit.full)
par(mfrow = c(2, 2))
plot(reg.summary$cp, xlab = "Number of variables", ylab = "C_p", type = "l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of variables", ylab = "BIC", type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)
plot(reg.summary$adjr2, xlab = "Number of variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red", cex = 2, pch = 20)
We find that, with \(C_p\) we pick the 2-variables model, with BIC we pick the 1-variables model, and with adjusted \(R^2\) we pick the 4-variables model.
coef(regfit.full, 1)
## (Intercept) I(x^7)
## 1.95894 7.00077
coef(regfit.full, 2)
## (Intercept) I(x^2) I(x^7)
## 2.0704904 -0.1417084 7.0015552
coef(regfit.full, 4)
## (Intercept) x I(x^2) I(x^3) I(x^7)
## 2.0762524 0.2914016 -0.1617671 -0.2526527 7.0091338
Here best subset selection with BIC picks the most accurate 1-variable model with matching coefficients.
Now we proceed with the lasso.
xmat <- model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full)[, -1]
cv.lasso <- cv.glmnet(xmat, y, alpha = 1)
bestlam <- cv.lasso$lambda.min
bestlam
## [1] 12.36884
fit.lasso <- glmnet(xmat, y, alpha = 1)
predict(fit.lasso, s = bestlam, type = "coefficients")[1:11, ]
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5)
## 2.820215 0.000000 0.000000 0.000000 0.000000 0.000000
## I(x^6) I(x^7) I(x^8) I(x^9) I(x^10)
## 0.000000 6.796694 0.000000 0.000000 0.000000
Here the lasso also picks the most accurate 1-variable model, but the intercept is quite off.
Q9. In this exercise, we will predict the number of applications received using the other variables in the “College” data set.
library(ISLR)
data(College)
set.seed(11)
train = sample(1:dim(College)[1], dim(College)[1] / 2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
fit.lm <- lm(Apps ~ ., data = College.train)
pred.lm <- predict(fit.lm, College.test)
mean((pred.lm - College.test$Apps)^2)
## [1] 1538442
The test MSE is 1.538442210^{6}.
train.mat <- model.matrix(Apps ~ ., data = College.train)
test.mat <- model.matrix(Apps ~ ., data = College.test)
grid <- 10 ^ seq(4, -2, length = 100)
fit.ridge <- glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.ridge <- cv.ridge$lambda.min
bestlam.ridge
## [1] 18.73817
pred.ridge <- predict(fit.ridge, s = bestlam.ridge, newx = test.mat)
mean((pred.ridge - College.test$Apps)^2)
## [1] 1608859
The test MSE is higher for ridge regression than for least squares.
fit.lasso <- glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso <- cv.lasso$lambda.min
bestlam.lasso
## [1] 21.54435
pred.lasso <- predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
mean((pred.lasso - College.test$Apps)^2)
## [1] 1635280
The test MSE is also higher for ridge regression than for least squares.
predict(fit.lasso, s = bestlam.lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -836.50402310
## (Intercept) .
## PrivateYes -385.73749394
## Accept 1.17935134
## Enroll .
## Top10perc 22.70211938
## Top25perc .
## F.Undergrad 0.07062149
## P.Undergrad 0.01366763
## Outstate -0.03424677
## Room.Board 0.01281659
## Books -0.02167770
## Personal .
## PhD -1.46396964
## Terminal -5.17281004
## S.F.Ratio 5.70969524
## perc.alumni -9.95007567
## Expend 0.14852541
## Grad.Rate 5.79789861
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
fit.pcr <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pcr, val.type = "MSEP")
pred.pcr <- predict(fit.pcr, College.test, ncomp = 10)
mean((pred.pcr - College.test$Apps)^2)
## [1] 3014496
The test MSE is also higher for PCR than for least squares.
fit.pls <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")
pred.pls <- predict(fit.pls, College.test, ncomp = 10)
mean((pred.pls - College.test$Apps)^2)
## [1] 1508987
Here, the test MSE is lower for PLS than for least squares.
To compare the results obtained above, we have to compute the test \(R^2\) for all models.
test.avg <- mean(College.test$Apps)
lm.r2 <- 1 - mean((pred.lm - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
ridge.r2 <- 1 - mean((pred.ridge - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lasso.r2 <- 1 - mean((pred.lasso - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pcr.r2 <- 1 - mean((pred.pcr - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pls.r2 <- 1 - mean((pred.pls - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
So the test \(R^2\) for least squares is 0.9044281, the test \(R^2\) for ridge is 0.9000536, the test \(R^2\) for lasso is 0.8984123, the test \(R^2\) for pcr is 0.8127319 and the test \(R^2\) for pls is 0.9062579. All models, except PCR, predict college applications with high accuracy.
Q10. We have seen that as a number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
set.seed(1)
x <- matrix(rnorm(1000 * 20), 1000, 20)
b <- rnorm(20)
b[3] <- 0
b[4] <- 0
b[9] <- 0
b[19] <- 0
b[10] <- 0
eps <- rnorm(1000)
y <- x %*% b + eps
train <- sample(seq(1000), 100, replace = FALSE)
test <- -train
x.train <- x[train, ]
x.test <- x[test, ]
y.train <- y[train]
y.test <- y[test]
data.train <- data.frame(y = y.train, x = x.train)
regfit.full <- regsubsets(y ~ ., data = data.train, nvmax = 20)
train.mat <- model.matrix(y ~ ., data = data.train, nvmax = 20)
val.errors <- rep(NA, 20)
for (i in 1:20) {
coefi <- coef(regfit.full, id = i)
pred <- train.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((pred - y.train)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Training MSE", pch = 19, type = "b")
data.test <- data.frame(y = y.test, x = x.test)
test.mat <- model.matrix(y ~ ., data = data.test, nvmax = 20)
val.errors <- rep(NA, 20)
for (i in 1:20) {
coefi <- coef(regfit.full, id = i)
pred <- test.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((pred - y.test)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b")
which.min(val.errors)
## [1] 14
The 14-variables model has the smallest test MSE.
coef(regfit.full, which.min(val.errors))
## (Intercept) x.1 x.2 x.5 x.7 x.8
## 0.1630514 0.3707851 0.3176740 1.0424379 -1.2895997 0.8308835
## x.11 x.12 x.13 x.14 x.15 x.16
## 0.6919104 0.5638802 -0.3641320 -0.8346409 -0.5667810 -0.1959694
## x.17 x.18 x.20
## 0.3128194 1.5567459 -0.7831598
The best model caught all zeroed out coefficients.
val.errors <- rep(NA, 20)
x_cols = colnames(x, do.NULL = FALSE, prefix = "x.")
for (i in 1:20) {
coefi <- coef(regfit.full, id = i)
val.errors[i] <- sqrt(sum((b[x_cols %in% names(coefi)] - coefi[names(coefi) %in% x_cols])^2) + sum(b[!(x_cols %in% names(coefi))])^2)
}
plot(val.errors, xlab = "Number of coefficients", ylab = "Error between estimated and true coefficients", pch = 19, type = "b")
We may see that the model with 3 variables minimizes the error between the estimated and true coefficients. However test error is minimized by the model with 14 variables. So, a better fit of true coefficients doesn’t necessarily mean a lower test MSE.
Q11. We will now try to predict per capita crime rate in the “Boston” data set.
we begin with best subset selection.
library(MASS)
data(Boston)
set.seed(1)
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred <- predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")
We may see that cross-validation selects an 12-variables model. We have a CV estimate for the test MSE equal to 41.0345657.
Next we proceed with the lasso.
x <- model.matrix(crim ~ ., Boston)[, -1]
y <- Boston$crim
cv.out <- cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out)
Here cross-validation selects a \(\lambda\) equal to 0.0467489. We have a CV estimate for the test MSE equal to 42.134324.
Next we proceed with ridge regression.
cv.out <- cv.glmnet(x, y, alpha = 0, type.measure = "mse")
plot(cv.out)
Here cross-validation selects a \(\lambda\) equal to 0.5374992. We have a CV estimate for the test MSE equal to 42.9834518.
Finally, we proceed with PCR.
pcr.fit <- pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.170 7.163 6.733 6.728 6.740 6.765
## adjCV 8.61 7.169 7.162 6.730 6.723 6.737 6.760
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.760 6.634 6.652 6.642 6.652 6.607 6.546
## adjCV 6.754 6.628 6.644 6.635 6.643 6.598 6.536
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 93.45 95.40 97.04 98.46 99.52 100.0
## crim 42.47 42.55 42.78 43.04 44.13 45.4
validationplot(pcr.fit, val.type = "MSEP")
Here cross-validation selects \(M\) to be equal to 14 (so, no dimension reduction). We have a CV estimate for the test MSE equal to 45.693568.
As computed above the model with the lower cross-validation error is the one chosen by the best subset selection method.
No, the model chosen by the best subset selection method has only 13 predictors.