Chapter 6: 2, 9, 11
a) The lasso, relative to least squares, is:
iii - less flexible, decrease in variance, increase in bias
b) Repeat a) for ridge regression relative to least squares.
iii - less flexible, decrease in variance, increase in bias
c) Repeat a) for non-linear methods relative to least squares.
ii - more flexible, increase in variance, decrease in bias
a) Split the data set into a training set and a test set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
set.seed(11)
data(College)
train = sample(c(TRUE, FALSE), length(College$Apps), rep=TRUE)
test = !train
college.train = College[train,]
college.test = College[test,]
b) Fit a linear model using least squares on the training set, and report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.5
## Loading required package: Matrix
## Loaded glmnet 4.1-2
lm.fit = lm(Apps~., data = college.train)
lm.pred = predict(lm.fit, college.test)
err.lm = mean((lm.pred - college.test$Apps)^2)
err.lm
## [1] 1512942
c) Fit a ridge regression model on the training set, with a lambda chosen by cross-validation. Report the test error obtained
train.mat = model.matrix(Apps~., data = college.train)
test.mat = model.matrix(Apps~., data = college.test)
grid = 10^seq(4,-2,length=100)
ridge.fit = glmnet(train.mat, college.train$Apps, alpha=0, lambda=grid, thresh= 1e-12)
cv.ridge = cv.glmnet(train.mat, college.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
ridge.bestlam = cv.ridge$lambda.min
ridge.bestlam
## [1] 65.79332
ridge.pred = predict(ridge.fit, s = ridge.bestlam, newx=test.mat)
err.ridge = mean((ridge.pred - college.test$Apps)^2)
err.ridge
## [1] 1762771
d) Fit a lasso model on the training set, with lambda chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.fit = glmnet(train.mat, college.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso = cv.glmnet(train.mat, college.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso = cv.lasso$lambda.min
bestlam.lasso
## [1] 16.29751
lasso.pred = predict(lasso.fit, s=bestlam.lasso, newx = test.mat)
err.lasso = mean((lasso.pred- college.test$Apps)^2)
err.lasso
## [1] 1606104
lasso.coef = predict(lasso.fit, s= bestlam.lasso, type = "coefficients")
length(lasso.coef[lasso.coef!= 0])
## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
## [1] 16
e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.0.5
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit = pcr(Apps~., data = college.train, sclae = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
summary(pcr.fit)
## Data: X dimension: 390 17
## Y dimension: 390 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3599 3657 1458 1413 1382 1123 1098
## adjCV 3599 3667 1456 1411 1385 1117 1094
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1091 1099 1107 1071 1047 1049 1050
## adjCV 1087 1095 1102 1066 1042 1044 1045
## 14 comps 15 comps 16 comps 17 comps
## CV 1053 1031 1032 1024
## adjCV 1048 1026 1027 1019
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.987 88.60 96.41 97.72 98.77 99.48 99.93 99.97
## Apps 0.899 84.26 85.30 86.23 91.33 91.78 91.87 91.93
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Apps 91.94 92.45 92.78 92.78 92.82 92.89 93.15
## 16 comps 17 comps
## X 100.00 100.00
## Apps 93.19 93.34
pcr.pred = predict(pcr.fit, college.test, ncomp = 5)
err.pcr = mean((pcr.pred-college.test$Apps)^2)
err.pcr
## [1] 2085998
f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along the the value of M selected by cross-validation.
library(pls)
set.seed(11)
pls.fit = plsr(Apps~., data=college.train, scale=T, validation="CV")
validationplot(pls.fit, val.type="MSEP")
summary(pls.fit)
## Data: X dimension: 390 17
## Y dimension: 390 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3599 1503 1191 1150 1136 1123 1081
## adjCV 3599 1501 1181 1151 1130 1115 1071
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1070 1056 1048 1050 1058 1060 1054
## adjCV 1062 1050 1042 1043 1050 1052 1047
## 14 comps 15 comps 16 comps 17 comps
## CV 1054 1054 1055 1055
## adjCV 1047 1047 1047 1047
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.02 34.05 62.16 66.16 70.27 72.65 76.94 80.87
## Apps 83.31 90.04 90.74 91.68 92.36 93.11 93.17 93.19
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.40 85.4 87.24 91.11 93.48 96.07 97.43
## Apps 93.25 93.3 93.33 93.33 93.34 93.34 93.34
## 16 comps 17 comps
## X 98.41 100.00
## Apps 93.34 93.34
pls.pred = predict(pls.fit, college.test, ncomp=9)
err.pls = mean((college.test$Apps - pls.pred)^2)
err.pls
## [1] 1523609
g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting form these five approaches?
err.all = c(err.lm, err.ridge, err.lasso, err.pcr, err.pls)
barplot(err.all, xlab="Models", ylab="Test MSE", names=c("lm", "ridge", "lasso", "pcr", "pls"))
test.avg <- mean(college.test$Apps)
lm.r2 <- 1 - mean((lm.pred - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)
ridge.r2 <- 1 - mean((ridge.pred - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)
lasso.r2 <- 1 - mean((lasso.pred - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)
pcr.r2 <- 1 - mean((pcr.pred - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)
pls.r2 <- 1 - mean((pls.pred - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)
lm.r2
## [1] 0.9112304
ridge.r2
## [1] 0.8965721
lasso.r2
## [1] 0.9057643
pcr.r2
## [1] 0.8776072
pls.r2
## [1] 0.9106046
barplot(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), xlab="Models", ylab="R2",
names=c("lm", "ridge", "lasso", "pcr", "pls"))
a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
## Best subset
library(MASS)
library("leaps")
## Warning: package 'leaps' was built under R version 4.0.5
library(glmnet)
data(Boston)
set.seed(0)
regfit.full=regsubsets(crim~., data=Boston, nvmax=13)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id= id)
xvars = names(coefi)
mat[, xvars] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(1:k, nrow(Boston), replace=TRUE)
cv.errors = matrix(NA, k, p, dimnames = list(NULL, paste(1:p)))
for (i in 1:k) {
best.fit = regsubsets(crim~., data=Boston[folds != i, ], nvmax =p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i,j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
mean.cv.errors = apply(cv.errors, 2, mean)
min =which.min(mean.cv.errors)
err.subset = mean.cv.errors[min]
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")
points(which.min(mean.cv.errors), mean.cv.errors[which.min(mean.cv.errors)], col ="red", cex =2, pch = 20)
err.subset
## 12
## 42.77213
n = dim(Boston)[1]
p = dim(Boston)[2]
train = sample(c(TRUE, FALSE), n, rep = TRUE)
test = (!train)
boston.train = Boston[train, ]
boston.test = Boston[test, ]
#Ridge Regression
y = boston.train$crim
mat = model.matrix(crim~., data = boston.train)
cv.out = cv.glmnet(mat, y, alpha = 0)
plot(cv.out)
bestlam.ridge = cv.out$lambda.min
ridge.mod = glmnet(mat, y, alpha = 0)
y.hat = predict(ridge.mod, s = bestlam.ridge, newx = model.matrix(crim~., data = boston.test))
err.ridge = mean((boston.test$crim - y.hat)^2)
err.ridge
## [1] 53.75521
set.seed(0)
cv.out = cv.glmnet(mat, y, alpha =1)
plot(cv.out)
bestlam.lasso = cv.out$lambda.min
lasso.mod = glmnet(mat, y, alpha =1)
y.hat = predict(lasso.mod, s=bestlam.lasso, newx = model.matrix(crim~., data = boston.test))
err.lasso = mean((boston.test$crim - y.hat)^2)
err.lasso
## [1] 53.73129
predict(lasso.mod, type = "coefficients", s = bestlam.lasso)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 15.0702060387
## (Intercept) .
## zn 0.0214023242
## indus -0.0278209284
## chas -0.5299396698
## nox -5.4014855676
## rm .
## age 0.0006941891
## dis -0.5187805371
## rad 0.4386875085
## tax .
## ptratio -0.0425804107
## black -0.0211709776
## lstat .
## medv -0.1148156984
# PCR
pcr.mod = pcr(crim~., data = boston.train, scale = TRUE, validation = "CV")
validationplot(pcr.mod, val.type = "MSEP")
summary(pcr.mod)
## Data: X dimension: 245 13
## Y dimension: 245 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 7.856 6.561 6.552 6.093 6.005 6.013 6.031
## adjCV 7.856 6.555 6.548 6.081 5.993 6.004 6.017
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.044 5.977 5.980 5.969 5.988 5.943 5.850
## adjCV 6.030 5.955 5.967 5.955 5.974 5.925 5.832
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 45.49 58.25 67.99 75.57 82.12 87.16 90.74 93.02
## crim 31.84 32.07 42.82 44.38 44.53 45.57 45.57 47.10
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.03 96.91 98.35 99.56 100.00
## crim 47.24 47.49 47.49 48.78 50.44
ncomp = 4
y.hat = predict(pcr.mod, boston.test, ncomp = ncomp)
err.pcr = mean((boston.test$crim - y.hat)^2)
err.pcr
## [1] 55.5631
b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
err.all = c(err.subset, err.ridge, err.lasso, err.pcr)
barplot(err.all, xlab="Models", ylab="Test MSE", names=c("subset", "ridge", "lasso", "pcr"))
c) Does your chosen model involve all of the features in the data set? Why or why not?