library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
library(pls)
## Warning: package 'pls' was built under R version 4.0.4
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(miscTools)
## Warning: package 'miscTools' was built under R version 4.0.4
library(MASS)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
QUESTION 2: For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Reason: Lasso will limit predictors’ coefficients to 0 which reduces variance, and a small increase in bias. We get a better error rate when increase in bias is less relative to the decrease in variance. (b) Repeat (a) for ridge regression relative to least squares. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Reason: Similar to (a), Ridge regression will limit the correfficients by shrinking them (as opposed to 0 in Lasso). This will reduce variance but have a high bias. It is less flexible and best in situations where OLS estimates have a high variance (c) Repeat (a) for non-linear methods relative to least squares. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Reason: This is non linear and hence, more flexible. We will have increased accuracy when increase in variance is less relative to decrease in bias.
QUESTION 9: In this exercise, we will predict the number of applications received using the other variables in the College data set.
data("College")
set.seed(1)
train <- sample(1:nrow(College), nrow(College)/2)
test <- (-train)
coll.train <- College[train,]
coll.test <- College[test,]
coll.lm <- lm(Apps ~ ., coll.train)
lm.pred <- predict(coll.lm, coll.test)
mean((coll.test$Apps - lm.pred)^2)
## [1] 1135758
Test error is 1135758
set.seed(1)
mat.train <- model.matrix(Apps ~ ., coll.train)
mat.test <- model.matrix(Apps ~ ., coll.test)
grid <- 10^seq(10, -2, length = 100)
coll.cvridge <- cv.glmnet(mat.train, coll.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
coll.bestlamridge <- coll.cvridge$lambda.min
coll.bestlamridge
## [1] 0.01
ridge.pred = predict(coll.cvridge, mat.test, s = coll.bestlamridge)
mean((ridge.pred - coll.test$Apps)^2)
## [1] 1135714
Test error is 1135714 which is slightly better than LM test error of 1135758
set.seed(1)
mat.train <- model.matrix(Apps ~ ., coll.train)
mat.test <- model.matrix(Apps ~ ., coll.test)
grid <- 10^seq(10, -2, length = 100)
coll.lasso = cv.glmnet(mat.train, coll.train$Apps, alpha=1, lambda = grid, thresh=1e-12)
coll.lassobestlam = coll.lasso$lambda.min
coll.lassobestlam
## [1] 0.01
lasso.pred <- predict(coll.lasso, mat.test, s = coll.lassobestlam)
mean((lasso.pred - coll.test$Apps)^2)
## [1] 1135659
The test error is 1135659 which is lower and better compared to both ridge and LM models
predict(coll.lasso, s = coll.lassobestlam, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -7.900392e+02
## (Intercept) .
## PrivateYes -3.070090e+02
## Accept 1.779328e+00
## Enroll -1.469512e+00
## Top10perc 6.672205e+01
## Top25perc -2.230441e+01
## F.Undergrad 9.259049e-02
## P.Undergrad 9.408441e-03
## Outstate -1.083495e-01
## Room.Board 2.115144e-01
## Books 2.912138e-01
## Personal 6.120206e-03
## PhD -1.547169e+01
## Terminal 6.409278e+00
## S.F.Ratio 2.282635e+01
## perc.alumni 1.130513e+00
## Expend 4.856698e-02
## Grad.Rate 7.488072e+00
set.seed(1)
coll.pcr = pcr(Apps ~ ., data = coll.train, scale = TRUE, validation = 'CV')
validationplot(coll.pcr, val.type = 'MSEP')
summary(coll.pcr)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4006 2373 2372 2069 1961 1919
## adjCV 4288 4007 2368 2369 1999 1948 1911
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1919 1921 1876 1832 1832 1836 1837
## adjCV 1912 1915 1868 1821 1823 1827 1827
## 14 comps 15 comps 16 comps 17 comps
## CV 1853 1759 1341 1270
## adjCV 1850 1733 1326 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
Lowest CV error at M = 10 hence use ncomp = 10
pcr.pred <- predict(coll.pcr, coll.test, ncomp = 10)
mean((pcr.pred - coll.test$Apps)^2)
## [1] 1723100
The test error rate 1723100 is much higher compared to all the models above
set.seed(1)
coll.pls <- plsr(Apps ~ ., data = coll.train, scale=TRUE, validation='CV')
validationplot(coll.pls)
pls.pred <- predict(coll.pls, coll.test, ncomp = 10)
mean((pls.pred - coll.test$Apps)^2)
## [1] 1131661
The test error is 1131661 which is better than 1723100 from pcr above
LM 1135758 Ridge 1135714 Lasso 1135659 PCR 1723100 PLS 1131661
PCR has the worst MSE PLS has the best MSE in comparison, the MSEs of PLS, Lasso, Ridge, and LM are similar and not much different, however, PLS has the best MSE of all followed by Lasso
rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - lm.pred))
## [,1]
## [1,] 0.9015413
rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - ridge.pred))
## 1
## 1 0.9015452
rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - lasso.pred))
## 1
## 1 0.9015499
rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - pcr.pred))
## [,1]
## [1,] 0.8506248
rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - pls.pred))
## [,1]
## [1,] 0.9018965
LM 0.9015413 Ridge 0.9015452 Lasso 0.9015499 PCR 0.8506248 PLS 0.9018965
PCR has the worst R squared and also the highest MSE making PCR PCR has an R squared of 85.06 which means it only explains 85.06% of variance in apps
While the R-squared for the rest of the models is similar, PLS has the best R-squared nd it also has the lowest MSE.
QUESTION 11: We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
data("Boston")
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
test <- (-train)
bos.train <- Boston[train,]
bos.test <- Boston[test,]
Best Subset Selection
set.seed(1)
predict.regsubsets <- function(object, newdata, id, ...)
{
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:p)))
for (j in 1:k)
{
best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = p)
for (i in 1:p)
{
pred <- predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948
## 9 10 11 12 13
## 42.53822 42.73416 42.52367 42.46014 42.50125
The lowest CV error is 12, and the MSE at 12 is 42.46014
LM
bos.lm <- lm(crim ~ ., bos.train)
boslm.pred <- predict(bos.lm, bos.test)
mean((bos.test$crim - boslm.pred)^2)
## [1] 41.54639
MSE for LM is 41.54639
Lasso
set.seed(1)
bosmat.train <- model.matrix(crim ~ ., bos.train)
bosmat.test <- model.matrix(crim ~ ., bos.test)
grid <- 10^seq(10, -2, length = 100)
bos.lasso = cv.glmnet(bosmat.train, bos.train$crim, alpha=1, lambda = grid, thresh=1e-12)
bos.lassobestlam = bos.lasso$lambda.min
bos.lassobestlam
## [1] 0.05336699
boslasso.pred <- predict(bos.lasso, bosmat.test, s = bos.lassobestlam)
mean((boslasso.pred - bos.test$crim)^2)
## [1] 40.97987
MSE for Lasso is 40.97987
Ridge
set.seed(1)
bosmat.train <- model.matrix(crim ~ ., bos.train)
bosmat.test <- model.matrix(crim ~ ., bos.test)
grid <- 10^seq(10, -2, length = 100)
bos.cvridge <- cv.glmnet(bosmat.train, bos.train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
bos.bestlamridge <- bos.cvridge$lambda.min
bos.bestlamridge
## [1] 0.2848036
bosridge.pred = predict(bos.cvridge, bosmat.test, s = bos.bestlamridge)
mean((bosridge.pred - bos.test$crim)^2)
## [1] 41.10444
MSE for Ridge is 41.10444
PCR
set.seed(1)
bos.pcr = pcr(crim ~ ., data = bos.train, scale = TRUE, validation = 'CV')
validationplot(bos.pcr, val.type = 'MSEP')
summary(bos.pcr)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.555 7.549 7.093 6.926 6.932 6.977
## adjCV 9.275 7.550 7.544 7.088 6.920 6.926 6.968
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.976 6.871 6.845 6.848 6.797 6.764 6.700
## adjCV 6.967 6.859 6.843 6.842 6.786 6.751 6.686
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.51 60.4 69.86 77.08 82.80 87.68 91.24 93.56
## crim 34.94 35.2 42.83 45.47 45.57 45.58 45.75 47.59
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.47 97.08 98.48 99.54 100.00
## crim 47.68 48.75 49.31 50.14 51.37
bospcr.pred <- predict(bos.pcr, bos.test, ncomp = 13)
mean((bospcr.pred - bos.test$crim)^2)
## [1] 41.54639
The MSE for PCR with M = 13 is 41.54639
bospcr.pred <- predict(bos.pcr, bos.test, ncomp = 4)
mean((bospcr.pred - bos.test$crim)^2)
## [1] 43.91107
The MSE for PCR with M = 4 is 43.91107
MSE summary Best Subset 42.46014 LM 41.54639 Lasso 40.97987 Ridge 41.10444 PCR 41.54639
Lasso has the lowest MSE Best Subset model has the highest MSE
rSquared(bos.test$crim, resid = as.matrix(bos.test$crim - boslm.pred))
## [,1]
## [1,] 0.3297973
rSquared(bos.test$crim, resid = as.matrix(bos.test$crim - bosridge.pred))
## 1
## 1 0.3369265
rSquared(bos.test$crim, resid = as.matrix(bos.test$crim - boslasso.pred))
## 1
## 1 0.3389361
rSquared(bos.test$crim, resid = as.matrix(bos.test$crim - bospcr.pred))
## [,1]
## [1,] 0.2916516
The best model out of all is lasso in terms of MSE since it has the lowest MSE The best model in terms of R-squared is also Lasso with the highest R-squared of 33.89361
as.matrix(coef(bos.lasso, bos.lasso$lambda.min))
## 1
## (Intercept) 18.7499830804
## (Intercept) 0.0000000000
## zn 0.0364732124
## indus -0.1248293847
## chas -0.4697623428
## nox -8.1724091087
## rm 0.1123257531
## age -0.0005500647
## dis -0.8323606739
## rad 0.5306677665
## tax 0.0000000000
## ptratio -0.3791108368
## black -0.0129846301
## lstat 0.2578373987
## medv -0.1591511818
Lasso does not use all features. Lasso specializes in feature reduction. Tax has a coefficient of 0 which means 12 out of 13 features are used and tax is dropped out of the model