For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares,
is:
- i. More flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
- ii. More flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
- iii. Less flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
- iv. Less flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
Statement iii is correct. In comparison to least squares, the lasso is less flexible. By reducing the number of variables (coefficients), the model’s complexity is reduced, which typically leads to an increase in bias but a decrease in variance. Therefore, the lasso being less flexible would give improved prediction accuracy when its increase in bias is less than its decrease in variance.
(b) Repeat (a) for ridge regression relative to least squares.
Statement iii is correct. Similar to the lasso, ridge regression is a shrinkage method that results in less flexibility than least square. So, ridge regression will yield a better prediction accuracy when its increase in bias is less than its decrease in variance.
(c) Repeat (a) for non-linear methods relative to least squares.
Statement ii is correct. Non-linear models are generally more flexible relative to least squares given their ability to capture more complex relationships by not assuming a linear shape of the data. For this reason, non-linear models will in most cases give improved prediction accuracy when its increase in variance is less than its decrease in bias. It is important to note that other factors can influence the accuracy such as the specific method used, the dataset, and the tuning of hyperparameters.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
College <- College
set.seed(13)
train_indices <- sample(1:nrow(College), nrow(College)/2)
train_df <- College[train_indices, ]
test_indices <- setdiff(1:nrow(College), train_indices)
test_df <- College[test_indices, ]
y.test <- test_df$Apps
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
# Fitting linear model
lm.fit <- lm(Apps ~ ., data = train_df)
# Predicting on the test set and calculating the test MSE:
lm.preds <- predict(lm.fit, test_df)
lm.error <- mean((lm.preds - y.test)^2)
lm.error
## [1] 1062514
The test error (MSE) is 1,062,514.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
x <- model.matrix(Apps ~ ., data = College)[, -1]
x.train <- x[train_indices,]
y.train <- College$Apps[train_indices]
x.test <- x[test_indices, ]
# Fitting ridge regression model and finding best lambda using CV:
set.seed(13)
ridge.fit <- glmnet(x.train, y.train, alpha = 0)
ridge.cv <- cv.glmnet(x.train, y.train, alpha = 0)
ridge.best.lambda <- ridge.cv$lambda.min
# Predicting on the test set and calculating the test MSE:
ridge.preds <- predict(ridge.fit, s = ridge.best.lambda, newx = x.test)
ridge.error <- mean((ridge.preds - y.test)^2)
ridge.error
## [1] 1173593
The test error for the ridge regression is 1,173,593, which is a slight increase from the MSE of the linear model.
(d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
# Fitting lasso model and finding best lambda using 10-fold CV:
set.seed(13)
lasso.fit <- glmnet(x.train, y.train, alpha = 1)
lasso.cv <- cv.glmnet(x.train, y.train, alpha = 1)
lasso.best.lambda <- lasso.cv$lambda.min
# Predicting on the test set and calculating the test MSE:
lasso.preds <- predict(lasso.fit, s = lasso.best.lambda, newx = x.test)
lasso.error <- mean((lasso.preds - y.test)^2)
lasso.error
## [1] 1055879
The test error of the lasso model is 1,055,879. This is just marginally better than the linear model.
# Finding number of non-zero coefficients:
lasso.coef <- predict(lasso.fit, type = "coefficients", s = lasso.best.lambda)[2:18,]
lasso.coef
## PrivateYes Accept Enroll Top10perc Top25perc
## -354.19610642 1.56542628 -0.48622156 30.62693312 0.00000000
## F.Undergrad P.Undergrad Outstate Room.Board Books
## 0.00000000 0.05092959 -0.07080014 0.11875892 0.00000000
## Personal PhD Terminal S.F.Ratio perc.alumni
## 0.04352157 -7.07398124 -2.29263444 25.78675291 0.00000000
## Expend Grad.Rate
## 0.08482118 5.55241821
length(lasso.coef[lasso.coef != 0])
## [1] 13
When examining the output above, we find four coefficients equaling zero, leaving 13 non-zero coefficients (not including the intercept).
(e) Fit a PCR model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.
set.seed(13)
pcr.fit <- pcr(Apps ~ ., data = train_df, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4290 4143 2285 2290 1938 1858 1827
## adjCV 4290 4145 2279 2286 1876 1838 1817
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1827 1775 1747 1737 1746 1742 1740
## adjCV 1819 1755 1740 1726 1738 1734 1727
## 14 comps 15 comps 16 comps 17 comps
## CV 1742 1729 1398 1331
## adjCV 1734 1714 1378 1314
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.815 56.30 63.57 69.42 74.94 80.01 84.00 87.65
## Apps 9.051 73.44 73.60 83.11 83.88 84.02 84.03 85.64
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.78 92.95 95.05 96.76 97.81 98.79 99.40
## Apps 85.72 86.03 86.17 86.32 86.87 86.99 90.83
## 16 comps 17 comps
## X 99.83 100.00
## Apps 93.69 94.06
validationplot(pcr.fit, val.type = "MSEP")
From the summary and the plot above, we see that cross-validation identified 17 components. However, because this does not result in any dimension reduction, I will consider \(M=4\). When disregarding components that leads to little or no dimension reduction, the forth component is the last to yield a significant drop in the Root Mean Square Error of Prediction (RMSEP) as well as a significant increase (~10%) in % of variance explained.
pcr.preds <- predict(pcr.fit, x.test, ncomp = 4)
pcr.error <- mean((pcr.preds - y.test)^2)
pcr.error
## [1] 2102455
The principal component regression results in an MSE of 2,102,455, making it the poorest performing model among those tested so far.
(f) Fit a PLS model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the valuen of \(M\) selected by cross-validation.
set.seed(13)
pls.fit <- plsr(Apps ~ ., data = train_df, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4290 2112 1887 1676 1665 1567 1461
## adjCV 4290 2104 1885 1666 1633 1532 1436
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1363 1352 1337 1334 1335 1335 1334
## adjCV 1344 1334 1320 1317 1318 1318 1316
## 14 comps 15 comps 16 comps 17 comps
## CV 1332 1332 1331 1331
## adjCV 1314 1314 1314 1314
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.73 49.37 61.82 64.43 67.98 73.18 75.79 79.75
## Apps 78.00 83.25 88.08 91.45 93.18 93.64 93.86 93.93
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 81.98 84.98 87.93 90.50 92.21 94.32 96.91
## Apps 93.99 94.02 94.04 94.05 94.05 94.06 94.06
## 16 comps 17 comps
## X 98.74 100.00
## Apps 94.06 94.06
validationplot(pls.fit, val.type = "MSEP")
Similarly to PCR, the Partial Least Squares (PLS) shows the lowest RMSEP at 17 components. Once again, evaluating the trade-off between dimension reduction and RMSEP, \(M=7\) appears as an appropriate selection.
pls.preds <- predict(pls.fit, x.test, ncomp = 7)
pls.error <- mean((pls.preds - y.test)^2)
pls.error
## [1] 1090075
The PLS model achieves an MSE of 1,090,075, which demonstrates performance comparable to both the linear model and the lasso model, although slightly worse.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
errors <- c(lm.error, ridge.error, lasso.error, pcr.error, pls.error)
names <- c("Linear:", "Ridge:", "Lasso:", "PCR:", "PLS:")
sorted_errors <- sort(errors)
sorted_names <- names[order(errors)]
result <- paste(sorted_names, round(sorted_errors))
print(result)
## [1] "Lasso: 1055879" "Linear: 1062514" "PLS: 1090075" "Ridge: 1173593"
## [5] "PCR: 2102455"
The lasso model exhibited the best performance, closely trailed by
linear regression and PLS. The ridge model performed somewhat worse,
although the PCR by far yielded the highest MSE. Per the calculations
below, we see that the lasso model can explain 90.88% of the variance in
Apps, making it a sufficient predictor of the number of
college applications received.
# Calculating TSS, RSS, and R-square:
y_mean <- mean(y.test)
TSS <- sum((y.test - y_mean)^2)
RSS <- sum((lasso.preds - y.test)^2)
R_squared <- 1 - (RSS / TSS)
R_squared
## [1] 0.9088928
We will now try to predict per capita crime rate
(crim) in the Boston data
set.
(a) Try out some of the regression methods explored in
this chapter, such as best subset selection, the lasso, ridge
regression, and PCR. Present and discuss results for the approaches that
you consider.
I will test the regression methods mentioned, starting out with best
subset selection. First, I split the data into training set and test
set.
boston <- ISLR2::Boston
set.seed(13)
train_indices <- sample(1:nrow(boston), nrow(boston)/2)
train_df <- boston[train_indices, ]
test_indices <- setdiff(1:nrow(boston), train_indices)
test_df <- boston[test_indices, ]
y.test <- test_df$crim
# Fitting full model
regfit.full = regsubsets(crim ~ ., data = train_df, nvmax=13)
reg.summary = summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train_df, nvmax = 13)
## 12 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*"
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" "*" " " "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" "*" " " "*" "*"
## 10 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# Plotting performance statistics to select best subset
par(mfrow=c(2,2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg.summary$adjr2, xla = "Number of Variables", ylab = "Adjusted RSq", type = "l")
p = which.max(reg.summary$adjr2)
points(p, reg.summary$adjr2[p], col = "blue", cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = 'l')
p = which.min(reg.summary$cp)
points(p, reg.summary$cp[p], col = "blue", cex = 2, pch = 20)
plot(reg.summary$bic, xlab="Number of Variables", ylab = "BIC", type = 'l')
p = which.min(reg.summary$bic)
points(p, reg.summary$bic[p], col = "blue", cex = 2, pch = 20)
We notice from the plots, that the Adjusted R-Square, Cp, and BIC suggest different subsets. Considering all four plots, 5 predictors appears as a reasonable option for all measures. I will now fit a linear model using the six predictors identified in the summary above the plots.
lm.fit = lm(crim ~ zn+indus+dis+rad+medv, data = train_df)
lm.preds = predict(lm.fit, test_df, type = "response")
lm.subset.error <- mean((lm.preds - test_df$crim)^2)
lm.subset.error
## [1] 35.5139
The model using the best subset approach results in an MSE of 35.51. In comparison to the MSE of the full model as seen below (35.82), this is a very insignificant improvement. However, we successfully excluded 7 predictors while achieving a similar result.
full.lm.fit = lm(crim ~ ., data = train_df)
full.lm.preds = predict(full.lm.fit, test_df, type = "response")
mean((full.lm.preds - test_df$crim)^2)
## [1] 34.82423
The next regression method I will use to predict crim is
the lasso.
x <- model.matrix(crim ~ ., data = Boston)[, -1]
x.train <- x[train_indices,]
y.train <- Boston$crim[train_indices]
x.test <- x[test_indices, ]
# Fitting lasso model and finding best lambda using 10-fold CV:
set.seed(13)
lasso.fit <- glmnet(x.train, y.train, alpha = 1)
lasso.cv <- cv.glmnet(x.train, y.train, alpha = 1)
lasso.best.lambda <- lasso.cv$lambda.min
# Predicting on the test set and calculating the test MSE:
lasso.preds <- predict(lasso.fit, s = lasso.best.lambda, newx = x.test)
lasso.error <- mean((lasso.preds - y.test)^2)
lasso.error
## [1] 32.88646
The MSE of the lasso model is 32.89. Next, I will assess ridge regression.
# Fitting ridge regression model and finding best lambda using CV:
set.seed(13)
ridge.fit <- glmnet(x.train, y.train, alpha = 0)
ridge.cv <- cv.glmnet(x.train, y.train, alpha = 0)
ridge.best.lambda <- ridge.cv$lambda.min
# Predicting on the test set and calculating the test MSE:
ridge.preds <- predict(ridge.fit, s = ridge.best.lambda, newx = x.test)
ridge.error <- mean((ridge.preds - y.test)^2)
ridge.error
## [1] 33.09034
Ridge regression results an a test error of 33.09. Finally, I will employ PCR tp predict the per capita crime rate.
set.seed(13)
pcr.fit <- pcr(crim ~ ., data = train_df, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 253 12
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.744 8.191 8.212 7.703 7.602 7.588 7.579
## adjCV 9.744 8.186 8.205 7.694 7.582 7.581 7.572
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 7.415 7.444 7.477 7.472 7.502 7.395
## adjCV 7.407 7.433 7.465 7.458 7.483 7.374
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 50.70 63.6 73.62 80.96 87.82 90.91 93.60 95.54
## crim 30.42 30.6 38.87 40.63 40.99 41.22 43.78 44.21
## 9 comps 10 comps 11 comps 12 comps
## X 97.12 98.48 99.50 100.00
## crim 44.22 44.81 45.88 47.59
validationplot(pcr.fit, val.type = "MSEP")
Looking at the plot, we notice a local minimum at 7 predictors, while the lowest RMSEP appears at 13 predictors. I will predict on the test set using \(comps=7\) due to its optimal trade-off between RMSEP and dimension reduction.
pcr.preds <- predict(pcr.fit, x.test, ncomp = 7)
pcr.error <- mean((pcr.preds - y.test)^2)
pcr.error
## [1] 33.87554
The principal components regression yields an MSE of 33.88.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
rsq.lm <- 1 - (sum((lm.preds - y.test)^2) / sum((y.test - mean(y.test))^2))
rsq.ridge <- 1 - (sum((ridge.preds - y.test)^2) / sum((y.test - mean(y.test))^2))
rsq.lasso <- 1 - (sum((lasso.preds - y.test)^2) / sum((y.test - mean(y.test))^2))
rsq.pcr <- 1 - (sum((pcr.preds - y.test)^2) / sum((y.test - mean(y.test))^2))
errors <- c(lm.subset.error, ridge.error, lasso.error, pcr.error)
names <- c("Best Subset", "Ridge", "Lasso", "PCR")
rsqs <- c(rsq.lm, rsq.ridge, rsq.lasso, rsq.pcr)
sorted_errors <- sort(errors)
sorted_names <- names[order(errors)]
sorted_rsqs <- rsqs[order(errors)]
result <- paste(sorted_names, round(sorted_errors, 2), round(sorted_rsqs, 2), sep = " | ")
print("MODEL | MSE | R-SQUARE")
## [1] "MODEL | MSE | R-SQUARE"
print(result)
## [1] "Lasso | 32.89 | 0.38" "Ridge | 33.09 | 0.37"
## [3] "PCR | 33.88 | 0.36" "Best Subset | 35.51 | 0.33"
Upon examining the test errors of the evaluated models, we notice
only a slight difference between the top-performing ones -lasso and
ridge- closely followed by PCR. On the other hand, the linear model
using the “best subset” of predictors showed the weakest performance,
though not significantly worse than the others. Looking at the R-squared
values, we find a similar ranking, with lasso performing the best (0.38)
and the linear model the worst (0.33). Overall, the models don’t excel
in predicting crim, but the lasso method stood out with the
best performance in both test error and percent of variance explained.
Hence, I suggest using the lasso method with cross-validation to find
the best lambda as the optimal approach for prediction purposes.
(c) Does your chosen model involve all of the features in
the data set? Why or why not?
My chosen model involves all predictors. However, while the lasso method
does not involve variable selection, it may shrink some coefficients to
zero. As illustrated below, the model retains 7 non-zero coefficients
(not including the intercept). In other words, the lasso includes all
predictors in the model, but it also reduces the impact of some
predictors by shrinking their coefficients to zero.
# Finding number of non-zero coefficients:
lasso.coef <- predict(lasso.fit, type = "coefficients", s = lasso.best.lambda)[2:13,]
lasso.coef
## zn indus chas nox rm age
## 0.04929413 -0.11619299 -0.17398359 0.00000000 0.00000000 0.00000000
## dis rad tax ptratio lstat medv
## -0.80414243 0.52577506 0.00000000 0.00000000 0.13745178 -0.20155505
length(lasso.coef[lasso.coef != 0])
## [1] 7