Exercise 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:
- i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
- ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
- iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
- iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Statement iii is correct. In comparison to least squares, the lasso is less flexible. By reducing the number of variables (coefficients), the model’s complexity is reduced, which typically leads to an increase in bias but a decrease in variance. Therefore, the lasso being less flexible would give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(b) Repeat (a) for ridge regression relative to least squares.

Statement iii is correct. Similar to the lasso, ridge regression is a shrinkage method that results in less flexibility than least square. So, ridge regression will yield a better prediction accuracy when its increase in bias is less than its decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.

Statement ii is correct. Non-linear models are generally more flexible relative to least squares given their ability to capture more complex relationships by not assuming a linear shape of the data. For this reason, non-linear models will in most cases give improved prediction accuracy when its increase in variance is less than its decrease in bias. It is important to note that other factors can influence the accuracy such as the specific method used, the dataset, and the tuning of hyperparameters.

Exercise 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

College <- College
set.seed(13)
train_indices <- sample(1:nrow(College), nrow(College)/2)
train_df <- College[train_indices, ]
test_indices <- setdiff(1:nrow(College), train_indices)
test_df <- College[test_indices, ]
y.test <- test_df$Apps

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

# Fitting linear model
lm.fit <- lm(Apps ~ ., data = train_df)

# Predicting on the test set and calculating the test MSE:
lm.preds <- predict(lm.fit, test_df)
lm.error <- mean((lm.preds - y.test)^2)
lm.error
## [1] 1062514

The test error (MSE) is 1,062,514.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

x <- model.matrix(Apps ~ ., data = College)[, -1]
x.train <- x[train_indices,]
y.train <- College$Apps[train_indices]
x.test <- x[test_indices, ]

# Fitting ridge regression model and finding best lambda using CV:
set.seed(13)
ridge.fit <- glmnet(x.train, y.train, alpha = 0)
ridge.cv <- cv.glmnet(x.train, y.train, alpha = 0)
ridge.best.lambda <- ridge.cv$lambda.min

# Predicting on the test set and calculating the test MSE:
ridge.preds <- predict(ridge.fit, s = ridge.best.lambda, newx = x.test) 
ridge.error <- mean((ridge.preds - y.test)^2)
ridge.error
## [1] 1173593

The test error for the ridge regression is 1,173,593, which is a slight increase from the MSE of the linear model.

(d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# Fitting lasso model and finding best lambda using 10-fold CV:
set.seed(13)
lasso.fit <- glmnet(x.train, y.train, alpha = 1)
lasso.cv <- cv.glmnet(x.train, y.train, alpha = 1)
lasso.best.lambda <- lasso.cv$lambda.min

# Predicting on the test set and calculating the test MSE:
lasso.preds <- predict(lasso.fit, s = lasso.best.lambda, newx = x.test) 
lasso.error <- mean((lasso.preds - y.test)^2)
lasso.error
## [1] 1055879

The test error of the lasso model is 1,055,879. This is just marginally better than the linear model.

# Finding number of non-zero coefficients:
lasso.coef <- predict(lasso.fit, type = "coefficients", s = lasso.best.lambda)[2:18,]
lasso.coef
##    PrivateYes        Accept        Enroll     Top10perc     Top25perc 
## -354.19610642    1.56542628   -0.48622156   30.62693312    0.00000000 
##   F.Undergrad   P.Undergrad      Outstate    Room.Board         Books 
##    0.00000000    0.05092959   -0.07080014    0.11875892    0.00000000 
##      Personal           PhD      Terminal     S.F.Ratio   perc.alumni 
##    0.04352157   -7.07398124   -2.29263444   25.78675291    0.00000000 
##        Expend     Grad.Rate 
##    0.08482118    5.55241821
length(lasso.coef[lasso.coef != 0])
## [1] 13

When examining the output above, we find four coefficients equaling zero, leaving 13 non-zero coefficients (not including the intercept).

(e) Fit a PCR model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.

set.seed(13)
pcr.fit <- pcr(Apps ~ ., data = train_df, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4290     4143     2285     2290     1938     1858     1827
## adjCV         4290     4145     2279     2286     1876     1838     1817
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1827     1775     1747      1737      1746      1742      1740
## adjCV     1819     1755     1740      1726      1738      1734      1727
##        14 comps  15 comps  16 comps  17 comps
## CV         1742      1729      1398      1331
## adjCV      1734      1714      1378      1314
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.815    56.30    63.57    69.42    74.94    80.01    84.00    87.65
## Apps    9.051    73.44    73.60    83.11    83.88    84.02    84.03    85.64
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.78     92.95     95.05     96.76     97.81     98.79     99.40
## Apps    85.72     86.03     86.17     86.32     86.87     86.99     90.83
##       16 comps  17 comps
## X        99.83    100.00
## Apps     93.69     94.06
validationplot(pcr.fit, val.type = "MSEP")

From the summary and the plot above, we see that cross-validation identified 17 components. However, because this does not result in any dimension reduction, I will consider \(M=4\). When disregarding components that leads to little or no dimension reduction, the forth component is the last to yield a significant drop in the Root Mean Square Error of Prediction (RMSEP) as well as a significant increase (~10%) in % of variance explained.

pcr.preds <- predict(pcr.fit, x.test, ncomp = 4) 
pcr.error <- mean((pcr.preds - y.test)^2)
pcr.error
## [1] 2102455

The principal component regression results in an MSE of 2,102,455, making it the poorest performing model among those tested so far.

(f) Fit a PLS model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the valuen of \(M\) selected by cross-validation.

set.seed(13)
pls.fit <- plsr(Apps ~ ., data = train_df, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4290     2112     1887     1676     1665     1567     1461
## adjCV         4290     2104     1885     1666     1633     1532     1436
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1363     1352     1337      1334      1335      1335      1334
## adjCV     1344     1334     1320      1317      1318      1318      1316
##        14 comps  15 comps  16 comps  17 comps
## CV         1332      1332      1331      1331
## adjCV      1314      1314      1314      1314
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.73    49.37    61.82    64.43    67.98    73.18    75.79    79.75
## Apps    78.00    83.25    88.08    91.45    93.18    93.64    93.86    93.93
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       81.98     84.98     87.93     90.50     92.21     94.32     96.91
## Apps    93.99     94.02     94.04     94.05     94.05     94.06     94.06
##       16 comps  17 comps
## X        98.74    100.00
## Apps     94.06     94.06
validationplot(pls.fit, val.type = "MSEP")

Similarly to PCR, the Partial Least Squares (PLS) shows the lowest RMSEP at 17 components. Once again, evaluating the trade-off between dimension reduction and RMSEP, \(M=7\) appears as an appropriate selection.

pls.preds <- predict(pls.fit, x.test, ncomp = 7) 
pls.error <- mean((pls.preds - y.test)^2)
pls.error
## [1] 1090075

The PLS model achieves an MSE of 1,090,075, which demonstrates performance comparable to both the linear model and the lasso model, although slightly worse.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

errors <- c(lm.error, ridge.error, lasso.error, pcr.error, pls.error)
names <- c("Linear:", "Ridge:", "Lasso:", "PCR:", "PLS:")
sorted_errors <- sort(errors)
sorted_names <- names[order(errors)]
result <- paste(sorted_names, round(sorted_errors))
print(result)
## [1] "Lasso: 1055879"  "Linear: 1062514" "PLS: 1090075"    "Ridge: 1173593" 
## [5] "PCR: 2102455"

The lasso model exhibited the best performance, closely trailed by linear regression and PLS. The ridge model performed somewhat worse, although the PCR by far yielded the highest MSE. Per the calculations below, we see that the lasso model can explain 90.88% of the variance in Apps, making it a sufficient predictor of the number of college applications received.

# Calculating TSS, RSS, and R-square:
y_mean <- mean(y.test)
TSS <- sum((y.test - y_mean)^2)
RSS <- sum((lasso.preds - y.test)^2)
R_squared <- 1 - (RSS / TSS)
R_squared
## [1] 0.9088928

Exercise 11

We will now try to predict per capita crime rate (crim) in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
I will test the regression methods mentioned, starting out with best subset selection. First, I split the data into training set and test set.

boston <- ISLR2::Boston

set.seed(13)
train_indices <- sample(1:nrow(boston), nrow(boston)/2)
train_df <- boston[train_indices, ]
test_indices <- setdiff(1:nrow(boston), train_indices)
test_df <- boston[test_indices, ]
y.test <- test_df$crim


# Fitting full model
regfit.full = regsubsets(crim ~ ., data = train_df, nvmax=13)
reg.summary = summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train_df, nvmax = 13)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   "*" 
## 7  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" "*" " "     "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " "*" "*" "*" "*" " "     "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" " " "*" "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
# Plotting performance statistics to select best subset
par(mfrow=c(2,2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")

plot(reg.summary$adjr2, xla = "Number of Variables", ylab = "Adjusted RSq", type = "l")
p = which.max(reg.summary$adjr2)
points(p, reg.summary$adjr2[p], col = "blue", cex = 2, pch = 20)

plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = 'l')
p = which.min(reg.summary$cp)
points(p, reg.summary$cp[p], col = "blue", cex = 2, pch = 20)

plot(reg.summary$bic, xlab="Number of Variables", ylab = "BIC", type = 'l')
p = which.min(reg.summary$bic)
points(p, reg.summary$bic[p], col = "blue", cex = 2, pch = 20)

We notice from the plots, that the Adjusted R-Square, Cp, and BIC suggest different subsets. Considering all four plots, 5 predictors appears as a reasonable option for all measures. I will now fit a linear model using the six predictors identified in the summary above the plots.

lm.fit = lm(crim ~ zn+indus+dis+rad+medv, data = train_df)
lm.preds = predict(lm.fit, test_df, type = "response")
lm.subset.error <- mean((lm.preds - test_df$crim)^2)
lm.subset.error
## [1] 35.5139

The model using the best subset approach results in an MSE of 35.51. In comparison to the MSE of the full model as seen below (35.82), this is a very insignificant improvement. However, we successfully excluded 7 predictors while achieving a similar result.

full.lm.fit = lm(crim ~ ., data = train_df)
full.lm.preds = predict(full.lm.fit, test_df, type = "response")
mean((full.lm.preds - test_df$crim)^2)
## [1] 34.82423

The next regression method I will use to predict crim is the lasso.

x <- model.matrix(crim ~ ., data = Boston)[, -1]
x.train <- x[train_indices,]
y.train <- Boston$crim[train_indices]
x.test <- x[test_indices, ]

# Fitting lasso model and finding best lambda using 10-fold CV:
set.seed(13)
lasso.fit <- glmnet(x.train, y.train, alpha = 1)
lasso.cv <- cv.glmnet(x.train, y.train, alpha = 1)
lasso.best.lambda <- lasso.cv$lambda.min

# Predicting on the test set and calculating the test MSE:
lasso.preds <- predict(lasso.fit, s = lasso.best.lambda, newx = x.test) 
lasso.error <- mean((lasso.preds - y.test)^2)
lasso.error
## [1] 32.88646

The MSE of the lasso model is 32.89. Next, I will assess ridge regression.

# Fitting ridge regression model and finding best lambda using CV:
set.seed(13)
ridge.fit <- glmnet(x.train, y.train, alpha = 0)
ridge.cv <- cv.glmnet(x.train, y.train, alpha = 0)
ridge.best.lambda <- ridge.cv$lambda.min

# Predicting on the test set and calculating the test MSE:
ridge.preds <- predict(ridge.fit, s = ridge.best.lambda, newx = x.test) 
ridge.error <- mean((ridge.preds - y.test)^2)
ridge.error
## [1] 33.09034

Ridge regression results an a test error of 33.09. Finally, I will employ PCR tp predict the per capita crime rate.

set.seed(13)
pcr.fit <- pcr(crim ~ ., data = train_df, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 253 12 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.744    8.191    8.212    7.703    7.602    7.588    7.579
## adjCV        9.744    8.186    8.205    7.694    7.582    7.581    7.572
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       7.415    7.444    7.477     7.472     7.502     7.395
## adjCV    7.407    7.433    7.465     7.458     7.483     7.374
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       50.70     63.6    73.62    80.96    87.82    90.91    93.60    95.54
## crim    30.42     30.6    38.87    40.63    40.99    41.22    43.78    44.21
##       9 comps  10 comps  11 comps  12 comps
## X       97.12     98.48     99.50    100.00
## crim    44.22     44.81     45.88     47.59
validationplot(pcr.fit, val.type = "MSEP")

Looking at the plot, we notice a local minimum at 7 predictors, while the lowest RMSEP appears at 13 predictors. I will predict on the test set using \(comps=7\) due to its optimal trade-off between RMSEP and dimension reduction.

pcr.preds <- predict(pcr.fit, x.test, ncomp = 7) 
pcr.error <- mean((pcr.preds - y.test)^2)
pcr.error
## [1] 33.87554

The principal components regression yields an MSE of 33.88.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

rsq.lm <- 1 - (sum((lm.preds - y.test)^2) / sum((y.test - mean(y.test))^2))
rsq.ridge <- 1 - (sum((ridge.preds - y.test)^2) / sum((y.test - mean(y.test))^2))
rsq.lasso <- 1 - (sum((lasso.preds - y.test)^2) / sum((y.test - mean(y.test))^2))
rsq.pcr <- 1 - (sum((pcr.preds - y.test)^2) / sum((y.test - mean(y.test))^2))

errors <- c(lm.subset.error, ridge.error, lasso.error, pcr.error)
names <- c("Best Subset", "Ridge", "Lasso", "PCR")
rsqs <- c(rsq.lm, rsq.ridge, rsq.lasso, rsq.pcr)

sorted_errors <- sort(errors)
sorted_names <- names[order(errors)]
sorted_rsqs <- rsqs[order(errors)]

result <- paste(sorted_names, round(sorted_errors, 2), round(sorted_rsqs, 2), sep = " | ")
print("MODEL |  MSE  | R-SQUARE")
## [1] "MODEL |  MSE  | R-SQUARE"
print(result)
## [1] "Lasso | 32.89 | 0.38"       "Ridge | 33.09 | 0.37"      
## [3] "PCR | 33.88 | 0.36"         "Best Subset | 35.51 | 0.33"

Upon examining the test errors of the evaluated models, we notice only a slight difference between the top-performing ones -lasso and ridge- closely followed by PCR. On the other hand, the linear model using the “best subset” of predictors showed the weakest performance, though not significantly worse than the others. Looking at the R-squared values, we find a similar ranking, with lasso performing the best (0.38) and the linear model the worst (0.33). Overall, the models don’t excel in predicting crim, but the lasso method stood out with the best performance in both test error and percent of variance explained. Hence, I suggest using the lasso method with cross-validation to find the best lambda as the optimal approach for prediction purposes.

(c) Does your chosen model involve all of the features in the data set? Why or why not?
My chosen model involves all predictors. However, while the lasso method does not involve variable selection, it may shrink some coefficients to zero. As illustrated below, the model retains 7 non-zero coefficients (not including the intercept). In other words, the lasso includes all predictors in the model, but it also reduces the impact of some predictors by shrinking their coefficients to zero.

# Finding number of non-zero coefficients:
lasso.coef <- predict(lasso.fit, type = "coefficients", s = lasso.best.lambda)[2:13,]
lasso.coef
##          zn       indus        chas         nox          rm         age 
##  0.04929413 -0.11619299 -0.17398359  0.00000000  0.00000000  0.00000000 
##         dis         rad         tax     ptratio       lstat        medv 
## -0.80414243  0.52577506  0.00000000  0.00000000  0.13745178 -0.20155505
length(lasso.coef[lasso.coef != 0])
## [1] 7