ISLR Chapter 6 Exercises 2,9,11

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.4.3

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.3

## 
## Attaching package: 'ISLR2'

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'tibble' was built under R version 4.4.3

## Warning: package 'tidyr' was built under R version 4.4.3

## Warning: package 'readr' was built under R version 4.4.3

## Warning: package 'purrr' was built under R version 4.4.3

## Warning: package 'dplyr' was built under R version 4.4.3

## Warning: package 'stringr' was built under R version 4.4.3

## Warning: package 'forcats' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.4.3

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 4.4.3

## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-9

library(pls)

## Warning: package 'pls' was built under R version 4.4.3

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings

library(leaps)

## Warning: package 'leaps' was built under R version 4.4.3

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:pls':
## 
##     R2
## 
## The following object is masked from 'package:purrr':
## 
##     lift

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increalse in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

b. Repeat (a) for ridge regression relative to least squares.

c. Repeat (a) for non-linear methods relative to least squares.

a)The lasso, relative to least squares, is:

Correct answer: iii. The lasso is less flexible than least squares because it penalizes large coefficients and can shrink some to zero, effectively performing variable selection. This introduces more bias but reduces variance, leading to improved prediction accuracy when the increase in bias is less than the decrease in variance.

Ridge regression, relative to least squares, is:

Correct answer: iii. Ridge regression is also less flexible than least squares because it applies an L2 penalty that shrinks coefficients without eliminating them. This adds bias and reduces variance, and prediction improves when the bias increase is outweighed by the variance reduction.

Non-linear methods, relative to least squares, are:

Correct answer: i. Non-linear methods are generally more flexible than least squares, allowing them to fit complex patterns and reduce bias. However, this comes with increased variance, and they improve prediction accuracy when the increase in bias is less than the decrease in variance.

Problem 9: Modeling Applications Using College Data

(a)Divide the data into training and test sets using a reproducible 70-30 split.

data("College")

College_df <- College
College_df$CollegeName <- rownames(College_df)  
rownames(College_df) <- NULL                    
College_df$CollegeName <- NULL
set.seed(123)
indices <- createDataPartition(College_df$Apps, p = 0.7, list = FALSE)
train_data <- College_df[indices, ]
test_data  <- College_df[-indices, ]


for (col in names(train_data)) {
  if (is.factor(train_data[[col]])) {
    test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))
  }
}

(b) Linear Regression (Ordinary Least Squares)

Fit a full linear regression model and assess the mean squared prediction error on the test data.

lm_model <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_model, newdata = test_data)
mse <- mean((test_data$Apps - lm_pred)^2)
mse

## [1] 1882074

(c)Ridge Regression with Cross-Validation

Ridge regression uses glmnet with alpha set to 0, which corresponds to ridge regression. Lambda is chosen via cross-validation.

X_train <- model.matrix(Apps ~ ., data = train_data)[, -1]
X_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
Y_train <- train_data$Apps
Y_test <- test_data$Apps

ridge_cv <- cv.glmnet(X_train, Y_train, alpha = 0)
ridge_best_lambda <- ridge_cv$lambda.min
ridge_preds <- predict(ridge_cv, s = ridge_best_lambda, newx = X_test)
mse_ridge <- mean((Y_test - ridge_preds)^2)
mse_ridge

## [1] 3265646

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coeffcient estimates.

set.seed(123)
cv_lasso <- cv.glmnet(X_train, Y_train, alpha = 1)
best_lasso <- cv_lasso$lambda.min


lasso.pred <- predict(cv_lasso, s = best_lasso, newx = X_test)
mean((Y_test - lasso.pred)^2)

## [1] 1936048

lasso_coef <- predict(cv_lasso, s = best_lasso, type = "coefficients")
sum(lasso_coef != 0)

## [1] 17

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pcr.fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")


summary(pcr.fit)

## Data:    X dimension: 545 17 
##  Y dimension: 545 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3389     3395     1657     1670     1409     1266     1254
## adjCV         3389     3394     1654     1672     1389     1243     1250
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1252     1198     1170      1169      1168      1175      1177
## adjCV     1254     1191     1168      1168      1166      1172      1174
##        14 comps  15 comps  16 comps  17 comps
## CV         1173      1174      1011      1007
## adjCV      1170      1172      1008      1004
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.892    57.49    64.60    70.26    75.69    80.70    84.36    87.81
## Apps    1.358    76.76    76.85    84.55    87.14    87.14    87.28    88.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.88     93.17     95.25     97.13     98.20     98.95     99.47
## Apps    88.69     88.77     88.90     88.98     89.04     89.09     89.37
##       16 comps  17 comps
## X        99.86    100.00
## Apps     92.08     92.22

validationplot(pcr.fit, val.type = "MSEP")

pcr.pred <- predict(pcr.fit, newdata = test_data, ncomp = 17)
mean((test_data$Apps - pcr.pred)^2)

## [1] 1882074

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")


summary(pls.fit)

## Data:    X dimension: 545 17 
##  Y dimension: 545 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3389     1490     1198     1134     1100     1045     1016
## adjCV         3389     1488     1199     1132     1095     1042     1012
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      1002.1   1000.5    997.8     996.3     996.2     995.4     995.4
## adjCV    998.9    997.5    994.9     993.5     993.2     992.6     992.6
##        14 comps  15 comps  16 comps  17 comps
## CV        995.1     995.0     995.1     995.1
## adjCV     992.4     992.3     992.4     992.4
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.63    38.41    62.86    66.01    69.22    73.83    76.90    81.19
## Apps    81.30    87.78    89.41    90.52    91.58    92.03    92.14    92.17
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.07     85.38     87.23     90.36     93.14     95.35     97.20
## Apps    92.19     92.20     92.21     92.21     92.22     92.22     92.22
##       16 comps  17 comps
## X        98.56    100.00
## Apps     92.22     92.22

validationplot(pls.fit, val.type = "MSEP")

pls.pred <- predict(pls.fit, newdata = test_data, ncomp = 16)

mean((test_data$Apps - pls.pred)^2)

## [1] 1882170

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Five different methods were tested to predict how many students applied to each college. The most accurate predictions came from Linear Regression, Principal Components Regression (PCR), and Partial Least Squares (PLS), each with test errors close to 1,882,074. Lasso regression also performed well, with a test error of 1,936,048, and it used only 17 variables, making the model simpler and more interpretable. In contrast, Ridge Regression had the highest test error at 3,265,646, showing that it was less accurate than the other approaches. Overall, while all methods produced reasonable results, Linear Regression, PCR, and PLS provided the most accurate predictions.

Problem 11: Predicting Per Capita Crime Rate in the Boston Data set

(a) Regression Methods and Results

Apply best subset selection, ridge regression, lasso regression, and PCR to the Boston dataset.

Setup and Data Split

data("Boston")


set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Boston), rep = TRUE)
test <- !train

1. Best Subset Selection

reg.fit.best <- regsubsets(crim ~ ., data = Boston[train, ], nvmax = 13)

test_matrix <- model.matrix(crim ~ ., data = Boston[test, ])


nv <- nrow(summary(reg.fit.best)$which)
val_errors <- rep(NA, nv)

for (i in 1:nv) {
  coeficient <- coef(reg.fit.best, id = i)
  pred <- test_matrix[, names(coeficient)] %*% coeficient
  val_errors[i] <- mean((Boston$crim[test] - pred)^2)
}


min(val_errors)

## [1] 49.15767

which.min(val_errors)

## [1] 12

plot(val_errors, type = "b", xlab = "Number of Variables", ylab = "Validation MSE")

2. Ridge Regression

x_train <- model.matrix(crim ~ ., Boston[train, ])[, -1]
x_test  <- model.matrix(crim ~ ., Boston[test, ])[, -1]
y_train <- Boston$crim[train]
y_test  <- Boston$crim[test]


cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_ridge <- cv_ridge$lambda.min

ridge.pred <- predict(cv_ridge, s = best_ridge, newx = x_test)
mean((y_test - ridge.pred)^2)

## [1] 49.91218

best_ridge

## [1] 0.5240686

plot(cv_ridge)
title("Ridge Regression: Cross-Validation Curve", line = 2.5)

3. Lasso Regression

set.seed(1)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lasso <- cv_lasso$lambda.min


lasso.pred <- predict(cv_lasso, s = best_lasso, newx = x_test)
mean((y_test - lasso.pred)^2)

## [1] 49.37916

best_lasso

## [1] 0.04558081

plot(cv_lasso)
title("Lasso Regression: Cross-Validation Curve", line = 2.5)

4. Principal Components Regression (PCR)

pcr_model <- pcr(crim ~ ., data = Boston[train, ], scale = TRUE, validation = "CV")

best_ncomp <- which.min(pcr_model$validation$PRESS)

pcr.pred <- predict(pcr_model, newdata = Boston[test, ], ncomp = best_ncomp)
pcr.pred <- as.vector(pcr.pred)
mean((Boston$crim[test] - pcr.pred)^2)

## [1] 49.15767

best_ncomp

## [1] 12

validationplot(pcr_model, val.type = "MSEP")

(b) What Model Should Be Used and Why?

After trying four different ways to predict crime in Boston, the two best methods were best subset selection and PCR. Both had the lowest error, around 49.16, meaning they gave the most accurate guesses. Even though PCR worked just as well, it used new made-up variables called “components,” which are harder to explain. Best subset selection used the real features from the data, like how many rooms a house has or how far it is from work. Because it’s easier to understand and just as accurate, best subset selection is the better choice.

(c) Did the Chosen Model Use All the Features?

No, the model did not use all the features. There were 13 features available, but the model only picked 12. It left out the one that didn’t help much. This makes the model simpler and easier to understand, while still giving good results. Choosing only the most useful features helps the model stay smart and not get confused by unimportant information.