Chapter 6

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
(iii). Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justification This is because Lasso regression adds an L1 penalty to the least squares loss function, shrinking some coefficients exactly to zero, which performs variable selection. This makes the lasso model less flexible than least squares.In doing so, lasso reduces variance (less overfitting), but introduces bias (coefficients are shrunk). The trade off improves prediction accuracy when the reduction in variance more than compensates for the increase in bias.

(b) Repeat (a) for ridge regression relative to least squares
(iii) Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justification: Ridge regression adds an L2 penalty (squared magnitude of coefficients), shrinking them toward zero but never to zero. This again makes it less flexible than least squares. It also introduces bias but reduces variance, especially helpful when multicollinearity or many predictors exist. Like lasso, ridge improves prediction accuracy if the bias-variance trade off is favorable.

The other options are incorrect (i) and (ii): Incorrect because ridge regression is less flexible than least squares.

(iv): Incorrect because the trade-off is about increased bias and decreased variance.

(c) Repeat (a) for non-linear methods relative to least squares.
(ii). More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Justification: That is why they are able to capture complex patterns while providing higher variance than bias. Regarding the matter of prediction accuracy enhancement, it will be optimal if bias reduction is achieved at the cost of raise in variance.

The other options are incorrect (i): Incorrect because prediction improvement depends on bias reduction, not variance reduction, for more flexible methods.

  1. and (iv): Incorrect because non-linear methods are more flexible, not less flexible.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
attach(College)
college<-College

#Check for missing values
sum(is.na(college))
## [1] 0
 str(college)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

(a) Split the data set into a training set and a test set.

#Split the data into training and test sets
set.seed(123)
# 70% training, 30% test
train_indices <- sample(1:nrow(college), nrow(college) * 0.7)
train_data <- college[train_indices, ]
test_data <- college[-train_indices, ]

cat("Training set size:", nrow(train_data), "\n")
## Training set size: 543
cat("Test set size:", nrow(test_data), "\n")
## Test set size: 234

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

set.seed(123)
# Fit linear model
model_lm <- lm(Apps ~ ., data = train_data)
summary(model_lm)
## 
## Call:
## lm(formula = Apps ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3097.8  -455.8   -46.5   343.8  6452.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -310.17331  481.30075  -0.644 0.519566    
## PrivateYes  -681.96465  164.08211  -4.156 3.78e-05 ***
## Accept         1.22130    0.05921  20.626  < 2e-16 ***
## Enroll         0.08046    0.21794   0.369 0.712155    
## Top10perc     49.33503    6.18296   7.979 9.31e-15 ***
## Top25perc    -16.11744    5.02717  -3.206 0.001428 ** 
## F.Undergrad    0.02284    0.03985   0.573 0.566831    
## P.Undergrad    0.03541    0.03529   1.003 0.316139    
## Outstate      -0.05446    0.02132  -2.555 0.010910 *  
## Room.Board     0.18967    0.05275   3.596 0.000354 ***
## Books          0.21366    0.28099   0.760 0.447381    
## Personal      -0.03685    0.07279  -0.506 0.612876    
## PhD           -6.00401    5.34580  -1.123 0.261897    
## Terminal      -5.01712    5.77787  -0.868 0.385609    
## S.F.Ratio     -2.18927   14.83898  -0.148 0.882766    
## perc.alumni   -8.01836    4.67330  -1.716 0.086792 .  
## Expend         0.07614    0.01340   5.681 2.23e-08 ***
## Grad.Rate     10.63461    3.38228   3.144 0.001760 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 992.3 on 525 degrees of freedom
## Multiple R-squared:  0.9175, Adjusted R-squared:  0.9148 
## F-statistic: 343.2 on 17 and 525 DF,  p-value: < 2.2e-16
# Predict on test set
predict_lm <- predict(model_lm, newdata = test_data)

# Calculate test error (MSE)
lm_test_mse <- mean((predict_lm - test_data$Apps)^2)
cat("Test MSE for Linear Regression:", lm_test_mse, "\n")
## Test MSE for Linear Regression: 1734841

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# (c) Ridge regression with cross-validation
set.seed(123)
# Create x and y matrices for the entire dataset
x <- model.matrix(Apps ~ ., data = college)[, -1]  # Remove intercept column
y <- college$Apps

# Fit Ridge model with cross-validation on training data only
set.seed(123)
cv.out <- cv.glmnet(x[train_indices, ], y[train_indices], alpha = 0)

# Get the best lambda
best_lambda <- cv.out$lambda.min
cat("Best lambda for Ridge:", best_lambda, "\n")
## Best lambda for Ridge: 314.2524
# Predict on the test set using the best lambda
ridge_preds <- predict(cv.out, s = best_lambda, newx = x[-train_indices, ])

# Compute test MSE
ridge_test_mse <- mean((ridge_preds - y[-train_indices])^2)
cat("Test MSE for Ridge Regression:", ridge_test_mse, "\n")
## Test MSE for Ridge Regression: 2976365
# Plot the cross-validation curve
plot(cv.out)
title("Ridge Regression Cross-Validation",line=2.5)

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# Fit lasso regression with cross-validation to choose lambda

set.seed(123)
lasso_cv <- cv.glmnet(x[train_indices, ], y[train_indices], alpha = 1)

# Predict and evaluate
best_lambda_lasso <- lasso_cv$lambda.min
lasso_pred <- predict(lasso_cv, s = best_lambda_lasso, newx = x[-train_indices, ])
lasso_mse <- mean((lasso_pred - y[-train_indices])^2)
lasso_rmse <- sqrt(lasso_mse)

# Non-zero coefficients
lasso_coef <- predict(lasso_cv, s = best_lambda_lasso, type = "coefficients")
nonzero <- sum(lasso_coef != 0) - 1

cat("Lasso Test MSE:", round(lasso_mse, 2), "\n")
## Lasso Test MSE: 1730913
cat("Lasso Test RMSE:", round(lasso_rmse, 2), "\n")
## Lasso Test RMSE: 1315.64
cat("Non-zero Coefficients:", nonzero, "\n")
## Non-zero Coefficients: 16

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

#PCR model using cross validation

library(pls)
## Warning: package 'pls' was built under R version 4.4.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(123)
pcr.fit=pcr(Apps~., data=train_data, scale=TRUE, validation ="CV")
summary(pcr.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3402     3378     1683     1661     1335     1307     1274
## adjCV         3402     3379     1681     1660     1324     1302     1272
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1238     1221     1177      1181      1182      1185      1187
## adjCV     1230     1218     1175      1178      1180      1182      1184
##        14 comps  15 comps  16 comps  17 comps
## CV         1189      1199      1050      1052
## adjCV      1186      1196      1046      1048
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.797    57.68    64.58     70.2    75.49    80.41    84.00    87.43
## Apps    3.037    76.20    77.49     85.1    86.15    86.83    87.92    88.01
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.62     93.05     95.12     96.96     98.04     98.89     99.42
## Apps    88.85     88.87     88.94     88.98     89.01     89.01     89.03
##       16 comps  17 comps
## X        99.83    100.00
## Apps     91.68     91.75
validationplot(pcr.fit, val.type="MSEP")

set.seed(123)
# Get MSEs, excluding the intercept (which is at index 1)
cv_mse <- RMSEP(pcr.fit)$val[1,1, -1]  # Drop intercept
best_M <- which.min(cv_mse)           # This will now be from 1 to 17

cat("Best number of components (M):", best_M, "\n")
## Best number of components (M): 16
# Predict using best_M components
pcr_pred <- predict(pcr.fit, newdata = test_data, ncomp = best_M)

# Calculate test MSE and RMSE
pcr_test_mse <- mean((test_data$Apps - pcr_pred)^2)
pcr_test_rmse <- sqrt(pcr_test_mse)

cat("PCR Test MSE:", round(pcr_test_mse, 2), "\n")
## PCR Test MSE: 1853635
cat("PCR Test RMSE:", round(pcr_test_rmse, 2), "\n")
## PCR Test RMSE: 1361.48

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

#PLS Model using cross validation
set.seed(123)
pls.fit=plsr(Apps~., data=train_data, scale=TRUE, validation ="CV")
summary (pls.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3402     1520     1261     1159     1132     1104     1067
## adjCV         3402     1517     1262     1157     1129     1100     1061
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1057     1051     1052      1050      1049      1050      1051
## adjCV     1052     1047     1048      1046      1045      1046      1047
##        14 comps  15 comps  16 comps  17 comps
## CV         1051      1052      1052      1052
## adjCV      1047      1048      1048      1048
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.09    41.97    63.14    67.44    71.36    74.05    77.72    80.98
## Apps    80.83    86.94    89.29    90.10    90.94    91.65    91.71    91.73
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.77     86.46     89.83     91.07     93.08     95.14     97.06
## Apps    91.73     91.74     91.74     91.74     91.75     91.75     91.75
##       16 comps  17 comps
## X        99.09    100.00
## Apps     91.75     91.75
validationplot(pls.fit,val.type="MSEP")

set.seed(123)
# Get MSEs, excluding the intercept (which is at index 1)
cv_mse <- RMSEP(pls.fit)$val[1,1, -1]  # Drop intercept
best_M <- which.min(cv_mse)           # This will now be from 1 to 17

cat("Best number of components (M):", best_M, "\n")
## Best number of components (M): 11
# Predict using best_M components
pls_pred <- predict(pls.fit, newdata = test_data, ncomp = best_M)

# Calculate test MSE and RMSE
pls_test_mse <- mean((test_data$Apps - pls_pred)^2)
pls_test_rmse <- sqrt(pls_test_mse)

cat("PLS Test MSE:", round(pls_test_mse, 2), "\n")
## PLS Test MSE: 1756685
cat("PLS Test RMSE:", round(pls_test_rmse, 2), "\n")
## PLS Test RMSE: 1325.4

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
1. All methods perform similarly in most cases (within 2-5% difference). 2. Lasso retains most predictors (16-17 out of 17), suggesting low multicollinearity. 3. Results vary notably across train/test splits, indicating model sensitivity. 4. No clear winner; linear regression is preferred for simplicity. Overall: Applications can be predicted with reasonable accuracy (~1,000-1,700 RMSE), but model choice has minimal impact on performance for this data.

11. We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)       # For Boston dataset
library(glmnet)     # For lasso and ridge regression
library(pls)        # For PCR
library(caret)      # For train/test split and CV
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
## 
##     R2
set.seed(123)
data(Boston)
X <- as.matrix(Boston[, -which(names(Boston) == "crim")])
y <- Boston$crim

train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
y_train <- y[train_index]
X_test <- X[-train_index, ]
y_test <- y[-train_index]
set.seed(123)
#Ridge Regression (alpha = 0)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
best_lambda_ridge <- cv_ridge$lambda.min

ridge_model <- glmnet(X_train, y_train, alpha = 0, lambda = best_lambda_ridge)
ridge_pred <- predict(ridge_model, X_test)
ridge_mse <- mean((ridge_pred - y_test)^2)
ridge_mse
## [1] 54.31016
set.seed(123)
#Lasso Regression (alpha = 1)
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min

lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = best_lambda_lasso)
lasso_pred <- predict(lasso_model, X_test)
lasso_mse <- mean((lasso_pred - y_test)^2)
lasso_mse
## [1] 55.02971
# Which coefficients are zero?
coef(lasso_model)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  6.360624218
## zn           0.025322574
## indus       -0.022491968
## chas         .          
## nox          .          
## rm           .          
## age          .          
## dis         -0.433027162
## rad          0.456305248
## tax          .          
## ptratio     -0.029577721
## black       -0.008358225
## lstat        0.051396218
## medv        -0.122330104
library(pls)

#Principal Components Regression (PCR)
set.seed(123)
pcr_model <- pcr(crim ~ ., data = Boston[train_index, ], scale = TRUE, validation = "CV")
summary(pcr_model)
## Data:    X dimension: 356 13 
##  Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.162    6.719    6.708    6.356    6.329    6.341    6.398
## adjCV        8.162    6.717    6.706    6.351    6.323    6.336    6.389
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.402    6.261    6.283     6.298     6.304     6.248     6.190
## adjCV    6.393    6.251    6.273     6.287     6.292     6.234     6.176
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.09    60.33    69.78    76.64    82.79    87.75    91.19    93.56
## crim    32.62    32.98    40.74    41.32    41.32    41.46    41.66    44.25
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.56     97.06     98.48     99.52    100.00
## crim    44.30     44.30     44.45     45.87     47.15
# Choose number of components with lowest CV error
validationplot(pcr_model, val.type = "MSEP")

# Predict on test set using the chosen number of components (e.g., 7)
ncomp_optimal <- which.min(pcr_model$validation$PRESS)
pcr_pred <- predict(pcr_model, Boston[-train_index, ], ncomp = ncomp_optimal)
pcr_mse <- mean((pcr_pred - y_test)^2)
pcr_mse
## [1] 53.60105

All three methods perform similarly with test MSEs around 53–55. PCR slightly outperforms ridge and lasso in terms of prediction accuracy on the test set in terms of MSE(PCR has the lowest test MSE: 53.60 (vs. 54.31 for Ridge and 55.03 for Lasso).

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

cat("Test MSE for Ridge:", ridge_mse, "\n")
## Test MSE for Ridge: 54.31016
cat("Test MSE for Lasso:", lasso_mse, "\n")
## Test MSE for Lasso: 55.02971
cat("Test MSE for PCR:", pcr_mse, "\n")
## Test MSE for PCR: 53.60105

Proposed Model: PCR model Justification: Lowest test MSE among all models, indicating best generalization performance. Avoids overfitting by reducing dimensionality and noise.

(c) Does your chosen model involve all of the features in the dataset? Why or why not?
No, PCR does not use all features directly. Because PCR projects all 13 original predictors onto a new set of orthogonal components (principal components). The final model only uses the first 8 components, as those explained sufficient variance while minimizing prediction error.