2. For parts (a) through (c), indicate which of i. through
iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
(iii). Less flexible and hence will give improved prediction accuracy
when its increase in bias is less than its decrease in variance.
Justification This is because Lasso regression adds an L1 penalty to the least squares loss function, shrinking some coefficients exactly to zero, which performs variable selection. This makes the lasso model less flexible than least squares.In doing so, lasso reduces variance (less overfitting), but introduces bias (coefficients are shrunk). The trade off improves prediction accuracy when the reduction in variance more than compensates for the increase in bias.
(b) Repeat (a) for ridge regression relative to least
squares
(iii) Less flexible and hence will give improved prediction accuracy
when its increase in bias is less than its decrease in variance.
Justification: Ridge regression adds an L2 penalty (squared magnitude of coefficients), shrinking them toward zero but never to zero. This again makes it less flexible than least squares. It also introduces bias but reduces variance, especially helpful when multicollinearity or many predictors exist. Like lasso, ridge improves prediction accuracy if the bias-variance trade off is favorable.
The other options are incorrect (i) and (ii): Incorrect because ridge regression is less flexible than least squares.
(iv): Incorrect because the trade-off is about increased bias and decreased variance.
(c) Repeat (a) for non-linear methods relative to least
squares.
(ii). More flexible and hence will give improved prediction accuracy
when its increase in variance is less than its decrease in bias.
Justification: That is why they are able to capture complex patterns while providing higher variance than bias. Regarding the matter of prediction accuracy enhancement, it will be optimal if bias reduction is achieved at the cost of raise in variance.
The other options are incorrect (i): Incorrect because prediction improvement depends on bias reduction, not variance reduction, for more flexible methods.
9. In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
attach(College)
college<-College
#Check for missing values
sum(is.na(college))
## [1] 0
str(college)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
(a) Split the data set into a training set and a test set.
#Split the data into training and test sets
set.seed(123)
# 70% training, 30% test
train_indices <- sample(1:nrow(college), nrow(college) * 0.7)
train_data <- college[train_indices, ]
test_data <- college[-train_indices, ]
cat("Training set size:", nrow(train_data), "\n")
## Training set size: 543
cat("Test set size:", nrow(test_data), "\n")
## Test set size: 234
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
set.seed(123)
# Fit linear model
model_lm <- lm(Apps ~ ., data = train_data)
summary(model_lm)
##
## Call:
## lm(formula = Apps ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3097.8 -455.8 -46.5 343.8 6452.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -310.17331 481.30075 -0.644 0.519566
## PrivateYes -681.96465 164.08211 -4.156 3.78e-05 ***
## Accept 1.22130 0.05921 20.626 < 2e-16 ***
## Enroll 0.08046 0.21794 0.369 0.712155
## Top10perc 49.33503 6.18296 7.979 9.31e-15 ***
## Top25perc -16.11744 5.02717 -3.206 0.001428 **
## F.Undergrad 0.02284 0.03985 0.573 0.566831
## P.Undergrad 0.03541 0.03529 1.003 0.316139
## Outstate -0.05446 0.02132 -2.555 0.010910 *
## Room.Board 0.18967 0.05275 3.596 0.000354 ***
## Books 0.21366 0.28099 0.760 0.447381
## Personal -0.03685 0.07279 -0.506 0.612876
## PhD -6.00401 5.34580 -1.123 0.261897
## Terminal -5.01712 5.77787 -0.868 0.385609
## S.F.Ratio -2.18927 14.83898 -0.148 0.882766
## perc.alumni -8.01836 4.67330 -1.716 0.086792 .
## Expend 0.07614 0.01340 5.681 2.23e-08 ***
## Grad.Rate 10.63461 3.38228 3.144 0.001760 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 992.3 on 525 degrees of freedom
## Multiple R-squared: 0.9175, Adjusted R-squared: 0.9148
## F-statistic: 343.2 on 17 and 525 DF, p-value: < 2.2e-16
# Predict on test set
predict_lm <- predict(model_lm, newdata = test_data)
# Calculate test error (MSE)
lm_test_mse <- mean((predict_lm - test_data$Apps)^2)
cat("Test MSE for Linear Regression:", lm_test_mse, "\n")
## Test MSE for Linear Regression: 1734841
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# (c) Ridge regression with cross-validation
set.seed(123)
# Create x and y matrices for the entire dataset
x <- model.matrix(Apps ~ ., data = college)[, -1] # Remove intercept column
y <- college$Apps
# Fit Ridge model with cross-validation on training data only
set.seed(123)
cv.out <- cv.glmnet(x[train_indices, ], y[train_indices], alpha = 0)
# Get the best lambda
best_lambda <- cv.out$lambda.min
cat("Best lambda for Ridge:", best_lambda, "\n")
## Best lambda for Ridge: 314.2524
# Predict on the test set using the best lambda
ridge_preds <- predict(cv.out, s = best_lambda, newx = x[-train_indices, ])
# Compute test MSE
ridge_test_mse <- mean((ridge_preds - y[-train_indices])^2)
cat("Test MSE for Ridge Regression:", ridge_test_mse, "\n")
## Test MSE for Ridge Regression: 2976365
# Plot the cross-validation curve
plot(cv.out)
title("Ridge Regression Cross-Validation",line=2.5)
(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
# Fit lasso regression with cross-validation to choose lambda
set.seed(123)
lasso_cv <- cv.glmnet(x[train_indices, ], y[train_indices], alpha = 1)
# Predict and evaluate
best_lambda_lasso <- lasso_cv$lambda.min
lasso_pred <- predict(lasso_cv, s = best_lambda_lasso, newx = x[-train_indices, ])
lasso_mse <- mean((lasso_pred - y[-train_indices])^2)
lasso_rmse <- sqrt(lasso_mse)
# Non-zero coefficients
lasso_coef <- predict(lasso_cv, s = best_lambda_lasso, type = "coefficients")
nonzero <- sum(lasso_coef != 0) - 1
cat("Lasso Test MSE:", round(lasso_mse, 2), "\n")
## Lasso Test MSE: 1730913
cat("Lasso Test RMSE:", round(lasso_rmse, 2), "\n")
## Lasso Test RMSE: 1315.64
cat("Non-zero Coefficients:", nonzero, "\n")
## Non-zero Coefficients: 16
(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
#PCR model using cross validation
library(pls)
## Warning: package 'pls' was built under R version 4.4.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(123)
pcr.fit=pcr(Apps~., data=train_data, scale=TRUE, validation ="CV")
summary(pcr.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3402 3378 1683 1661 1335 1307 1274
## adjCV 3402 3379 1681 1660 1324 1302 1272
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1238 1221 1177 1181 1182 1185 1187
## adjCV 1230 1218 1175 1178 1180 1182 1184
## 14 comps 15 comps 16 comps 17 comps
## CV 1189 1199 1050 1052
## adjCV 1186 1196 1046 1048
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.797 57.68 64.58 70.2 75.49 80.41 84.00 87.43
## Apps 3.037 76.20 77.49 85.1 86.15 86.83 87.92 88.01
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 93.05 95.12 96.96 98.04 98.89 99.42
## Apps 88.85 88.87 88.94 88.98 89.01 89.01 89.03
## 16 comps 17 comps
## X 99.83 100.00
## Apps 91.68 91.75
validationplot(pcr.fit, val.type="MSEP")
set.seed(123)
# Get MSEs, excluding the intercept (which is at index 1)
cv_mse <- RMSEP(pcr.fit)$val[1,1, -1] # Drop intercept
best_M <- which.min(cv_mse) # This will now be from 1 to 17
cat("Best number of components (M):", best_M, "\n")
## Best number of components (M): 16
# Predict using best_M components
pcr_pred <- predict(pcr.fit, newdata = test_data, ncomp = best_M)
# Calculate test MSE and RMSE
pcr_test_mse <- mean((test_data$Apps - pcr_pred)^2)
pcr_test_rmse <- sqrt(pcr_test_mse)
cat("PCR Test MSE:", round(pcr_test_mse, 2), "\n")
## PCR Test MSE: 1853635
cat("PCR Test RMSE:", round(pcr_test_rmse, 2), "\n")
## PCR Test RMSE: 1361.48
(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
#PLS Model using cross validation
set.seed(123)
pls.fit=plsr(Apps~., data=train_data, scale=TRUE, validation ="CV")
summary (pls.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3402 1520 1261 1159 1132 1104 1067
## adjCV 3402 1517 1262 1157 1129 1100 1061
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1057 1051 1052 1050 1049 1050 1051
## adjCV 1052 1047 1048 1046 1045 1046 1047
## 14 comps 15 comps 16 comps 17 comps
## CV 1051 1052 1052 1052
## adjCV 1047 1048 1048 1048
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.09 41.97 63.14 67.44 71.36 74.05 77.72 80.98
## Apps 80.83 86.94 89.29 90.10 90.94 91.65 91.71 91.73
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.77 86.46 89.83 91.07 93.08 95.14 97.06
## Apps 91.73 91.74 91.74 91.74 91.75 91.75 91.75
## 16 comps 17 comps
## X 99.09 100.00
## Apps 91.75 91.75
validationplot(pls.fit,val.type="MSEP")
set.seed(123)
# Get MSEs, excluding the intercept (which is at index 1)
cv_mse <- RMSEP(pls.fit)$val[1,1, -1] # Drop intercept
best_M <- which.min(cv_mse) # This will now be from 1 to 17
cat("Best number of components (M):", best_M, "\n")
## Best number of components (M): 11
# Predict using best_M components
pls_pred <- predict(pls.fit, newdata = test_data, ncomp = best_M)
# Calculate test MSE and RMSE
pls_test_mse <- mean((test_data$Apps - pls_pred)^2)
pls_test_rmse <- sqrt(pls_test_mse)
cat("PLS Test MSE:", round(pls_test_mse, 2), "\n")
## PLS Test MSE: 1756685
cat("PLS Test RMSE:", round(pls_test_rmse, 2), "\n")
## PLS Test RMSE: 1325.4
(g) Comment on the results obtained. How accurately can we
predict the number of college applications received? Is there much
difference among the test errors resulting from these five
approaches?
1. All methods perform similarly in most cases (within 2-5% difference).
2. Lasso retains most predictors (16-17 out of 17), suggesting low
multicollinearity. 3. Results vary notably across train/test splits,
indicating model sensitivity. 4. No clear winner; linear regression is
preferred for simplicity. Overall: Applications can be predicted with
reasonable accuracy (~1,000-1,700 RMSE), but model choice has minimal
impact on performance for this data.
11. We will now try to predict per capita crime rate in the
Boston data set.
(a) Try out some of the regression methods explored in this
chapter, such as best subset selection, the lasso, ridge regression, and
PCR. Present and discuss results for the approaches that you
consider.
library(MASS) # For Boston dataset
library(glmnet) # For lasso and ridge regression
library(pls) # For PCR
library(caret) # For train/test split and CV
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
##
## R2
set.seed(123)
data(Boston)
X <- as.matrix(Boston[, -which(names(Boston) == "crim")])
y <- Boston$crim
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
y_train <- y[train_index]
X_test <- X[-train_index, ]
y_test <- y[-train_index]
set.seed(123)
#Ridge Regression (alpha = 0)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
best_lambda_ridge <- cv_ridge$lambda.min
ridge_model <- glmnet(X_train, y_train, alpha = 0, lambda = best_lambda_ridge)
ridge_pred <- predict(ridge_model, X_test)
ridge_mse <- mean((ridge_pred - y_test)^2)
ridge_mse
## [1] 54.31016
set.seed(123)
#Lasso Regression (alpha = 1)
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min
lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = best_lambda_lasso)
lasso_pred <- predict(lasso_model, X_test)
lasso_mse <- mean((lasso_pred - y_test)^2)
lasso_mse
## [1] 55.02971
# Which coefficients are zero?
coef(lasso_model)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 6.360624218
## zn 0.025322574
## indus -0.022491968
## chas .
## nox .
## rm .
## age .
## dis -0.433027162
## rad 0.456305248
## tax .
## ptratio -0.029577721
## black -0.008358225
## lstat 0.051396218
## medv -0.122330104
library(pls)
#Principal Components Regression (PCR)
set.seed(123)
pcr_model <- pcr(crim ~ ., data = Boston[train_index, ], scale = TRUE, validation = "CV")
summary(pcr_model)
## Data: X dimension: 356 13
## Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.162 6.719 6.708 6.356 6.329 6.341 6.398
## adjCV 8.162 6.717 6.706 6.351 6.323 6.336 6.389
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.402 6.261 6.283 6.298 6.304 6.248 6.190
## adjCV 6.393 6.251 6.273 6.287 6.292 6.234 6.176
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.09 60.33 69.78 76.64 82.79 87.75 91.19 93.56
## crim 32.62 32.98 40.74 41.32 41.32 41.46 41.66 44.25
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.56 97.06 98.48 99.52 100.00
## crim 44.30 44.30 44.45 45.87 47.15
# Choose number of components with lowest CV error
validationplot(pcr_model, val.type = "MSEP")
# Predict on test set using the chosen number of components (e.g., 7)
ncomp_optimal <- which.min(pcr_model$validation$PRESS)
pcr_pred <- predict(pcr_model, Boston[-train_index, ], ncomp = ncomp_optimal)
pcr_mse <- mean((pcr_pred - y_test)^2)
pcr_mse
## [1] 53.60105
All three methods perform similarly with test MSEs around 53–55. PCR slightly outperforms ridge and lasso in terms of prediction accuracy on the test set in terms of MSE(PCR has the lowest test MSE: 53.60 (vs. 54.31 for Ridge and 55.03 for Lasso).
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.
cat("Test MSE for Ridge:", ridge_mse, "\n")
## Test MSE for Ridge: 54.31016
cat("Test MSE for Lasso:", lasso_mse, "\n")
## Test MSE for Lasso: 55.02971
cat("Test MSE for PCR:", pcr_mse, "\n")
## Test MSE for PCR: 53.60105
Proposed Model: PCR model Justification: Lowest test MSE among all models, indicating best generalization performance. Avoids overfitting by reducing dimensionality and noise.
(c) Does your chosen model involve all of the features in the
dataset? Why or why not?
No, PCR does not use all features directly. Because PCR projects all 13
original predictors onto a new set of orthogonal components (principal
components). The final model only uses the first 8 components, as those
explained sufficient variance while minimizing prediction error.