2. We perform best subset, forward stepwise, and backward
stepwise selection on a single data set. For each approach, we obtain p
+ 1 models, containing 0, 1, 2, . . . , p predictors. Explain your
answers:
(a) Which of the three models with k predictors has the smallest
training RSS?
(b) Which of the three models with k predictors has the smallest
test RSS?
(c) True or False:
i. The predictors in the k-variable model identified by forward
stepwise are a subset of the predictors in the (k+1)-variable model
identified by forward stepwise selection.
ii. The predictors in the k-variable model identified by
backward stepwise are a subset of the predictors in the (k + 1)-
variable model identified by backward stepwise selection.
iii. The predictors in the k-variable model identified by
backward stepwise are a subset of the predictors in the (k + 1)-
variable model identified by forward stepwise selection.
iv. The predictors in the k-variable model identified by forward
stepwise are a subset of the predictors in the (k+1)-variable model
identified by backward stepwise selection.
v. The predictors in the k-variable model identified by best
subset are a subset of the predictors in the (k + 1)-variable model
identified by best subset selection.
For parts (a) through (c), indicate which of i. through iv. is
correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
ii. More flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
iii. Less flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
iv. Less flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
Statements i and iv are naturally wrong because they contradict the
definitions of model flexibility and the bias-variance tradeoff. So that
leaves us to choose from statements ii and iii. Answer for (a): iii.
Less flexible and hence will give improved prediction accuracy when its
increase in bias is less than its decrease in variance.
Lasso is less flexible because it shrinks some coefficients exactly to
zero, simplifying the model. This can reduce variance but may introduce
some bias. Prediction improves if the increase in bias is smaller than
the decrease in variance.
(b) Repeat (a) for ridge regression relative to least
squares.
Answer for (b): iii. Less flexible and hence will give improved
prediction accuracy when its increase in bias is less than its decrease
in variance.
Ridge is also less flexible, shrinking coefficients toward zero but not
exactly zero. Like lasso, it trades a little bias for less variance.
Prediction improves when the bias increase is less than variance
reduction.
(c) Repeat (a) for non-linear methods relative to least
squares.
Answer for (c): ii. More flexible and hence will give improved
prediction accuracy when its increase in variance is less than its
decrease in bias.
Non-linear methods are more flexible than least squares regression.
Being more flexible means they can fit complex patterns better, so they
reduce bias.
9. In this exercise, we will predict the number of
applications received using the other variables in the College data
set.
(a) Split the data set into a training set and a test
set.
data(College)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
set.seed(123)
trainIndex <- createDataPartition(College$Apps, p = 0.75, list = FALSE)
trainData <- College[trainIndex, ]
testData <- College[-trainIndex, ]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm_model <- lm(Apps ~ ., data = trainData)
lm_pred <- predict(lm_model, newdata = testData)
lm_test_mse <- mean((testData$Apps - lm_pred)^2)
cat("Linear Model Test MSE:", lm_test_mse, "\n")
## Linear Model Test MSE: 1213202
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
x_train <- model.matrix(Apps ~ ., data = trainData)[, -1]
y_train <- trainData$Apps
x_test <- model.matrix(Apps ~ ., data = testData)[, -1]
y_test <- testData$Apps
ridge_model <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)
best_lambda <- ridge_model$lambda.min
cat("Ridge Best Lambda:", best_lambda, "\n")
## Ridge Best Lambda: 381.899
ridge_pred <- predict(ridge_model, s = best_lambda, newx = x_test)
ridge_test_mse <- mean((y_test - ridge_pred)^2)
cat("Ridge Regression Test MSE:", ridge_test_mse, "\n")
## Ridge Regression Test MSE: 1176853
plot(ridge_model)
plot(ridge_model$glmnet.fit, xvar = "lambda", label = TRUE)
The cross-validation plot shows that the optimal regularization parameter λ occurs at log(λ) ≈ 6.2, minimizing the test MSE. The coefficient path plot confirms that, at this level of regularization, key predictors like Room.Board, Accept, and Enroll retain large coefficients, indicating their strong influence on predicting college applications. As λ increases, all coefficients shrink toward zero, demonstrating the ridge regression’s ability to reduce model complexity while preserving important features.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso_model <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)
best_lambda_lasso <- lasso_model$lambda.min
cat("Lasso Best Lambda:", best_lambda_lasso, "\n")
## Lasso Best Lambda: 17.31865
lasso_pred <- predict(lasso_model, s = best_lambda_lasso, newx = x_test)
lasso_test_mse <- mean((y_test - lasso_pred)^2)
cat("Lasso Test MSE:", lasso_test_mse, "\n")
## Lasso Test MSE: 1207359
lasso_coefs <- coef(lasso_model, s = best_lambda_lasso)
non_zero_coefs <- sum(lasso_coefs != 0) - 1
cat("Number of non-zero coefficients:", non_zero_coefs, "\n")
## Number of non-zero coefficients: 15
plot(lasso_model)
plot(lasso_model$glmnet.fit, xvar = "lambda", label = TRUE)
The two lasso plots together illustrate the trade-off between model complexity and prediction error as the regularization parameter λ changes. In the top plot—the cross-validation curve—the test MSE is minimized when log(λ) is between approximately 2 and 6.2, indicating this is the optimal range for λ. Beyond this point, the MSE rises sharply, showing the model begins to underfit. The bottom plot tracks the coefficients of each predictor: as λ increases, many coefficients shrink toward zero due to the L1 penalty, and by log(λ) ≈ 6, all coefficients are essentially zero, leaving only the intercept. This reflects lasso’s ability to perform variable selection and highlights the importance of choosing λ carefully to balance predictive accuracy and model simplicity.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pcr_model <- pcr(Apps ~ ., data = trainData, validation = "CV", segments = 10)
cv_results <- RMSEP(pcr_model)
optimal_M <- which.min(cv_results$val["CV", , ]) - 1
cat("PCR Optimal M:", optimal_M, "\n")
## PCR Optimal M: 17
pcr_pred <- predict(pcr_model, newdata = testData, ncomp = optimal_M)
pcr_test_mse <- mean((testData$Apps - pcr_pred)^2)
cat("PCR Test MSE:", pcr_test_mse, "\n")
## PCR Test MSE: 1213202
summary(pcr_model)
## Data: X dimension: 585 17
## Y dimension: 585 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4027 4076 1965 1918 1785 1285 1266
## adjCV 4027 4087 1962 1914 1792 1275 1259
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1268 1255 1256 1210 1184 1184 1197
## adjCV 1261 1247 1248 1203 1176 1177 1189
## 14 comps 15 comps 16 comps 17 comps
## CV 1192 1173 1174 1172
## adjCV 1184 1164 1165 1163
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.1672 87.58 95.85 97.42 98.71 99.47 99.91 99.96
## Apps 0.6963 77.68 78.82 85.73 91.88 91.91 91.92 92.19
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Apps 92.21 92.76 93.05 93.05 93.06 93.25 93.52
## 16 comps 17 comps
## X 100.00 100.00
## Apps 93.54 93.64
plot(pcr_model, plottype = "validation", val.type = "R2", main = "Cross-validated R²")
plot(pcr_model, plottype = "validation", val.type = "MSEP", main = "Cross-validated MSEP")
In Cross-validated R² Plot, the R² starts near 0 with 0 components, rises sharply to about 0.6-0.7 with 1-2 components, and plateaus around 0.8 from 5 components onward. This suggests that adding components beyond 5 provides diminishing returns in explanatory power, with an optimal number likely between 5 and 10 where R² stabilizes.
For the Cross-validated MSEP Plot, MSEP starts high (around 1.5e-07) with 0 components, drops steeply to about 5.0e-06 by 2-3 components, and then levels off with minor fluctuations between 5 and 15 components. The lowest MSEP (around 2.0e-06) occurs between 5 and 10 components, indicating the model fits best with this range.
The elbow or stabilization point in both plots—around 5 to 10 components—suggests this range balances model complexity and predictive accuracy. Beyond 10 components, additional components do not significantly improve R² or reduce MSEP, indicating potential overfitting. Thus, selecting 5-10 components is a reasonable choice based on these cross-validated metrics.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls_model <- plsr(Apps ~ ., data = trainData, validation = "CV", segments = 10)
pls_cv_results <- RMSEP(pls_model)
optimal_M_pls <- which.min(pls_cv_results$val["CV", , ]) - 1
cat("PLS Optimal M:", optimal_M_pls, "\n")
## PLS Optimal M: 17
pls_pred <- predict(pls_model, newdata = testData, ncomp = optimal_M_pls)
pls_test_mse <- mean((testData$Apps - pls_pred)^2)
cat("PLS Test MSE:", pls_test_mse, "\n")
## PLS Test MSE: 1213202
summary(pls_model)
## Data: X dimension: 585 17
## Y dimension: 585 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4027 1891 1814 1642 1276 1237 1238
## adjCV 4027 1881 1793 1641 1267 1231 1232
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1229 1219 1228 1158 1147 1143 1137
## adjCV 1222 1213 1224 1152 1141 1136 1130
## 14 comps 15 comps 16 comps 17 comps
## CV 1136 1136 1136 1130
## adjCV 1130 1129 1129 1123
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 39.50 67.13 92.78 97.31 98.68 99.13 99.52 99.96
## Apps 79.01 83.10 87.33 91.86 91.96 92.09 92.22 92.23
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 99.99 100.00 100.00 100.00 100.00 100.00 100.00
## Apps 92.35 93.04 93.28 93.53 93.54 93.54 93.54
## 16 comps 17 comps
## X 100.00 100.00
## Apps 93.55 93.64
plot(pls_model, plottype = "validation", val.type = "R2", main = "Cross-validated R²")
plot(pls_model, plottype = "validation", val.type = "MSEP", main = "Cross-validated MSEP")
In the Cross-validated R² Plot for PLS, the R² starts near 0 with 0 components, increases sharply to approximately 0.6-0.7 with 1-2 components, and plateaus around 0.8 from 5 components onward. This indicates that adding components beyond 5 yields diminishing returns in explanatory power, with an optimal number likely between 5 and 10 where R² stabilizes.
For the Cross-validated MSEP Plot, MSEP begins high (around 1.4e-07) with 0 components, decreases steeply to about 2.0e-06 by 2-3 components, and then levels off with slight variations between 5 and 15 components. The lowest MSEP (around 2.0e-06) occurs between 5 and 10 components, suggesting the model achieves the best fit within this range.
The elbow or stabilization point in both plots—around 5 to 10 components—indicates this range balances model complexity and predictive accuracy. Beyond 10 components, additional components do not significantly enhance R² or reduce MSEP, pointing to potential overfitting. Therefore, selecting 5-10 components is a reasonable choice based on these cross-validated metrics.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
I employed five modeling approaches—linear regression, ridge regression, lasso, PCR, and PLS—to predict the number of college applications, with test MSEs as follows: linear regression (1.21M), ridge regression (1.17M), lasso (1.21M with 15 non-zero predictors), PCR (1.21M with M=17), and PLS (1.21M with M=17). The MSE values are quite similar, indicating no single model substantially outperforms the others, though I found ridge regression offers a slight edge with the lowest MSE (1.17M), suggesting better generalization by mitigating overfitting. This minimal difference suggests to me that while techniques like ridge, lasso, and dimensionality reduction (PCR, PLS) offer slight advantages, the data’s inherent complexity—possibly driven by outliers or unmodeled variability—limits overall impact. Both PCR and PLS utilized all 17 components, implying most predictor information was necessary for optimization, while lasso’s sparser model (15 predictors) enhances interpretability. The stabilization of cross-validated metrics (for example, 5-10 components for PCR/PLS, optimal λ around log(6) for ridge/lasso) supports a balanced model complexity, yet I believe accuracy remains constrained.
11. We will now try to predict per capita crime rate in the
Boston data set.
(a) Try out some of the regression methods explored in this
chapter, such as best subset selection, the lasso, ridge regression, and
PCR. Present and discuss results for the approaches that you
consider.
data(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
set.seed(123)
trainIndex <- createDataPartition(Boston$crim, p = 0.75, list = FALSE)
trainData <- Boston[trainIndex, ]
testData <- Boston[-trainIndex, ]
lm_model <- lm(crim ~ ., data = trainData)
lm_pred <- predict(lm_model, newdata = testData)
lm_test_mse <- mean((testData$crim - lm_pred)^2)
cat("Linear Model Test MSE:", lm_test_mse, "\n")
## Linear Model Test MSE: 68.67029
x_train <- model.matrix(crim ~ ., data = trainData)[, -1]
y_train <- trainData$crim
x_test <- model.matrix(crim ~ ., data = testData)[, -1]
y_test <- testData$crim
ridge_model <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)
best_lambda <- ridge_model$lambda.min
cat("Ridge Best Lambda:", best_lambda, "\n")
## Ridge Best Lambda: 0.5258691
ridge_pred <- predict(ridge_model, s = best_lambda, newx = x_test)
ridge_test_mse <- mean((y_test - ridge_pred)^2)
cat("Ridge Regression Test MSE:", ridge_test_mse, "\n")
## Ridge Regression Test MSE: 69.72666
plot(ridge_model)
plot(ridge_model$glmnet.fit, xvar = "lambda", label = TRUE)
lasso_model <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)
best_lambda_lasso <- lasso_model$lambda.min
cat("Lasso Best Lambda:", best_lambda_lasso, "\n")
## Lasso Best Lambda: 0.00648316
lasso_pred <- predict(lasso_model, s = best_lambda_lasso, newx = x_test)
lasso_test_mse <- mean((y_test - lasso_pred)^2)
cat("Lasso Test MSE:", lasso_test_mse, "\n")
## Lasso Test MSE: 68.70042
lasso_coefs <- coef(lasso_model, s = best_lambda_lasso)
non_zero_coefs <- sum(lasso_coefs != 0) - 1
cat("Number of non-zero coefficients:", non_zero_coefs, "\n")
## Number of non-zero coefficients: 13
plot(lasso_model)
plot(lasso_model$glmnet.fit, xvar = "lambda", label = TRUE)
pcr_model <- pcr(crim ~ ., data = trainData, validation = "CV", segments = 10)
cv_results <- RMSEP(pcr_model)
optimal_M <- which.min(cv_results$val["CV", , ]) - 1
cat("PCR Optimal M:", optimal_M, "\n")
## PCR Optimal M: 10
pcr_pred <- predict(pcr_model, newdata = testData, ncomp = optimal_M)
pcr_test_mse <- mean((testData$crim - pcr_pred)^2)
cat("PCR Test MSE:", pcr_test_mse, "\n")
## PCR Test MSE: 69.34457
summary(pcr_model)
## Data: X dimension: 382 13
## Y dimension: 382 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.04 6.245 6.371 6.365 6.355 6.272 6.081
## adjCV 8.04 6.243 6.360 6.354 6.344 6.260 6.068
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.016 5.949 5.943 5.899 5.912 5.909 5.900
## adjCV 6.003 5.936 5.930 5.885 5.898 5.894 5.885
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 81.69 96.82 99.02 99.69 99.89 99.93 99.97 99.99
## crim 40.28 40.88 41.06 41.27 43.17 46.59 48.32 49.50
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 100.00 100.00 100.00 100.0 100.00
## crim 49.75 50.63 50.64 50.7 50.89
plot(pcr_model, plottype = "validation", val.type = "R2", main = "Cross-validated R²")
plot(pcr_model, plottype = "validation", val.type = "MSEP", main = "Cross-validated MSEP")
pls_model <- plsr(crim ~ ., data = trainData, validation = "CV", segments = 10)
pls_cv_results <- RMSEP(pls_model)
optimal_M_pls <- which.min(pls_cv_results$val["CV", , ]) - 1
cat("PLS Optimal M:", optimal_M_pls, "\n")
## PLS Optimal M: 9
pls_pred <- predict(pls_model, newdata = testData, ncomp = optimal_M_pls)
pls_test_mse <- mean((testData$crim - pls_pred)^2)
cat("PLS Test MSE:", pls_test_mse, "\n")
## PLS Test MSE: 69.46011
summary(pls_model)
## Data: X dimension: 382 13
## Y dimension: 382 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.04 6.272 6.293 6.267 6.117 6.023 5.888
## adjCV 8.04 6.268 6.289 6.258 6.109 6.015 5.878
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5.874 5.865 5.849 5.851 5.874 5.862 5.858
## adjCV 5.864 5.854 5.838 5.840 5.861 5.849 5.845
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 81.66 96.67 98.64 99.37 99.84 99.93 99.95 99.98
## crim 40.51 41.07 42.36 45.05 46.78 49.57 50.03 50.36
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 99.99 100.00 100.0 100.00 100.00
## crim 50.58 50.64 50.7 50.73 50.89
plot(pls_model, plottype = "validation", val.type = "R2", main = "Cross-validated R²")
plot(pls_model, plottype = "validation", val.type = "MSEP", main = "Cross-validated MSEP")
(b) Propose a model (or set of models) that seem to perform
well on this data set, and justify your answer. Make sure that you are
evaluating model performance using validation set error,
cross-validation, or some other reasonable alternative, as opposed to
using training error.
I propose the linear model as the best-performing approach for
predicting per capita crime rate in the Boston data set, based on its
test MSE of 68.67, the lowest among the models I tested. I evaluated
performance using test set error after splitting the data into 75%
training and 25% test sets, ensuring the assessment reflects
generalization to unseen data rather than training fit. Compared to
ridge regression (69.73), lasso (68.70), PCR (69.34 with 10 components),
and PLS (69.46 with 9 components), the linear model’s slight edge
suggests it captures the relationship effectively without needing
regularization or dimensionality reduction. The cross-validation results
for PCR and PLS (e.g., RMSEP stabilizing around 5.9-6.0) and the minimal
MSE differences (all within ~1 unit) support that the linear model’s
simplicity and direct use of all predictors align well with the data’s
structure, though the range of crim (0.006 to 88.98) indicates some
residual variability remains unmodeled.
(c) Does your chosen model involve all of the features in the
data set? Why or why not?
Yes, my chosen linear model involves all 13 features in the Boston data
set (crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black,
lstat). I opted for this approach because the data’s 13 predictors,
including correlated ones like nox and dis, contribute to explaining
crim’s wide range, and the low test MSE (68.67) suggests no single
feature is redundant. Unlike lasso, which retained all 13 with minimal
shrinkage (λ = 0.006), or ridge and PCR/PLS, which adjust feature
impact, the linear model uses all predictors without regularization,
reflecting their collective importance. This choice avoids overfitting
risks seen in more complex models and leverages the data’s full
information, given the modest sample size (506 observations).