ISLR Chapter 06 (page 283): 2, 9, 11
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
** Correct answer is
iii. Because Lasso is less flexible than least squares because it shrinks coefficients near or absolute zero. This increases bias but reduces variance. Prediction improves when the decrease in variance is greater than the increase in bias.
(b) Repeat (a) for ridge regression relative to least squares.
** Correct answer is
iii. Because Ridge regression is also less flexible than least squares because it shrinks coefficients toward zero. Like lasso, it increases bias and reduces variance, improving prediction when the reduction in variance outweighs the increase in bias.
(c) Repeat (a) for non-linear methods relative to least squares.
** Correct answer is
ii. Because Non-linear methods are generally more flexible than least squares. Greater flexibility reduces bias but increases variance. They improve prediction when the increase in variance is smaller than the decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
set.seed(210)
train_sample <- sample(nrow(College), nrow(College) *0.5)
train_size <- College[train_sample, ]
test_size <- College[-train_sample, ]
dim(train_size)## [1] 388 18
## [1] 389 18
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
##
## Call:
## lm(formula = Apps ~ ., data = train_size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3409.8 -357.9 3.3 335.1 6784.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.036e+02 5.723e+02 -1.055 0.292204
## PrivateYes -3.076e+02 1.955e+02 -1.573 0.116559
## Accept 1.736e+00 4.689e-02 37.024 < 2e-16 ***
## Enroll -7.761e-01 2.254e-01 -3.444 0.000640 ***
## Top10perc 2.559e+01 7.571e+00 3.380 0.000803 ***
## Top25perc -3.253e+00 5.988e+00 -0.543 0.587339
## F.Undergrad 2.445e-02 3.959e-02 0.618 0.537271
## P.Undergrad 4.676e-03 3.834e-02 0.122 0.902987
## Outstate -9.543e-02 2.493e-02 -3.828 0.000152 ***
## Room.Board 2.856e-02 6.548e-02 0.436 0.662972
## Books -2.487e-02 3.771e-01 -0.066 0.947457
## Personal 3.629e-03 7.907e-02 0.046 0.963421
## PhD -9.092e+00 6.549e+00 -1.388 0.165854
## Terminal -5.359e+00 7.166e+00 -0.748 0.454964
## S.F.Ratio 2.632e+01 1.980e+01 1.330 0.184437
## perc.alumni 4.141e+00 5.340e+00 0.775 0.438625
## Expend 1.350e-01 1.860e-02 7.257 2.34e-12 ***
## Grad.Rate 8.200e+00 4.001e+00 2.049 0.041128 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 995 on 370 degrees of freedom
## Multiple R-squared: 0.9401, Adjusted R-squared: 0.9374
## F-statistic: 341.6 on 17 and 370 DF, p-value: < 2.2e-16
we fit our linear model using our train size now, we can predict using test size and obtain test error.
lm_pred <- predict(lm_fit, newdata = test_size)
lm_mse <- mean((test_size$Apps - lm_pred)^2)
print(lm_mse)## [1] 1415912
Our test error: 1878956
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
## Loading required package: Matrix
## Loaded glmnet 4.1-10
x_train <- model.matrix(Apps ~ ., train_size)[, -1]
y_train <- train_size$Apps
x_test <- model.matrix(Apps ~ ., test_size)[, -1]
y_test <- test_size$Appsnow we converted splitted data into matrixes and fit the ridge regression model.
we can now get the best lambda value
## [1] 377.3813
ridge_pred <- predict(cv_ridge, s = best_lambda, newx = x_test)
ridge_test_mse <- mean((y_test - ridge_pred)^2)
ridge_test_mse## [1] 1228067
Ridge regression produced larger mse error with 2583504
>
(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
## [1] 32.8227
## [1] 1451952
Our lasso model test error is 1882754 slightly higher than our linear model and lot less than our ridge model.
(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
## Warning: package 'pls' was built under R version 4.5.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3980 3995 2302 2294 2301 2011 1977
## adjCV 3980 3995 2297 2290 2320 1961 1965
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1965 1929 1803 1809 1821 1838 1872
## adjCV 1956 1915 1790 1798 1808 1824 1861
## 14 comps 15 comps 16 comps 17 comps
## CV 1644 1249 1178 1149
## adjCV 1584 1231 1166 1138
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 3.175e+01 57.77 65.36 71.07 76.27 81.15 84.66 88.01
## Apps 5.501e-04 68.37 68.54 68.95 78.93 79.17 79.70 80.70
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.80 92.99 95.07 96.69 97.77 98.59 99.31
## Apps 83.79 84.12 84.33 84.35 84.35 92.86 93.49
## 16 comps 17 comps
## X 99.77 100.00
## Apps 93.75 94.01
Looking at the summary, the 16 component model achieves the lowest RMSEP score while capturing the maximum possible explained variance for the output (92.45%), matching the 17 component model with fewer variables.
## [1] 17
We could now apply our model on the test size and observ the test error.
## [1] 1415912
PCR model have the lowest test error within our previous model with 1845839
(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
After fitting the model we could plot the error
## 9 comps
## 9
now, we can predict using best m value and compute test MSE.
pls_pred <- predict(pls_fit, newdata = test_size, ncomp = M)
pls_test_mse <- mean((test_size$Apps - pls_pred)^2)
pls_test_mse## [1] 1426303
Our observed test error fot PLS model is 1863155
(g) Comment on the results obtained. How accurately can we pre dict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Lets look at all the test errors of our different model approaches
model_names <- c("Linear Regression",
"Ridge Regression",
"Lasso Regression",
"PCR",
"PLS")
test_mse <- c(lm_mse,
ridge_test_mse,
lasso_test_mse,
pcr_test_mse,
pls_test_mse)
results <- data.frame(
Model = model_names,
Test_MSE = test_mse
)
resultsThe
PCR modelachieved the lowest test error 1845839, closely followed byPLS model1,863,155 andlinear regression model1,878,956. Thelasso modelperformed very similarly toleast squares, suggesting that only limited variable selection benefit was gained in this case. Theridge regression modelperformed the worst, with a noticeably higher test MSE 2583504 indicating that in this dataset, stronger coefficient shrinkage may have led to underfitting.
We will now try to predict per capita crime rate in the Boston data set.
## Warning: package 'MASS' was built under R version 4.5.3
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
First, lets start with ridge regression model and regress all the variables againts
crim(per capita crime rate by town). For us to apply these model we need to split the data into train and test.
set.seed(35)
train_boston_sample <- sample(nrow(Boston), nrow(Boston) *0.5)
train_size_boston <- Boston[train_boston_sample, ]
test_size_boston <- Boston[-train_boston_sample, ]
dim(train_size_boston)## [1] 253 14
## [1] 253 14
As next, we need to prepare testing matrix and response variable
x_train_boston <- model.matrix(crim ~ ., train_size_boston)[, -1]
y_train_boston <- train_size_boston$crim
x_test_boston <- model.matrix(crim ~ ., test_size_boston)[, -1]
y_test_boston <- test_size_boston$crimFitting the ridge regession
now we need to perform cross-validation to find the optimal lambda
set.seed(35)
cv_ridge_boston <- cv.glmnet(x_train_boston, y_train_boston, alpha = 0)
best_lambda_boston <- cv_ridge_boston$lambda.min
print(best_lambda_boston)## [1] 0.539526
we can also plot our best lambda value
now we can use our
best_lambda_bostonto predict housing prices on the test data split and calculate the Test Mean Squared Error (MSE).
ridge_pred_boston <- predict(ridge_model_boston, s = best_lambda_boston, newx = x_test_boston)
test_mse_ridge_boston <- mean((ridge_pred - y_test)^2)
print(test_mse_ridge_boston)## [1] 1228067
Our Ridge Model MSE: 2583504, now we can do the same steps for the other models and observe test error.
lets fit Lasso regression
Just like before, we run cross-validation with our seed to find the optimal \(\lambda\) for Lasso.
set.seed(210)
cv_lasso_boston <- cv.glmnet(x_train_boston, y_train_boston, alpha = 1)
best_lambda_lasso <- cv_lasso_boston$lambda.min
best_lambda_lasso## [1] 0.03549721
Next, Predict on test set using the corrected prediction variable name
lasso_pred_boston <- predict(lasso_model_boston, s = best_lambda_lasso, newx = x_test_boston)
test_mse_lasso_boston <- mean((lasso_pred - y_test)^2)
print(test_mse_lasso_boston)## [1] 1451952
Our lasso test error is 1882754, which is significantly smaller than our Ridge model’s test error.
we can try PCR model as next
set.seed(210)
pcr_model_boston <- pcr(crim~., data=train_size_boston, scale=TRUE, validation="CV")
summary(pcr_model_boston)## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 7.892 6.323 6.337 5.955 5.961 5.922 5.998
## adjCV 7.892 6.319 6.333 5.945 5.954 5.914 5.986
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5.998 5.861 5.814 5.879 5.894 5.753 5.632
## adjCV 5.986 5.844 5.798 5.864 5.880 5.735 5.614
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.80 61.37 70.48 77.12 83.65 88.38 91.56 93.70
## crim 36.96 36.96 45.23 45.63 46.08 46.27 46.36 49.17
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.56 97.19 98.54 99.59 100.00
## crim 50.02 50.02 50.05 52.44 54.63
Our PCR model suggests that all 12 compenents in our data contain useful predictive information with the lowest RMSEP and highest % variance explained. we can furhter confirm this by utilizing validation plot and calling the minimum RMSEP.
## [1] 13
We can now predict with PCR model and observe test error
## [1] 54.17217
Our pcr test error is a lot lower than previous ridge and lasso models with 54.67936
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.
model_comparison <- data.frame(
Model = c("Ridge Regression",
"Lasso Regression",
"Principal Components Regression (PCR)"),
Test_MSE = c(test_mse_ridge_boston,
test_mse_lasso_boston,
test_mse_pcr_boston))
model_comparison$Test_RMSE <- sqrt(model_comparison$Test_MSE)
model_comparison$Method <- c("L2 Regularization", "L1 Regularization","Dimension Reduction")
model_comparison$Test_MSE <- round(model_comparison$Test_MSE, 2)
model_comparison$Test_RMSE <- round(model_comparison$Test_RMSE, 2)
model_comparison <- model_comparison[order(model_comparison$Test_MSE), ]
model_comparisonThe PCR model performed substantially better than both the Ridge and Lasso regression models, achieving the lowest test MSE (54.68) and RMSE (7.39), indicating the most accurate predictions on the test data. In contrast, the Ridge and Lasso models had much larger prediction errors, with Ridge performing the worst. Based on these results, PCR appears to be the most effective model for this dataset, likely because reducing the predictors to a smaller set of principal components better captured the underlying structure of the data.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
No. my PCR model does not directly use all of the original features. Instead, it transforms the predictors into a smaller number of principal components and uses only the selected components for prediction. This helps reduce the dimensionality of the data while retaining most of the important information, which can improve prediction accuracy and reduce overfitting.