Assignment 5

ISLR Chapter 06 (page 283): 2, 9, 11


Q2. Bias–variance trade off for lasso, ridge, and non-linear regression methods compared with least squares.

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

    i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

    ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

    iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

    iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

** Correct answer is iii. Because Lasso is less flexible than least squares because it shrinks coefficients near or absolute zero. This increases bias but reduces variance. Prediction improves when the decrease in variance is greater than the increase in bias.

(b) Repeat (a) for ridge regression relative to least squares.

** Correct answer is iii. Because Ridge regression is also less flexible than least squares because it shrinks coefficients toward zero. Like lasso, it increases bias and reduces variance, improving prediction when the reduction in variance outweighs the increase in bias.

(c) Repeat (a) for non-linear methods relative to least squares.

** Correct answer is ii. Because Non-linear methods are generally more flexible than least squares. Greater flexibility reduces bias but increases variance. They improve prediction when the increase in variance is smaller than the decrease in bias.

Q9.Comparison of predictive models (Linear Regression, Ridge, Lasso, PCR, and PLS) for predicting college applications using test MSE

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
data(College)
head(College)

(a) Split the data set into a training set and a test set.

set.seed(210)

train_sample <- sample(nrow(College), nrow(College) *0.5)

train_size <- College[train_sample, ]
test_size <- College[-train_sample, ]

dim(train_size)
## [1] 388  18
dim(test_size)
## [1] 389  18

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm_fit <- lm(Apps ~., data = train_size)
summary(lm_fit)
## 
## Call:
## lm(formula = Apps ~ ., data = train_size)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3409.8  -357.9     3.3   335.1  6784.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.036e+02  5.723e+02  -1.055 0.292204    
## PrivateYes  -3.076e+02  1.955e+02  -1.573 0.116559    
## Accept       1.736e+00  4.689e-02  37.024  < 2e-16 ***
## Enroll      -7.761e-01  2.254e-01  -3.444 0.000640 ***
## Top10perc    2.559e+01  7.571e+00   3.380 0.000803 ***
## Top25perc   -3.253e+00  5.988e+00  -0.543 0.587339    
## F.Undergrad  2.445e-02  3.959e-02   0.618 0.537271    
## P.Undergrad  4.676e-03  3.834e-02   0.122 0.902987    
## Outstate    -9.543e-02  2.493e-02  -3.828 0.000152 ***
## Room.Board   2.856e-02  6.548e-02   0.436 0.662972    
## Books       -2.487e-02  3.771e-01  -0.066 0.947457    
## Personal     3.629e-03  7.907e-02   0.046 0.963421    
## PhD         -9.092e+00  6.549e+00  -1.388 0.165854    
## Terminal    -5.359e+00  7.166e+00  -0.748 0.454964    
## S.F.Ratio    2.632e+01  1.980e+01   1.330 0.184437    
## perc.alumni  4.141e+00  5.340e+00   0.775 0.438625    
## Expend       1.350e-01  1.860e-02   7.257 2.34e-12 ***
## Grad.Rate    8.200e+00  4.001e+00   2.049 0.041128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 995 on 370 degrees of freedom
## Multiple R-squared:  0.9401, Adjusted R-squared:  0.9374 
## F-statistic: 341.6 on 17 and 370 DF,  p-value: < 2.2e-16

we fit our linear model using our train size now, we can predict using test size and obtain test error.

lm_pred <- predict(lm_fit, newdata = test_size)

lm_mse <- mean((test_size$Apps - lm_pred)^2)
print(lm_mse)
## [1] 1415912

Our test error: 1878956

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-10
x_train <- model.matrix(Apps ~ ., train_size)[, -1]
y_train <- train_size$Apps

x_test <- model.matrix(Apps ~ ., test_size)[, -1]
y_test <- test_size$Apps

now we converted splitted data into matrixes and fit the ridge regression model.

set.seed(210)

cv_ridge <- cv.glmnet(x_train,  y_train,  alpha = 0)

we can now get the best lambda value

best_lambda <- cv_ridge$lambda.min
best_lambda
## [1] 377.3813
ridge_pred <- predict(cv_ridge, s = best_lambda, newx = x_test)
ridge_test_mse <- mean((y_test - ridge_pred)^2)
ridge_test_mse
## [1] 1228067

Ridge regression produced larger mse error with 2583504

plot(cv_ridge)

>

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(210)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda <- cv_lasso$lambda.min
best_lambda
## [1] 32.8227
lasso_pred <- predict(cv_lasso, s = best_lambda,newx = x_test)
lasso_test_mse <- mean((y_test - lasso_pred)^2)
lasso_test_mse
## [1] 1451952

Our lasso model test error is 1882754 slightly higher than our linear model and lot less than our ridge model.

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## Warning: package 'pls' was built under R version 4.5.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(210)

pcr_fit <- pcr(Apps~., data=train_size, scale=T, validation="CV")
summary(pcr_fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3980     3995     2302     2294     2301     2011     1977
## adjCV         3980     3995     2297     2290     2320     1961     1965
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1965     1929     1803      1809      1821      1838      1872
## adjCV     1956     1915     1790      1798      1808      1824      1861
##        14 comps  15 comps  16 comps  17 comps
## CV         1644      1249      1178      1149
## adjCV      1584      1231      1166      1138
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     3.175e+01    57.77    65.36    71.07    76.27    81.15    84.66    88.01
## Apps  5.501e-04    68.37    68.54    68.95    78.93    79.17    79.70    80.70
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.80     92.99     95.07     96.69     97.77     98.59     99.31
## Apps    83.79     84.12     84.33     84.35     84.35     92.86     93.49
##       16 comps  17 comps
## X        99.77    100.00
## Apps     93.75     94.01

Looking at the summary, the 16 component model achieves the lowest RMSEP score while capturing the maximum possible explained variance for the output (92.45%), matching the 17 component model with fewer variables.

validationplot(pcr_fit, val.type = "MSEP")

M <- which.min(pcr_fit$validation$PRESS)
M
## [1] 17

We could now apply our model on the test size and observ the test error.

pcr_pred <- predict(pcr_fit, newdata = test_size, ncomp = M)
pcr_test_mse <- mean((test_size$Apps - pcr_pred)^2)
pcr_test_mse
## [1] 1415912

PCR model have the lowest test error within our previous model with 1845839

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(210)

pls_fit <- plsr(  Apps ~ .,  data = train_size,  scale = TRUE,  validation = "CV")

After fitting the model we could plot the error

validationplot(pls_fit, val.type = "MSEP")

M <- which.min(RMSEP(pls_fit)$val[1,, -1])
M
## 9 comps 
##       9

now, we can predict using best m value and compute test MSE.

pls_pred <- predict(pls_fit, newdata = test_size, ncomp = M)

pls_test_mse <- mean((test_size$Apps - pls_pred)^2)
pls_test_mse
## [1] 1426303

Our observed test error fot PLS model is 1863155

(g) Comment on the results obtained. How accurately can we pre dict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Lets look at all the test errors of our different model approaches

model_names <- c("Linear Regression",
                 "Ridge Regression",
                 "Lasso Regression",
                 "PCR",
                 "PLS")

test_mse <- c(lm_mse,
              ridge_test_mse,
              lasso_test_mse,
              pcr_test_mse,
              pls_test_mse)

results <- data.frame(
  Model = model_names,
  Test_MSE = test_mse
)

results

The PCR model achieved the lowest test error 1845839, closely followed by PLS model 1,863,155 and linear regression model 1,878,956. The lasso model performed very similarly to least squares, suggesting that only limited variable selection benefit was gained in this case. The ridge regression model performed the worst, with a noticeably higher test MSE 2583504 indicating that in this dataset, stronger coefficient shrinkage may have led to underfitting.

Q11. Comparison of Ridge, Lasso, and PCR models for predicting per capita crime rate in the Boston dataset based on test set performance.

We will now try to predict per capita crime rate in the Boston data set.

library(MASS)
## Warning: package 'MASS' was built under R version 4.5.3
data("Boston")
head(Boston)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

First, lets start with ridge regression model and regress all the variables againts crim (per capita crime rate by town). For us to apply these model we need to split the data into train and test.

set.seed(35)

train_boston_sample <- sample(nrow(Boston), nrow(Boston) *0.5)

train_size_boston <- Boston[train_boston_sample, ]
test_size_boston <- Boston[-train_boston_sample, ]

dim(train_size_boston)
## [1] 253  14
dim(test_size_boston)
## [1] 253  14

As next, we need to prepare testing matrix and response variable

x_train_boston <- model.matrix(crim ~ ., train_size_boston)[, -1]
y_train_boston <- train_size_boston$crim

x_test_boston <- model.matrix(crim ~ ., test_size_boston)[, -1]
y_test_boston <- test_size_boston$crim

Fitting the ridge regession

ridge_model_boston <- glmnet(x_train_boston, y_train_boston, alpha = 0)

now we need to perform cross-validation to find the optimal lambda

set.seed(35)

cv_ridge_boston <- cv.glmnet(x_train_boston, y_train_boston, alpha = 0)

best_lambda_boston <- cv_ridge_boston$lambda.min

print(best_lambda_boston)
## [1] 0.539526

we can also plot our best lambda value

plot(cv_ridge_boston)

now we can use our best_lambda_boston to predict housing prices on the test data split and calculate the Test Mean Squared Error (MSE).

ridge_pred_boston <- predict(ridge_model_boston, s = best_lambda_boston, newx = x_test_boston)

test_mse_ridge_boston <- mean((ridge_pred - y_test)^2)
print(test_mse_ridge_boston)
## [1] 1228067

Our Ridge Model MSE: 2583504, now we can do the same steps for the other models and observe test error.

lets fit Lasso regression

lasso_model_boston <- glmnet(x_train_boston, y_train_boston, alpha = 1)

Just like before, we run cross-validation with our seed to find the optimal \(\lambda\) for Lasso.

set.seed(210)
cv_lasso_boston <- cv.glmnet(x_train_boston, y_train_boston, alpha = 1)

best_lambda_lasso <- cv_lasso_boston$lambda.min
best_lambda_lasso
## [1] 0.03549721

Next, Predict on test set using the corrected prediction variable name

lasso_pred_boston <- predict(lasso_model_boston, s = best_lambda_lasso, newx = x_test_boston)

test_mse_lasso_boston <- mean((lasso_pred - y_test)^2)
print(test_mse_lasso_boston)
## [1] 1451952

Our lasso test error is 1882754, which is significantly smaller than our Ridge model’s test error.

we can try PCR model as next

set.seed(210)

pcr_model_boston <- pcr(crim~., data=train_size_boston, scale=TRUE, validation="CV")
summary(pcr_model_boston)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           7.892    6.323    6.337    5.955    5.961    5.922    5.998
## adjCV        7.892    6.319    6.333    5.945    5.954    5.914    5.986
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       5.998    5.861    5.814     5.879     5.894     5.753     5.632
## adjCV    5.986    5.844    5.798     5.864     5.880     5.735     5.614
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.80    61.37    70.48    77.12    83.65    88.38    91.56    93.70
## crim    36.96    36.96    45.23    45.63    46.08    46.27    46.36    49.17
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.56     97.19     98.54     99.59    100.00
## crim    50.02     50.02     50.05     52.44     54.63

Our PCR model suggests that all 12 compenents in our data contain useful predictive information with the lowest RMSEP and highest % variance explained. we can furhter confirm this by utilizing validation plot and calling the minimum RMSEP.

validationplot(pcr_model_boston, val.type = "MSEP")

M_pcr_boston <- which.min(pcr_model_boston$validation$PRESS)
M_pcr_boston
## [1] 13

We can now predict with PCR model and observe test error

pcr_pred_boston <- predict(pcr_model_boston, newdata = test_size_boston, ncomp = M_pcr_boston)
test_mse_pcr_boston <- mean((test_size_boston$crim - pcr_pred_boston)^2)
test_mse_pcr_boston
## [1] 54.17217

Our pcr test error is a lot lower than previous ridge and lasso models with 54.67936

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

model_comparison <- data.frame(
  Model = c("Ridge Regression",
            "Lasso Regression",
            "Principal Components Regression (PCR)"),
  Test_MSE = c(test_mse_ridge_boston,
               test_mse_lasso_boston,
               test_mse_pcr_boston))

model_comparison$Test_RMSE <- sqrt(model_comparison$Test_MSE)


model_comparison$Method <- c("L2 Regularization", "L1 Regularization","Dimension Reduction")
model_comparison$Test_MSE <- round(model_comparison$Test_MSE, 2)
model_comparison$Test_RMSE <- round(model_comparison$Test_RMSE, 2)


model_comparison <- model_comparison[order(model_comparison$Test_MSE), ]

model_comparison

The PCR model performed substantially better than both the Ridge and Lasso regression models, achieving the lowest test MSE (54.68) and RMSE (7.39), indicating the most accurate predictions on the test data. In contrast, the Ridge and Lasso models had much larger prediction errors, with Ridge performing the worst. Based on these results, PCR appears to be the most effective model for this dataset, likely because reducing the predictors to a smaller set of principal components better captured the underlying structure of the data.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

No. my PCR model does not directly use all of the original features. Instead, it transforms the predictors into a smaller number of principal components and uses only the selected components for prediction. This helps reduce the dimensionality of the data while retaining most of the important information, which can improve prediction accuracy and reduce overfitting.