Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer

The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Repeat (a) for ridge regression relative to least squares.
Repeat (a) for non-linear methods relative to least squares.

Answer

Lasso relative to least squares is:

i. Incorrect

ii. Incorrect

iii. Correct

iv. Incorrect

Ridge regression relative to least squares is:

i. Incorrect

ii. Incorrect

iii. Correct

iv. Incorrect

non-linear methods relative to least squares are:

i. Correct

ii. Incorrect

iii. Incorrect

iv. Incorrect

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

(a) Split the data set into a training set and a test set.

set.seed(42)
train_index <- createDataPartition(College$Apps, p = 0.7, list = FALSE)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm_model <- lm(Apps ~ ., data = train_data)
lm_preds <- predict(lm_model, newdata = test_data, type = 'response')
lm_test_error <- mean((lm_preds - test_data$Apps)^2)
lm_test_error

## [1] 1390507

MSE of linear model using least squares = 1390507

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

x_train <- model.matrix(Apps ~ ., data = train_data)[, -1] #removes intercept
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]

#ridge regression
grid <- 10^seq(10, -2, length = 100)

cv_ridge <- cv.glmnet(x_train, y_train, lambda = grid, alpha = 0) #alpha = 0 is for ridge
best_lambda_ridge <- cv_ridge$lambda.min
ridge_preds <- predict(cv_ridge, s = best_lambda_ridge, newx = x_test)
ridge_test_error <- mean((ridge_preds - test_data$Apps)^2)
ridge_test_error

## [1] 1390359

MSE of ridge regression = 1390359

It is slightly better than linear regression because it is lower.

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coeﬃcient estimates.

#lasso regression
cv_lasso <- cv.glmnet(x_train, y_train, lambda = grid, alpha = 1) #alpha = 1 makes it lasso
best_lambda_lasso <- cv_lasso$lambda.min
lasso_preds <- predict(cv_lasso, s = best_lambda_lasso, newx = x_test)
lasso_test_error <- mean((lasso_preds - test_data$Apps)^2)
lasso_test_error

## [1] 1390191

MSE of lasso regression = 1416339

It is worse than both our previous models because it is higher.

lasso_coef <- predict(cv_lasso, type = 'coefficient', s = best_lambda_lasso)
lasso_coef

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -596.78495261
## PrivateYes  -339.69792355
## Accept         1.65416184
## Enroll        -1.00517762
## Top10perc     44.69859446
## Top25perc    -13.37564877
## F.Undergrad    0.06111737
## P.Undergrad    0.02135864
## Outstate      -0.08342593
## Room.Board     0.08516271
## Books          0.19320953
## Personal       0.04506171
## PhD           -9.88092135
## Terminal      -0.07124284
## S.F.Ratio     18.75160346
## perc.alumni    1.73925789
## Expend         0.09175698
## Grad.Rate      6.11088222

We have 15 non-zero coefficient estimates.

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

pcr_model <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = 'CV')
summary(pcr_model)

## Data:    X dimension: 545 17 
##  Y dimension: 545 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3925     3946     2147     2128     2110     1771     1747
## adjCV         3925     3947     2145     2127     2179     1763     1742
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1743     1750     1655      1620      1610      1615      1622
## adjCV     1737     1745     1647      1613      1605      1610      1617
##        14 comps  15 comps  16 comps  17 comps
## CV         1646      1375      1174      1119
## adjCV      1662      1355      1164      1111
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     31.22181    57.32    64.52    70.43    76.09    80.77    84.43    87.74
## Apps   0.02335    70.85    71.39    71.69    81.41    81.71    81.84    81.86
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.51     92.89     95.07     96.72     97.81     98.63     99.35
## Apps    84.05     85.05     85.22     85.24     85.30     85.40     92.19
##       16 comps  17 comps
## X        99.83    100.00
## Apps     93.45     93.88

validationplot(pcr_model, val.type = 'MSEP')

best_M_pcr <- which.min(pcr_model$validation$PRESS)
pcr_preds <- predict(pcr_model, newdata = test_data, ncomp = best_M_pcr)
pcr_test_error <- mean((pcr_preds - test_data$Apps)^2)
list(pcr_test_error, best_M_pcr)

## [[1]]
## [1] 1390507
## 
## [[2]]
## [1] 17

MSE of PCR = 1390507

optimal M = 17

The plot shows the lowest MSEP at the end.

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

pls_model <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = 'CV')
summary(pls_model)

## Data:    X dimension: 545 17 
##  Y dimension: 545 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3925     1983     1565     1520     1384     1259     1143
## adjCV         3925     1978     1543     1469     1366     1247     1133
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1133     1124     1119      1117      1118      1116      1112
## adjCV     1124     1115     1110      1108      1110      1107      1103
##        14 comps  15 comps  16 comps  17 comps
## CV         1112      1112      1112      1112
## adjCV      1103      1103      1103      1103
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.94    31.61    53.18    65.41    70.15    73.61    77.40    79.96
## Apps    76.28    86.90    88.60    91.29    92.76    93.60    93.67    93.76
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.89     86.76     89.51     91.14     92.60     94.80     97.15
## Apps    93.82     93.84     93.85     93.87     93.88     93.88     93.88
##       16 comps  17 comps
## X        99.02    100.00
## Apps     93.88     93.88

validationplot(pls_model, val.type = 'MSEP')

best_M_pls <- which.min(pls_model$validation$PRESS)
pls_preds <- predict(pls_model, newdata = test_data, ncomp = 9)
pls_test_error <- mean((pls_preds - test_data$Apps)^2)

list(pls_test_error, best_M_pls)

## [[1]]
## [1] 1378551
## 
## [[2]]
## [1] 13

MSE of pls = 1378551

Optimal M = 13

In this case we don’t use the optimal M because at 9, the trade off becomes much lower and variance explained is basically the same.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much diﬀerence among the test errors resulting from these five approaches?

The results indicate that all five models produce similar test errors, suggesting that our ability to predict the number of college applications received is somewhat limited.

We have a small difference in MSE across most models, meaning that no single approach drastically outperforms the others.

In conclusion, PLS was our best model, offering the lowest MSE while balancing complexity, but the model itself does not accurately predict the number of college applications.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

library(MASS)
data(Boston)
attach(Boston)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

x <- Boston[, -Boston$crim]
y <- Boston$crim

set.seed(42)
train_index2 <- createDataPartition(Boston$crim, p = 0.7, list = FALSE)
train_data2 <- Boston[train_index2, ]
test_data2 <- Boston[-train_index2, ]

Best subset selection

library(leaps)
best_subset <- regsubsets(crim ~ ., data = train_data2, nvmax = ncol(train_data2) - 1)
best_subset_summary <- summary(best_subset)

par(mfrow = c(2, 2))

plot(best_subset_summary$rss, type = 'l', pch = 19, xlab = 'Number of predictors', 
     ylab = 'RSS')

plot(best_subset_summary$adjr2, type = 'l', pch = 19, xlab = 'Number of predictors', 
     ylab = 'AdjR2')
p = which.max(best_subset_summary$adjr2)
points(p, best_subset_summary$adjr2[p], col = 'red', cex = 2, pch = 20)

plot(best_subset_summary$cp, type = 'l', pch = 19, xlab = 'Number of predictors', 
     ylab = 'CP')
p = which.min(best_subset_summary$cp)
points(p, best_subset_summary$cp[p], col = 'red', cex = 2, pch = 20)

plot(best_subset_summary$bic, type = 'l', pch = 19, xlab = 'Number of predictors', 
     ylab = 'BIC')
p = which.min(best_subset_summary$bic)
points(p, best_subset_summary$bic[p], col = 'red', cex = 2, pch = 20)

The best number of predictors are 3 and 6 which gives us the highest adjusted r-squared, but the model performance almost improves nothing.

best_subset_summary

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train_data2, nvmax = ncol(train_data2) - 
##     1)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " "   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  " " " "   " "  "*" " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" " " " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" " "   "*"  "*" " " "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

black, rad, and lstat are our 3 predictors.

zn, nox, dis, rad, black, and medv are our predictors for 6 predictor model.

best_subset_model <- lm(crim ~ black + rad + lstat, data = train_data2)
best_subset_model_larger <- lm(crim ~ zn + nox + dis + rad + black + medv, data = train_data2)

bs_preds <- predict(best_subset_model, newdata = test_data2)
bs_larger_preds <- predict(best_subset_model_larger, newdata = test_data2)

mean((bs_preds - test_data2$crim)^2)

## [1] 45.90462

mean((bs_larger_preds - test_data2$crim)^2)

## [1] 44.41283

It appears larger model is indeed better, but not by much. Addition of 3 variables isn’t worth for such little improvement in performance.

Lasso

x_train2 <- model.matrix(crim ~ ., data = train_data2)[, -1]
y_train2 <- train_data2$crim
x_test2 <- model.matrix(crim ~ ., data = test_data2)[, -1]

cv_lasso2 <- cv.glmnet(x_train2, y_train2, lambda = grid, alpha = 1)

#optimal lambda
best_lambda_lasso2 <- cv_lasso2$lambda.min

lasso_model <- glmnet(x_train2, y_train2, alpha = 1, lambda = best_lambda_lasso2)
lasso_pred <- predict(lasso_model, s = best_lambda_lasso2, newx = x_test2)

#mse
lasso_mse <- mean((lasso_pred - test_data2$crim)^2)
lasso_mse

## [1] 43.91328

predict(cv_lasso2, type = 'coefficient', s = best_lambda_lasso2)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 12.490672439
## zn           0.032243142
## indus       -0.048258168
## chas        -0.989551968
## nox         -8.040554865
## rm           0.263696046
## age          0.008175934
## dis         -0.621009082
## rad          0.504244683
## tax          .          
## ptratio     -0.155705969
## black       -0.012575303
## lstat        0.099700619
## medv        -0.123537799

PCR

pcr_model2 <- pcr(crim ~ ., data = train_data2, scale = TRUE, validation = 'CV')
summary(pcr_model2)

## Data:    X dimension: 356 13 
##  Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.444    7.181    7.192    6.659    6.698    6.704    6.731
## adjCV        8.444    7.177    7.189    6.652    6.691    6.697    6.722
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.730    6.628    6.711     6.715     6.727     6.713     6.619
## adjCV    6.721    6.618    6.697     6.698     6.711     6.694     6.600
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.12    59.96    68.89    75.77    82.48    87.54    90.78    93.34
## crim    28.69    28.72    39.09    39.11    39.11    39.14    39.45    41.56
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.24     96.98     98.41     99.47    100.00
## crim    41.82     42.07     42.07     43.12     44.56

validationplot(pcr_model2, val.type = 'MSEP')

pcr_preds2 <- predict(pcr_model2, newdata = test_data2, ncomp = 8)
mean((pcr_preds2 - test_data2$crim)^2)

## [1] 45.95997

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

Lasso Regression appears the be the best-performing model on this dataset out of the 3 that we tested. Although it could be argued that best subset could be used as well, Lasso outperforms it by a small margin.

The model we would propose to use is best subset, and the reason why is because of the very slight disadvantage in performance compared to the amount of predictors it uses compared to Lasso. As we saw, Lasso uses almost all predictors except tax which is surprising since both subset and PCR used 3 to 8.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

Our chosen model does not involve all of the features in the data set. The reason why is because the trade-off between model performance improvement and addition of predictive variables is very low after those 3 predictors are added.

STA6543_Assignment5

Mauricio Ponce

2025-03-31

Problem 2

Answer

Lasso relative to least squares is:

Ridge regression relative to least squares is:

non-linear methods relative to least squares are:

Problem 9

(a) Split the data set into a training set and a test set.

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coeﬃcient estimates.

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much diﬀerence among the test errors resulting from these five approaches?

Problem 11

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best subset selection

Lasso

PCR

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

(c) Does your chosen model involve all of the features in the data set? Why or why not?