Assignment 5

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: ggplot2

## Loading required package: lattice

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.4.2

## Loading required package: Matrix

## Loaded glmnet 4.1-8

library(pls)

## Warning: package 'pls' was built under R version 4.4.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

library(leaps)

## Warning: package 'leaps' was built under R version 4.4.3

attach(College)
attach(Boston)

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer: iii

Lasso is less flexible than least squares because it shrinks some coefficients to exactly zero. When lasso works better than least squares, it’s because the decrease in variance is bigger than the increase in bias.

(b) Repeat (a) for ridge regression relative to least squares.

Answer: iii

Ridge regression is also less flexible than least squares because it shrinks all coefficients toward zero. Like lasso, ridge regression performs better when its decrease in variance outweighs its increase in bias.

(c) Repeat (a) for non-linear methods relative to least squares.

Answer: ii

Non-linear methods are more flexible than least squares because they can model curved relationships. When non-linear methods work better, it’s because their decrease in bias exceeds their increase in variance.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set

set.seed(20)
train_index <- createDataPartition(College$Apps, p = 0.7, list = FALSE)
col_train <- College[train_index, ]
col_test <- College[-train_index, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

col_fit <- lm(Apps ~ ., data = col_train)
summary(col_fit)

## 
## Call:
## lm(formula = Apps ~ ., data = col_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3621.7  -451.5     1.4   351.2  7033.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -332.20811  495.74137  -0.670 0.503072    
## PrivateYes  -344.96641  164.64642  -2.095 0.036630 *  
## Accept         1.63836    0.04609  35.548  < 2e-16 ***
## Enroll        -0.85727    0.24060  -3.563 0.000400 ***
## Top10perc     57.69812    6.82745   8.451 2.84e-16 ***
## Top25perc    -22.00444    5.58659  -3.939 9.29e-05 ***
## F.Undergrad    0.06140    0.04132   1.486 0.137862    
## P.Undergrad    0.05219    0.03639   1.434 0.152084    
## Outstate      -0.08962    0.02374  -3.776 0.000178 ***
## Room.Board     0.14151    0.05789   2.445 0.014831 *  
## Books         -0.17329    0.27292  -0.635 0.525732    
## Personal       0.02192    0.07426   0.295 0.767957    
## PhD           -7.56687    5.55053  -1.363 0.173380    
## Terminal      -3.38850    6.10177  -0.555 0.578904    
## S.F.Ratio     15.49485   15.90679   0.974 0.330453    
## perc.alumni    2.19067    5.04900   0.434 0.664552    
## Expend         0.07599    0.01433   5.304 1.67e-07 ***
## Grad.Rate      8.33272    3.64965   2.283 0.022819 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1067 on 527 degrees of freedom
## Multiple R-squared:  0.9307, Adjusted R-squared:  0.9284 
## F-statistic: 416.1 on 17 and 527 DF,  p-value: < 2.2e-16

col_preds <- predict(col_fit, newdata = col_test)
col_er <- mean((col_preds - col_test$Apps)^2)
col_er

## [1] 1043245

After fitting a linear model using least squares on the training set, the test error was 1043245.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained

x_train <- model.matrix(Apps ~ ., data = col_train)[, -1]
y_train <- col_train$Apps
x_test <- model.matrix(Apps ~ ., data = col_test)[, -1]

grid=10^seq(10,-2,length=100)

col_grid <- cv.glmnet(x_train, y_train, lambda = grid, alpha = 0)
grid_lambda <- col_grid$lambda.min
col_grid

## 
## Call:  cv.glmnet(x = x_train, y = y_train, lambda = grid, alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure     SE Nonzero
## min   0.01   100 1361534 356253      17
## 1se 231.01    64 1680177 628505      17

col_grid_pred <- predict(col_grid, s = grid_lambda, newx = x_test)
col_grid_er <- mean((col_grid_pred - col_test$Apps)^2)
col_grid_er

## [1] 1043219

After fitting a ridge regression model on the training set with λ chosen by cross-validation, the test error was 1043219.

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

col_lasso <- cv.glmnet(x_train, y_train, lambda = grid, alpha = 1)
lasso_lambda <- col_lasso$lambda.min
col_lasso

## 
## Call:  cv.glmnet(x = x_train, y = y_train, lambda = grid, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure     SE Nonzero
## min   0.01   100 1382267 318738      17
## 1se 305.39    63 1610045 401474       3

col_lasso_pred <- predict(col_lasso, s = lasso_lambda, newx = x_test)
col_lasso_er <- mean((col_lasso_pred - col_test$Apps)^2)
col_lasso_er

## [1] 1043249

After fitting a lasso regression model on the training set with λ chosen by cross-validation, the test error was 1043249.

col_lasso_coef <- predict(col_lasso, type = "coefficients", s = lasso_lambda)
col_lasso_coef

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -334.98659413
## PrivateYes  -345.34671278
## Accept         1.63683324
## Enroll        -0.84628954
## Top10perc     57.60008684
## Top25perc    -21.92675371
## F.Undergrad    0.06009396
## P.Undergrad    0.05220000
## Outstate      -0.08944182
## Room.Board     0.14177374
## Books         -0.17319275
## Personal       0.02180997
## PhD           -7.55776847
## Terminal      -3.40610008
## S.F.Ratio     15.50406762
## perc.alumni    2.15679557
## Expend         0.07597679
## Grad.Rate      8.32502809

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

col_pcr <- pcr(Apps ~., data = col_train, scale = TRUE, validation = 'CV')
summary(col_pcr)

## Data:    X dimension: 545 17 
##  Y dimension: 545 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3992     3839     2155     2152     1986     1746     1737
## adjCV         3992     3839     2153     2151     2006     1726     1733
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1725     1650     1612      1621      1633      1629      1636
## adjCV     1750     1635     1607      1617      1629      1624      1631
##        14 comps  15 comps  16 comps  17 comps
## CV         1636      1401      1210      1172
## adjCV      1632      1381      1202      1165
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.924    57.01    64.07    69.75    75.14    80.15    83.75    87.29
## Apps    7.847    71.42    71.64    75.66    82.28    82.52    82.62    84.75
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.34     92.70     94.94     96.71     97.79     98.66     99.38
## Apps    85.07     85.09     85.09     85.19     85.33     85.36     91.46
##       16 comps  17 comps
## X        99.84    100.00
## Apps     92.72     93.07

validationplot(col_pcr, val.type = 'MSEP')

col_pcr_m <- which.min(col_pcr$validation$PRESS)
col_pcr_pred <-predict(col_pcr, newdata = col_test, ncomp = col_pcr_m)
col_pcr_er <- mean((col_pcr_pred - col_test$Apps)^2)
list(col_pcr_m, col_pcr_er)

## [[1]]
## [1] 17
## 
## [[2]]
## [1] 1043245

After fitting a PCR model on the training set, with M chosen by cross-validation, the test error was 1043245 and the optimal number of components selected was 17.

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

col_pls <- plsr(Apps ~., data = col_train, scale = TRUE, validation = 'CV')
summary(col_pls)

## Data:    X dimension: 545 17 
##  Y dimension: 545 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3992     2004     1773     1537     1417     1274     1200
## adjCV         3992     2000     1771     1530     1395     1257     1192
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1187     1177     1176      1176      1175      1173      1172
## adjCV     1179     1170     1169      1169      1168      1166      1165
##        14 comps  15 comps  16 comps  17 comps
## CV         1172      1172      1172      1172
## adjCV      1165      1164      1164      1164
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.06    48.38    62.18    64.92    68.60    73.63    77.16    80.29
## Apps    76.27    82.28    87.49    90.90    92.47    92.80    92.88    92.96
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.47     85.19     88.71     90.23     91.79     93.65     96.86
## Apps    93.01     93.03     93.04     93.06     93.07     93.07     93.07
##       16 comps  17 comps
## X        97.90    100.00
## Apps     93.07     93.07

validationplot(col_pls, val.type = 'MSEP')

col_pls_m <- which.min(col_pls$validation$PRESS)
col_pls_pred <- predict(col_pls, newdata = col_test, ncomp = col_pls_m)
col_pls_er <- mean((col_pls_pred - col_test$Apps)^2)
list(col_pls_m, col_pls_er)

## [[1]]
## [1] 16
## 
## [[2]]
## [1] 1043226

After fitting a PLS model on the training set, with M chosen by cross-validation, the test error was 1043226 and the optimal number of components selected was 16.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

We tried five different models to predict the number of college applications. All of them had fairly large test errors, meaning they aren’t perfect at predicting the number of applications. Lasso, Ridge, LM, and PLS all had very similar results. PCR however, had the same test error as Linear Regression, but was slightly worse than Ridge and PLS. Overall, Ridge and PLS performed the best, but the differences were minimal.

11. We will now try to predict per capita crime rate in the Boston data set

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

set.seed(28)
Bostontrain_index <- createDataPartition(Boston$crim, p = 0.7, list = FALSE)
Boston_train <- Boston[Bostontrain_index, ]
Boston_test <- Boston[-Bostontrain_index, ]

Best Subset Selection

bos_regfit <- regsubsets(crim ~ ., data = Boston_train, nvmax=13)
summary(bos_regfit)

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston_train, nvmax = 13)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " "*"     " "   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 )  " " " "   " "  " " "*" " " " " "*" " " "*"     "*"   "*" 
## 6  ( 1 )  "*" " "   " "  " " "*" " " "*" "*" " " " "     "*"   "*" 
## 7  ( 1 )  "*" "*"   " "  " " "*" " " "*" "*" " " " "     "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 9  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

bos_summary = summary(bos_regfit)

par(mfrow = c(2, 2))

plot(bos_summary$rss, type = "l", xlab = "Number of Variables", ylab = "RSS")
points(which.max(bos_summary$rss), max(bos_summary$rss), cex = 2, pch = 20)

plot(bos_summary$adjr2, type = "l", xlab = "Number of Variables", ylab = "Adjusted R²")
points(which.max(bos_summary$adjr2), max(bos_summary$adjr2), cex = 2, pch = 20)

plot(bos_summary$cp, type = "l", xlab = "Number of Variables", ylab = "Cp")
points(which.min(bos_summary$cp), min(bos_summary$cp), cex = 2, pch = 20)

plot(bos_summary$bic, type = "l", xlab = "Number of Variables", ylab = "BIC")
points(which.min(bos_summary$bic), min(bos_summary$bic), cex = 2, pch = 20)

The best model according to BIC uses 2 variables, suggesting a simpler model is preferred. The Adjusted R² peaks around 10 variables, indicating that adding predictors improves performance up to that point. Meanwhile, Cp is minimized at about 6 variables, and RSS steadily decreases with more variables, showing reduced error.

# Get Coefficients of Best Model by BIC
best_model_bic <- which.min(bos_summary$bic)
coef(bos_regfit, id = best_model_bic)

## (Intercept)         rad       lstat 
##  -4.3605627   0.4966580   0.2497854

BIC-optimal model includes rad and lstat as predictors.

#Get Coefficients of the best 8
coef(bos_regfit, id = 8)

##  (Intercept)           zn          nox           rm          dis          rad 
##   8.89846618   0.04237313 -11.43416265   1.56209304  -0.79124903   0.52990145 
##      ptratio        lstat         medv 
##  -0.41637837   0.17496592  -0.26327815

#lm Model 1
Bos_lm1 <- lm(crim ~ rad + lstat, data = Boston_train)
summary(Bos_lm1)

## 
## Call:
## lm(formula = crim ~ rad + lstat, data = Boston_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.513 -1.908 -0.264  0.945 77.118 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.36056    0.74545  -5.850 1.12e-08 ***
## rad          0.49666    0.04898  10.140  < 2e-16 ***
## lstat        0.24979    0.05901   4.233 2.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.931 on 353 degrees of freedom
## Multiple R-squared:  0.389,  Adjusted R-squared:  0.3856 
## F-statistic: 112.4 on 2 and 353 DF,  p-value: < 2.2e-16

#lm model 2
Bos_lm2 <- lm(crim ~ zn + nox + rm + dis + rad + ptratio + lstat + medv, data = Boston_train)
summary(Bos_lm2)

## 
## Call:
## lm(formula = crim ~ zn + nox + rm + dis + rad + ptratio + lstat + 
##     medv, data = Boston_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.889 -2.230 -0.303  1.018 73.406 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.89847    8.96450   0.993 0.321580    
## zn            0.04237    0.02337   1.813 0.070652 .  
## nox         -11.43416    5.91213  -1.934 0.053924 .  
## rm            1.56209    0.79141   1.974 0.049195 *  
## dis          -0.79125    0.32705  -2.419 0.016062 *  
## rad           0.52990    0.06304   8.406 1.11e-15 ***
## ptratio      -0.41638    0.23500  -1.772 0.077300 .  
## lstat         0.17497    0.09231   1.895 0.058876 .  
## medv         -0.26328    0.07413  -3.552 0.000436 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.814 on 347 degrees of freedom
## Multiple R-squared:  0.4196, Adjusted R-squared:  0.4062 
## F-statistic: 31.36 on 8 and 347 DF,  p-value: < 2.2e-16

Boston_pred1 <- predict(Bos_lm1, newdata = Boston_test)
Boston_pred2 <- predict(Bos_lm2, newdata = Boston_test)

mean((Boston_test$crim - Boston_pred1)^2)

## [1] 31.3732

mean((Boston_test$crim - Boston_pred2)^2)

## [1] 31.39418

The Mean Squared Errors for both models are nearly identical (31.37 vs 31.39), meaning the simpler model with just rad and lstat performs just as well as the more complex one. This suggests that adding more variables didn’t meaningfully improve predictive accuracy.

Ridge regression

x_train1 <- model.matrix(crim ~ ., data = Boston_train)[, -1]
y_train1 <- Boston_train$crim
x_test1 <- model.matrix(crim ~ ., data = Boston_test)[, -1]

grid1 <- 10^seq(10, -2, length = 100)

boston_ridge <- cv.glmnet(x_train1, y_train1, alpha = 0, lambda = grid1)
grid1_lambda <- boston_ridge$lambda.min
boston_ridge

## 
## Call:  cv.glmnet(x = x_train1, y = y_train1, lambda = grid1, alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min    0.1    91   48.36 26.38      12
## 1se  403.7    62   74.10 34.88      12

boston_ridge_pred <- predict(boston_ridge, s = grid1_lambda, newx = x_test1)
boston_ridge_er <- mean((boston_ridge_pred - Boston_test$crim)^2)
boston_ridge_er

## [1] 31.35174

Lasso

Boston_lasso <- cv.glmnet(x_train1, y_train1, lambda = grid1, alpha = 1)
Boston_lasso_lambda <- Boston_lasso$lambda.min
Boston_lasso

## 
## Call:  cv.glmnet(x = x_train1, y = y_train1, lambda = grid1, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min  0.017    98   47.71 19.15      12
## 1se  3.511    79   63.71 25.63       1

Boston_lasso_pred <- predict(Boston_lasso, s = Boston_lasso_lambda, newx = x_test1)
Boston_lasso_er <- mean((Boston_lasso_pred - Boston_test$crim)^2)
Boston_lasso_er

## [1] 31.26229

Boston_lasso_coef <- predict(Boston_lasso, type = "coefficients", s = Boston_lasso_lambda)
Boston_lasso_coef

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  7.898131720
## zn           0.041751269
## indus       -0.054305063
## chas        -0.323709225
## nox         -7.494961620
## rm           1.445043223
## age         -0.006782595
## dis         -0.825629960
## rad          0.572939199
## tax         -0.003426776
## ptratio     -0.338110920
## lstat        0.186432250
## medv        -0.255356964

PCR

Boston_pcr <- pcr(crim ~ ., data = Boston_train, scale = TRUE, validation = 'CV')
summary(Boston_pcr)

## Data:    X dimension: 356 12 
##  Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.855    7.592    7.598    7.176    7.183    7.126    7.096
## adjCV        8.855    7.591    7.596    7.171    7.182    7.123    7.093
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.968    7.004    6.969     6.913     6.945     6.880
## adjCV    6.964    7.001    6.964     6.905     6.936     6.871
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       50.56    64.38    73.73    80.93     87.3    90.59    93.09    95.18
## crim    26.52    26.54    34.79    34.79     35.9    36.49    38.55    38.55
##       9 comps  10 comps  11 comps  12 comps
## X       96.88     98.32     99.43    100.00
## crim    39.24     40.39     40.72     42.25

validationplot(Boston_pcr, val.type = "MSEP")

Boston_pcr_m <- which.min(Boston_pcr$validation$PRESS)
Boston_pcr_pred <-predict(Boston_pcr, newdata = Boston_test, ncomp = Boston_pcr_m)
Boston_pcr_er <- mean((Boston_pcr_pred - Boston_test$crim)^2)
list(Boston_pcr_m, Boston_pcr_er)

## [[1]]
## [1] 12
## 
## [[2]]
## [1] 31.42049

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

Lasso regression performs the best on this data set with the lowest validation error of 31.21781. This means it makes the most accurate predictions on new, unseen data.Unlike regular linear models, Lasso also removes less important features. This helps prevent overfitting and keeps the model simpler. Other models like Ridge and PCR had higher errors, so they didn’t perform as well. That’s why Lasso is the best choice here.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

No, the chosen model does not involve all the features in the dataset.

Based on the BIC , the best model only uses two variables rad and lstat.

This is because BIC prefers simpler models that still explain the data well—it penalizes models with too many variables to avoid overfitting.