1. We will now try to predict per capita crime rate in the Boston data set.
  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

as shown below, lasso arrived at two predictors, namely, rad and lstat, with a coefficient of 0.29841807 and 0.05947548, respectively. Interestingly, for ridge the magnitude of the intercept was increased to 2.076815743 and nox was the variable with by far the highest coefficient. Regarding PCR, the 13 component fit gave the lowest CV and ajdCV.

library(ISLR)
library(MASS)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.3
## Loading required package: Matrix
## Loaded glmnet 4.0-2
data(Boston)
set.seed(1)
#Applying lasso
x_boston <-  model.matrix(crim ~ . -1, data = Boston)
y_boston <-  Boston$crim
train_index <- sample(1:nrow(x_boston), nrow(x_boston) / 2)
l_cv_boston = cv.glmnet(x_boston[train_index,], y_boston[train_index], alpha = 1)
l_lambda_min_boston = l_cv_boston$lambda.min
l_lambda_min_boston
## [1] 0.06805595
plot(l_cv_boston)

coefficients(l_cv_boston)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 0.33915371
## zn          .         
## indus       .         
## chas        .         
## nox         .         
## rm          .         
## age         .         
## dis         .         
## rad         0.29841807
## tax         .         
## ptratio     .         
## black       .         
## lstat       0.05947548
## medv        .
#Applying ridge
r_cv_boston = cv.glmnet(x_boston[train_index,], y_boston[train_index], alpha = 0)
r_lambda_min_boston = r_cv_boston$lambda.min
r_lambda_min_boston
## [1] 0.5919159
plot(r_cv_boston)

coefficients(r_cv_boston)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  2.076815743
## zn          -0.003267336
## indus        0.032273775
## chas        -0.218547068
## nox          2.280066310
## rm          -0.205041439
## age          0.007227813
## dis         -0.115751650
## rad          0.051354659
## tax          0.002342923
## ptratio      0.078169359
## black       -0.003578856
## lstat        0.047379241
## medv        -0.031124023
#Applying PCR
library(pls)
## Warning: package 'pls' was built under R version 4.0.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
pcr_boston <-  pcr(crim ~ . , data = Boston[train_index,], scale = TRUE, validation = "CV")
summary(pcr_boston)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.515    7.523    7.179    7.030    7.046    7.116
## adjCV        9.275    7.511    7.521    7.166    7.018    7.034    7.101
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       7.127    7.014    7.018     6.934     6.959     6.955     6.918
## adjCV    7.110    6.995    7.002     6.914     6.941     6.932     6.892
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.51     60.4    69.86    77.08    82.80    87.68    91.24    93.56
## crim    34.94     35.2    42.83    45.47    45.57    45.58    45.75    47.59
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.47     97.08     98.48     99.54    100.00
## crim    47.68     48.75     49.31     50.14     51.37
  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

as shown below, lasso gave the lowest test MSE of 40.89875. Additionally, it only leverages 2 variables. As a result, the proposed model is: \(crim = 0.33915371 + 0.29841807rad + 0.05947548lstat.\)

#Lasso test MSE
l_grid <- 10^seq(10,-2,length=100)
l_fit <- glmnet(x_boston[train_index,], y_boston[train_index], lambda=l_grid, alpha=1)
l_predict <- predict(l_fit,s=l_lambda_min_boston,newx=x_boston[-train_index,])
l_mse <- mean((l_predict - y_boston[-train_index])^2)
l_mse
## [1] 40.89875
#Ridge test MSE
r_fit <- glmnet(x_boston[train_index,], y_boston[train_index], lambda=l_grid, alpha=0)
r_predict <- predict(r_fit,s=r_lambda_min_boston,newx=x_boston[-train_index,])
r_mse <- mean((r_predict - y_boston[-train_index])^2)
r_mse
## [1] 40.92395
#PCR test MSE
pcr_predict <- predict(pcr_boston,x_boston[-train_index,],ncomp =7)
pcr_mse <- mean((pcr_predict - y_boston[-train_index])^2)
pcr_mse
## [1] 44.21932
  1. Does your chosen model involve all of the features in the data set? Why or why not?

as elaborated upon in part (b), the number of features involved by the chosen model courtesy of lasso is 2. This is because unlike ridge, lasso performs feature selection, and found the optimal number of features to be 2, i.e., rad and lstat.