\(\textbf{Chapter 6 Conceptual Problems}\)
\(\textbf{Answer:}\) Lasso has an ability to shrink coefficient estimates all the way to 0, and ultimately can perform variable reduction. While such shrinking can increase bias, it can reduce the model’s variance. It also makes the model less flexible. In turn, the resultant model has a lower variance and higher bias relative to least squares.
\(\textbf{Answer:}\) similarly to the Lasso response above, Ridge has an ability to shrink coefficient estimates though variable reduction is not performed. Thus, Ridge gives a model with a lower variance and higher bias relative to least squares, though higher variance and lower bias with respect to Lasso.
\(\textbf{Answer:}\) unlike that mentioned in (a) and (b), non-linear methods are more flexible than least squares. As a result, non-linear methods are accompanied by less bias and higher variance than least squares, and are noted as being more tightly coupled with higher accuracy than least squares.
\(\textbf{Chapter 6 Applied Problems}\)
library(ISLR)
data(College)
set.seed(4)
data_index = sample(1:nrow(College), nrow(College) / 2)
training_dataset = College[data_index,]
testing_dataset = College[-data_index,]
lm_model = lm(Apps ~ ., data = training_dataset)
lm_prediction = predict(lm_model, testing_dataset)
mean((testing_dataset$Apps - lm_prediction)^2)
## [1] 1343376
\(\textbf{Answer:}\) as can be seen from parts (b) and (c), ridge regression gave a higher test error than least squares.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.3
## Loading required package: Matrix
## Loaded glmnet 4.0-2
set.seed(4)
x_training_data = model.matrix(Apps ~ ., data = training_dataset)
x_testing_data = model.matrix(Apps ~ ., data = testing_dataset)
lam = 10 ^ seq(4, -2, length = 100)
ridge_reg_model = glmnet(x_training_data, training_dataset$Apps, alpha = 0, lambda = lam, thresh = 1e-12)
cv_ridge_reg_model <- cv.glmnet(x_training_data, training_dataset$Apps, alpha = 0, lambda = lam, thresh = 1e-12)
lam_res = cv_ridge_reg_model$lambda.min
ridge_reg_prediction = predict(ridge_reg_model, newx = x_testing_data, s = lam_res)
mean((testing_dataset$Apps - ridge_reg_prediction)^2)
## [1] 1560722
set.seed(4)
lasso_model = glmnet(x_training_data, training_dataset$Apps, alpha = 1, lambda = lam, thresh = 1e-12)
cv_lasso_model = cv.glmnet(x_training_data, training_dataset$Apps, alpha = 1, lambda = lam, thresh = 1e-12)
lam_res_lasso = cv_lasso_model$lambda.min
lasso_prediction = predict(lasso_model, newx = x_testing_data, s = lam_res_lasso)
mean((testing_dataset$Apps - lasso_prediction)^2)
## [1] 1385943
predict(lasso_model, s = lam_res_lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.067307e+03
## (Intercept) .
## PrivateYes -4.445851e+02
## Accept 1.155582e+00
## Enroll .
## Top10perc 4.073365e+01
## Top25perc .
## F.Undergrad 9.844581e-02
## P.Undergrad .
## Outstate -1.865699e-02
## Room.Board 1.327248e-01
## Books .
## Personal .
## PhD -5.146250e+00
## Terminal -2.566682e+00
## S.F.Ratio .
## perc.alumni -1.165190e+01
## Expend 5.736533e-02
## Grad.Rate 8.567708e+00
\(\textbf{Answer:}\) as shown, lasso also gave a higher test error than least squares.
library(pls)
## Warning: package 'pls' was built under R version 4.0.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr_model <- pcr(Apps ~ ., data = training_dataset, scale = TRUE, validation = "CV")
validationplot(pcr_model, val.type = "MSEP")
summary(pcr_model)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3761 3462 1805 1772 1666 1399 1340
## adjCV 3761 3467 1803 1771 1752 1395 1336
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1328 1323 1253 1224 1239 1247 1246
## adjCV 1327 1322 1249 1220 1235 1243 1242
## 14 comps 15 comps 16 comps 17 comps
## CV 1247 1254 1191 1177
## adjCV 1243 1249 1185 1171
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.08 58.77 65.66 71.27 76.78 81.34 84.88 88.15
## Apps 16.75 77.64 78.52 79.31 87.22 88.01 88.15 88.36
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.02 93.34 95.29 97.07 98.15 98.94 99.48
## Apps 89.58 90.15 90.15 90.16 90.23 90.23 90.25
## 16 comps 17 comps
## X 99.86 100.00
## Apps 91.64 92.11
pcr_prediction = predict(pcr_model, testing_dataset, ncomp = 10)
mean((testing_dataset$Apps - pcr_prediction)^2)
## [1] 2824548
\(\textbf{Answer:}\) as shown, PCR also gave a higher test error than least squares.
pls_model <- plsr(Apps ~ ., data = training_dataset, scale = TRUE, validation = "CV")
validationplot(pls_model, val.type = "MSEP")
summary(pls_model)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3761 1653 1429 1236 1221 1211 1199
## adjCV 3761 1651 1429 1234 1217 1207 1191
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1185 1177 1173 1170 1177 1174 1173
## adjCV 1178 1171 1167 1164 1171 1168 1167
## 14 comps 15 comps 16 comps 17 comps
## CV 1173 1174 1174 1174
## adjCV 1167 1167 1167 1167
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 28.09 50.62 64.14 68.05 72.87 74.92 78.36 81.1
## Apps 81.36 86.23 90.12 90.81 91.18 91.81 91.95 92.0
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.41 87.21 89.69 91.60 93.90 96.04 97.35
## Apps 92.04 92.06 92.09 92.11 92.11 92.11 92.11
## 16 comps 17 comps
## X 98.48 100.00
## Apps 92.11 92.11
pls_prediction = predict(pls_model, testing_dataset, ncomp = 10)
mean((testing_dataset$Apps - pls_prediction)^2)
## [1] 1325931
\(\textbf{Answer:}\) as given by the output above, PLS gave a lower test error than least squares.
\(\textbf{Answer:}\) All of the models performed rather similarly, aside from PCR being outperformed by the rest of the models. Two noteworthy mentions were that least squares held its own against the other models, and ultimately PLS gave the lowest test error.
\(\textbf{Answer:}\) as shown below, lasso arrived at two predictors, namely, rad and lstat, with a coefficient of 0.29841807 and 0.05947548, respectively. Interestingly, for ridge the magnitude of the intercept was increased to 2.076815743 and nox was the variable with by far the highest coefficient. Regarding PCR, the 13 component fit gave the lowest CV and adjCV.
library(ISLR)
library(MASS)
library(glmnet)
data(Boston)
set.seed(1)
#Applying lasso
x_boston <- model.matrix(crim ~ . -1, data = Boston)
y_boston <- Boston$crim
train_index <- sample(1:nrow(x_boston), nrow(x_boston) / 2)
l_cv_boston = cv.glmnet(x_boston[train_index,], y_boston[train_index], alpha = 1)
l_lambda_min_boston = l_cv_boston$lambda.min
l_lambda_min_boston
## [1] 0.06805595
plot(l_cv_boston)
coefficients(l_cv_boston)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.33915371
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.29841807
## tax .
## ptratio .
## black .
## lstat 0.05947548
## medv .
#Applying ridge
r_cv_boston = cv.glmnet(x_boston[train_index,], y_boston[train_index], alpha = 0)
r_lambda_min_boston = r_cv_boston$lambda.min
r_lambda_min_boston
## [1] 0.5919159
plot(r_cv_boston)
coefficients(r_cv_boston)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.076815743
## zn -0.003267336
## indus 0.032273775
## chas -0.218547068
## nox 2.280066310
## rm -0.205041439
## age 0.007227813
## dis -0.115751650
## rad 0.051354659
## tax 0.002342923
## ptratio 0.078169359
## black -0.003578856
## lstat 0.047379241
## medv -0.031124023
#Applying PCR
library(pls)
pcr_boston <- pcr(crim ~ . , data = Boston[train_index,], scale = TRUE, validation = "CV")
summary(pcr_boston)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.515 7.523 7.179 7.030 7.046 7.116
## adjCV 9.275 7.511 7.521 7.166 7.018 7.034 7.101
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 7.127 7.014 7.018 6.934 6.959 6.955 6.918
## adjCV 7.110 6.995 7.002 6.914 6.941 6.932 6.892
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.51 60.4 69.86 77.08 82.80 87.68 91.24 93.56
## crim 34.94 35.2 42.83 45.47 45.57 45.58 45.75 47.59
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.47 97.08 98.48 99.54 100.00
## crim 47.68 48.75 49.31 50.14 51.37
\(\textbf{Answer:}\) as shown below, lasso gave the lowest test MSE of 40.89875. Additionally, it only leverages 2 variables. As a result, the proposed model is: \(crim = 0.33915371 + 0.29841807rad + 0.05947548lstat.\)
#Lasso test MSE
l_grid <- 10^seq(10,-2,length=100)
l_fit <- glmnet(x_boston[train_index,], y_boston[train_index], lambda=l_grid, alpha=1)
l_predict <- predict(l_fit,s=l_lambda_min_boston,newx=x_boston[-train_index,])
l_mse <- mean((l_predict - y_boston[-train_index])^2)
l_mse
## [1] 40.89875
#Ridge test MSE
r_fit <- glmnet(x_boston[train_index,], y_boston[train_index], lambda=l_grid, alpha=0)
r_predict <- predict(r_fit,s=r_lambda_min_boston,newx=x_boston[-train_index,])
r_mse <- mean((r_predict - y_boston[-train_index])^2)
r_mse
## [1] 40.92395
#PCR test MSE
pcr_predict <- predict(pcr_boston,x_boston[-train_index,],ncomp =7)
pcr_mse <- mean((pcr_predict - y_boston[-train_index])^2)
pcr_mse
## [1] 44.21932
\(\textbf{Answer:}\) as elaborated upon in part (b), the number of features involved by the chosen model courtesy of lasso is 2. This is because unlike ridge, lasso performs feature selection, and found the optimal number of features to be 2, i.e., rad and lstat.