For parts (a) through (c), indicate which of i through iv is correct. Justify your answer.
a) The lasso, relative to least squares, is: i) More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii) More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias iii) Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv) Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The correct answer for lasso relative to least squares is iii. According to figure 2.7, a lasso would have a low flexibility, but a high interpretability compared to least squares (pg 25). It is also stated that a lasso procedure is a type of shrinking procedure, which can "substantially reduce the variance at the cost of a negligible increase in bias (pg 204).
b) Repeat (a) for ridge regression relative to least squares.
The correct answer for lasso relative to least squares is iii. Once again referring to figure 2.7, a ridge regression would have a low flexibility, but a high interpretability , while least squares has a higher flexibility but lower interpretability (pg 25). A ridge regression is also a type of shrinking procedure, which can “substantially reduce the variance at the cost of a negligible increase in bias (pg 204). It is also stated that”as λ → ∞, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero" (pg 215), effectively minimizing variance while increasing bias.
c) Repeat (a) for non-linear methods relative to least squares.
The correct answer for non-linear methods is ii. Non-linear methods include such examples as bagging, boosting, and support vector machines, the first 2 being found in figure 2.7 with a high flexibility and low interpretability compared to least squares (pg 25). If we consider the bias-variance trade-off as we have in the 2 previous examples, we can infer that a higher flexibility would give improved prediction accuracy when its increase in variance is less than its decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set. a) Split the data set into a training set and a test set.
college <- ISLR::College
training_sample_college <- sample(nrow(college), 0.6*nrow(college))
training_data <- college[training_sample_college, ]
test_data <- college[-training_sample_college, ]
b) Fit a linear model using least squares on the training set, and report the test error obtained.
least_squares_college <- lm(Apps ~ ., data = training_data)
predict_class_college <- predict(least_squares_college, test_data)
college_error <- mean((test_data$Apps - predict_class_college)^2)
college_error
## [1] 1530961
The test error obtained by the linear model using least squares is calculated by squaring the difference between observed y-values and predicted y-values for every x. For all of the variables in the model, the test error is calculated to be 895, 155.9.
c) Fit a ridge regression model on the training set, with λ (lambda) chosen by cross-validation. Report the test error obtained.
x_train_college <- model.matrix(Apps ~ ., data = training_data[, -1])
y_train_college <- training_data$Apps
x_test_college <- model.matrix(Apps ~ ., data = test_data[, -1])
y_test_college <- test_data$Apps
set.seed(1)
ridge_college <- glmnet::cv.glmnet(x_train_college, y_train_college, alpha = 0)
ridge_lamba_college <- ridge_college$lambda.min
ridge_prediction <- predict(ridge_college, s = ridge_lamba_college, newx = x_test_college)
ridge_error <- mean((ridge_prediction - y_test_college)^2)
ridge_error
## [1] 2573989
We use lambda.min to find the minimum lambda to use (as it will reduce the greatest amount of error possible in the model). Our test error is 850,441.8, which falls in line with ridge regressions typically predicting better than least squares (which can be seen with a lower test error in the ridge regression than least squares regression). Still, as ridge regression includes all variables but penalizes them and shrinks them towards a 0 value, this would not be very good for the interpretation between the dependent and independent variables.
d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates
set.seed(1)
lasso_college <- glmnet::cv.glmnet(x_train_college, y_train_college, alpha = 1)
lasso_lambda_college <- lasso_college$lambda.min
lasso_lambda_college
## [1] 10.00359
lasso_prediction <- predict(lasso_college, s = lasso_lambda_college, newx = x_test_college)
lasso_error <- mean((lasso_prediction - y_test_college)^2)
lasso_error #I am admittedly not sure why this error is greater than the least squares one, since it should more closely match the ridge regression value
## [1] 1528678
Just like with the ridge regression, we use lambda.min to find the minimum lambda to use (as it will reduce the greatest amount of error possible in the model). Our test error is 918,371, which is even greater than the test error for our least squares model.
e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pcr_college <- pcr(Apps ~ ., data = training_data, scale = TRUE, validation = "CV")
validationplot(pcr_college, val.type = "MSEP")
summary(pcr_college)
## Data: X dimension: 466 17
## Y dimension: 466 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3440 3428 1716 1720 1521 1267 1262
## adjCV 3440 3428 1714 1721 1516 1262 1259
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1261 1206 1188 1186 1187 1191 1199
## adjCV 1263 1196 1186 1184 1185 1189 1196
## 14 comps 15 comps 16 comps 17 comps
## CV 1200 1200 1045 1041
## adjCV 1197 1197 1042 1038
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.011 57.72 64.88 70.97 76.36 80.99 84.66 88.11
## Apps 1.634 75.47 75.55 82.81 87.29 87.30 87.47 88.64
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.21 93.48 95.41 97.04 98.09 98.95 99.45
## Apps 88.74 88.82 88.82 88.93 88.95 88.96 89.05
## 16 comps 17 comps
## X 99.85 100.0
## Apps 91.70 91.8
pcr_prediction <- predict(pcr_college, test_data, ncomp = 10)
pcr_error <- mean((pcr_prediction - test_data$Apps)^2)
pcr_error
## [1] 3348213
When we look at the model’s graph, we see that the lowest cross-validation measure belongs to 10 principle components, which also explains about 95% of the variance for the training data. When using more principal components, the variance would decrease, but the bias would increase (per the variance-bias trade-off). The test error obtained by this model is 1,652,708, which is the highest test error for all of our models so far.
f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls_college <- plsr(Apps ~ ., data = training_data, scale = TRUE, validation = "CV")
validationplot(pls_college, val.type = "MSEP")
summary(pls_college)
## Data: X dimension: 466 17
## Y dimension: 466 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3440 1540 1247 1165 1137 1103 1075
## adjCV 3440 1538 1246 1162 1133 1098 1069
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1063 1055 1054 1053 1052 1053 1054
## adjCV 1059 1052 1050 1049 1048 1049 1050
## 14 comps 15 comps 16 comps 17 comps
## CV 1054 1054 1054 1054
## adjCV 1050 1050 1050 1050
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.75 38.84 63.35 67.02 70.07 73.82 78.95 82.03
## Apps 80.65 87.56 89.26 90.16 91.18 91.66 91.74 91.77
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 84.22 86.47 88.80 91.58 93.53 96.31 97.25
## Apps 91.78 91.79 91.79 91.80 91.80 91.80 91.80
## 16 comps 17 comps
## X 98.82 100.0
## Apps 91.80 91.8
pls_predict <- predict(pls_college, test_data, ncomp = 10)
pls_error <- mean((pls_predict - test_data$Apps)^2)
pls_error
## [1] 1562788
When we look at the model’s graph, we see that the lowest cross-validation measure belongs to principle components 9, 10, and 12. For the sake of reducing the variance, I will choose 10 principle components, which also explains about 90% of the variance for the training data. When using more principal components, the variance would decrease, but the bias would increase (per the variance-bias trade-off). The test error obtained by this model is 916,759, which is lower than the test error for our PCR regression (which is normal), but also having more error than some of our other models. g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The best model for this problem was the ridge regression, as it had the lowest test error from all of the models we ran. To calculate the accuracy of predicting the number of college applications received per college, we must compute the R^2 measure for each of the models; this is calculated by obtaining the average number of applications in the testing data and doing 1 - [mean (predicted applications of model - observed applications in testing data)^2 / mean (average applications in test data - observed applications in testing data)^2].
average_applications <- mean(test_data$Apps)
least_squares_r2 <- 1 - mean((predict_class_college - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
least_squares_r2
## [1] 0.9222907
ridge_r2 <- 1 - mean((ridge_prediction - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
ridge_r2
## [1] 0.8693482
lasso_r2 <- 1 - mean((lasso_prediction - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
lasso_r2
## [1] 0.9224066
pcr_r2 <- 1 - mean((pcr_prediction - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
pcr_r2
## [1] 0.8300497
pls_r2 <- 1 - mean((pls_predict - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
pls_r2
## [1] 0.9206752
As we can see, all of the models seem to have a very high R^2 measure, meaning that all of them are able to explain most of the variance of the dependent variable with the independent variables given in the data, with most of them actually being around 92% (except the PCR model).
We will now try to predict per capita crime rate in the Boston data set.
a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
boston <- MASS::Boston %>%
clean_names()
training_sample_boston <- sample(nrow(boston), 0.6*nrow(boston))
training_data_boston <- boston[training_sample_boston, ]
test_data_boston <- boston[-training_sample_boston, ]
least_squares_boston <- lm(crim ~ ., data = training_data_boston)
predict_class_boston <- predict(least_squares_boston, test_data_boston)
boston_error <- mean((test_data_boston$crim - predict_class_boston)^2)
boston_error
## [1] 41.7718
x_train_boston <- model.matrix(training_data_boston$crim ~ ., data = training_data_boston[, -1])
y_train_boston <- training_data_boston$crim
x_test_boston <- model.matrix(test_data_boston$crim ~ ., data = test_data_boston[, -1])
y_test_boston <- test_data_boston$crim
set.seed(1)
ridge_boston <- glmnet::cv.glmnet(x_train_boston, y_train_boston, alpha = 0)
ridge_lamba_boston <- ridge_boston$lambda.min
ridge_prediction_boston <- predict(ridge_boston, s = ridge_lamba_boston, newx = x_test_boston)
ridge_error_boston <- mean((ridge_prediction_boston - y_test_boston)^2)
ridge_error_boston
## [1] 42.13274
set.seed(1)
lasso_boston <- glmnet::cv.glmnet(x_train_boston, y_train_boston, alpha = 1)
lasso_lambda_boston <- lasso_boston$lambda.min
lasso_lambda_boston
## [1] 0.09397746
lasso_prediction_boston <- predict(lasso_boston, s = lasso_lambda_boston, newx = x_test_boston)
lasso_error_boston <- mean((lasso_prediction_boston - y_test_boston)^2)
lasso_error_boston
## [1] 41.87298
pcr_boston <- pcr(training_data_boston$crim ~ ., data = training_data_boston, scale = TRUE, validation = "CV")
validationplot(pcr_boston, val.type = "MSEP")
summary(pcr_boston)
## Data: X dimension: 303 13
## Y dimension: 303 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.444 7.184 7.198 6.810 6.816 6.834 6.834
## adjCV 8.444 7.181 7.196 6.803 6.809 6.827 6.825
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.83 6.684 6.744 6.755 6.763 6.742 6.695
## adjCV 6.82 6.672 6.732 6.743 6.748 6.724 6.677
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 48.13 60.44 69.81 76.98 83.62 88.96
## training_data_boston$crim 28.29 28.38 36.42 36.58 36.58 37.56
## 7 comps 8 comps 9 comps 10 comps 11 comps
## X 91.83 94.02 95.82 97.28 98.53
## training_data_boston$crim 37.85 40.13 40.42 40.64 40.90
## 12 comps 13 comps
## X 99.60 100.00
## training_data_boston$crim 42.08 43.08
pcr_prediction_boston <- predict(pcr_boston, test_data_boston, ncomp = 4)
pcr_error_boston <- mean((pcr_prediction_boston - test_data_boston$crim)^2)
pcr_error_boston
## [1] 44.60879
pls_boston <- plsr(training_data_boston$crim ~ ., data = training_data_boston, scale = TRUE, validation = "CV")
validationplot(pls_boston, val.type = "MSEP")
summary(pls_boston)
## Data: X dimension: 303 13
## Y dimension: 303 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.444 7.039 6.775 6.781 6.745 6.760 6.771
## adjCV 8.444 7.035 6.765 6.759 6.731 6.741 6.748
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.743 6.743 6.731 6.733 6.735 6.735 6.735
## adjCV 6.723 6.722 6.711 6.713 6.715 6.715 6.715
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 47.68 57.13 61.19 69.54 76.6 79.51
## training_data_boston$crim 31.84 39.42 41.77 42.31 42.6 42.89
## 7 comps 8 comps 9 comps 10 comps 11 comps
## X 85.04 86.96 89.11 94.92 97.09
## training_data_boston$crim 42.97 43.06 43.08 43.08 43.08
## 12 comps 13 comps
## X 98.62 100.00
## training_data_boston$crim 43.08 43.08
pls_predict_boston <- predict(pls_boston, test_data_boston, ncomp = 6)
pls_error_boston <- mean((pls_predict_boston - test_data_boston$crim)^2)
pls_error_boston
## [1] 41.94278
b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
average_crim <- mean(test_data_boston$crim)
least_squares__boston_r2 <- 1 - mean((predict_class_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
least_squares__boston_r2
## [1] 0.465481
ridge_boston_r2 <- 1 - mean((ridge_prediction_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
ridge_boston_r2
## [1] 0.4608624
lasso__boston_r2 <- 1 - mean((lasso_prediction_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
lasso__boston_r2
## [1] 0.4641862
pcr_boston_r2 <- 1 - mean((pcr_prediction_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
pcr_boston_r2
## [1] 0.4291784
pls_boston_r2 <- 1 - mean((pls_predict_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
pls_boston_r2
## [1] 0.4632931
The model that explains the most amount of variance in the dependent variable given the independent variables is our ridge model, having an R^2 measure of about 53%.
c) Does your chosen model involve all of the features in the data set? Why or why not As our chosen model is the ridge model, we know that this procedure doesn’t really throw away variables, but rather shrinks them toward zero based on the size of the turning parameter lambda; therefore, we would use all of the features in the data set for this model.