Question 2

For parts (a) through (c), indicate which of i through iv is correct. Justify your answer.

a) The lasso, relative to least squares, is: i) More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii) More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias iii) Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv) Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

The correct answer for lasso relative to least squares is iii. According to figure 2.7, a lasso would have a low flexibility, but a high interpretability compared to least squares (pg 25). It is also stated that a lasso procedure is a type of shrinking procedure, which can "substantially reduce the variance at the cost of a negligible increase in bias (pg 204).

b) Repeat (a) for ridge regression relative to least squares.

The correct answer for lasso relative to least squares is iii. Once again referring to figure 2.7, a ridge regression would have a low flexibility, but a high interpretability , while least squares has a higher flexibility but lower interpretability (pg 25). A ridge regression is also a type of shrinking procedure, which can “substantially reduce the variance at the cost of a negligible increase in bias (pg 204). It is also stated that”as λ → ∞, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero" (pg 215), effectively minimizing variance while increasing bias.

c) Repeat (a) for non-linear methods relative to least squares.

The correct answer for non-linear methods is ii. Non-linear methods include such examples as bagging, boosting, and support vector machines, the first 2 being found in figure 2.7 with a high flexibility and low interpretability compared to least squares (pg 25). If we consider the bias-variance trade-off as we have in the 2 previous examples, we can infer that a higher flexibility would give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set. a) Split the data set into a training set and a test set.

college <- ISLR::College
training_sample_college <- sample(nrow(college), 0.6*nrow(college))
training_data <- college[training_sample_college, ]
test_data <- college[-training_sample_college, ]

b) Fit a linear model using least squares on the training set, and report the test error obtained.

least_squares_college <- lm(Apps ~ ., data = training_data)
predict_class_college <- predict(least_squares_college, test_data)
college_error <- mean((test_data$Apps - predict_class_college)^2)
college_error
## [1] 1530961

The test error obtained by the linear model using least squares is calculated by squaring the difference between observed y-values and predicted y-values for every x. For all of the variables in the model, the test error is calculated to be 895, 155.9.

c) Fit a ridge regression model on the training set, with λ (lambda) chosen by cross-validation. Report the test error obtained.

x_train_college <- model.matrix(Apps ~ ., data = training_data[, -1])
y_train_college <- training_data$Apps
x_test_college <- model.matrix(Apps ~ ., data = test_data[, -1])
y_test_college <- test_data$Apps
set.seed(1)
ridge_college <- glmnet::cv.glmnet(x_train_college, y_train_college, alpha = 0)
ridge_lamba_college <- ridge_college$lambda.min
ridge_prediction <- predict(ridge_college, s = ridge_lamba_college, newx = x_test_college)
ridge_error <- mean((ridge_prediction - y_test_college)^2)
ridge_error
## [1] 2573989

We use lambda.min to find the minimum lambda to use (as it will reduce the greatest amount of error possible in the model). Our test error is 850,441.8, which falls in line with ridge regressions typically predicting better than least squares (which can be seen with a lower test error in the ridge regression than least squares regression). Still, as ridge regression includes all variables but penalizes them and shrinks them towards a 0 value, this would not be very good for the interpretation between the dependent and independent variables.

d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates

set.seed(1)
lasso_college <- glmnet::cv.glmnet(x_train_college, y_train_college, alpha = 1)
lasso_lambda_college <- lasso_college$lambda.min
lasso_lambda_college
## [1] 10.00359
lasso_prediction <- predict(lasso_college, s = lasso_lambda_college, newx = x_test_college)
lasso_error <- mean((lasso_prediction - y_test_college)^2)
lasso_error #I am admittedly not sure why this error is greater than the least squares one, since it should more closely match the ridge regression value
## [1] 1528678

Just like with the ridge regression, we use lambda.min to find the minimum lambda to use (as it will reduce the greatest amount of error possible in the model). Our test error is 918,371, which is even greater than the test error for our least squares model.

e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pcr_college <- pcr(Apps ~ ., data = training_data, scale = TRUE, validation = "CV")
validationplot(pcr_college, val.type = "MSEP")

summary(pcr_college)
## Data:    X dimension: 466 17 
##  Y dimension: 466 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3440     3428     1716     1720     1521     1267     1262
## adjCV         3440     3428     1714     1721     1516     1262     1259
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1261     1206     1188      1186      1187      1191      1199
## adjCV     1263     1196     1186      1184      1185      1189      1196
##        14 comps  15 comps  16 comps  17 comps
## CV         1200      1200      1045      1041
## adjCV      1197      1197      1042      1038
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.011    57.72    64.88    70.97    76.36    80.99    84.66    88.11
## Apps    1.634    75.47    75.55    82.81    87.29    87.30    87.47    88.64
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.21     93.48     95.41     97.04     98.09     98.95     99.45
## Apps    88.74     88.82     88.82     88.93     88.95     88.96     89.05
##       16 comps  17 comps
## X        99.85     100.0
## Apps     91.70      91.8
pcr_prediction <- predict(pcr_college, test_data, ncomp = 10)
pcr_error <- mean((pcr_prediction - test_data$Apps)^2)
pcr_error
## [1] 3348213

When we look at the model’s graph, we see that the lowest cross-validation measure belongs to 10 principle components, which also explains about 95% of the variance for the training data. When using more principal components, the variance would decrease, but the bias would increase (per the variance-bias trade-off). The test error obtained by this model is 1,652,708, which is the highest test error for all of our models so far.

f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls_college <- plsr(Apps ~ ., data = training_data, scale = TRUE, validation = "CV")
validationplot(pls_college, val.type = "MSEP")

summary(pls_college)
## Data:    X dimension: 466 17 
##  Y dimension: 466 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3440     1540     1247     1165     1137     1103     1075
## adjCV         3440     1538     1246     1162     1133     1098     1069
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1063     1055     1054      1053      1052      1053      1054
## adjCV     1059     1052     1050      1049      1048      1049      1050
##        14 comps  15 comps  16 comps  17 comps
## CV         1054      1054      1054      1054
## adjCV      1050      1050      1050      1050
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.75    38.84    63.35    67.02    70.07    73.82    78.95    82.03
## Apps    80.65    87.56    89.26    90.16    91.18    91.66    91.74    91.77
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       84.22     86.47     88.80     91.58     93.53     96.31     97.25
## Apps    91.78     91.79     91.79     91.80     91.80     91.80     91.80
##       16 comps  17 comps
## X        98.82     100.0
## Apps     91.80      91.8
pls_predict <- predict(pls_college, test_data, ncomp = 10)
pls_error <- mean((pls_predict - test_data$Apps)^2)
pls_error
## [1] 1562788

When we look at the model’s graph, we see that the lowest cross-validation measure belongs to principle components 9, 10, and 12. For the sake of reducing the variance, I will choose 10 principle components, which also explains about 90% of the variance for the training data. When using more principal components, the variance would decrease, but the bias would increase (per the variance-bias trade-off). The test error obtained by this model is 916,759, which is lower than the test error for our PCR regression (which is normal), but also having more error than some of our other models. g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The best model for this problem was the ridge regression, as it had the lowest test error from all of the models we ran. To calculate the accuracy of predicting the number of college applications received per college, we must compute the R^2 measure for each of the models; this is calculated by obtaining the average number of applications in the testing data and doing 1 - [mean (predicted applications of model - observed applications in testing data)^2 / mean (average applications in test data - observed applications in testing data)^2].

average_applications <- mean(test_data$Apps)
least_squares_r2 <- 1 - mean((predict_class_college - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
least_squares_r2
## [1] 0.9222907
ridge_r2 <- 1 - mean((ridge_prediction - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
ridge_r2
## [1] 0.8693482
lasso_r2 <- 1 - mean((lasso_prediction - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
lasso_r2
## [1] 0.9224066
pcr_r2 <- 1 - mean((pcr_prediction - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
pcr_r2
## [1] 0.8300497
pls_r2 <- 1 - mean((pls_predict - test_data$Apps)^2) / mean((average_applications - test_data$Apps)^2)
pls_r2
## [1] 0.9206752

As we can see, all of the models seem to have a very high R^2 measure, meaning that all of them are able to explain most of the variance of the dependent variable with the independent variables given in the data, with most of them actually being around 92% (except the PCR model).

Question 11

We will now try to predict per capita crime rate in the Boston data set.

a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

boston <- MASS::Boston %>%
          clean_names()
training_sample_boston <- sample(nrow(boston), 0.6*nrow(boston))
training_data_boston <- boston[training_sample_boston, ]
test_data_boston <- boston[-training_sample_boston, ]
least_squares_boston <- lm(crim ~ ., data = training_data_boston)
predict_class_boston <- predict(least_squares_boston, test_data_boston)
boston_error <- mean((test_data_boston$crim - predict_class_boston)^2)
boston_error
## [1] 41.7718
x_train_boston <- model.matrix(training_data_boston$crim ~ ., data = training_data_boston[, -1])
y_train_boston <- training_data_boston$crim
x_test_boston <- model.matrix(test_data_boston$crim ~ ., data = test_data_boston[, -1])
y_test_boston <- test_data_boston$crim
set.seed(1)

ridge_boston <- glmnet::cv.glmnet(x_train_boston, y_train_boston, alpha = 0)
ridge_lamba_boston <- ridge_boston$lambda.min
ridge_prediction_boston <- predict(ridge_boston, s = ridge_lamba_boston, newx = x_test_boston)
ridge_error_boston <- mean((ridge_prediction_boston - y_test_boston)^2)
ridge_error_boston
## [1] 42.13274
set.seed(1)
lasso_boston <- glmnet::cv.glmnet(x_train_boston, y_train_boston, alpha = 1)
lasso_lambda_boston <- lasso_boston$lambda.min
lasso_lambda_boston
## [1] 0.09397746
lasso_prediction_boston <- predict(lasso_boston, s = lasso_lambda_boston, newx = x_test_boston)
lasso_error_boston <- mean((lasso_prediction_boston - y_test_boston)^2)
lasso_error_boston
## [1] 41.87298
pcr_boston <- pcr(training_data_boston$crim ~ ., data = training_data_boston, scale = TRUE, validation = "CV")
validationplot(pcr_boston, val.type = "MSEP")

summary(pcr_boston)
## Data:    X dimension: 303 13 
##  Y dimension: 303 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.444    7.184    7.198    6.810    6.816    6.834    6.834
## adjCV        8.444    7.181    7.196    6.803    6.809    6.827    6.825
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        6.83    6.684    6.744     6.755     6.763     6.742     6.695
## adjCV     6.82    6.672    6.732     6.743     6.748     6.724     6.677
## 
## TRAINING: % variance explained
##                            1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                            48.13    60.44    69.81    76.98    83.62    88.96
## training_data_boston$crim    28.29    28.38    36.42    36.58    36.58    37.56
##                            7 comps  8 comps  9 comps  10 comps  11 comps
## X                            91.83    94.02    95.82     97.28     98.53
## training_data_boston$crim    37.85    40.13    40.42     40.64     40.90
##                            12 comps  13 comps
## X                             99.60    100.00
## training_data_boston$crim     42.08     43.08
pcr_prediction_boston <- predict(pcr_boston, test_data_boston, ncomp = 4)
pcr_error_boston <- mean((pcr_prediction_boston - test_data_boston$crim)^2)
pcr_error_boston
## [1] 44.60879
pls_boston <- plsr(training_data_boston$crim ~ ., data = training_data_boston, scale = TRUE, validation = "CV")
validationplot(pls_boston, val.type = "MSEP")

summary(pls_boston)
## Data:    X dimension: 303 13 
##  Y dimension: 303 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.444    7.039    6.775    6.781    6.745    6.760    6.771
## adjCV        8.444    7.035    6.765    6.759    6.731    6.741    6.748
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.743    6.743    6.731     6.733     6.735     6.735     6.735
## adjCV    6.723    6.722    6.711     6.713     6.715     6.715     6.715
## 
## TRAINING: % variance explained
##                            1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                            47.68    57.13    61.19    69.54     76.6    79.51
## training_data_boston$crim    31.84    39.42    41.77    42.31     42.6    42.89
##                            7 comps  8 comps  9 comps  10 comps  11 comps
## X                            85.04    86.96    89.11     94.92     97.09
## training_data_boston$crim    42.97    43.06    43.08     43.08     43.08
##                            12 comps  13 comps
## X                             98.62    100.00
## training_data_boston$crim     43.08     43.08
pls_predict_boston <- predict(pls_boston, test_data_boston, ncomp = 6)
pls_error_boston <- mean((pls_predict_boston - test_data_boston$crim)^2)
pls_error_boston
## [1] 41.94278

b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

average_crim <- mean(test_data_boston$crim)
least_squares__boston_r2 <- 1 - mean((predict_class_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
least_squares__boston_r2
## [1] 0.465481
ridge_boston_r2 <- 1 - mean((ridge_prediction_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
ridge_boston_r2
## [1] 0.4608624
lasso__boston_r2 <- 1 - mean((lasso_prediction_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
lasso__boston_r2
## [1] 0.4641862
pcr_boston_r2 <- 1 - mean((pcr_prediction_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
pcr_boston_r2
## [1] 0.4291784
pls_boston_r2 <- 1 - mean((pls_predict_boston - test_data_boston$crim)^2) / mean((average_crim - test_data_boston$crim)^2)
pls_boston_r2
## [1] 0.4632931

The model that explains the most amount of variance in the dependent variable given the independent variables is our ridge model, having an R^2 measure of about 53%.

c) Does your chosen model involve all of the features in the data set? Why or why not As our chosen model is the ridge model, we know that this procedure doesn’t really throw away variables, but rather shrinks them toward zero based on the size of the turning parameter lambda; therefore, we would use all of the features in the data set for this model.