For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Repeat (a) for ridge regression relative to least squares.
Repeat (a) for non-linear methods relative to least squares.
i. Incorrect
ii. Incorrect
iii. Correct
iv. Incorrect
i. Incorrect
ii. Incorrect
iii. Correct
iv. Incorrect
i. Correct
ii. Incorrect
iii. Incorrect
iv. Incorrect
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
set.seed(42)
train_index <- createDataPartition(College$Apps, p = 0.7, list = FALSE)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]
lm_model <- lm(Apps ~ ., data = train_data)
lm_preds <- predict(lm_model, newdata = test_data, type = 'response')
lm_test_error <- mean((lm_preds - test_data$Apps)^2)
lm_test_error
## [1] 1390507
MSE of linear model using least squares = 1390507
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1] #removes intercept
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
#ridge regression
grid <- 10^seq(10, -2, length = 100)
cv_ridge <- cv.glmnet(x_train, y_train, lambda = grid, alpha = 0) #alpha = 0 is for ridge
best_lambda_ridge <- cv_ridge$lambda.min
ridge_preds <- predict(cv_ridge, s = best_lambda_ridge, newx = x_test)
ridge_test_error <- mean((ridge_preds - test_data$Apps)^2)
ridge_test_error
## [1] 1390359
MSE of ridge regression = 1390359
It is slightly better than linear regression because it is lower.
#lasso regression
cv_lasso <- cv.glmnet(x_train, y_train, lambda = grid, alpha = 1) #alpha = 1 makes it lasso
best_lambda_lasso <- cv_lasso$lambda.min
lasso_preds <- predict(cv_lasso, s = best_lambda_lasso, newx = x_test)
lasso_test_error <- mean((lasso_preds - test_data$Apps)^2)
lasso_test_error
## [1] 1390191
MSE of lasso regression = 1416339
It is worse than both our previous models because it is higher.
lasso_coef <- predict(cv_lasso, type = 'coefficient', s = best_lambda_lasso)
lasso_coef
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -596.78495261
## PrivateYes -339.69792355
## Accept 1.65416184
## Enroll -1.00517762
## Top10perc 44.69859446
## Top25perc -13.37564877
## F.Undergrad 0.06111737
## P.Undergrad 0.02135864
## Outstate -0.08342593
## Room.Board 0.08516271
## Books 0.19320953
## Personal 0.04506171
## PhD -9.88092135
## Terminal -0.07124284
## S.F.Ratio 18.75160346
## perc.alumni 1.73925789
## Expend 0.09175698
## Grad.Rate 6.11088222
We have 15 non-zero coefficient estimates.
pcr_model <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = 'CV')
summary(pcr_model)
## Data: X dimension: 545 17
## Y dimension: 545 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3925 3946 2147 2128 2110 1771 1747
## adjCV 3925 3947 2145 2127 2179 1763 1742
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1743 1750 1655 1620 1610 1615 1622
## adjCV 1737 1745 1647 1613 1605 1610 1617
## 14 comps 15 comps 16 comps 17 comps
## CV 1646 1375 1174 1119
## adjCV 1662 1355 1164 1111
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.22181 57.32 64.52 70.43 76.09 80.77 84.43 87.74
## Apps 0.02335 70.85 71.39 71.69 81.41 81.71 81.84 81.86
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.51 92.89 95.07 96.72 97.81 98.63 99.35
## Apps 84.05 85.05 85.22 85.24 85.30 85.40 92.19
## 16 comps 17 comps
## X 99.83 100.00
## Apps 93.45 93.88
validationplot(pcr_model, val.type = 'MSEP')
best_M_pcr <- which.min(pcr_model$validation$PRESS)
pcr_preds <- predict(pcr_model, newdata = test_data, ncomp = best_M_pcr)
pcr_test_error <- mean((pcr_preds - test_data$Apps)^2)
list(pcr_test_error, best_M_pcr)
## [[1]]
## [1] 1390507
##
## [[2]]
## [1] 17
MSE of PCR = 1390507
optimal M = 17
The plot shows the lowest MSEP at the end.
pls_model <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = 'CV')
summary(pls_model)
## Data: X dimension: 545 17
## Y dimension: 545 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3925 1983 1565 1520 1384 1259 1143
## adjCV 3925 1978 1543 1469 1366 1247 1133
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1133 1124 1119 1117 1118 1116 1112
## adjCV 1124 1115 1110 1108 1110 1107 1103
## 14 comps 15 comps 16 comps 17 comps
## CV 1112 1112 1112 1112
## adjCV 1103 1103 1103 1103
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.94 31.61 53.18 65.41 70.15 73.61 77.40 79.96
## Apps 76.28 86.90 88.60 91.29 92.76 93.60 93.67 93.76
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.89 86.76 89.51 91.14 92.60 94.80 97.15
## Apps 93.82 93.84 93.85 93.87 93.88 93.88 93.88
## 16 comps 17 comps
## X 99.02 100.00
## Apps 93.88 93.88
validationplot(pls_model, val.type = 'MSEP')
best_M_pls <- which.min(pls_model$validation$PRESS)
pls_preds <- predict(pls_model, newdata = test_data, ncomp = 9)
pls_test_error <- mean((pls_preds - test_data$Apps)^2)
list(pls_test_error, best_M_pls)
## [[1]]
## [1] 1378551
##
## [[2]]
## [1] 13
MSE of pls = 1378551
Optimal M = 13
In this case we don’t use the optimal M because at 9, the trade off becomes much lower and variance explained is basically the same.
The results indicate that all five models produce similar test errors, suggesting that our ability to predict the number of college applications received is somewhat limited.
We have a small difference in MSE across most models, meaning that no single approach drastically outperforms the others.
In conclusion, PLS was our best model, offering the lowest MSE while balancing complexity, but the model itself does not accurately predict the number of college applications.
We will now try to predict per capita crime rate in the Boston data set.
library(MASS)
data(Boston)
attach(Boston)
x <- Boston[, -Boston$crim]
y <- Boston$crim
set.seed(42)
train_index2 <- createDataPartition(Boston$crim, p = 0.7, list = FALSE)
train_data2 <- Boston[train_index2, ]
test_data2 <- Boston[-train_index2, ]
library(leaps)
best_subset <- regsubsets(crim ~ ., data = train_data2, nvmax = ncol(train_data2) - 1)
best_subset_summary <- summary(best_subset)
par(mfrow = c(2, 2))
plot(best_subset_summary$rss, type = 'l', pch = 19, xlab = 'Number of predictors',
ylab = 'RSS')
plot(best_subset_summary$adjr2, type = 'l', pch = 19, xlab = 'Number of predictors',
ylab = 'AdjR2')
p = which.max(best_subset_summary$adjr2)
points(p, best_subset_summary$adjr2[p], col = 'red', cex = 2, pch = 20)
plot(best_subset_summary$cp, type = 'l', pch = 19, xlab = 'Number of predictors',
ylab = 'CP')
p = which.min(best_subset_summary$cp)
points(p, best_subset_summary$cp[p], col = 'red', cex = 2, pch = 20)
plot(best_subset_summary$bic, type = 'l', pch = 19, xlab = 'Number of predictors',
ylab = 'BIC')
p = which.min(best_subset_summary$bic)
points(p, best_subset_summary$bic[p], col = 'red', cex = 2, pch = 20)
The best number of predictors are 3 and 6 which gives us the highest adjusted r-squared, but the model performance almost improves nothing.
best_subset_summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train_data2, nvmax = ncol(train_data2) -
## 1)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) " " " " " " "*" " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" " " " " "*" "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" " " "*" "*" " " " " "*" "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
black, rad, and lstat are our
3 predictors.
zn, nox, dis,
rad, black, and medv are our
predictors for 6 predictor model.
best_subset_model <- lm(crim ~ black + rad + lstat, data = train_data2)
best_subset_model_larger <- lm(crim ~ zn + nox + dis + rad + black + medv, data = train_data2)
bs_preds <- predict(best_subset_model, newdata = test_data2)
bs_larger_preds <- predict(best_subset_model_larger, newdata = test_data2)
mean((bs_preds - test_data2$crim)^2)
## [1] 45.90462
mean((bs_larger_preds - test_data2$crim)^2)
## [1] 44.41283
It appears larger model is indeed better, but not by much. Addition of 3 variables isn’t worth for such little improvement in performance.
x_train2 <- model.matrix(crim ~ ., data = train_data2)[, -1]
y_train2 <- train_data2$crim
x_test2 <- model.matrix(crim ~ ., data = test_data2)[, -1]
cv_lasso2 <- cv.glmnet(x_train2, y_train2, lambda = grid, alpha = 1)
#optimal lambda
best_lambda_lasso2 <- cv_lasso2$lambda.min
lasso_model <- glmnet(x_train2, y_train2, alpha = 1, lambda = best_lambda_lasso2)
lasso_pred <- predict(lasso_model, s = best_lambda_lasso2, newx = x_test2)
#mse
lasso_mse <- mean((lasso_pred - test_data2$crim)^2)
lasso_mse
## [1] 43.91328
predict(cv_lasso2, type = 'coefficient', s = best_lambda_lasso2)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 12.490672439
## zn 0.032243142
## indus -0.048258168
## chas -0.989551968
## nox -8.040554865
## rm 0.263696046
## age 0.008175934
## dis -0.621009082
## rad 0.504244683
## tax .
## ptratio -0.155705969
## black -0.012575303
## lstat 0.099700619
## medv -0.123537799
pcr_model2 <- pcr(crim ~ ., data = train_data2, scale = TRUE, validation = 'CV')
summary(pcr_model2)
## Data: X dimension: 356 13
## Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.444 7.181 7.192 6.659 6.698 6.704 6.731
## adjCV 8.444 7.177 7.189 6.652 6.691 6.697 6.722
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.730 6.628 6.711 6.715 6.727 6.713 6.619
## adjCV 6.721 6.618 6.697 6.698 6.711 6.694 6.600
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.12 59.96 68.89 75.77 82.48 87.54 90.78 93.34
## crim 28.69 28.72 39.09 39.11 39.11 39.14 39.45 41.56
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.24 96.98 98.41 99.47 100.00
## crim 41.82 42.07 42.07 43.12 44.56
validationplot(pcr_model2, val.type = 'MSEP')
pcr_preds2 <- predict(pcr_model2, newdata = test_data2, ncomp = 8)
mean((pcr_preds2 - test_data2$crim)^2)
## [1] 45.95997
Lasso Regression appears the be the best-performing model on this dataset out of the 3 that we tested. Although it could be argued that best subset could be used as well, Lasso outperforms it by a small margin.
The model we would propose to use is best subset, and the reason why
is because of the very slight disadvantage in performance compared to
the amount of predictors it uses compared to Lasso. As we saw, Lasso
uses almost all predictors except tax which is surprising
since both subset and PCR used 3 to 8.
Our chosen model does not involve all of the features in the data set. The reason why is because the trade-off between model performance improvement and addition of predictive variables is very low after those 3 predictors are added.