For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The correct answer for Part(b) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: Lasso’s advantage over least squares is rooted in the bias-variance trade-off. When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias. This consequently can generate more accurate predictions. In addition, lasso performs variable selection which makes it easier to interpret than other methods like ridge regression.
(b) Repeat (a) for ridge regression relative to least squares.
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The correct answer for Part(b) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: Ridge Regression’s advantage over least squares is rooted in the bias-variance trade-off. As \(\lambda\) increases, the flexibility of Ridge Regression fit decreases, leading to decreased variance but increased bias. In situations where the relationship between the response and the predictors is close to linear, the Least Squares Estimates will have low bias but may have high variance. This means when there is a small change in the training data, this may cause a large change in the least squares coefficient estimates. This is especially true when the number of variables ‘p’ is almost as large as the number of observations ‘n’. Whereas Ridge Regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, Ridge Regression works best in situations where the Least Square Estimates have high variance.
(c) Repeat (a) for non-linear methods relative to least squares.
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The correct answer for part (c) is: ii. More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
pairs(College)
set.seed(1)
trainingIndex <- sample(nrow(College), 0.75*nrow(College))
College.train <- College[trainingIndex,]
College.test <- College[-trainingIndex,]
dim(College)
## [1] 777 18
dim(College.train)
## [1] 582 18
dim(College.test)
## [1] 195 18
College.lm <- lm(Apps~., data = College.train)
summary(College.lm)
##
## Call:
## lm(formula = Apps ~ ., data = College.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5773.1 -425.2 4.5 327.9 7496.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.784e+02 4.707e+02 -1.229 0.21962
## PrivateYes -4.673e+02 1.571e+02 -2.975 0.00305 **
## Accept 1.712e+00 4.567e-02 37.497 < 2e-16 ***
## Enroll -1.197e+00 2.151e-01 -5.564 4.08e-08 ***
## Top10perc 5.298e+01 6.158e+00 8.603 < 2e-16 ***
## Top25perc -1.528e+01 4.866e+00 -3.141 0.00177 **
## F.Undergrad 7.085e-02 3.760e-02 1.884 0.06002 .
## P.Undergrad 5.771e-02 3.530e-02 1.635 0.10266
## Outstate -8.143e-02 2.077e-02 -3.920 9.95e-05 ***
## Room.Board 1.609e-01 5.361e-02 3.002 0.00280 **
## Books 2.338e-01 2.634e-01 0.887 0.37521
## Personal 6.611e-03 6.850e-02 0.097 0.92315
## PhD -1.114e+01 5.149e+00 -2.163 0.03093 *
## Terminal 9.186e-01 5.709e+00 0.161 0.87223
## S.F.Ratio 1.689e+01 1.542e+01 1.096 0.27368
## perc.alumni 2.256e+00 4.635e+00 0.487 0.62667
## Expend 5.567e-02 1.300e-02 4.284 2.16e-05 ***
## Grad.Rate 6.427e+00 3.307e+00 1.944 0.05243 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared: 0.9336, Adjusted R-squared: 0.9316
## F-statistic: 466.7 on 17 and 564 DF, p-value: < 2.2e-16
College.lm.pred <- predict(College.lm, newdata = College.test)
College.lm.test.err <- mean((College.test$Apps - College.lm.pred)^2)
College.lm.test.err
## [1] 1384604
After fitting a Linear Regression Model to predict the # of Applications, we retrieve a Test Error of 1,384,604.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
grid = 10^seq(4, -2 ,length = 100)
# Training Data for X & Y Variables
College.x.train <- model.matrix(Apps~., data = College.train[,-1])
College.y.train <- College.train$Apps
# Testing Data for X & Y Variables
College.x.test <- model.matrix(Apps~., data = College.test[,-1])
College.y.test <- College.test$Apps
# College Ridge Regression Model
College.ridge <- cv.glmnet(College.x.train, College.y.train, alpha = 0)
plot(College.ridge)
College.ridge.best.lambda <- College.ridge$lambda.min
College.ridge.best.lambda
## [1] 364.6228
Using Cross-Validation, we receive the tuning \(\lambda\) value of 364.6228. This tuning \(\lambda\) value of 364.6228 results in the smallest cross-validation error.
College.ridge.pred <- predict(College.ridge, s = College.ridge.best.lambda, newx = College.x.test)
College.ridge.err <- mean((College.ridge.pred - College.y.test)^2)
College.ridge.err
## [1] 1260111
The Test MSE or Test Error produced by the Cross Validation Ridge Regression is 1,260,111. This Test Error is significantly smaller than that of the OLS. While the prediction may be more accurate, the shrinking of the coefficients towards zero makes it challenging to interpret the relationship between the variables and the response variable - Applications.
College.lasso <- cv.glmnet(College.x.train, College.y.train, alpha = 1)
plot(College.lasso)
College.lasso.best.lambda <- College.lasso$lambda.min
College.lasso.best.lambda
## [1] 1.945882
Using Cross-Validation, we receive the \(\lambda\) value of 1.945882. This \(\lambda\) value of 1.945882 results in the smallest cross-validation error for the lasso regression.
College.lasso.pred <- predict(College.lasso, s = College.lasso.best.lambda, newx = College.x.test)
College.lasso.err <- mean((College.lasso.pred - College.y.test)^2)
College.lasso.err
## [1] 1394834
The Lasso does not outperform the Ridge Regression in this case. The Test Error for the Lasso Model is 1,394,834.
College.lasso.coeff <- predict(College.lasso, type = "coefficients", s = College.lasso.best.lambda)[1:18,]
College.lasso.coeff
## (Intercept) (Intercept) Accept Enroll Top10perc
## -1.030320e+03 0.000000e+00 1.693638e+00 -1.100616e+00 5.104012e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -1.369969e+01 7.851307e-02 6.036558e-02 -9.861002e-02 1.475536e-01
## Books Personal PhD Terminal S.F.Ratio
## 2.185146e-01 0.000000e+00 -8.390364e+00 1.349648e+00 2.457249e+01
## perc.alumni Expend Grad.Rate
## 7.684156e-01 5.858007e-02 5.324356e+00
With the Lasso Model, the coefficients for this model are as per the above. While the Lasso Model may be more simple and interpretable model than the Ridge Regression, it did not perform as well in prediction accuracy.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
College.pcr <- pcr(Apps~., data = College.train, scale = TRUE, validation = "CV")
summary(College.pcr)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 3767 2088 2097 1815 1682 1675
## adjCV 3862 3766 2085 2096 1807 1669 1670
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1671 1620 1574 1568 1575 1579 1580
## adjCV 1666 1610 1568 1562 1570 1574 1574
## 14 comps 15 comps 16 comps 17 comps
## CV 1575 1502 1204 1134
## adjCV 1571 1486 1193 1126
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.159 57.17 64.41 70.20 75.53 80.48 84.24 87.56
## Apps 5.226 71.83 71.84 80.02 83.01 83.07 83.21 84.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.54 92.81 94.92 96.73 97.81 98.69 99.35
## Apps 85.00 85.22 85.22 85.23 85.36 85.45 89.93
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.84 93.36
validationplot(College.pcr, val.type = "MSEP")
By looking at the PCR Model’s graph and output, it appears that the MSE is lowest around M = 10 components. With the 10 components, the PCR Model achieves the lowest CV Value of 1568 and a moderately good % of Variance Explained for both the Predictors (92.81%) and Response Variable (85.22%).
College.pcr.pred <- predict(College.pcr, College.test, ncomp = 10)
College.pcr.err <- mean((College.pcr.pred - College.test$Apps)^2)
College.pcr.err
## [1] 1952693
The PCR model with M = 10 Principal Components does not outperform any of the previous models as it has a Test Error of 1,952,693. I believe that this may have occured since a model with higher number of principal components was chosen.
set.seed(1)
College.pls <- plsr(Apps~., data = College.train, scale = TRUE, validation = "CV")
summary(College.pls)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 1921 1712 1511 1429 1255 1175
## adjCV 3862 1916 1710 1503 1410 1229 1164
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1153 1147 1139 1138 1138 1136 1134
## adjCV 1143 1138 1131 1130 1129 1128 1126
## 14 comps 15 comps 16 comps 17 comps
## CV 1134 1134 1134 1134
## adjCV 1126 1126 1126 1126
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.09 62.54 65.0 67.54 72.28 76.80 80.63
## Apps 76.80 82.71 87.20 90.8 92.79 93.05 93.14 93.22
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.71 85.53 88.01 90.95 93.07 95.18 96.86
## Apps 93.30 93.32 93.34 93.35 93.36 93.36 93.36
## 16 comps 17 comps
## X 98.00 100.00
## Apps 93.36 93.36
validationplot(College.pls, val.type = "MSEP")
By reviewing the Validation Plot, I can see that the MSE drops to it’s lowest around 6 and 10 Components. Based on the summary of the PLS Model, the CV is lowest starting at 13 Components; however, this is too many Components to include a method that is supposed to be Dimension Reducing. As such, with consideration of the % of Variance Explained for the Predictors and Response Variable, I will proceed with 9 Components. At 9 Components, 82.71% of the Variance is explained in the Predictors and 93.30% of Variance is explained in the Response Variable - Apps.
College.pls.pred <- predict(College.pls, College.test, ncomp = 9)
College.pls.err <- mean((College.pls.pred - College.test$Apps)^2)
College.pls.err
## [1] 1381335
The PLS Model with M = 9 Components competively compares with the Lasso and Ridge Regression Model - in comparison it is in between the Lasso and Ridge Regression - with a Test Error of 1,381,335.
The below are the results for each of the models ran for this exercise. As seen in the results, the model that most accurately predicts the # of College Applications Received is the Ridge Regression. All the models are relatively similar, the model that is significantly different than the rest is the PCR Model as the Test Error is almost 600,000 more.
College.lm.test.err
## [1] 1384604
College.ridge.err
## [1] 1260111
College.lasso.err
## [1] 1394834
College.pcr.err
## [1] 1952693
College.pls.err
## [1] 1381335
barplot(c(College.lm.test.err, College.ridge.err, College.lasso.err, College.pcr.err, College.pls.err), col="darkslategray1", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
We will now try to predict per capita crime rate in the Boston data set.
library(MASS)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
training <- sample(nrow(Boston), 0.75*nrow(Boston))
Boston.train <- Boston[training,]
Boston.test <- Boston[-training,]
Best Subset Selection
library(leaps)
set.seed(12)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
Based on the RMSE Cross-Validation Plot, it appears that the Best Subset Model is the model with 8 variables.
summary(best.fit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" "*" " "
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" "*" " "
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" " " " " "*" "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
which.min(rmse.cv)
## [1] 8
Boston.bestsubset.err <- (rmse.cv[which.min(rmse.cv)]^2)
Boston.bestsubset.err
## [1] 44.89955
With Cross-Validation, the Best Subset Model with 8 variables is selected. At 8 variables, the estimated Test Error is 44.89955.
Lasso
library(glmnet)
grid = 10^seq(4, -2 ,length = 100)
set.seed(1)
# Training Data for X & Y Variables
Boston.x <- model.matrix(crim~., data = Boston)
Boston.y <- Boston$crim
Boston.lasso <- cv.glmnet(Boston.x, Boston.y, alpha = 1, type.measure = "mse")
plot(Boston.lasso)
coef(Boston.lasso)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.0894283
## (Intercept) .
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2643196
## tax .
## ptratio .
## black .
## lstat .
## medv .
Boston.lasso.err <- Boston.lasso$cvm[Boston.lasso$lambda==Boston.lasso$lambda.1se]
Boston.lasso.err
## [1] 55.3338
The Lasso does not appear to outperform the Best Subset Regression. It’s important to keep in mind that Lasso is a variable reduction method. From what it looks like, the lerror produced is 55.3338. The only variable included in this model is rad.
Ridge Regression
Boston.ridge = cv.glmnet(Boston.x, Boston.y, alpha = 0, type.measure = "mse")
plot(Boston.ridge)
Unlike Lasso, Ridge Regression attempts keep all the variables but push their coefficient value close to 0 if they don’t have significance in the relationship with the response.
coef(Boston.ridge)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.190805375
## (Intercept) .
## zn -0.002514410
## indus 0.021067968
## chas -0.102372822
## nox 1.323474252
## rm -0.106109537
## age 0.004452629
## dis -0.066139399
## rad 0.029778399
## tax 0.001383946
## ptratio 0.049720162
## black -0.001713827
## lstat 0.024367321
## medv -0.016059212
Boston.ridge.err <- Boston.ridge$cvm[Boston.ridge$lambda==Boston.ridge$lambda.1se]
Boston.ridge.err
## [1] 63.17158
The Ridge Regression is competitively comparable to the Lasso Model’s results with an MSE of 63.17158, but Best Subset Selection is still outperforming.
Principal Components Regression
library(pls)
Boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(Boston.pcr)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.200 7.197 6.776 6.770 6.770 6.783
## adjCV 8.61 7.198 7.194 6.771 6.761 6.765 6.777
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.772 6.644 6.646 6.641 6.629 6.600 6.539
## adjCV 6.766 6.637 6.639 6.633 6.621 6.591 6.529
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
The PCR Model indicates that the model with the lowest Cross-Validated RMSEP at 13 components.
As computed above, the model that has the lowest cross-validation error is the one chosen by the Best Subset Selection method. This model had an MSE of 44.89955.
The chose Best Subset Model does not include all variables in the data set as there are only 8 included. The variable included in the model are: zn, nox, rad, ptratio, black, lstat, and medv. If the model were to include of the thrown-out features, more variation of the response would be present. For this particular problem, we are looking to have a model with the best prediction accuracy.