Best subset selection.
Best subset selection should have less variance and generally have the smallest test RSS.
Lasso - Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Ridge regression - Also less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Non-linear methods - More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Steadily decrease.
Decrease initially, and then eventually start increasing in a U shape.
Steadily increase.
Steadily decrease.
Steadily decrease.
Decrease initially, and then eventually start increasing in a U shape.
Steadily decrease.
Steadily increase.
set.seed(1)
library(MASS)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
cv.lasso = cv.glmnet(model.matrix(crim ~ . - 1, data = Boston),
Boston$crim, type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.0894283
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2643196
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.438669
cv.ridge = cv.glmnet(model.matrix(crim ~ . - 1, data = Boston), Boston$crim, type.measure = "mse", alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.190805375
## zn -0.002514410
## indus 0.021067968
## chas -0.102372822
## nox 1.323474252
## rm -0.106109537
## age 0.004452629
## dis -0.066139399
## rad 0.029778399
## tax 0.001383946
## ptratio 0.049720162
## black -0.001713827
## lstat 0.024367321
## medv -0.016059212
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
## [1] 7.948055
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.200 7.197 6.776 6.770 6.770 6.783
## adjCV 8.61 7.198 7.194 6.771 6.761 6.765 6.777
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.772 6.644 6.646 6.641 6.629 6.600 6.539
## adjCV 6.766 6.637 6.639 6.633 6.621 6.591 6.529
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
I propose the 13 component PCR model, since it has lowest root mean square error when cross validated. Next I would pick the Lasso model and finally the ridge regression which had the worst results.
It is also possible that an ensemble of these models would probably yield the best results.
I would use all the features in the data set.