More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
(a) The lasso, relative to least squares, is:
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: Generally, the more flexible a method is the more variance it has. When the least squares estimates have excessively high variance, the lasso leads to qualitatively similar behavior to ridge regression, in that as λ increases, the variance decreases and the bias increases
(b) Repeat (a) for ridge regression relative to least squares.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation:Generally, the more flexible a method is the more variance it has. At the least squares coefficient estimates, which correspond to ridge regression with λ = 0, the variance is high but there is no bias. But as λ increases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias.
(c) Repeat (a) for non-linear methods relative to least squares
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Explanation: While OLS can model curves, it is relatively restricted in the shapes of the curves that it can fit. Sometimes it can’t fit the specific curve in your data. Non-linear methods are much more flexible in the shapes of the curves that it can fit, given the possible outcomes a shape can take that can be considered non-linear.
(a) Split the data set into a training set and a test set.
library(ISLR)
attach(College)
set.seed(7)
subset = sample(nrow(College),nrow(College)*0.75)
train = College[subset,]
test = College[-subset,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
fit.lm9B = lm(Apps~., data = train)
pred.fit = predict(fit.lm9B,test)
te9B = mean((pred.fit - test$Apps)^2)
te9B
## [1] 784381.4
The Mean Square error appears to be 784381.4
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
train.mat = model.matrix(Apps~., data = train)
test.mat = model.matrix(Apps~., data = test)
grid = 10^seq(4, -2, length = 100)
ridge = glmnet(train.mat, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge = cv.glmnet(train.mat, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.ridge = cv.ridge$lambda.min
pred.ridge = predict(ridge,s = bestlam.ridge, newx = test.mat)
te9C = mean((pred.ridge - test$Apps)^2)
te9C
## [1] 784367.1
The MSE via ridge regression is 784367.1 and it appears to be a little less than our OLS MSE value of 784381.4
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
fit.lasso = glmnet(train.mat, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso = cv.glmnet(train.mat, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso = cv.lasso$lambda.min
pred.lasso = predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
te9D = mean((pred.lasso - test$Apps)^2)
te9D
## [1] 781272.7
predict(fit.lasso, s = bestlam.lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -599.62722787
## (Intercept) .
## PrivateYes -424.49706194
## Accept 1.48427724
## Enroll -0.23908656
## Top10perc 30.66864732
## Top25perc .
## F.Undergrad .
## P.Undergrad .
## Outstate -0.04569949
## Room.Board 0.10671006
## Books .
## Personal 0.02434163
## PhD -5.20456803
## Terminal -3.54311550
## S.F.Ratio .
## perc.alumni -0.86758825
## Expend 0.06367211
## Grad.Rate 4.30134480
The MSE via lasso model is 781272.7, which is lower than both ridge regressions MSE and OLS MSE. With regards to the remaining non-zero coefficient estimates, there are only 12.
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
pcrmodel = pcr(Apps~., data = train, scale = TRUE, validation = "CV")
validationplot(pcrmodel, val.type = "MSEP")
summary(pcrmodel)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4081 4065 2197 2194 1959 1811 1778
## adjCV 4081 4065 2193 2190 1944 1799 1772
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1765 1726 1666 1677 1684 1690 1700
## adjCV 1759 1717 1660 1670 1677 1684 1693
## 14 comps 15 comps 16 comps 17 comps
## CV 1700 1601 1302 1264
## adjCV 1694 1573 1291 1253
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.153 57.36 64.31 70.06 75.61 80.56 84.16 87.66
## Apps 1.529 72.25 72.54 78.43 82.58 83.06 83.83 84.40
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.58 92.83 94.97 96.78 97.87 98.71 99.33
## Apps 85.28 85.31 85.31 85.31 85.32 85.36 90.82
## 16 comps 17 comps
## X 99.81 100.00
## Apps 92.47 93.03
pred.pcr = predict(pcrmodel, test, ncomp = 16)
te9E = mean((pred.pcr - test$Apps)^2)
te9E
## [1] 774105.7
The PCR MSE is 774105.7. Without recreating the OLS model by picking M = 17, the value of M = 16 produces the lowest MSE
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
plsmodel = plsr(Apps~., data = train, scale = TRUE, validation = "CV")
validationplot(plsmodel, val.type = "MSEP")
summary(plsmodel)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4081 2013 1724 1608 1525 1349 1295
## adjCV 4081 2008 1719 1600 1506 1334 1283
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1282 1269 1263 1265 1267 1264 1263
## adjCV 1271 1259 1252 1255 1256 1253 1252
## 14 comps 15 comps 16 comps 17 comps
## CV 1264 1264 1264 1264
## adjCV 1253 1253 1253 1253
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.28 39.73 62.68 65.48 68.91 73.73 77.69 80.69
## Apps 77.38 84.65 87.18 90.44 92.33 92.78 92.84 92.92
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.97 87.00 89.36 90.79 93.32 95.13 96.76
## Apps 92.99 93.01 93.02 93.03 93.03 93.03 93.03
## 16 comps 17 comps
## X 98.17 100.00
## Apps 93.03 93.03
pred.pls = predict(plsmodel, test, ncomp = 9)
te9F = mean((pred.pls - test$Apps)^2)
te9F
## [1] 777977.1
The PCR MSE is 777977.1 and the value of M = 9 produces the lowest MSE.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Test_Error_Values = c(te9B, te9C, te9D, te9E, te9F)
names(Test_Error_Values) = c("lm", "ridge", "lasso", "pcr", "pls")
test.avg = mean(test$Apps)
lm.r2 = 1 - mean((pred.fit - test$Apps)^2) / mean((test.avg - test$Apps)^2)
ridge.r2 = 1 - mean((pred.ridge - test$Apps)^2) / mean((test.avg - test$Apps)^2)
lasso.r2 = 1 - mean((pred.lasso - test$Apps)^2) / mean((test.avg - test$Apps)^2)
pcr.r2 = 1 - mean((pred.pcr - test$Apps)^2) / mean((test.avg - test$Apps)^2)
pls.r2 = 1 - mean((pred.pls - test$Apps)^2) / mean((test.avg - test$Apps)^2)
RSquared_Chart = c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2)
names(RSquared_Chart) = c("lm", "ridge", "lasso", "pcr", "pls")
Test Error Values
Test_Error_Values
## lm ridge lasso pcr pls
## 784381.4 784367.1 781272.7 774105.7 777977.1
RSquared Values
RSquared_Chart
## lm ridge lasso pcr pls
## 0.9208070 0.9208084 0.9211209 0.9218445 0.9214536
When it comes to looking at how accurately we predicted the number of college applications received, all of the Rsquared values were above 92%, which seems pretty good.
After running through the OLS, Ridge Regression, Lasso, PCR, and PLS models the end results show that the PCR model, with an MSE of 774105.7 and an RSquared at 0.921, outperformed the rest of the models. Still, PCR did not outperform the rest of the models by large margin, in fact the differences appear to be fiarly small.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
I’ll begin by exploring the Boston dataset’s basic structure and summary statistics. I’ll then follow up by doing a 75/25 test/train split on the dataset. After that, I will try out some different regressions methods and then collect, analyze, and discuss the results in terms of the best method based on test MSE. These methods will include Best Subset Regression, Ridge Regression, Lasso, PCR, and PLS.
library(MASS)
attach(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
set.seed(2244422)
subset = sample(nrow(Boston), nrow(Boston)*0.75)
btrain = Boston[subset,]
btest = Boston[-subset,]
Best Subset Selection
library(leaps)
fit.bestsub = regsubsets(crim~., data = btrain, nvmax = 13)
btest.mat = model.matrix(crim~., data = btest, nvmax = 13)
val.errors = rep(NA, 13)
for (i in 1:13) {
coefi = coef(fit.bestsub, id = i)
pred = btest.mat[, names(coefi)] %*% coefi
val.errors[i] = mean((pred- btest$crim)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="orange")
which.min(val.errors)
## [1] 12
Bestsub_MSE = val.errors[12]
Bestsub_MSE
## [1] 38.88763
Ridge Regression
set.seed(2321)
train.mat = model.matrix(crim~., data = btrain)
test.mat = model.matrix(crim~., data = btest)
grid = 10^seq(4, -2, length = 100)
ridge = glmnet(train.mat, btrain$crim, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge = cv.glmnet(train.mat, btrain$crim, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.ridge = cv.ridge$lambda.min
pred.ridge = predict(ridge, s = bestlam.ridge, newx = test.mat)
Ridge_MSE = mean((pred.ridge - btest$crim)^2)
Ridge_MSE
## [1] 38.66517
Lasso
fit.lasso = glmnet(train.mat, btrain$crim, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso = cv.glmnet(train.mat, btrain$crim, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso = cv.lasso$lambda.min
pred.lasso = predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
Lasso_MSE = mean((pred.lasso - btest$crim)^2)
Lasso_MSE
## [1] 39.03718
PCR
set.seed(123)
pcrmodel = pcr(crim~., data = btrain, scale = TRUE, validation = "CV")
validationplot(pcrmodel, val.type = "MSEP")
summary(pcrmodel)
## Data: X dimension: 379 13
## Y dimension: 379 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.776 7.356 7.331 6.916 6.930 6.927 6.920
## adjCV 8.776 7.351 7.326 6.910 6.925 6.922 6.913
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.928 6.770 6.838 6.825 6.838 6.815 6.734
## adjCV 6.921 6.759 6.825 6.812 6.825 6.798 6.717
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.72 60.18 69.28 76.11 82.78 87.90 90.90 93.32
## crim 31.17 31.66 39.08 39.13 39.24 40.08 40.28 43.21
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.45 97.07 98.41 99.49 100.00
## crim 43.24 43.65 43.67 44.98 46.37
pred.pcr = predict(pcrmodel, btest, ncomp = 8)
PCR_MSE = mean((pred.pcr - btest$crim)^2)
PCR_MSE
## [1] 40.21201
PLS
set.seed(3211)
plsmodel = plsr(crim~., data = btrain, scale = TRUE, validation = "CV")
validationplot(plsmodel, val.type = "MSEP")
summary(plsmodel)
## Data: X dimension: 379 13
## Y dimension: 379 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.776 7.126 6.771 6.707 6.677 6.653 6.645
## adjCV 8.776 7.124 6.765 6.695 6.666 6.641 6.632
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.632 6.631 6.625 6.627 6.627 6.627 6.627
## adjCV 6.620 6.619 6.614 6.616 6.615 6.615 6.615
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.32 56.66 62.55 71.27 75.80 78.52 83.85 85.70
## crim 34.67 42.35 44.63 45.43 45.89 46.23 46.29 46.36
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 88.56 94.09 96.53 98.54 100.00
## crim 46.37 46.37 46.37 46.37 46.37
pred.pls = predict(plsmodel, btest, ncomp = 9)
PLS_MSE = mean((pred.pls - btest$crim)^2)
PLS_MSE
## [1] 38.90602
Now that I have ran through the different models, I will post their respective MSE values
Test_Error_Values11 = c(Bestsub_MSE, Ridge_MSE, Lasso_MSE, PCR_MSE, PLS_MSE)
names(Test_Error_Values11) = c("Best SS", "ridge", "lasso", "pcr", "pls")
Test_Error_Values11
## Best SS ridge lasso pcr pls
## 38.88763 38.66517 39.03718 40.21201 38.90602
Best Subset MSE = 38.88763
Ridge MSE = 38.66517
Lasso MSE = 39.03718
PCR MSE = 40.21201
PLS MSE = 38.90602
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
There was not a huge difference in outcome with regards to all of my test error values. Still, If I had to make a reccomendation I would go ahead and chose the ridge model, for it has the lowest MSE at 38.6651.
(c) Does your chosen model involve all of the features in the data set? Why or why not
My chosen model does involve all of the features in the data set. Since I picked my model based on the lowest MSE, I chose the Ridge model, which includes all of the features by design.