library(ISLR2)
library(MASS)
library(class)
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The solution is iii. When looking at the variance bias
trade off, lasso can reduce variance with only a little increase in
bias, when least squares may over fit.
(b) Repeat (a) for ridge regression relative to least
squares.
The solution is iii. The reasons are the same, with the
primary difference being that coefficients all the way to zero, but the
trade off still applies.
(c) Repeat (a) for non-linear methods relative to least
squares.
The solution is ii. Non-linear methods are more flexible.
When the function is approximated best using non-linear relationships,
there will be a decrease in bias but smaller increases in variance.
attach(College)
In this exercise, we will predict the number of applications received using the other variables in the College data set.
Before we begin, lets check for nulls.
dim(College)
## [1] 777 18
sum(is.na(Apps))
## [1] 0
sum(is.na(College))
## [1] 0
library(leaps)
(a) Split the data set into a training set and a test set.
Lets split it 50/50
set.seed(1)
train=sample(1:dim(College)[1], round(dim(College)[1] / 2))
test=-train
College.train = College[train, ]
College.test = College[-train, ]
nrow(College.train) / nrow(College)
## [1] 0.4993565
nrow(College.test) / nrow(College)
## [1] 0.5006435
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit = lm(Apps ~ ., data = College.train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = College.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5741.2 -479.5 15.3 359.6 7258.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.902e+02 6.381e+02 -1.238 0.216410
## PrivateYes -3.070e+02 2.006e+02 -1.531 0.126736
## Accept 1.779e+00 5.420e-02 32.830 < 2e-16 ***
## Enroll -1.470e+00 3.115e-01 -4.720 3.35e-06 ***
## Top10perc 6.673e+01 8.310e+00 8.030 1.31e-14 ***
## Top25perc -2.231e+01 6.533e+00 -3.415 0.000708 ***
## F.Undergrad 9.269e-02 5.529e-02 1.676 0.094538 .
## P.Undergrad 9.397e-03 5.493e-02 0.171 0.864275
## Outstate -1.084e-01 2.700e-02 -4.014 7.22e-05 ***
## Room.Board 2.115e-01 7.224e-02 2.928 0.003622 **
## Books 2.912e-01 3.985e-01 0.731 0.465399
## Personal 6.133e-03 8.803e-02 0.070 0.944497
## PhD -1.548e+01 6.681e+00 -2.316 0.021082 *
## Terminal 6.415e+00 7.290e+00 0.880 0.379470
## S.F.Ratio 2.283e+01 2.047e+01 1.115 0.265526
## perc.alumni 1.134e+00 6.083e+00 0.186 0.852274
## Expend 4.857e-02 1.619e-02 2.999 0.002890 **
## Grad.Rate 7.490e+00 4.397e+00 1.703 0.089324 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1083 on 370 degrees of freedom
## Multiple R-squared: 0.9389, Adjusted R-squared: 0.9361
## F-statistic: 334.3 on 17 and 370 DF, p-value: < 2.2e-16
With MSE
ols_pred = predict(lm.fit, College.test)
ols_mse = mean((ols_pred - Apps)^2)
ols_mse
## [1] 27181984
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
x=model.matrix(Apps~.,College[,-Apps])
y=Apps
y.test=y[test]
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],lambda = grid, alpha = 0,thresh=1e-12)
cv.out=cv.glmnet(x[train,],y[train],alpha=0, lambda=grid, thresh=1e-12)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.01
ridge.pred=predict(ridge.mod, s=bestlam, newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 1135714
Admittedly, the code got a little messy. After researching other analysts’ solutions, I have rerun this code and obtained the same result.
train.mat=model.matrix(Apps~.,data=College.train)
test.mat=model.matrix(Apps~.,data=College.test)
grid=10^seq(4,-2,length=100)
ridge.mod = glmnet(train.mat,College.train$Apps,alpha=0,lambda=grid,thresh = 1e-12)
cv.out = cv.glmnet(train.mat,College.train$Apps,alpha=0,lambda=grid,thresh=1e-12)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.01
pred.newridge = predict(ridge.mod,s=bestlam,newx=test.mat)
mean((College.test$Apps-pred.newridge)^2)
## [1] 1135714
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.mod=glmnet(x[train,], y[train], alpha=1, lambda=grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod, s=bestlam, newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1116174
Lasso gave a slightly better performance than the ridge regression.
out=glmnet(x,y,alpha = 1, lambda = grid)
lasso.coef=predict(out, type='coefficient', s=bestlam)[1:19,]
round(lasso.coef, 3)
## (Intercept) (Intercept) PrivateYes Accept Enroll Top10perc
## -469.506 0.000 -491.366 1.571 -0.768 48.256
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books
## -12.943 0.043 0.044 -0.083 0.150 0.015
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## 0.029 -8.428 -3.265 14.640 -0.029 0.077
## Grad.Rate
## 8.315
Based on these coefficients, none of the values got set to 0, and therefore they should all be used.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
set.seed(2)
pcr.fit=pcr(Apps~.,data=College,scale=TRUE,validation="CV")
summary(pcr.fit)
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3873 3831 2030 2037 1720 1595 1594
## adjCV 3873 3831 2028 2037 1688 1586 1590
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1583 1548 1507 1508 1516 1516 1520
## adjCV 1584 1543 1503 1505 1512 1512 1516
## 14 comps 15 comps 16 comps 17 comps
## CV 1523 1416 1169 1146
## adjCV 1519 1403 1163 1139
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.670 57.30 64.30 69.90 75.39 80.38 83.99 87.40
## Apps 2.316 73.06 73.07 82.08 84.08 84.11 84.32 85.18
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.50 92.91 95.01 96.81 97.9 98.75 99.36
## Apps 85.88 86.06 86.06 86.10 86.1 86.13 90.32
## 16 comps 17 comps
## X 99.84 100.00
## Apps 92.52 92.92
1507*1507
## [1] 2271049
1146*1146
## [1] 1313316
validationplot(pcr.fit,val.type = "MSEP")
“9 comps” for CV has the lowest value at 1507, which explains ~86% Apps explained. However, the MSE has quite a large jump at 9. We could go up to 16 where the next drop off occurs as well, but it becomes relatively flat at 9. Here our MSE is 1313316, which is close to our previous options.
We can run the same thing on the test data as well, using 9 for ncomp.
set.seed(1)
pcr.fit=pcr(Apps~., data=College[train,], scale=TRUE,validation='CV')
validationplot(pcr.fit,val.type="MSEP")
pcr.pred=predict(pcr.fit,College[test,],ncomp=9)
mean((pcr.pred-College$Apps[test])^2)
## [1] 1583520
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pls.fit=plsr(Apps~.,data=College[train,],scale=TRUE,validation='CV')
summary(pls.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 2217 2019 1761 1630 1533 1347
## adjCV 4288 2211 2012 1749 1605 1510 1331
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1309 1303 1286 1283 1283 1277 1271
## adjCV 1296 1289 1273 1270 1270 1264 1258
## 14 comps 15 comps 16 comps 17 comps
## CV 1270 1270 1270 1270
## adjCV 1258 1257 1257 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 27.21 50.73 63.06 65.52 70.20 74.20 78.62 80.81
## Apps 75.39 81.24 86.97 91.14 92.62 93.43 93.56 93.68
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.29 87.17 89.15 91.37 92.58 94.42 96.98
## Apps 93.76 93.79 93.83 93.86 93.88 93.89 93.89
## 16 comps 17 comps
## X 98.78 100.00
## Apps 93.89 93.89
The output is close to the PCR. The lowest cross validation error occurs at the end again, flattening out around 13, which explains ~94% of Apps.
1270*1270
## [1] 1612900
The MSE on PLS increased again.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
While there is a difference in the test errors, most are close to each other. For each of these models, it is explaining upper 80s to lower 90% of Apps. We should take the approach that has the lowest MSE. Lasso provided the lowest MSE, and unfortunately, using all variables in PLS was the best when using dimension reduction.
detach(College)
We will now try to predict per capita crime rate in the Boston data set.
attach(Boston)
dim(Boston)
## [1] 506 14
set.seed(1)
train = sample(nrow(Boston), nrow(Boston)*.7)
Boston.train = Boston[train,]
Boston.test=Boston[-train,]
nrow(Boston.train)/nrow(Boston)
## [1] 0.6996047
nrow(Boston.test)/nrow(Boston)
## [1] 0.3003953
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
First, \(\textbf{Full Model}\)
lr.full=lm(medv~.,data=Boston.train)
lr.predict=predict(lr.full,Boston.test)
MSE.lr=mean((lr.predict-Boston.test$medv)^2)
MSE.lr
## [1] 27.31196
\(\textbf{Best Subset}\)
best.subset=regsubsets(medv~., data=Boston.train,nbest=1,nvmax=13)
summary(best.subset)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston.train, nbest = 1,
## nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " "*" " " "*" " " " " " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 7 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 8 ( 1 ) " " " " " " "*" "*" "*" " " "*" "*" " " "*" "*" "*"
## 9 ( 1 ) " " " " " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) " " "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Using best subset with nvmax of 13 (all variables minus the
prediction), rm is used in every one of them and
lstat in all but 1 variable. In all but one case, once a
variable is included, its used for each subsequent model, except
chas is used in 4 but not 5.
Boston.test.mat = model.matrix(medv~ ., data = Boston.test, nvmax = 13)
val.errors = rep(NA, 13)
for (i in 1:13) {
coefi = coef(best.subset, id = i)
pred = Boston.test.mat[, names(coefi)] %*% coefi
val.errors[i] = mean((pred - Boston.test$medv)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="orange")
I liked a method I found from Swapnil Sharma in 2017 using the for loop to loop through all the options, and modified his code to examine the predictors shown above. We can see that the best number of predictors is 11, but we can verify easily enough:
which.min(val.errors)
## [1] 11
coef(best.subset, 11)
## (Intercept) crim zn chas nox rm
## 28.87260476 -0.07192626 0.03661785 3.38187968 -15.21333970 4.72714761
## dis rad tax ptratio black lstat
## -1.36283626 0.30319961 -0.01317327 -0.95549564 0.01047972 -0.49806806
To make comparing all the values easier, we’ll put each MSE in its own variable
MSE.bsm = val.errors[11]
MSE.bsm
## [1] 26.97848
\(\textbf{Ridge Regression}\)
bostontrain.mat<-model.matrix(medv~.,data=Boston.train)
bostontest.mat<-model.matrix(medv~.,data=Boston.test)
grid=10^seq(4,-2,length=500)
ridge.model=glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=grid)
boston.cv.ridge<-cv.glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=grid)
plot(boston.cv.ridge)
Now we can reuse code from problem 9 to determine best lambda, use it in glmnet, and determine its MSE.
boston.bestlam.ridge=boston.cv.ridge$lambda.min
boston.bestlam.ridge
## [1] 0.1731717
ridge.model.train=glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=boston.bestlam.ridge)
coef(ridge.model.train)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 25.856548970
## (Intercept) .
## crim -0.065719240
## zn 0.032170536
## indus 0.004017301
## chas 3.527697622
## nox -13.073981370
## rm 4.898809469
## age -0.014867741
## dis -1.316054846
## rad 0.246930945
## tax -0.010916150
## ptratio -0.915227320
## black 0.010545924
## lstat -0.466744445
boston.pred.newridge=predict(ridge.model,s=boston.bestlam.ridge,newx=bostontest.mat)
MSE.ridge=mean((Boston.test$medv-boston.pred.newridge)^2)
MSE.ridge
## [1] 27.2719
\(\textbf{Lasso}\)
Lasso just needs to switch alpha to 0 from 1.
lasso.model=glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=grid)
boston.cv.lasso=cv.glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=grid)
plot(boston.cv.lasso)
boston.bestlam.lasso=boston.cv.lasso$lambda.min
boston.bestlam.lasso
## [1] 0.2414157
Continuing to reuse code at this point, we find the best lambda the same way.
lasso.model.train=glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=boston.bestlam.lasso)
coef(lasso.model.train)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 25.118872062
## (Intercept) .
## crim -0.064301884
## zn 0.030954936
## indus -0.003100209
## chas 3.555879937
## nox -12.567302633
## rm 4.906872607
## age -0.014904396
## dis -1.281172956
## rad 0.230383197
## tax -0.010128972
## ptratio -0.906531148
## black 0.010488861
## lstat -0.462156393
boston.pred.newlasso=predict(lasso.model,s=boston.bestlam.lasso,newx=bostontest.mat)
MSE.lasso=mean((Boston.test$medv-boston.pred.newlasso)^2)
MSE.lasso
## [1] 27.26207
And Finally, \(\textbf{PCR:}\)
pcr.model=pcr(medv~.,data=Boston.train,scale=TRUE,validation="CV")
summary(pcr.model)
## Data: X dimension: 354 13
## Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.575 7.456 6.962 5.538 5.281 4.953 4.942
## adjCV 9.575 7.454 6.958 5.538 5.284 4.942 4.935
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 4.945 4.928 4.943 5.013 4.946 4.862 4.783
## adjCV 4.937 4.919 4.933 5.004 4.933 4.848 4.768
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.82 59.12 68.54 75.10 81.25 86.12 90.03 93.14
## medv 39.44 48.26 67.33 71.41 74.40 74.65 74.87 75.28
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.17 96.79 98.21 99.48 100.00
## medv 75.30 75.30 76.07 76.92 77.78
validationplot(pcr.model,val.type="MSEP")
Looks like 5 is the number of components we are looking for.
pcr.model.5comp=pcr(medv~.,data=Boston.train,scale=TRUE,ncomp=5)
summary(pcr.model.5comp)
## Data: X dimension: 354 13
## Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 47.82 59.12 68.54 75.10 81.25
## medv 39.44 48.26 67.33 71.41 74.40
boston.predict.pcr=predict(pcr.model,Boston.test,ncomp=5)
MSE.pcr=mean((Boston.test$medv-boston.predict.pcr)^2)
MSE.pcr
## [1] 31.26027
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
Lets look at each of the models: Full:
MSE.lr
## [1] 27.31196
Best Subset:
MSE.bsm
## [1] 26.97848
Ridge:
MSE.ridge
## [1] 27.2719
Lasso:
MSE.lasso
## [1] 27.26207
PCR:
MSE.pcr
## [1] 31.26027
I propose we use best subset. All of the models were very close in MSE, but best subset was the lowest. Best subset is feasible because the dataset is not very large.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
When using best subset, we saw the best MSE occurred at 11 variables.
Scrolling back up to the chart, we see at value of 11, the variables NOT
included are indus and age.