(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
ii. More flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
iii. Less flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
iv. Less flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
iii is the correct answer here. Lasso has less variance with the trade-off of having a slightly higher bias whereas least squares has a high variance. Lasso shrinks coefficient estimates so it can remove variables that aren’t important to the model for less variance but higher bias.
(b) Repeat (a) for ridge regression relative to least squares.
iii is also the correct answer here. Ridge regression also shrinks coefficient estimates, having a similar effect as the lasso with decreasing variance but having a higher bias.
(c) Repeat (a) for non-linear methods relative to least squares.
The answer here would be ii. Non-linear models are more flexible and will have less bias than least squares. So when the increase in variance is less than the decrease in bias we could expect better results.
(a) Split the data set into a training set and a test set.
library(ISLR)
attach(College)
x=model.matrix(Apps~.,College[,-1])
y=College$Apps
set.seed(5)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
college.train=College[train,]
college.test=College[test,]
y.test=y[test]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit=lm(Apps~.,data=College,subset=train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5285.7 -370.1 -24.5 320.6 7113.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.700e+02 5.882e+02 -1.649 0.099959 .
## PrivateYes -5.556e+02 1.928e+02 -2.882 0.004177 **
## Accept 1.760e+00 5.268e-02 33.412 < 2e-16 ***
## Enroll -1.703e+00 2.636e-01 -6.459 3.32e-10 ***
## Top10perc 4.956e+01 7.893e+00 6.278 9.59e-10 ***
## Top25perc -1.470e+01 6.463e+00 -2.274 0.023517 *
## F.Undergrad 1.342e-01 4.301e-02 3.121 0.001946 **
## P.Undergrad -1.592e-03 5.789e-02 -0.027 0.978081
## Outstate -9.801e-02 2.807e-02 -3.491 0.000539 ***
## Room.Board 1.418e-01 6.855e-02 2.069 0.039242 *
## Books -2.290e-01 3.113e-01 -0.736 0.462438
## Personal 2.066e-01 9.323e-02 2.216 0.027277 *
## PhD -1.528e+01 7.015e+00 -2.178 0.030023 *
## Terminal 2.974e+00 7.325e+00 0.406 0.685005
## S.F.Ratio 4.376e+01 2.104e+01 2.080 0.038195 *
## perc.alumni 7.724e-01 5.947e+00 0.130 0.896732
## Expend 9.610e-02 2.209e-02 4.351 1.75e-05 ***
## Grad.Rate 1.054e+01 4.204e+00 2.507 0.012605 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1064 on 370 degrees of freedom
## Multiple R-squared: 0.9344, Adjusted R-squared: 0.9314
## F-statistic: 310.1 on 17 and 370 DF, p-value: < 2.2e-16
pred.apps=predict(lm.fit,college.test)
mean((college.test$Apps-pred.apps)^2)
## [1] 1166539
The test error obtained here is 1,166,539.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.college.out=cv.glmnet(x[train,],y[train],alpha=0)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 384.3366
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 1077128
The test error obtained here is 1,077,128. Best lambda is 384.34.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.college.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 2.051089
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1153439
Here the bestlam was 2.05, and the test error for this model is 1,153,439.
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]
## (Intercept) Accept Enroll Top10perc Top25perc
## -921.13191496 1.57619429 -0.77470826 47.93717208 -13.04948741
## F.Undergrad P.Undergrad Outstate Room.Board Personal
## 0.05804280 0.04945985 -0.10661855 0.13342723 0.02873374
## PhD Terminal S.F.Ratio perc.alumni Expend
## -6.48796215 -1.35545596 22.11969735 -1.35217989 0.08114511
## Grad.Rate
## 7.68185047
The number of non-zero coefficient estimates is 15.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.college=pcr(Apps~.,data=college.train,scale=TRUE,validation="CV")
summary(pcr.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4068 4057 2217 2217 2156 2011 1885
## adjCV 4068 4058 2211 2212 2139 1985 1876
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1889 1871 1822 1810 1853 1853 1880
## adjCV 1880 1839 1812 1799 1844 1844 1871
## 14 comps 15 comps 16 comps 17 comps
## CV 1878 1526 1426 1272
## adjCV 1903 1501 1408 1258
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 33.325 57.73 65.26 70.81 76.08 81.02 84.91 87.77
## Apps 1.007 72.25 72.60 75.39 79.89 81.19 81.23 82.43
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.61 92.97 95.08 96.70 97.72 98.54 99.31
## Apps 82.76 83.18 83.21 83.27 83.75 83.79 91.27
## 16 comps 17 comps
## X 99.79 100.00
## Apps 92.11 93.44
validationplot(pcr.college,val.type="MSEP")
pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)
## [1] 1647713
The value of M selected by cross-validation was 10, and the test error for this model is 1,647,713.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.college=plsr(Apps~.,data=college.train,scale=TRUE,validation="CV")
summary(pls.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4068 2039 1910 1769 1680 1553 1362
## adjCV 4068 2033 1904 1752 1645 1523 1344
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1321 1295 1289 1293 1290 1284 1283
## adjCV 1306 1280 1274 1278 1275 1269 1269
## 14 comps 15 comps 16 comps 17 comps
## CV 1283 1283 1283 1283
## adjCV 1269 1269 1269 1269
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 24.51 42.39 62.71 65.69 70.93 74.26 78.53 80.36
## Apps 76.66 82.87 86.43 90.72 92.18 92.86 92.99 93.26
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.07 86.56 88.83 91.66 93.11 94.39 96.61
## Apps 93.37 93.41 93.43 93.44 93.44 93.44 93.44
## 16 comps 17 comps
## X 99.13 100.00
## Apps 93.44 93.44
validationplot(pls.college,val.type="MSEP")
pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)
## [1] 1127857
The value of M selected by cross-validation here was 9, and the test error is 1,127,857.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The errors I got from the five approaches are as follows, Least squares- 1,166,539 Ridge regression- 1,077,128 Lasso- 1,153,439 PCR- 1,647,713 PLS- 1,127,857
The ridge regression model with a test error of 1,077,128 is the lowest, so perhaps this is the one to go with out of the five. The R squared for the least squares model is about 93%, and its error isn’t far off from the others so I would say yes, these could be used to predict the number of college applications received.
detach(College)
library(leaps)
library(MASS)
set.seed(5)
attach(Boston)
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
#Best Subset Selection
set.seed(5)
train=sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test=(-train)
regfit.Boston=regsubsets(crim~.,data=Boston[train,],nvmax=13)
test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
coefi=coef(regfit.Boston,id=i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
## [1] 45.61825 43.73050 43.81836 43.11088 42.24088 42.40160 42.02448 42.04104
## [9] 42.51405 42.50313 42.47811 42.41412 42.20707
which.min(val.errors)
## [1] 7
coef(regfit.Boston,7)
## (Intercept) zn indus rm dis rad
## -2.30177611 0.04639411 -0.19424726 1.49536825 -0.82113681 0.58240665
## lstat medv
## 0.14640159 -0.25988540
So here I used best subset selection on the Boston dataset, and using the validation set approach it seems that a 7-variable model is what is best here, as it has the lowest error at 42.02. The variables it chose are zn, indus, rm, dis, rad, lstat, and medv.
#Ridge Regression
set.seed(5)
x=model.matrix(crim~.,Boston[,-1])
y=Boston$crim
grid=10^seq(10,-2,length=100)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
cv.out.Boston=cv.glmnet(x[train,],y[train],alpha=0)
bestlam.Boston=cv.out.Boston$lambda.min
bestlam.Boston
## [1] 0.600269
ridge.Boston=glmnet(x[train,],y[train],alpha=0,lambda=grid,thresh=1e-12)
ridge.pred.Boston=predict(ridge.Boston,s=bestlam.Boston,newx=x[test,])
mean((ridge.pred.Boston-y.test)^2)
## [1] 20.87974
We can see here that the best lambda selected by cross-validation for the training set is .6, and the error we get is 20.88. Now we will run a ridge regression on the whole dataset.
#Ridge Regression
realridge.Boston=glmnet(x,y,alpha=0)
realridge.pred=predict(realridge.Boston,type='coefficients',s=bestlam,)[1:13,]
realridge.pred
## (Intercept) (Intercept) zn indus chas nox
## 2.655874373 0.000000000 0.022581911 -0.044047690 -0.775627763 -0.780512240
## rm age dis rad tax ptratio
## 0.179615519 0.004589786 -0.414468222 0.294548154 0.006484046 -0.001051838
## black
## -0.008999011
Now we can see the coefficients for each of the variables for the ridge regression of the whole dataset with our best lambda of .6.
#The Lasso
set.seed(5)
lasso.Boston=glmnet(x[train,],y[train],alpha=1,lambda=grid)
cv.out.lasso=cv.glmnet(x[train,],y[train],alpha=1)
bestlam.lasso=cv.out.lasso$lambda.min
bestlam.lasso
## [1] 0.1452728
lasso.pred.Boston=predict(lasso.Boston,s=bestlam,newx=x[test,])
mean((lasso.pred.Boston-y.test)^2)
## [1] 18.75912
Here we can see that the best lambda selected by cross-validation for the training set is .14, and the error we get is lower than ridge regression at 18.76. Now we will run the lasso on the whole data set.
#The Lasso
reallasso.Boston=glmnet(x,y,alpha=1,lambda=grid)
reallasso.pred=predict(reallasso.Boston,type='coefficients',s=bestlam)[1:13,]
reallasso.pred
## (Intercept) (Intercept) zn indus chas nox
## -0.4325120 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## rm age dis rad tax ptratio
## 0.0000000 0.0000000 0.0000000 0.3641029 0.0000000 0.0000000
## black
## 0.0000000
Now here is the coefficients for the variables using the lasso for the whole dataset. Nearly all variables were zeroed out by this lasso model.
#PCR
set.seed(5)
pcr.Boston=pcr(crim~.,data=Boston,scale=TRUE,validation="CV")
summary(pcr.Boston)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.211 7.204 6.748 6.739 6.743 6.790
## adjCV 8.61 7.208 7.201 6.744 6.733 6.739 6.784
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.792 6.654 6.676 6.671 6.663 6.606 6.527
## adjCV 6.785 6.647 6.668 6.662 6.654 6.597 6.518
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
The number of components with the most variance explained is 13 with 45.4% explained. We will fit the PCR on the training set and plot a validation plot and try and see if it is any different.
#Fitting PCR on the training set
set.seed(5)
pcr.Boston.again=pcr(crim~.,data=Boston[train,],scale=TRUE,validation="CV")
validationplot(pcr.Boston.again,val.type="MSEP")
pcr.pred.Boston=predict(pcr.Boston.again,Boston[test,],ncomp=13)
mean((pcr.pred.Boston-Boston$crim[test])^2)
## [1] 23.06438
It seemed that on the training set that 13 variables was the best here, and when using 13 variables the error we get is 23.06. I will try again with 8 variables just to see how it does.
#Training PCR with ncomp=8
pcr.pred.Boston=predict(pcr.Boston.again,Boston[test,],ncomp=8)
mean((pcr.pred.Boston-Boston$crim[test])^2)
## [1] 21.30832
At ncomp=8 the error is actually lower at 21.31!
(b) Propose a model (or set of models) that seem to perform well on this dataset, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
I would suggest the lasso model as it has the lowest error at 18.76. I am not too sold because it zeroed out all variables except for rad, but perhaps that is because rad is a good predictor for per capita crime rate. It even has the highest positive coefficient for ridge regression. If I had to choose a second-best, I would suggest the ridge model because it has the next-lowest error of the models I tested at 20.88. PCR with ncomp=8 also gets an honorable mention as its error was 21.31.
(c) Does your chosen model involve all of the features in the dataset? Why or why not?
No it does not because the lasso model put almost all of the coefficients of the different variables to 0, effectively taking these variables out of the model.