For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
The correct answer for part (a) is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: Lasso’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the Lasso decreases,leading to decreased variance but increased bias.
(b) Repeat (a) for ridge regression relative to least squares.
The correct answer for part (b) is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: The reason for this is because ridge regression, like Lasso, is rooted in the bias-variance trade off. However, Lasso is easier to interpret.
(c) Repeat (a) for non-linear methods relative to least squares.
The correct answer for part (c) is ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
attach(College)
(a) Split the data set into a training set and a test set.
set.seed(1)
training=sample(nrow(College),0.75*nrow(College))
train=College[training,]
test=College[-training,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit=lm(Apps~.,data=train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5773.1 -425.2 4.5 327.9 7496.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.784e+02 4.707e+02 -1.229 0.21962
## PrivateYes -4.673e+02 1.571e+02 -2.975 0.00305 **
## Accept 1.712e+00 4.567e-02 37.497 < 2e-16 ***
## Enroll -1.197e+00 2.151e-01 -5.564 4.08e-08 ***
## Top10perc 5.298e+01 6.158e+00 8.603 < 2e-16 ***
## Top25perc -1.528e+01 4.866e+00 -3.141 0.00177 **
## F.Undergrad 7.085e-02 3.760e-02 1.884 0.06002 .
## P.Undergrad 5.771e-02 3.530e-02 1.635 0.10266
## Outstate -8.143e-02 2.077e-02 -3.920 9.95e-05 ***
## Room.Board 1.609e-01 5.361e-02 3.002 0.00280 **
## Books 2.338e-01 2.634e-01 0.887 0.37521
## Personal 6.611e-03 6.850e-02 0.097 0.92315
## PhD -1.114e+01 5.149e+00 -2.163 0.03093 *
## Terminal 9.186e-01 5.709e+00 0.161 0.87223
## S.F.Ratio 1.689e+01 1.542e+01 1.096 0.27368
## perc.alumni 2.256e+00 4.635e+00 0.487 0.62667
## Expend 5.567e-02 1.300e-02 4.284 2.16e-05 ***
## Grad.Rate 6.427e+00 3.307e+00 1.944 0.05243 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared: 0.9336, Adjusted R-squared: 0.9316
## F-statistic: 466.7 on 17 and 564 DF, p-value: < 2.2e-16
lm.pred=predict(lm.fit,newdata=test)
lm.err=mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1384604
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
xtrain=model.matrix(Apps~.,data=train[,-1])
ytrain=train$Apps
xtest=model.matrix(Apps~.,data=test[,-1])
ytest=test$Apps
ridge.fit=cv.glmnet(xtrain,ytrain,alpha=0)
plot(ridge.fit)
ridge.lambda=ridge.fit$lambda.min
ridge.lambda
## [1] 364.6228
ridge.pred=predict(ridge.fit,s=ridge.lambda,newx=xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err
## [1] 1260111
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.fit=cv.glmnet(xtrain,ytrain,alpha=1)
plot(lasso.fit)
lasso.lambda=lasso.fit$lambda.min
lasso.lambda
## [1] 1.945882
lasso.pred=predict(lasso.fit,s=lasso.lambda,newx=xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err
## [1] 1394834
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit=pcr(Apps~.,data=train,scale=TRUE,validation="CV")
validationplot(pcr.fit,val.type="MSEP")
summary(pcr.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 3761 2078 2076 1809 1709 1662
## adjCV 3862 3761 2075 2076 1790 1689 1657
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1654 1615 1573 1578 1584 1587 1586
## adjCV 1649 1605 1568 1572 1579 1582 1580
## 14 comps 15 comps 16 comps 17 comps
## CV 1588 1528 1193 1133
## adjCV 1583 1511 1183 1124
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.159 57.17 64.41 70.20 75.53 80.48 84.24 87.56
## Apps 5.226 71.83 71.84 80.02 83.01 83.07 83.21 84.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.54 92.81 94.92 96.73 97.81 98.69 99.35
## Apps 85.00 85.22 85.22 85.23 85.36 85.45 89.93
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.84 93.36
According to the summary, I think the MSE is lowest around 10 so we will use this. It also has a 92.81% explanation of the variance of the predictors.
pcr.pred=predict(pcr.fit,test,ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1952693
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit=plsr(Apps~.,data=train,scale=TRUE,validation="CV")
validationplot(pls.fit,val.type="MSEP")
summary(pls.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 1933 1688 1489 1424 1250 1167
## adjCV 3862 1927 1687 1482 1404 1227 1157
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1149 1144 1138 1134 1135 1132 1132
## adjCV 1140 1136 1130 1126 1127 1124 1123
## 14 comps 15 comps 16 comps 17 comps
## CV 1130 1131 1130 1130
## adjCV 1122 1122 1122 1122
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.09 62.54 65.0 67.54 72.28 76.80 80.63
## Apps 76.80 82.71 87.20 90.8 92.79 93.05 93.14 93.22
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.71 85.53 88.01 90.95 93.07 95.18 96.86
## Apps 93.30 93.32 93.34 93.35 93.36 93.36 93.36
## 16 comps 17 comps
## X 98.00 100.00
## Apps 93.36 93.36
pls.pred=predict(pls.fit,test,ncomp=7)
pls.err=mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1333314
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approach
barplot(c(lm.err,ridge.err,lasso.err,pcr.err,pls.err),xlab="Regression Models",ylab="Test Error",main="Test Error Comparisons",names.arg=c("OLS","Ridge","Lasso","PCR","PLS"))
- As stated previously, PCR stands out as a poor test error. Ridge seems
to have the lowest error rate, but let’s also look at our R-squared of
these models to explain their variability.
avg.apps=mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lm.r2
## [1] 0.9086432
ridge.r2=1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2
## [1] 0.9168573
lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2
## [1] 0.9079682
pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2
## [1] 0.8711605
pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2
## [1] 0.9120273
barplot(c(lm.r2,ridge.r2,lasso.r2,pcr.r2,pls.r2),xlab="Regression Methods",ylab="Test R-Squared",names.arg=c("OLS","Ridge","Lasso","PCR","PLS"))
- As expected, PCR is the least accurate model across all of them, but
the other 4 models are all really good at +90%! The Ridge Regression is
the most accurate AND has the lowest test error, so this is the best
model to use for predicting Applications.
We will now try to predict per capita crime rate in the Boston data set.
library(MASS)
detach(College)
attach(Boston)
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(leaps)
set.seed(1)
predict.regsubsets=function(object,newdata,id,...){
form=as.formula (object$call [[2]])
mat=model.matrix(form ,newdata )
coefi=coef(object ,id=id)
xvars=names(coefi)
mat[,names(coefi)]%*%coefi
}
k=10
p=ncol(Boston)-1
folds=sample(rep(1:k,length=nrow(Boston)))
cv.errors=matrix(NA,k,p)
for(i in 1:k){
best.fit=regsubsets(crim~.,data=Boston[folds!=i, ],nvmax=p)
for(j in 1:p){pred=predict(best.fit,Boston[folds == i, ],id=j)
cv.errors[i,j]=mean((Boston$crim[folds==i]-pred)^2)}
}
rmse.cv=sqrt(apply(cv.errors,2,mean))
plot(rmse.cv,pch=19,type="b")
summary(best.fit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*" "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
which.min(rmse.cv)
## [1] 9
boston.bsm.err=(rmse.cv[which.min(rmse.cv)]^2)
boston.bsm.err
## [1] 42.81453
##Laso
boston.x=model.matrix(crim~.,data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1,type.measure="mse")
plot(boston.lasso)
coef(boston.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.176491
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.150484
## tax .
## ptratio .
## black .
## lstat .
## medv .
boston.lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
boston.lasso.err
## [1] 62.74783
##Ridge Regression
boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)
coef(boston.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.523899542
## zn -0.002949852
## indus 0.029276741
## chas -0.166526007
## nox 1.874769665
## rm -0.142852604
## age 0.006207995
## dis -0.094547258
## rad 0.045932737
## tax 0.002086668
## ptratio 0.071258052
## black -0.002605281
## lstat 0.035745604
## medv -0.023480540
boston.ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
boston.ridge.err
## [1] 58.8156
##PCR
boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.175 7.180 6.724 6.731 6.727 6.727
## adjCV 8.61 7.174 7.179 6.721 6.725 6.724 6.724
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.722 6.614 6.618 6.607 6.598 6.553 6.488
## adjCV 6.718 6.609 6.613 6.602 6.592 6.546 6.481
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
Test Error Rates: Subset = 42.30897, Lasso = 62.74783, Ridge = 58.8156, PCR = 44.38224,
(c) Does your chosen model involve all of the features in the data set? Why or why not?