Chapter 06 (page 259): 2, 9, 11
#install.packages("leaps")
(a) The lasso, relative to least squares, (iii) is less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
(b) The ridge regression relative to least squares, (iii) is less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
(c) Non-linear methods relative to least squares, (ii) is more flexible and will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
library(ISLR)
library(leaps)
(a) Split the data set into a training set and a test set.
attach(College)
set.seed(1)
train <- sample(nrow(College) * 0.7)
Ctrain <- College[train, ]
Ctest <- College[-train, ]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
The test mean squared error is 1413322
Col.lm <- lm(Apps~., data = Ctrain)
pred.lm <- predict(Col.lm, Ctest)
mean((Ctest$Apps - pred.lm)^2)
## [1] 1413322
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained. The test mean squared error is 976268.9
#install.packages("glmnet")
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
grid=10^seq(10,-2,length=100)
Cridge.mod=glmnet(x,y,alpha=1,lambda=grid)
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test=(-train)
y.test=y[test]
Cridge.mod=glmnet(x[train,], y[train], alpha = 0, lambda = grid)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
bestlam=cv.out$lambda.min
bestlam
## [1] 405.8404
Cridge.pred=predict(Cridge.mod,s=bestlam,newx=x[test,])
mean((Cridge.pred-y.test)^2)
## [1] 976268.9
(d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
The test mean squared error is 1116252 and there are 18 non-zero coefficient estimates.
Classo.mod=glmnet(x[train,], y[train], alpha = 1, lambda = grid)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
## [1] 1.97344
Classo.pred=predict(Classo.mod,s=bestlam,newx=x[test,])
mean((Classo.pred-y.test)^2)
## [1] 1116252
predict(cv.out,type="coefficients",s=bestlam) [1:18,]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -7.688896e+02 -3.127034e+02 1.762718e+00 -1.318195e+00 6.482356e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -2.081406e+01 7.119149e-02 1.246161e-02 -1.049091e-01 2.088305e-01
## Books Personal PhD Terminal S.F.Ratio
## 2.926466e-01 3.955068e-03 -1.455463e+01 5.395858e+00 2.171398e+01
## perc.alumni Expend Grad.Rate
## 5.088260e-01 4.824455e-02 7.036148e+00
(e) Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
The test mean squared error is 1936434 with 10 components.
#install.packages("pls")
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
train=sample(1:nrow(College), nrow(College)/2)
test=(-train)
set.seed(1)
Colpcr=pcr(Apps~., data=College, scale=TRUE, validation="CV")
summary(Colpcr)
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3873 3842 2033 2033 1725 1617 1597
## adjCV 3873 3844 2031 2033 1684 1604 1593
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1592 1549 1510 1508 1514 1515 1525
## adjCV 1592 1543 1507 1505 1511 1511 1522
## 14 comps 15 comps 16 comps 17 comps
## CV 1526 1453 1163 1140
## adjCV 1522 1435 1157 1134
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.670 57.30 64.30 69.90 75.39 80.38 83.99 87.40
## Apps 2.316 73.06 73.07 82.08 84.08 84.11 84.32 85.18
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.50 92.91 95.01 96.81 97.9 98.75 99.36
## Apps 85.88 86.06 86.06 86.10 86.1 86.13 90.32
## 16 comps 17 comps
## X 99.84 100.00
## Apps 92.52 92.92
validationplot(Colpcr, val.type = "MSEP")
set.seed(1)
Colpcr=pcr(Apps~., data=College[train, ], scale=TRUE, validation="CV")
#validationplot(Colpcr, val.type = "MSEP")
Colpcr.pred=predict(Colpcr, College[test,], ncomp=10)
mean((Colpcr.pred-College$Apps[test])^2)
## [1] 1322362
(f) Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
The test mean squared error is 1264440 with 11 components.
set.seed(1)
Colpls=plsr(Apps~., data=College[train,], scale=TRUE, validation="CV")
summary(Colpls)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4377 2187 1789 1784 1615 1453 1298
## adjCV 4377 2182 1773 1800 1588 1414 1282
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1273 1244 1241 1235 1220 1221 1220
## adjCV 1260 1233 1229 1223 1210 1210 1208
## 14 comps 15 comps 16 comps 17 comps
## CV 1219 1220 1220 1220
## adjCV 1208 1209 1209 1209
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.36 33.68 61.84 64.33 66.96 72.94 76.83 79.79
## Apps 76.52 86.14 87.25 91.87 93.75 93.96 94.04 94.15
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.64 86.12 89.36 90.92 92.85 95.55 96.98
## Apps 94.26 94.31 94.36 94.40 94.41 94.41 94.41
## 16 comps 17 comps
## X 98.86 100.00
## Apps 94.41 94.41
Colpls.pred=predict(Colpls, College[test,], ncomp=11)
mean((Colpls.pred-College$Apps[test])^2)
## [1] 1243875
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Yes there is a pretty big difference between Ridge Regression test MSE and PCR test MSE.
Ridge MSE 976268.9
Linear MSE 1116252
lasso MSE 1412848
PCR MSE 1936434
PLS MSE 1264440
library(MASS)
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider. I tried best subset, lasso, and ridge regression. The ridge regression with cross-validation had the best test MSE, but the MSE for lasso with cross-validation was not far off. Best subset had the highest MSE out of the three.
set.seed(1)
train <- sample(nrow(Boston) * 0.7)
Btrain <- Boston[train, ]
Btest <- Boston[-train, ]
Best Subset
bestBost <- regsubsets(crim ~ ., data = Boston, nvmax = 13)
summary(bestBost)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
reg.summary=summary(bestBost)
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
plot(reg.summary$cp,xlab="Number of Variables",ylab="Adjusted Cp",type="l")
plot(reg.summary$bic,xlab="Number of Variables",ylab="Adjusted BIC",type="l")
which.max(reg.summary$adjr2)
## [1] 9
which.min(reg.summary$cp)
## [1] 8
which.min(reg.summary$bic)
## [1] 3
set.seed(1)
train=sample(c(TRUE,FALSE), nrow(Hitters),rep=TRUE)
test=(!train)
bestBost <- regsubsets(crim ~ ., data = Boston[train, ], nvmax = 13)
test.mat=model.matrix(crim~.,data=Boston[test,])
predict.regsubsets=function(object,newdata,id,...){
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
k=10
set.seed(1)
folds=sample(1:k,nrow(Boston),replace=TRUE)
cv.errors=matrix(NA,k,13, dimnames=list(NULL, paste(1:13)))
for(j in 1:k){
best.fit=regsubsets(crim~.,data=Boston[folds!=j,],nvmax=13)
for(i in 1:13){
pred=predict(best.fit,Boston[folds==j,],id=i)
cv.errors[j,i]=mean( (Boston$crim[folds==j]-pred)^2)
}
}
mean.cv.errors=apply(cv.errors,2,mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948
## 9 10 11 12 13
## 42.53822 42.73416 42.52367 42.46014 42.50125
reg.best=regsubsets(crim~.,data=Boston, nvmax=13)
coef(reg.best, 12)
## (Intercept) zn indus chas nox
## 16.985713928 0.044673247 -0.063848469 -0.744367726 -10.202169211
## rm dis rad tax ptratio
## 0.439588002 -0.993556631 0.587660185 -0.003767546 -0.269948860
## black lstat medv
## -0.007518904 0.128120290 -0.198877768
Lasso Regression
x=model.matrix(crim~.,Boston)[,-1]
y=Boston$crim
grid=10^seq(10,-2,length=100)
lasso.mod=glmnet(x,y,alpha=1,lambda=grid)
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test=(-train)
y.test=y[test]
lasso.mod=glmnet(x[train,], y[train], alpha = 1, lambda = grid)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.05148183
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 40.99066
Ridge Regression
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.5919159
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 40.92395
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.
The ridge regression with cross-validation had the best test MSE, but the MSE for lasso with cross-validation did pretty well too. Best subset had a test MSE that was somewhat close to the other two, but it was the highest of the three.
Lasso test MSE 40.99066 Ridge test MSE 40.92395 Best Subset test MSE 42.50125
mean((lasso.pred-y.test)^2)
## [1] 40.99066
mean((ridge.pred-y.test)^2)
## [1] 40.92395
(c) Does your chosen model involve all of the features in the data set? Why or why not?
The ridge regression with cross-validation involves all of the features of the data set because ridge regression does not perform variable selection.