(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Of i through iv iii would be the most accurate. The bias/variance trade-off gives the lasso it’s advantage over least squares. The lasso uses a penalty term that yields a reduction in variance in comparison to least squares but a small increase in bias. This makes the model less flexible but more accurate when its increase in bias is less than its decrease in variance.
(b) Repeat (a) for ridge regression relative to least squares.
Ridge regression is similar to the lasso, so iii would be the most accurate. Just like the lasso, the ridge regression has the bias-variance trade-off. The penalty term used in ridge reduces variance but increases bias. This makes the model more accurate when its increase in bias is less than its decrease in variance.
(c) Repeat (a) for non-linear methods relative to least squares.
Non-linear methods tend to be more flexible and have less bias than least squares. This gives non-linear methods better accuracy when its increase in variance is less than its decrease in bias. Making ii the most accurate answer.
(a) Split the data set into a training set and a test set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
attach(College)
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
ls.fit<-lm(Apps~., data=College, subset=train)
ls.pred = predict(ls.fit, College.test, type="response")
test.error= mean((College.test$Apps-ls.pred)^2)
test.error
## [1] 1020100
The test error from the fitted liner model with all the variables is 1,020,100.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.5
## Loading required package: Matrix
## Loaded glmnet 4.1-3
set.seed(1)
#Set up matrices
grid=10^seq(10,-2,length=100)
#Fit a ridge regression
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
#Choose lambda using cross-validation
cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 411.3927
#Make predictions
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
#Calculate test error
mean((ridge.pred-y.test)^2)
## [1] 985020.1
MSE for the ridge model is 985020.1. This is lower than the MSE of the linear model using least squares.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
#fit lasso model
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
#choose lamda using CV
set.seed(2)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
## [1] 27.0669
#Make predictions and calculate test error
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1006648
The test error of the lasso model is 1,006,648.
#non-zero coefficient estimates
out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -647.69543453 -396.82408556 1.42411713 -0.09893105 30.63271978
## Top25perc P.Undergrad Outstate Room.Board PhD
## -0.57340223 0.01055132 -0.05004306 0.11718682 -4.81091793
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## -3.24227399 1.89312526 -1.02549443 0.06707521 4.38907054
(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.0.5
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
#fit a PCR model with M chosen by CV
pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4347 4335 2390 2401 2112 1954 1914
## adjCV 4347 4335 2386 2401 2085 1949 1905
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1910 1879 1871 1867 1867 1875 1894
## adjCV 1902 1862 1863 1860 1859 1867 1887
## 14 comps 15 comps 16 comps 17 comps
## CV 1853 1634 1323 1286
## adjCV 1934 1586 1310 1273
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.6794 56.94 64.38 70.61 76.27 80.97 84.48 87.54
## Apps 0.9148 71.17 71.36 79.85 81.49 82.73 82.79 83.70
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.50 92.89 94.96 96.81 97.97 98.73 99.39
## Apps 83.86 84.08 84.11 84.11 84.16 84.28 93.08
## 16 comps 17 comps
## X 99.86 100.00
## Apps 93.71 93.95
validationplot(pcr.college, val.type="MSEP")
pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)
## [1] 1422699
The test error of the PCR model is 1723100. The selected number of components by CV is 10. Looking at the summary and the plot of the components we can see that M = 10 has a lowest CV until M = 14. Although, M=10 has high variance explained and achieves the best dimension reduction.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
#Fit PLS model and visualize
pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")
summary(pls.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4347 2154 1836 1732 1620 1422 1314
## adjCV 4347 2148 1832 1724 1591 1397 1298
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1276 1264 1260 1266 1265 1262 1262
## adjCV 1264 1253 1250 1254 1253 1251 1250
## 14 comps 15 comps 16 comps 17 comps
## CV 1263 1263 1263 1263
## adjCV 1251 1251 1251 1252
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 24.27 38.72 62.64 65.26 69.01 73.96 78.86 82.18
## Apps 76.96 84.31 86.80 91.48 93.37 93.75 93.81 93.84
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 85.35 87.42 89.18 91.41 92.70 94.58 97.16
## Apps 93.88 93.91 93.93 93.94 93.95 93.95 93.95
## 16 comps 17 comps
## X 98.15 100.00
## Apps 93.95 93.95
Looking at the summary of the model we can clarify that M = 9 looks to be the best option in terms of lowest CV(1260).
set.seed(1)
pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)
## [1] 1049868
The test error rate for the PLS model with M=9 is 1049868.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The model performance from best to worst is as follows (based upon test error): Ridge Regression (985020.1),Lasso model (1006648),linear model (1020100), PLS model (1049868), PCR model (1723100). The test errors of PLS, linear, and lasso are fairly similar to one another. PCR performs significantly worse than the other models. Ridge performs a little better than the others. To say how accurately we can predict the number of applications received, let’s compute the ridge regression R-square value.
test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - ls.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")
Comparing the R-square of each model shows PCR has the lowest accuracy, while Ridge has the highest. Since Ridge has the highest R-squared value and the lowest MSE it would be the best model.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
detach(College)
library(MASS)
attach(Boston)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.5
set.seed(1)
train=sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test=(!train)
Boston.train = Boston[train,]
Boston.test = Boston[test,]
Best Subset Selection
regfit.full = regsubsets(crim~.,data=Boston.train,nvmax=15)
reg.summary = summary(regfit.full)
#Make plots of performance statistics to determine best subset
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
p = which.max(reg.summary$adjr2)
points(p,reg.summary$adjr2[p], col="red",cex=2,pch=20)
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
p = which.min(reg.summary$cp)
points(p,reg.summary$cp[p],col="red",cex=2,pch=20)
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
p = which.min(reg.summary$bic)
points(p,reg.summary$bic[p],col="red",cex=2,pch=20)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston.train, nvmax = 15)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Looking at the graphs above we can see that the adjusted RSq and the CP suggests a seven variable model, while the BIC suggest a 3 variable model. We will look at both and see which one is the better model of the two.
# p=3
lm.fit = lm(crim~rad+black+lstat, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)
## [1] 52.66064
#p=7
lm.fit = lm(crim~zn+indus+dis+rad+black+lstat+medv, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)
## [1] 51.04084
The test error of the linear model fit with the 3 best predictors is 52.66. The p=7 linear model has a test error of 51.04. The seven predictor model is only 1.62 higher than the 3 variable model. One would argue that the additional value is not worth the added complexity of 4 more variables in the model.
Lasso
set.seed(1)
#Set up matrices needed for the glmnet functions
train.mat = model.matrix(crim~., data = Boston.train)
test.mat = model.matrix(crim~., data = Boston.test)
#Choose lambda using cross-validation
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=1)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.01973085
#Fit lasso model
lasso.mod = glmnet(train.mat,Boston.train$crim,alpha=1)
#Make predictions
lasso.pred = predict(lasso.mod,s=bestlam,newx=test.mat)
mean((lasso.pred - Boston.test$crim)^2)
## [1] 50.7379
lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam)[1:15,]
lasso.coef[lasso.coef!=0]
## (Intercept) zn indus chas nox rm
## 15.96344351 0.04355602 -0.10751503 -0.14897948 -6.30730424 -0.24512633
## age dis rad ptratio black lstat
## 0.01130326 -0.74018280 0.47687429 -0.15621440 -0.01470752 0.11414439
## medv
## -0.11377842
Ridge Regression
#Choose lambda using cross-validation
cv.ridge = cv.glmnet(train.mat,Boston.train$crim,alpha=0)
bestlam = cv.ridge$lambda.min
bestlam
## [1] 0.5240686
#Fit a ridge regression
ridge.mod = glmnet(train.mat,Boston.train$crim,alpha = 0)
#Make predictions
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
#Calculate test error
mean((ridge.pred - Boston.test$crim)^2)
## [1] 51.46284
coef(cv.ridge)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.518428366
## (Intercept) .
## zn -0.003319432
## indus 0.032707247
## chas -0.202043154
## nox 2.120359324
## rm -0.161829416
## age 0.007019656
## dis -0.110482087
## rad 0.052144674
## tax 0.002404790
## ptratio 0.076371881
## black -0.003655338
## lstat 0.037206684
## medv -0.023652398
pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.249 7.244 6.813 6.820 6.824 6.867
## adjCV 8.61 7.244 7.239 6.806 6.809 6.816 6.856
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.858 6.721 6.773 6.743 6.755 6.734 6.669
## adjCV 6.847 6.710 6.760 6.731 6.741 6.717 6.652
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
The lasso and ridge regression have similar results but the lasso has the lowest MSE at 50.76 of the two.M=9 seems to be the best option for the this PCR model with the lowest CV until M=14. The lasso model has the lowest MSE of all the models used.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.
Based on the information above I would say that the best model to use is the linear model using p=3 or p=7. Even though Lasso has the smallest MSE it is not that much of a difference between the 3 predictor linear model to justify using it over the other. It also uses 9 more predictors than the linear model increasing the complexity for only a 1.9 MSE difference. Even still, if we wanted an increase in the MSE the p=7 linear model is only 0.28 less than the lasso and only uses 7 predictors.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
The model chosen is the best subset selection model with p=7. This model had the second lowest MSE at 51.04, which is only a 0.28 difference from the lowest MSE at 50.76. This model only includes 7 of the 13 available predictor variables. This reduces the dimensions and complexity of interpreting the model, while still obtaining low MSE and high accuracy.