i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
(a) The lasso, relative to least squares, is:
(b) Repeat (a) for ridge regression relative to least squares.
(c) Repeat (a) for non-linear methods relative to least squares.
(a) Split the data set into a training set and a test set.
library(ISLR)
set.seed(1)
#Create a train and test set from the College data
trainid=sample(1:nrow(College), nrow(College)/2)
test=College[trainid,]
train=College[-trainid,]
#dim(College)
names(College)
## [1] "Private" "Apps" "Accept" "Enroll" "Top10perc"
## [6] "Top25perc" "F.Undergrad" "P.Undergrad" "Outstate" "Room.Board"
## [11] "Books" "Personal" "PhD" "Terminal" "S.F.Ratio"
## [16] "perc.alumni" "Expend" "Grad.Rate"
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
#Fit the model with everything
lm.fit=lm(Apps~., data=train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2397.9 -400.0 -37.9 263.8 6330.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -274.36837 496.00827 -0.553 0.580492
## PrivateYes -812.55253 181.80228 -4.469 1.04e-05 ***
## Accept 1.20602 0.06300 19.143 < 2e-16 ***
## Enroll -0.29820 0.22026 -1.354 0.176612
## Top10perc 34.64854 7.04843 4.916 1.33e-06 ***
## Top25perc -8.28312 5.76933 -1.436 0.151926
## F.Undergrad 0.09624 0.03831 2.512 0.012418 *
## P.Undergrad 0.01715 0.03718 0.461 0.644851
## Outstate -0.04682 0.02587 -1.810 0.071124 .
## Room.Board 0.09182 0.06131 1.497 0.135117
## Books -0.11015 0.27284 -0.404 0.686649
## Personal 0.03498 0.08627 0.406 0.685312
## PhD -1.98865 6.04406 -0.329 0.742322
## Terminal -12.36690 6.70691 -1.844 0.065994 .
## S.F.Ratio 8.73527 15.70896 0.556 0.578499
## perc.alumni -5.87603 5.29707 -1.109 0.268020
## Expend 0.12014 0.01867 6.436 3.79e-10 ***
## Grad.Rate 12.70646 3.71635 3.419 0.000698 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 923.7 on 371 degrees of freedom
## Multiple R-squared: 0.9295, Adjusted R-squared: 0.9262
## F-statistic: 287.6 on 17 and 371 DF, p-value: < 2.2e-16
#Retrieve the prediction for the test set
lm.pred=predict(lm.fit, test)
#Compute the least squares error for the test set
lm.error=mean((test$Apps-lm.pred)^2)
lm.error
## [1] 1653835
The test error is 1653835
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
attach(College)
x=model.matrix(Apps~., College[,-17])
y=College$Apps
set.seed(1)
train=sample(1:nrow(x), nrow(x)/2)
#head(train)
test=(-train)
y.test=y[test]
grid=10^seq(10,-2,length=100)
cv.out=cv.glmnet(x[train,],y[train],lambda = grid, alpha = 0, thresh=1e-12)
plot(cv.out)
bestlam=cv.out$lambda.min
ridge.pred=predict(cv.out, s=bestlam, newx = x[test,])
mean((ridge.pred-y.test)^2)
## [1] 1209311
The test error is 1209311
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
names(College)
## [1] "Private" "Apps" "Accept" "Enroll" "Top10perc"
## [6] "Top25perc" "F.Undergrad" "P.Undergrad" "Outstate" "Room.Board"
## [11] "Books" "Personal" "PhD" "Terminal" "S.F.Ratio"
## [16] "perc.alumni" "Expend" "Grad.Rate"
lasso.mod=glmnet(x[train,], y[train], alpha=1, lambda = grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x[train,], y[train], alpha=1)
plot(cv.out)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod, s=bestlam, newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1190120
out=glmnet(x,y,alpha = 1, lambda = grid)
lasso.coef=predict(out, type='coefficient', s=bestlam)
lasso.coef
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 62.58919214
## (Intercept) .
## PrivateYes -566.41435509
## Accept 1.57833046
## Enroll -0.74995089
## Top10perc 61.56356704
## Top25perc -18.74197086
## F.Undergrad 0.04042129
## P.Undergrad 0.04940370
## Outstate -0.04934853
## Room.Board 0.18023292
## Books 0.07241693
## Personal 0.04577494
## PhD -7.87188542
## Terminal -1.82551823
## S.F.Ratio -13.76410979
## perc.alumni 0.07433295
## Grad.Rate 6.80183612
The test error is: 1190120 Surprisingly, there are no 0 coefficients and 16 non-zero coefficients in the LASSO model
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
set.seed(2)
pcr.fit=pcr(Apps~., data=College, scale=TRUE, validation="CV", subset=train)
summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4032 2359 2362 2055 1924 1881
## adjCV 4288 4035 2355 2358 1971 1911 1875
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1882 1890 1834 1799 1798 1799 1794
## adjCV 1877 1887 1829 1790 1791 1792 1787
## 14 comps 15 comps 16 comps 17 comps
## CV 1795 1764 1348 1278
## adjCV 1790 1737 1332 1265
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
validationplot(pcr.fit, val.type = 'MSEP')
pcr.fit$ncomp
## [1] 17
pcr.pred=predict(pcr.fit, College[test,], ncomp = 15)
mean((pcr.pred-College$Apps[test])^2)
## [1] 1268629
Best M is 15 which showed the lowest cross validation test error. The test error obtained is 1268629
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pls.fit=plsr(Apps~., data=College, scale=TRUE, validation="CV", subset=train)
summary(pls.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 2217 2019 1761 1630 1533 1347
## adjCV 4288 2211 2012 1749 1605 1510 1331
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1309 1303 1286 1283 1283 1277 1271
## adjCV 1296 1289 1273 1270 1270 1264 1258
## 14 comps 15 comps 16 comps 17 comps
## CV 1270 1270 1270 1270
## adjCV 1258 1257 1257 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 27.21 50.73 63.06 65.52 70.20 74.20 78.62 80.81
## Apps 75.39 81.24 86.97 91.14 92.62 93.43 93.56 93.68
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.29 87.17 89.15 91.37 92.58 94.42 96.98
## Apps 93.76 93.79 93.83 93.86 93.88 93.89 93.89
## 16 comps 17 comps
## X 98.78 100.00
## Apps 93.89 93.89
validationplot(pls.fit, val.type = "MSEP")
pls.pred=predict(pls.fit, College[test,], ncomp = 4)
mean((pls.pred-College$Apps[test])^2)
## [1] 1476030
#Perform PLS on full data set
pls.fit=plsr(Apps~., data=College, scale=TRUE, ncomp=4)
summary(pls.fit)
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: kernelpls
## Number of components considered: 4
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps
## X 25.76 40.33 62.59 64.97
## Apps 78.01 85.14 87.67 90.73
Best M is 4 which showed the lowest cross validation test error. The test error obtained is 1476030
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Our test errors show that LASSO would be the best model to leverage. Linear model provides a 92.62% adjusted R2 and PLS and PCR both explain a high amount of variance as well. However, just going with the test errors, LASSO is preferred.
Test Errors:
Linear: 1653835
Ridge: 1209311
Lasso: 1190120
PLS: 1476030, 96.98 % variance explained for x and 93.89 variance explained for APPS
PCR: 1268629, 99.42 % variance explained for x and 90.55 variance explained for APPS
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
attach(Boston)
Best Subset Selection:
library(leaps)
regfit.full<-regsubsets(crim~., Boston, nvmax = 14)
reg.summary=summary(regfit.full)
reg.summary$rsq
## [1] 0.3912567 0.4207965 0.4286123 0.4334892 0.4392738 0.4440173 0.4476594
## [8] 0.4504606 0.4524408 0.4530572 0.4535605 0.4540031 0.4540104
#To find the max of a given statistic, in this case which number of variable model has the most adjusted r2.
which.min(reg.summary$rss)
## [1] 13
which.max(reg.summary$adjr2)
## [1] 9
which.min(reg.summary$cp)
## [1] 8
which.min(reg.summary$bic)
## [1] 3
par(mfrow=c(2,2))
plot(reg.summary$rss, xlab="Number of Variables", ylab = "RSS", type ="l")
points(13, reg.summary$rss[13], col='red', cex=2, pch=20)
plot(reg.summary$cp, xlab="Number of Variables", ylab = "CP", type ="l")
points(8, reg.summary$cp[8], col='red', cex=2, pch=20)
plot(reg.summary$adjr2, xlab="Number of Variables", ylab = "adjusted r square", type ="l")
points(9, reg.summary$adjr2[9], col='red', cex=2, pch=20)
plot(reg.summary$bic, xlab="Number of Variables", ylab = "BIC", type ="l")
points(3, reg.summary$bic[3], col='red', cex=2, pch=20)
#preview the coefficients for the best subset model
coef(regfit.full, 8)
## (Intercept) zn nox dis rad
## 19.683127801 0.043293393 -12.753707757 -0.918318253 0.532616533
## ptratio black lstat medv
## -0.310540942 -0.007922426 0.110173124 -0.174207166
#Showe the Adjr^2 for the model chosen
reg.summary$adjr2[8]
## [1] 0.4416149
The best subset selection model suggests that a subset of 8 variables is ideal. These variables would be:
zn
nox
dis
rad
ptratio
black
lstat
medv
and would result in a model that had a 44.16% adjusted r^2.
Ridge Regression:
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
x=model.matrix(crim~., Boston[,-14])
y=Boston$crim
set.seed(1)
train=sample(1:nrow(x), nrow(x)/2)
#head(train)
test=(-train)
y.test=y[test]
grid=10^seq(10,-2,length=100)
cv.out=cv.glmnet(x[train,],y[train],lambda = grid, alpha = 0, thresh=1e-12)
plot(cv.out)
bestlam=cv.out$lambda.min
ridge.pred=predict(cv.out, s=bestlam, newx = x[test,])
mean((ridge.pred-y.test)^2)
## [1] 41.89459
The ridge regression model has a test error rate of 41.89 and would contain all variables by virture of the rige regression istelf.
LASSO
lasso.mod=glmnet(x[train,], y[train], alpha=1, lambda = grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x[train,], y[train], alpha=1)
plot(cv.out)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod, s=bestlam, newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 41.94497
out=glmnet(x,y,alpha = 1, lambda = grid)
lasso.coef=predict(out, type='coefficient', s=bestlam)
lasso.coef
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 6.478907123
## (Intercept) .
## zn 0.031635956
## indus -0.063492351
## chas -1.162574668
## nox -4.920418670
## rm -0.183476785
## age .
## dis -0.594671142
## rad 0.504869664
## tax .
## ptratio -0.040790700
## black -0.009212781
## lstat 0.230728259
The LASSO has a test error of 41.94 and would effectively remove age and tax from the model by setting their coefficients at 0, leaving a total of 10 non zero coefficients.
PCR
library(pls)
set.seed(2)
pcr.fit=pcr(crim~., data=Boston, scale=TRUE, validation="CV", subset=train)
summary(pcr.fit)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.617 7.605 7.206 7.014 7.050 7.219
## adjCV 9.275 7.608 7.597 7.190 7.001 7.037 7.199
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 7.212 7.130 7.135 7.036 7.006 6.976 6.924
## adjCV 7.191 7.105 7.113 7.012 6.983 6.952 6.898
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.51 60.4 69.86 77.08 82.80 87.68 91.24 93.56
## crim 34.94 35.2 42.83 45.47 45.57 45.58 45.75 47.59
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.47 97.08 98.48 99.54 100.00
## crim 47.68 48.75 49.31 50.14 51.37
validationplot(pcr.fit, val.type = 'MSEP')
pcr.fit$ncomp
## [1] 13
pcr.pred=predict(pcr.fit, Boston[test,], ncomp = 4)
mean((pcr.pred-Boston$crim[test])^2)
## [1] 43.91107
The PCR model suggests the number of optimal components is 4 and this would result in a test error rate of 43.91107.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
LASSO, Ridge, and PCR all leveraged a cross validation approach to determine their optimal models. The best subset model was selection using the highest adjusted r^2.
Best Subset Adjr^2: 44.16%
Ridge Test Error: 41.89459
Lasso Test Error: 41.94497
PCR Test Error: 43.91107
It can be suggested that in the case of this data set with the number of observations being to low, the best subset would be an ok model to use since it reduced the predictors to 8. Since Ridge and LASSO are so close in their test errors, it would be suggested to leverage LASSO due to the interpretability. PCR can be omitted.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
Neither of the preferred models use all of the features. Each model’s function is search for a model that can predict while limiting the amount of features or dimensions retained. The impacted predictors were found to have little or minimal enough impact to predicting the outcome that they could be removed.