i. More flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
ii. More flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
iii. Less flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
iv. Less flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
(a) The lasso, relative to least squares, is:
The lasso realtive to least squares is less flexible and hence will give
improved prediction accuracy when its increase in bias is less than its
decrease in variance.(iii)
Lasso avoid overfitting by using a less flexible fitting approach than
least squares. The lasso solution can yield a reduction in variance at
the expense of a small increase in bias, and consequently can generate
more accurate predictions.
(b) For ridge regression relative to least
squares.
For ridge regression relative to least squares is less flexible and
hence will give improved prediction accuracy when its increase in bias
is less than its decrease in variance.(iii)
Similar to lasso, ridge regression also avoid verfitting by using a less
flexible fitting approach than least squares. ridge regression can
perform well by trading off a small increase in bias for a large
decrease in variance. Hence, ridge regression works best in situations
where the least squares estimates have high variance.
(c) For non-linear methods relative to least
squares.
For non-linear methods relative to least squares is more flexible and
hence will give improved prediction accuracy when its increase in
variance is less than its decrease in bias.(ii)
### Problem 9. In this exercise, we will predict the number of
applications received using the other variables in the
College data set.
library(ISLR)
attach(College)
(a) Split the data set into a training set and a test set.
set.seed(1)
train=sample(c(TRUE,FALSE),nrow(College),rep=TRUE)
test=(!train)
College.train = College[train,]
College.test = College[test,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit = lm(Apps~., data=College.train)
lm.pred = predict(lm.fit, College.test, type="response")
mean((lm.pred-College.test$Apps)^2)
## [1] 984743.1
The test error of the linear model fit is 984743.1.
(c) Fit a ridge regression model on the training set, with λ
chosen by cross-validation. Report the test error obtained.
library(glmnet)
set.seed(1)
train.mat = model.matrix(Apps~., data = College.train)
test.mat = model.matrix(Apps~., data = College.test)
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(train.mat,College.train$Apps,alpha=0,lambda=grid)
cv.out = cv.glmnet(train.mat,College.train$Apps,alpha=0)
bestlam = cv.out$lambda.min
bestlam
## [1] 394.2365
summary(ridge.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1800 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
mean((ridge.pred - College.test$Apps)^2)
## [1] 940753.5
The test error of the ridge regression fit with a lambda selected by cross-validation is 940753.5, which is less than the test error of the linear model.
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.mod=glmnet(train.mat,College.train$Apps,alpha=1,lambda=grid)
summary(lasso.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1800 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.outl2=cv.glmnet(train.mat,College.train$Apps,alpha=1)
bestlam2=cv.outl2$lambda.min
bestlam2
## [1] 2.309051
lasso.pred=predict(lasso.mod,s=bestlam2,newx = test.mat)
mean((lasso.pred-College.test$Apps)^2)
## [1] 978962.6
The test error of the lasso model fit with a lambda chosen by cross-validation is 991035.4, which is higher than both the linear model test error and the ridge regression test error.
out=glmnet(train.mat,College.train$Apps,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam2)[1:18,]
lasso.coef[lasso.coef!=0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -173.89023228 -649.94882654 1.67083516 -0.75305706 57.37667424
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -20.72078908 0.01792667 0.03670117 -0.08378483 0.10937820
## Books Personal PhD Terminal S.F.Ratio
## 0.05923882 -0.01840355 -9.09196597 -6.37371755 20.24530218
## perc.alumni Expend
## 8.95013487 0.08288697
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)
## Data: X dimension: 393 17
## Y dimension: 393 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4189 4147 2344 2340 2099 2048 1978
## adjCV 4189 4146 2338 2335 2092 2073 1968
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1972 1949 1889 1893 1895 1906 1913
## adjCV 1964 1933 1875 1881 1884 1892 1900
## 14 comps 15 comps 16 comps 17 comps
## CV 1944 1853 1396 1404
## adjCV 1960 1812 1379 1386
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.858 57.44 64.20 69.91 75.10 80.17 83.82 87.30
## Apps 4.353 70.99 71.18 76.84 78.34 81.03 81.59 82.21
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.26 92.74 94.79 96.70 97.76 98.67 99.37
## Apps 83.31 83.97 83.97 84.34 84.58 84.70 91.28
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.83 93.02
validationplot(pcr.college, val.type="MSEP")
pcr.pred=predict(pcr.college,College.test,ncomp=10)
mean((pcr.pred-College.test$Apps)^2)
## [1] 1682909
The test error of the lasso model fit with a lambda chosen by
cross-validation is 1682909, which underperforms all previous
models.
(f) Fit a PLS model on the training set, with M chosen by
crossvalidation. Report the test error obtained, along with the value of
M selected by cross-validation.
pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")
summary(pls.college)
## Data: X dimension: 393 17
## Y dimension: 393 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4189 2180 1909 1755 1696 1455 1332
## adjCV 4189 2171 1906 1741 1665 1431 1318
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1310 1301 1298 1298 1297 1298 1297
## adjCV 1298 1289 1287 1286 1285 1286 1285
## 14 comps 15 comps 16 comps 17 comps
## CV 1297 1296 1296 1296
## adjCV 1285 1285 1285 1285
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.01 44.96 62.49 65.22 68.52 72.89 77.13 80.46
## Apps 75.74 82.40 86.74 90.58 92.34 92.79 92.88 92.93
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.45 84.76 88.08 90.76 92.80 94.45 97.02
## Apps 92.98 93.00 93.01 93.01 93.02 93.02 93.02
## 16 comps 17 comps
## X 98.03 100.00
## Apps 93.02 93.02
Around M=8 appears to be when the MSEP with PCR dimension reduction is lowest. M=9 has the lowest CV error, therefore I’ll try both and assess which performs better after looking at the summary.
pls.pred=predict(pls.college,College.test,ncomp=8)
mean((pls.pred-College.test$Apps)^2)
## [1] 978534.3
pls.pred=predict(pls.college,College.test,ncomp=9)
mean((pls.pred-College.test$Apps)^2)
## [1] 1007163
M=8 results in best perfomring PLS model with test error rate as
978534.3
(g) Comment on the results obtained. How accurately can we
predict the number of college applications received? Is there much
difference among the test errors resulting from these five
approaches?
The models performance is based on test error rate from best to worst
are as follows: Ridge Regression (940753.5)
PLS model (978534.3)
linear model (984743.1)
Lasso model (991035.4)
PCR model (1682909).
The test errors of PLS, linear, and lasso are fairly similary to one
another, while PCR performs significantly worse and Ridge performs a
little better. To say how accurately we can predict the number of
applications received, let’s compute the ridge regression R-square
value.
sum.squares = sum((mean(College.test$Apps) - College.test$Apps)^2)
sum.residuals = sum((ridge.pred - College.test$Apps)^2)
1 - (sum.residuals)/(sum.squares)
## [1] 0.9241129
The R-squared value of ridge regression tells us that I can predict
92.4% of the variance in Apps using this model
detach(College)
Boston data set.library(MASS)
library(leaps)
attach(Boston)
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
set.seed(1)
train=sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test=(!train)
Boston.train = Boston[train,]
Boston.test = Boston[test,]
Best Subset Selection:
regfit.full = regsubsets(crim~.,data=Boston.train,nvmax=13)
reg.summary = summary(regfit.full)
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
p = which.max(reg.summary$adjr2)
points(p,reg.summary$adjr2[p], col="red",cex=2,pch=20)
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
p = which.min(reg.summary$cp)
points(p,reg.summary$cp[p],col="red",cex=2,pch=20)
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
p = which.min(reg.summary$bic)
points(p,reg.summary$bic[p],col="red",cex=2,pch=20)
It appears both p=3 and p=7 could be good options, so I will fit models using both suggested subsets on test dataset.
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston.train, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
mean((mean(Boston.test$crim)-Boston.test$crim)^2)
## [1] 83.74891
lm.fit = lm(crim~rad+black+lstat, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)
## [1] 52.66064
The test error of the linear model fit with the 3 best predictors is 52.66.
lm.fit = lm(crim~zn+indus+dis+rad+black+lstat+medv, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)
## [1] 51.04084
51.04 is the test error of the linear model fit with the top 7
predictors. which, in my opinion, is insufficiently better than the
model with the three best predictors to justify the complexity brought
by the inclusion of four extra variables.
Lasso:
set.seed(1)
train.mat = model.matrix(crim~., data = Boston.train)
test.mat = model.matrix(crim~., data = Boston.test)
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=1)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.01973085
grid=10^seq(10,-2,length=100)
lasso.mod = glmnet(train.mat,Boston.train$crim,alpha=1,lambda = grid)
lasso.pred = predict(lasso.mod,s=bestlam,newx=test.mat)
mean((lasso.pred - Boston.test$crim)^2)
## [1] 50.73601
The test error of the lasso model fit with a lambda chosen by cross-validation is 50.75, which is lower than the linear model using best subsets. We can look at which variables were not shrunk to zero.
lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam)[1:15,]
lasso.coef[lasso.coef!=0]
## (Intercept) zn indus chas nox rm
## 15.97323390 0.04355463 -0.10777115 -0.14788387 -6.31552576 -0.24437358
## age dis rad ptratio black lstat
## 0.01121919 -0.74123953 0.47693745 -0.15627085 -0.01470577 0.11441311
## medv
## -0.11381509
Ridge Regression:
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=0)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.5240686
ridge.mod = glmnet(train.mat,Boston.train$crim,alpha = 0, lambda = grid)
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
mean((ridge.pred - Boston.test$crim)^2)
## [1] 51.45778
The test error of the ridge regression fit with a lambda chosen by
cross-validation is 51.42, higher than the LASSO test error.
PCR:
pcr.boston=pcr(crim~., data=Boston.train,scale=TRUE,validation="CV")
summary(pcr.boston)
## Data: X dimension: 262 13
## Y dimension: 262 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.065 6.594 6.607 6.076 6.027 6.048 6.104
## adjCV 8.065 6.591 6.603 6.070 6.017 6.042 6.094
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.092 6.010 5.968 6.018 6.042 6.005 5.958
## adjCV 6.081 5.996 5.949 6.003 6.027 5.988 5.941
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 44.58 58.02 68.58 75.64 82.03 87.37 90.66 93.02
## crim 33.90 34.29 44.32 44.91 45.57 45.62 46.13 48.13
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.01 96.87 98.4 99.54 100.00
## crim 48.80 48.96 49.0 49.91 50.71
validationplot(pcr.boston, val.type = "MSEP")
pcr.pred=predict(pcr.boston,Boston.test,ncomp=8)
mean((pcr.pred-Boston.test$crim)^2)
## [1] 53.17176
The test error for PCR model is 55.46 which is higher than all other
regression methods.
(b) Propose a model (or set of models) that seem to perform well
on this data set, and justify your answer. Make sure that you are
evaluating model performance using validation set error,
crossvalidation, or some other reasonable alternative, as opposed to
using training error.
I would suggest that the optimal model is the linear model with the best
subset of 3 predictors based on the information in part (a) above.
Despite having less test error than the three variable linear model, the
Lasso and Ridge models I’ve shown above still have 12 and 13 predictors,
respectively. I find it difficult to justify the reduction in test error
when compared to adding 9–10 extra variables.
(c) Does your chosen model involve all of the features in the
data set? Why or why not?
No, not all feature is involved. This method of selection uses the best
subset. It contains three of the 13 potential predictors, as was already
mentioned.
detach(Boston)