For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The answer is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. For methods like lasso and ridge regression, the advantage is that we only need to fit one model and computations are relatively simple. Also it has the effect of reducing variance. Which mean this methods works the best variance is excessively and decreasing would not enforce the bias to increase too much.
(b) Repeat (a) for ridge regression relative to least squares.
The answer is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Similar to Lasso model, the advantage of ridge regression is that we only need to fit one model and computations are relatively simple. Also it has the effect of reducing variance. Which mean this methods works the best variance is excessively and decreasing would not enforce the bias to increase too much.
(c) Repeat (a) for non-linear methods relative to least squares.
The answer is ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear methods are supposed to be a method that encounters much more flexible, thus this can create variance while decrease potential bias of the model. Obviously, this model is best to use when the purpose is to decrease the bias while not affecting the variance too much.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
attach(College)
(a) Split the data set into a training set and a test set.
set.seed(1)
train=sample(1:nrow(College),nrow(College)/2)
test=(-train)
Apps.test=Apps[test]
training.set=College[train,]
testing.set=College[test,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit = lm(Apps~., data=training.set)
lm.pred = predict(lm.fit, testing.set)
mean((lm.pred-Apps.test)^2)
## [1] 1135758
According to the linear model of fit using least squares, the test error is \(1135758\)
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet) #get glmnet packet
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
dim(coef(ridge.mod))
## [1] 18 100
This model is a matrix of 18X100.
set.seed(1)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
## [1] 405.8404
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred -y.test)^2)
## [1] 906029.4
For the λ chosen by cross validation, the test MSE is 906029.4. Then let’s refit the ridge regression model back into the full data set.
out=glmnet(x,y,alpha=0)
predict(out,type="coefficients",s=bestlam) [1:18,]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -1.521486e+03 -5.295317e+02 9.743995e-01 4.712708e-01 2.486088e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## 1.124189e+00 7.721190e-02 2.453206e-02 -2.100391e-02 1.999142e-01
## Books Personal PhD Terminal S.F.Ratio
## 1.362569e-01 -9.049362e-03 -3.734037e+00 -4.696652e+00 1.280442e+01
## perc.alumni Expend Grad.Rate
## -8.869487e+00 7.518826e-02 1.138097e+01
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
Using the variables from (c)
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)
bestlam =cv.out$lambda.min
lasso.pred=predict(lasso.mod ,s=bestlam ,newx=x[test ,])
mean((lasso.pred - y.test)^2)
## [1] 1116252
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s= bestlam) [1:18,]
lasso.coef
## (Intercept) PrivateYes Accept Enroll Top10perc
## -468.96036214 -491.47351867 1.57117027 -0.76858284 48.25732969
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -12.94716203 0.04293538 0.04406960 -0.08340724 0.14963683
## Books Personal PhD Terminal S.F.Ratio
## 0.01549548 0.02915128 -8.42538012 -3.26671319 14.61167079
## perc.alumni Expend Grad.Rate
## -0.02612797 0.07718333 8.31579869
We see that there are no zero estimates. And that the MSE using the lasso model and λ chosen by cross validation is 1116252
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
library(pls)
pcr.fit = pcr(Apps~., data=training.set, Scale=TRUE, validation='CV')
summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4171 2237 2162 2018 1444 1440
## adjCV 4288 4168 2231 2157 2016 1429 1427
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1425 1404 1407 1359 1326 1327 1339
## adjCV 1413 1391 1394 1347 1314 1314 1326
## 14 comps 15 comps 16 comps 17 comps
## CV 1335 1269 1269 1270
## adjCV 1322 1257 1256 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 51.812 87.28 95.83 97.47 98.79 99.45 99.94 99.98
## Apps 5.672 75.44 77.29 85.81 91.84 91.84 91.86 92.16
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Apps 92.18 92.68 93.05 93.06 93.07 93.15 93.82
## 16 comps 17 comps
## X 100.00 100.00
## Apps 93.85 93.89
validationplot(pcr.fit,val.type="MSEP")
pcr.pred=predict(pcr.fit,testing.set,ncomp =10)
mean((pcr.pred-Apps.test)^2)
## [1] 1192180
The MSE using pcr model is 1192180
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the valueof M selected by cross-validation.
set.seed(1)
library(pls)
pls.fit = plsr(Apps~., data=training.set, Scale=TRUE, validation='CV')
summary(pls.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 2203 2141 1939 1440 1424 1430
## adjCV 4288 2193 2141 1925 1427 1411 1416
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1406 1396 1390 1322 1306 1283 1271
## adjCV 1392 1384 1380 1310 1293 1269 1258
## 14 comps 15 comps 16 comps 17 comps
## CV 1268 1269 1269 1270
## adjCV 1256 1257 1256 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 37.84 85.43 93.39 97.36 98.78 99.17 99.32 99.97
## Apps 76.33 78.83 86.23 91.77 91.89 92.00 92.20 92.20
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Apps 92.32 93.09 93.41 93.82 93.83 93.84 93.84
## 16 comps 17 comps
## X 100.00 100.00
## Apps 93.85 93.89
validationplot(pls.fit,val.type="MSEP")
pls.pred=predict(pls.fit,testing.set,ncomp =10)
mean((pls.pred-Apps.test)^2)
## [1] 1144822
The MSE using pls model is 1144822.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
We can notice that we can get least amount of MSE using the ridge regression. We can get accurately approximately to MSE of 900000 to 1500000.
We will now try to predict per capita crime rate in the Boston data set.
library(MASS)
library(leaps)
library(glmnet)
library(pls)
attach(Boston)
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
set.seed(1)
train=sample(1:nrow(Boston), nrow(Boston)/2)
training.set=Boston[train,]
test=(-train)
testing.set=Boston[test,]
crim.test=Boston$crim[test]
lm.full=lm(crim~.,data=training.set)
lm.predict=predict(lm.full,testing.set)
lm.MSE=mean((lm.predict-crim.test)^2)
lm.MSE
## [1] 41.54639
By using least square method I get 41.54639 MSE.
regfit.best=regsubsets(crim~.,data=Boston[train,],nvmax=13)
test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
## [1] 40.14557 41.87706 42.40901 42.45745 43.03836 42.13258 41.25016 41.28957
## [9] 41.49271 41.50577 41.48839 41.46692 41.54639
which.min(val.errors)
## [1] 1
We can see that 1st variable is 40.14557, so the MSE for the subset selection.
x=model.matrix(crim~.,Boston)[,-1]
y=Boston$crim
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
set.seed(1)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.5919159
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred -y.test)^2)
## [1] 38.01174
I get MSE of 38.01174 using ridged regression.
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)
bestlam =cv.out$lambda.min
lasso.pred=predict(lasso.mod ,s=bestlam ,newx=x[test ,])
mean((lasso.pred - y.test)^2)
## [1] 40.99066
Lasso model gives me MSE of 40.99066
set.seed(1)
library(pls)
pcr.fit = pcr(crim~., data=training.set, Scale=TRUE, validation='CV')
summary(pcr.fit)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.305 7.267 7.254 7.234 7.130 7.115
## adjCV 9.275 7.300 7.259 7.246 7.226 7.121 7.107
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.795 6.763 6.743 6.694 6.694 6.696 6.700
## adjCV 6.782 6.753 6.732 6.682 6.682 6.684 6.686
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 81.27 97.18 99.08 99.72 99.89 99.93 99.97 99.99
## crim 39.26 40.61 40.87 41.25 43.24 43.95 48.88 49.34
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 100.00 100.00 100.00 100.00 100.00
## crim 49.94 50.86 50.93 50.98 51.37
validationplot(pcr.fit,val.type="MSEP")
pcr.pred=predict(pcr.fit,testing.set,ncomp =10)
mean((pcr.pred-crim.test)^2)
## [1] 41.97147
pcr prediction model gives me MSE of 41.971417
set.seed(1)
library(pls)
pls.fit = plsr(crim~., data=training.set, Scale=TRUE, validation='CV')
summary(pls.fit)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.279 7.249 7.181 7.049 6.902 6.737
## adjCV 9.275 7.274 7.242 7.174 7.040 6.894 6.728
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.735 6.706 6.700 6.693 6.693 6.694 6.700
## adjCV 6.724 6.696 6.688 6.682 6.681 6.682 6.686
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 81.18 97.11 98.75 99.48 99.86 99.92 99.95 99.98
## crim 39.77 40.80 42.23 44.69 46.96 49.67 50.15 50.62
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 99.99 100.00 100.00 100.00 100.00
## crim 50.80 50.91 50.98 51.05 51.37
validationplot(pls.fit,val.type="MSEP")
pls.pred=predict(pls.fit,testing.set,ncomp =10)
mean((pls.pred-crim.test)^2)
## [1] 41.94719
pls model gives MSE of 41.94719
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.
Using lasso model seems to perform well because it gives the lowest MSE except for ridged. And ridged will never force any of the coefficients to be exactly zero
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s= bestlam) [1:14,]
lasso.coef[lasso.coef!=0]
## (Intercept) zn indus chas nox
## 1.269174e+01 3.628905e-02 -7.012417e-02 -5.842484e-01 -6.989138e+00
## rm dis rad tax ptratio
## 2.308650e-01 -7.886241e-01 5.146154e-01 -2.235567e-05 -1.884305e-01
## black lstat medv
## -7.544153e-03 1.248260e-01 -1.582852e-01
The lasso model gives a fit of:
crim = 1.269174e01 + 3.628905e-02\(zn\) + -7.012417e-02\(indus\) + -5.842484e-01\(chas\) + -6.989138e+00\(nox\) + 2.308650e-01\(rm\) + -7.886241e-01\(dis\) + 5.146154e-01\(rad\) + -2.235567e-05\(tax\) + -1.884305e-01\(ptratio\) + -7.544153e-03\(black\) + 1.248260e-01\(lstat\) + -1.582852e-01\(medv\)
(c) Does your chosen model involve all of the features in the dataset? Why or why not?
This set does not involve all of the features in the dataset, because the age variable had coefficient of 0.
lasso.coef
## (Intercept) zn indus chas nox
## 1.269174e+01 3.628905e-02 -7.012417e-02 -5.842484e-01 -6.989138e+00
## rm age dis rad tax
## 2.308650e-01 0.000000e+00 -7.886241e-01 5.146154e-01 -2.235567e-05
## ptratio black lstat medv
## -1.884305e-01 -7.544153e-03 1.248260e-01 -1.582852e-01