Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

The answer is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. For methods like lasso and ridge regression, the advantage is that we only need to fit one model and computations are relatively simple. Also it has the effect of reducing variance. Which mean this methods works the best variance is excessively and decreasing would not enforce the bias to increase too much.

(b) Repeat (a) for ridge regression relative to least squares.

The answer is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Similar to Lasso model, the advantage of ridge regression is that we only need to fit one model and computations are relatively simple. Also it has the effect of reducing variance. Which mean this methods works the best variance is excessively and decreasing would not enforce the bias to increase too much.

(c) Repeat (a) for non-linear methods relative to least squares.

The answer is ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear methods are supposed to be a method that encounters much more flexible, thus this can create variance while decrease potential bias of the model. Obviously, this model is best to use when the purpose is to decrease the bias while not affecting the variance too much.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
attach(College)

(a) Split the data set into a training set and a test set.

set.seed(1)
train=sample(1:nrow(College),nrow(College)/2)
test=(-train)
Apps.test=Apps[test]
training.set=College[train,]
testing.set=College[test,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit = lm(Apps~., data=training.set)
lm.pred = predict(lm.fit, testing.set)
mean((lm.pred-Apps.test)^2)
## [1] 1135758

According to the linear model of fit using least squares, the test error is \(1135758\)

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet) #get glmnet packet
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
dim(coef(ridge.mod))
## [1]  18 100

This model is a matrix of 18X100.

set.seed(1)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)

bestlam=cv.out$lambda.min
bestlam
## [1] 405.8404
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred -y.test)^2)
## [1] 906029.4

For the λ chosen by cross validation, the test MSE is 906029.4. Then let’s refit the ridge regression model back into the full data set.

out=glmnet(x,y,alpha=0)
predict(out,type="coefficients",s=bestlam) [1:18,]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -1.521486e+03 -5.295317e+02  9.743995e-01  4.712708e-01  2.486088e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  1.124189e+00  7.721190e-02  2.453206e-02 -2.100391e-02  1.999142e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  1.362569e-01 -9.049362e-03 -3.734037e+00 -4.696652e+00  1.280442e+01 
##   perc.alumni        Expend     Grad.Rate 
## -8.869487e+00  7.518826e-02  1.138097e+01

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

Using the variables from (c)

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)

set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

bestlam =cv.out$lambda.min
lasso.pred=predict(lasso.mod ,s=bestlam ,newx=x[test ,])
mean((lasso.pred - y.test)^2)
## [1] 1116252
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s= bestlam) [1:18,]
lasso.coef
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -468.96036214 -491.47351867    1.57117027   -0.76858284   48.25732969 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -12.94716203    0.04293538    0.04406960   -0.08340724    0.14963683 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.01549548    0.02915128   -8.42538012   -3.26671319   14.61167079 
##   perc.alumni        Expend     Grad.Rate 
##   -0.02612797    0.07718333    8.31579869

We see that there are no zero estimates. And that the MSE using the lasso model and λ chosen by cross validation is 1116252

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
library(pls)
pcr.fit = pcr(Apps~., data=training.set, Scale=TRUE, validation='CV')
summary(pcr.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     4171     2237     2162     2018     1444     1440
## adjCV         4288     4168     2231     2157     2016     1429     1427
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1425     1404     1407      1359      1326      1327      1339
## adjCV     1413     1391     1394      1347      1314      1314      1326
##        14 comps  15 comps  16 comps  17 comps
## CV         1335      1269      1269      1270
## adjCV      1322      1257      1256      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      51.812    87.28    95.83    97.47    98.79    99.45    99.94    99.98
## Apps    5.672    75.44    77.29    85.81    91.84    91.84    91.86    92.16
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X      100.00    100.00    100.00    100.00    100.00    100.00    100.00
## Apps    92.18     92.68     93.05     93.06     93.07     93.15     93.82
##       16 comps  17 comps
## X       100.00    100.00
## Apps     93.85     93.89
validationplot(pcr.fit,val.type="MSEP")

pcr.pred=predict(pcr.fit,testing.set,ncomp =10)
mean((pcr.pred-Apps.test)^2)
## [1] 1192180

The MSE using pcr model is 1192180

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the valueof M selected by cross-validation.

set.seed(1)
library(pls)
pls.fit = plsr(Apps~., data=training.set, Scale=TRUE, validation='CV')
summary(pls.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     2203     2141     1939     1440     1424     1430
## adjCV         4288     2193     2141     1925     1427     1411     1416
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1406     1396     1390      1322      1306      1283      1271
## adjCV     1392     1384     1380      1310      1293      1269      1258
##        14 comps  15 comps  16 comps  17 comps
## CV         1268      1269      1269      1270
## adjCV      1256      1257      1256      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       37.84    85.43    93.39    97.36    98.78    99.17    99.32    99.97
## Apps    76.33    78.83    86.23    91.77    91.89    92.00    92.20    92.20
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X      100.00    100.00    100.00    100.00    100.00    100.00    100.00
## Apps    92.32     93.09     93.41     93.82     93.83     93.84     93.84
##       16 comps  17 comps
## X       100.00    100.00
## Apps     93.85     93.89
validationplot(pls.fit,val.type="MSEP")

pls.pred=predict(pls.fit,testing.set,ncomp =10)
mean((pls.pred-Apps.test)^2)
## [1] 1144822

The MSE using pls model is 1144822.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

We can notice that we can get least amount of MSE using the ridge regression. We can get accurately approximately to MSE of 900000 to 1500000.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

library(MASS)
library(leaps)
library(glmnet)
library(pls)
attach(Boston)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

set.seed(1)
train=sample(1:nrow(Boston), nrow(Boston)/2)
training.set=Boston[train,]
test=(-train)
testing.set=Boston[test,]
crim.test=Boston$crim[test]
lm.full=lm(crim~.,data=training.set)
lm.predict=predict(lm.full,testing.set)
lm.MSE=mean((lm.predict-crim.test)^2)
lm.MSE
## [1] 41.54639

By using least square method I get 41.54639 MSE.

regfit.best=regsubsets(crim~.,data=Boston[train,],nvmax=13)
test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
   coefi=coef(regfit.best,id=i)
   pred=test.mat[,names(coefi)]%*%coefi
   val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
##  [1] 40.14557 41.87706 42.40901 42.45745 43.03836 42.13258 41.25016 41.28957
##  [9] 41.49271 41.50577 41.48839 41.46692 41.54639
which.min(val.errors)
## [1] 1

We can see that 1st variable is 40.14557, so the MSE for the subset selection.

x=model.matrix(crim~.,Boston)[,-1]
y=Boston$crim
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
set.seed(1)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)

bestlam=cv.out$lambda.min
bestlam
## [1] 0.5919159
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred -y.test)^2)
## [1] 38.01174

I get MSE of 38.01174 using ridged regression.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)

set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

bestlam =cv.out$lambda.min
lasso.pred=predict(lasso.mod ,s=bestlam ,newx=x[test ,])
mean((lasso.pred - y.test)^2)
## [1] 40.99066

Lasso model gives me MSE of 40.99066

set.seed(1)
library(pls)
pcr.fit = pcr(crim~., data=training.set, Scale=TRUE, validation='CV')
summary(pcr.fit)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.305    7.267    7.254    7.234    7.130    7.115
## adjCV        9.275    7.300    7.259    7.246    7.226    7.121    7.107
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.795    6.763    6.743     6.694     6.694     6.696     6.700
## adjCV    6.782    6.753    6.732     6.682     6.682     6.684     6.686
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       81.27    97.18    99.08    99.72    99.89    99.93    99.97    99.99
## crim    39.26    40.61    40.87    41.25    43.24    43.95    48.88    49.34
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X      100.00    100.00    100.00    100.00    100.00
## crim    49.94     50.86     50.93     50.98     51.37
validationplot(pcr.fit,val.type="MSEP")

pcr.pred=predict(pcr.fit,testing.set,ncomp =10)
mean((pcr.pred-crim.test)^2)
## [1] 41.97147

pcr prediction model gives me MSE of 41.971417

set.seed(1)
library(pls)
pls.fit = plsr(crim~., data=training.set, Scale=TRUE, validation='CV')
summary(pls.fit)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.279    7.249    7.181    7.049    6.902    6.737
## adjCV        9.275    7.274    7.242    7.174    7.040    6.894    6.728
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.735    6.706    6.700     6.693     6.693     6.694     6.700
## adjCV    6.724    6.696    6.688     6.682     6.681     6.682     6.686
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       81.18    97.11    98.75    99.48    99.86    99.92    99.95    99.98
## crim    39.77    40.80    42.23    44.69    46.96    49.67    50.15    50.62
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       99.99    100.00    100.00    100.00    100.00
## crim    50.80     50.91     50.98     51.05     51.37
validationplot(pls.fit,val.type="MSEP")

pls.pred=predict(pls.fit,testing.set,ncomp =10)
mean((pls.pred-crim.test)^2)
## [1] 41.94719

pls model gives MSE of 41.94719

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

Using lasso model seems to perform well because it gives the lowest MSE except for ridged. And ridged will never force any of the coefficients to be exactly zero

out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s= bestlam) [1:14,]
lasso.coef[lasso.coef!=0]
##   (Intercept)            zn         indus          chas           nox 
##  1.269174e+01  3.628905e-02 -7.012417e-02 -5.842484e-01 -6.989138e+00 
##            rm           dis           rad           tax       ptratio 
##  2.308650e-01 -7.886241e-01  5.146154e-01 -2.235567e-05 -1.884305e-01 
##         black         lstat          medv 
## -7.544153e-03  1.248260e-01 -1.582852e-01

The lasso model gives a fit of:

crim = 1.269174e01 + 3.628905e-02\(zn\) + -7.012417e-02\(indus\) + -5.842484e-01\(chas\) + -6.989138e+00\(nox\) + 2.308650e-01\(rm\) + -7.886241e-01\(dis\) + 5.146154e-01\(rad\) + -2.235567e-05\(tax\) + -1.884305e-01\(ptratio\) + -7.544153e-03\(black\) + 1.248260e-01\(lstat\) + -1.582852e-01\(medv\)

(c) Does your chosen model involve all of the features in the dataset? Why or why not?

This set does not involve all of the features in the dataset, because the age variable had coefficient of 0.

lasso.coef
##   (Intercept)            zn         indus          chas           nox 
##  1.269174e+01  3.628905e-02 -7.012417e-02 -5.842484e-01 -6.989138e+00 
##            rm           age           dis           rad           tax 
##  2.308650e-01  0.000000e+00 -7.886241e-01  5.146154e-01 -2.235567e-05 
##       ptratio         black         lstat          medv 
## -1.884305e-01 -7.544153e-03  1.248260e-01 -1.582852e-01