2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii is the correct answer here. Lasso has less variance with the trade-off of having a slightly higher bias whereas least squares has a high variance. Lasso shrinks coefficient estimates so it can remove variables that aren’t important to the model for less variance but higher bias.

(b) Repeat (a) for ridge regression relative to least squares.

iii is also the correct answer here. Ridge regression also shrinks coefficient estimates, having a similar effect as the lasso with decreasing variance but having a higher bias.

(c) Repeat (a) for non-linear methods relative to least squares.

The answer here would be ii. Non-linear models are more flexible and will have less bias than least squares. So when the increase in variance is less than the decrease in bias we could expect better results.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR)
attach(College)
x=model.matrix(Apps~.,College[,-1])
y=College$Apps
set.seed(5)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
college.train=College[train,]
college.test=College[test,]
y.test=y[test]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit=lm(Apps~.,data=College,subset=train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5285.7  -370.1   -24.5   320.6  7113.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.700e+02  5.882e+02  -1.649 0.099959 .  
## PrivateYes  -5.556e+02  1.928e+02  -2.882 0.004177 ** 
## Accept       1.760e+00  5.268e-02  33.412  < 2e-16 ***
## Enroll      -1.703e+00  2.636e-01  -6.459 3.32e-10 ***
## Top10perc    4.956e+01  7.893e+00   6.278 9.59e-10 ***
## Top25perc   -1.470e+01  6.463e+00  -2.274 0.023517 *  
## F.Undergrad  1.342e-01  4.301e-02   3.121 0.001946 ** 
## P.Undergrad -1.592e-03  5.789e-02  -0.027 0.978081    
## Outstate    -9.801e-02  2.807e-02  -3.491 0.000539 ***
## Room.Board   1.418e-01  6.855e-02   2.069 0.039242 *  
## Books       -2.290e-01  3.113e-01  -0.736 0.462438    
## Personal     2.066e-01  9.323e-02   2.216 0.027277 *  
## PhD         -1.528e+01  7.015e+00  -2.178 0.030023 *  
## Terminal     2.974e+00  7.325e+00   0.406 0.685005    
## S.F.Ratio    4.376e+01  2.104e+01   2.080 0.038195 *  
## perc.alumni  7.724e-01  5.947e+00   0.130 0.896732    
## Expend       9.610e-02  2.209e-02   4.351 1.75e-05 ***
## Grad.Rate    1.054e+01  4.204e+00   2.507 0.012605 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1064 on 370 degrees of freedom
## Multiple R-squared:  0.9344, Adjusted R-squared:  0.9314 
## F-statistic: 310.1 on 17 and 370 DF,  p-value: < 2.2e-16
pred.apps=predict(lm.fit,college.test)
mean((college.test$Apps-pred.apps)^2)
## [1] 1166539

The test error obtained here is 1,166,539.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv.college.out=cv.glmnet(x[train,],y[train],alpha=0)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 384.3366
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 1077128

The test error obtained here is 1,077,128. Best lambda is 384.34.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv.college.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 2.051089
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1153439

Here the bestlam was 2.05, and the test error for this model is 1,153,439.

out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]
##   (Intercept)        Accept        Enroll     Top10perc     Top25perc 
## -921.13191496    1.57619429   -0.77470826   47.93717208  -13.04948741 
##   F.Undergrad   P.Undergrad      Outstate    Room.Board      Personal 
##    0.05804280    0.04945985   -0.10661855    0.13342723    0.02873374 
##           PhD      Terminal     S.F.Ratio   perc.alumni        Expend 
##   -6.48796215   -1.35545596   22.11969735   -1.35217989    0.08114511 
##     Grad.Rate 
##    7.68185047

The number of non-zero coefficient estimates is 15.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
pcr.college=pcr(Apps~.,data=college.train,scale=TRUE,validation="CV")
summary(pcr.college)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4068     4057     2217     2217     2156     2011     1885
## adjCV         4068     4058     2211     2212     2139     1985     1876
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1889     1871     1822      1810      1853      1853      1880
## adjCV     1880     1839     1812      1799      1844      1844      1871
##        14 comps  15 comps  16 comps  17 comps
## CV         1878      1526      1426      1272
## adjCV      1903      1501      1408      1258
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      33.325    57.73    65.26    70.81    76.08    81.02    84.91    87.77
## Apps    1.007    72.25    72.60    75.39    79.89    81.19    81.23    82.43
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.61     92.97     95.08     96.70     97.72     98.54     99.31
## Apps    82.76     83.18     83.21     83.27     83.75     83.79     91.27
##       16 comps  17 comps
## X        99.79    100.00
## Apps     92.11     93.44
validationplot(pcr.college,val.type="MSEP")

pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)
## [1] 1647713

The value of M selected by cross-validation was 10, and the test error for this model is 1,647,713.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.college=plsr(Apps~.,data=college.train,scale=TRUE,validation="CV")
summary(pls.college)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4068     2039     1910     1769     1680     1553     1362
## adjCV         4068     2033     1904     1752     1645     1523     1344
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1321     1295     1289      1293      1290      1284      1283
## adjCV     1306     1280     1274      1278      1275      1269      1269
##        14 comps  15 comps  16 comps  17 comps
## CV         1283      1283      1283      1283
## adjCV      1269      1269      1269      1269
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.51    42.39    62.71    65.69    70.93    74.26    78.53    80.36
## Apps    76.66    82.87    86.43    90.72    92.18    92.86    92.99    93.26
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.07     86.56     88.83     91.66     93.11     94.39     96.61
## Apps    93.37     93.41     93.43     93.44     93.44     93.44     93.44
##       16 comps  17 comps
## X        99.13    100.00
## Apps     93.44     93.44
validationplot(pls.college,val.type="MSEP")

pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)
## [1] 1127857

The value of M selected by cross-validation here was 9, and the test error is 1,127,857.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The errors I got from the five approaches are as follows, Least squares- 1,166,539 Ridge regression- 1,077,128 Lasso- 1,153,439 PCR- 1,647,713 PLS- 1,127,857

The ridge regression model with a test error of 1,077,128 is the lowest, so perhaps this is the one to go with out of the five. The R squared for the least squares model is about 93%, and its error isn’t far off from the others so I would say yes, these could be used to predict the number of college applications received.

11. We will now try to predict per capita crime rate in the Boston dataset.

detach(College)
library(leaps)
library(MASS)
set.seed(5)
attach(Boston)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

#Best Subset Selection
set.seed(5)
train=sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test=(-train)
regfit.Boston=regsubsets(crim~.,data=Boston[train,],nvmax=13)
test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
  coefi=coef(regfit.Boston,id=i)
  pred=test.mat[,names(coefi)]%*%coefi
  val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
##  [1] 45.61825 43.73050 43.81836 43.11088 42.24088 42.40160 42.02448 42.04104
##  [9] 42.51405 42.50313 42.47811 42.41412 42.20707
which.min(val.errors)
## [1] 7
coef(regfit.Boston,7)
## (Intercept)          zn       indus          rm         dis         rad 
## -2.30177611  0.04639411 -0.19424726  1.49536825 -0.82113681  0.58240665 
##       lstat        medv 
##  0.14640159 -0.25988540

So here I used best subset selection on the Boston dataset, and using the validation set approach it seems that a 7-variable model is what is best here, as it has the lowest error at 42.02. The variables it chose are zn, indus, rm, dis, rad, lstat, and medv.

#Ridge Regression
set.seed(5)
x=model.matrix(crim~.,Boston[,-1])
y=Boston$crim
grid=10^seq(10,-2,length=100)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
cv.out.Boston=cv.glmnet(x[train,],y[train],alpha=0)
bestlam.Boston=cv.out.Boston$lambda.min
bestlam.Boston
## [1] 0.600269
ridge.Boston=glmnet(x[train,],y[train],alpha=0,lambda=grid,thresh=1e-12)
ridge.pred.Boston=predict(ridge.Boston,s=bestlam.Boston,newx=x[test,])
mean((ridge.pred.Boston-y.test)^2)
## [1] 20.87974

We can see here that the best lambda selected by cross-validation for the training set is .6, and the error we get is 20.88. Now we will run a ridge regression on the whole dataset.

#Ridge Regression
realridge.Boston=glmnet(x,y,alpha=0)
realridge.pred=predict(realridge.Boston,type='coefficients',s=bestlam,)[1:13,]
realridge.pred
##  (Intercept)  (Intercept)           zn        indus         chas          nox 
##  2.655874373  0.000000000  0.022581911 -0.044047690 -0.775627763 -0.780512240 
##           rm          age          dis          rad          tax      ptratio 
##  0.179615519  0.004589786 -0.414468222  0.294548154  0.006484046 -0.001051838 
##        black 
## -0.008999011

Now we can see the coefficients for each of the variables for the ridge regression of the whole dataset with our best lambda of .6.

#The Lasso
set.seed(5)
lasso.Boston=glmnet(x[train,],y[train],alpha=1,lambda=grid)
cv.out.lasso=cv.glmnet(x[train,],y[train],alpha=1)
bestlam.lasso=cv.out.lasso$lambda.min
bestlam.lasso
## [1] 0.1452728
lasso.pred.Boston=predict(lasso.Boston,s=bestlam,newx=x[test,])
mean((lasso.pred.Boston-y.test)^2)
## [1] 18.75912

Here we can see that the best lambda selected by cross-validation for the training set is .14, and the error we get is lower than ridge regression at 18.76. Now we will run the lasso on the whole data set.

#The Lasso 
reallasso.Boston=glmnet(x,y,alpha=1,lambda=grid)
reallasso.pred=predict(reallasso.Boston,type='coefficients',s=bestlam)[1:13,]
reallasso.pred
## (Intercept) (Intercept)          zn       indus        chas         nox 
##  -0.4325120   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
##          rm         age         dis         rad         tax     ptratio 
##   0.0000000   0.0000000   0.0000000   0.3641029   0.0000000   0.0000000 
##       black 
##   0.0000000

Now here is the coefficients for the variables using the lasso for the whole dataset. Nearly all variables were zeroed out by this lasso model.

#PCR
set.seed(5)
pcr.Boston=pcr(crim~.,data=Boston,scale=TRUE,validation="CV")
summary(pcr.Boston)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.211    7.204    6.748    6.739    6.743    6.790
## adjCV         8.61    7.208    7.201    6.744    6.733    6.739    6.784
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.792    6.654    6.676     6.671     6.663     6.606     6.527
## adjCV    6.785    6.647    6.668     6.662     6.654     6.597     6.518
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

The number of components with the most variance explained is 13 with 45.4% explained. We will fit the PCR on the training set and plot a validation plot and try and see if it is any different.

#Fitting PCR on the training set
set.seed(5)
pcr.Boston.again=pcr(crim~.,data=Boston[train,],scale=TRUE,validation="CV")
validationplot(pcr.Boston.again,val.type="MSEP")

pcr.pred.Boston=predict(pcr.Boston.again,Boston[test,],ncomp=13)
mean((pcr.pred.Boston-Boston$crim[test])^2)
## [1] 23.06438

It seemed that on the training set that 13 variables was the best here, and when using 13 variables the error we get is 23.06. I will try again with 8 variables just to see how it does.

#Training PCR with ncomp=8
pcr.pred.Boston=predict(pcr.Boston.again,Boston[test,],ncomp=8)
mean((pcr.pred.Boston-Boston$crim[test])^2)
## [1] 21.30832

At ncomp=8 the error is actually lower at 21.31!

(b) Propose a model (or set of models) that seem to perform well on this dataset, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

I would suggest the lasso model as it has the lowest error at 18.76. I am not too sold because it zeroed out all variables except for rad, but perhaps that is because rad is a good predictor for per capita crime rate. It even has the highest positive coefficient for ridge regression. If I had to choose a second-best, I would suggest the ridge model because it has the next-lowest error of the models I tested at 20.88. PCR with ncomp=8 also gets an honorable mention as its error was 21.31.

(c) Does your chosen model involve all of the features in the dataset? Why or why not?

No it does not because the lasso model put almost all of the coefficients of the different variables to 0, effectively taking these variables out of the model.