1. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Lasso limits the number of predictors, thus it reduces the inherent variance at the cost of an increase in bias
  1. Repeat (a) for ridge regression relative to least squares. The same as part a, because it is less flexible hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. We choose ii because it is more flexible and will give improved prediction accuracy when its increase in variance is less than its decrease in bias
  1. In this exercise, we will predict the number of applications received using the other variables in the College data set.
  1. Split the data set into a training set and a test set.
attach(College)
set.seed(1)
index = sample(1:nrow(College), 0.7*nrow(College))
college.train = College[index,]
college.test = College[-index,]
  1. Fit a linear model using least squares on the training set, and report the test error obtained.

We find the test error rate to be 1261630

lm.apps = lm(Apps ~ ., college.train)
predict.apps = predict(lm.apps, college.test)
test.error = mean((college.test$Apps-predict.apps)^2)
test.error
## [1] 1261630
  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

Here we get a test error of 1261271

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
matrix.train = model.matrix(Apps~.,data=college.train)
matrix.test = model.matrix(Apps~.,data=college.test)
grid.college = 10^seq(2,-2,length=100)
ridge.college = glmnet(matrix.train,college.train$Apps,alpha=0,lambda=grid.college,thresh = 1e-12)

college.cv.ridge = cv.glmnet(matrix.train,college.train$Apps,alpha=0,lambda=grid.college, thresh = 1e-12)

college.bestlam.ridge = college.cv.ridge$lambda.min
college.bestlam.ridge
## [1] 0.01
college.pred.newridge = predict(ridge.college,s = college.bestlam.ridge,newx = matrix.test)

ridge.mse.college = mean((college.test$Apps-college.pred.newridge)^2)
ridge.mse.college
## [1] 1261598
  1. Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

Test error rate for cross validation comes out to 1261591

lasso = glmnet(matrix.train,college.train$Apps,alpha=1,lambda=grid.college,thresh = 1e-12)

cv.lasso = cv.glmnet(matrix.train,college.train$Apps,alpha=1,lambda=grid.college,thresh=1e-12)

best.lasso = cv.lasso$lambda.min
best.lasso
## [1] 0.01
pred.lasso = predict(lasso,s=best.lasso,newx =matrix.test)
mean((college.test$Apps-pred.lasso)^2)
## [1] 1261591
predict(lasso,s=best.lasso,type="coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -5.377440e+02
## (Intercept)  .           
## PrivateYes  -5.044661e+02
## Accept       1.722328e+00
## Enroll      -1.054309e+00
## Top10perc    5.357483e+01
## Top25perc   -1.613811e+01
## F.Undergrad  2.961764e-02
## P.Undergrad  7.161170e-02
## Outstate    -8.840082e-02
## Room.Board   1.630175e-01
## Books        2.725806e-01
## Personal    -7.289853e-03
## PhD         -9.674379e+00
## Terminal    -3.774252e-01
## S.F.Ratio    1.626145e+01
## perc.alumni  2.354532e+00
## Expend       5.986089e-02
## Grad.Rate    7.157181e+00
  1. Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

With PCR model on the training set we return a test error of 2067220

library(pls)
## Warning: package 'pls' was built under R version 4.0.4
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
pcrmodel = pcr(Apps~.,data=college.train,scale=TRUE,validation="CV")
summary(pcrmodel)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     3793     2133     2152     1835     1705     1719
## adjCV         3895     3794     2129     2153     1808     1695     1712
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1719     1670     1646      1639      1642      1651      1646
## adjCV     1713     1658     1639      1632      1635      1643      1639
##        14 comps  15 comps  16 comps  17 comps
## CV         1647      1621      1216      1197
## adjCV      1640      1603      1205      1185
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.051    57.00    64.42    70.27    75.65    80.65    84.26    87.61
## Apps    5.788    71.69    71.70    80.97    82.60    82.60    82.69    84.06
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.58     92.84     94.93     96.74     97.82     98.72     99.39
## Apps    84.55     84.82     84.86     84.86     85.01     85.05     89.81
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.03     93.32
predict.pcr = predict(pcrmodel,college.test,ncomp=6)
mean((college.test$Apps-predict.pcr)^2)
## [1] 2067220
  1. Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

Fitting a PLS model on a training set we return a test error of 1283996

plsr.model = plsr(Apps~.,data=college.train,scale=TRUE,validation="CV")

predict.plsr<-predict(plsr.model,college.test,ncomp=6)
mean((college.test$Apps-predict.plsr)^2)
## [1] 1283996
  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

There isn’t much different between any of the five approaches. Overall we see that we return the highest accuracy with ridge regression.

test.avg <- mean(college.test$Apps)

lm <- 1 - mean((predict.apps - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)

ridge <- 1 - mean((college.pred.newridge - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)

lasso <- 1 - mean((pred.lasso - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)

pcr <- 1 - mean((predict.pcr - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)

pls <- 1 - mean((predict.plsr - college.test$Apps)^2) / mean((test.avg - college.test$Apps)^2)

lm
## [1] 0.9134458
ridge
## [1] 0.913448
lasso
## [1] 0.9134485
pcr
## [1] 0.8581783
pls
## [1] 0.9119114
  1. We will now try to predict per-capita crime rate in the Boston data set.
  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
attach(Boston)
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(1)
index = sample(1:nrow(Boston), 0.75*nrow(Boston))
boston.train = Boston[index,]
boston.test = Boston[-index,]

We’ll first start with Ridge Regression. Looking at the mse we return a value of 67.96, which at fist glance might seem high, but as it is our first model examining we’ll need to compare it with the others.

boston.x = model.matrix(crim~.,data=boston.train)
boston.y = model.matrix(crim~.,data=boston.test)
grid = 10^seq(4,-2,length=100)
ridge.model = glmnet(boston.x,boston.train$crim,alpha=0,lambda=grid)

bos.cv.ridge<-cv.glmnet(boston.x,boston.train$crim,alpha=0,lambda=grid)

boston.bestlam.ridge<-bos.cv.ridge$lambda.min
boston.bestlam.ridge
## [1] 0.1873817
ridge.model.bos = glmnet(boston.x,boston.train$crim,alpha=0,lambda=boston.bestlam.ridge)
coef(ridge.model.bos)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 11.2629302279
## (Intercept)  .           
## zn           0.0305370585
## indus       -0.0824127219
## chas        -0.5553656195
## nox         -5.2788768616
## rm           0.1767109007
## age         -0.0009530483
## dis         -0.6099655904
## rad          0.4452240286
## tax          0.0008107787
## ptratio     -0.2203982049
## black       -0.0108054534
## lstat        0.2037272188
## medv        -0.0992378183
boston.pred.newridge = predict(ridge.model,s=boston.bestlam.ridge,newx=boston.y)

ridge.mse = mean((boston.test$crim-boston.pred.newridge)^2)
ridge.mse
## [1] 67.96956

Next we’ll look at Lasso. After computing the mse we return a value of 68.05, which in comparison to our Ridge regression model is only slightly lower.

lasso.model = glmnet(boston.x,boston.train$crim,alpha=0,lambda=grid)

bos.cv.lasso = cv.glmnet(boston.x,boston.train$crim,alpha=0,lambda=grid)

bos.bestlam.lasso<-bos.cv.lasso$lambda.min
bos.bestlam.lasso
## [1] 0.2154435
lasso.model.bos = glmnet(boston.x,boston.train$crim,alpha=0,lambda=bos.bestlam.lasso)
coef(lasso.model.bos)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 10.9109334208
## (Intercept)  .           
## zn           0.0301534426
## indus       -0.0827857455
## chas        -0.5538777338
## nox         -5.0631981876
## rm           0.1715961065
## age         -0.0008422032
## dis         -0.5978626881
## rad          0.4374621932
## tax          0.0011360955
## ptratio     -0.2135348396
## black       -0.0108410516
## lstat        0.2034722538
## medv        -0.0970580327
boston.pred.las = predict(lasso.model,s=bos.bestlam.lasso,newx=boston.y)

lasso.mse<-mean((boston.test$crim-boston.pred.las)^2)
lasso.mse
## [1] 68.05862

For our next model we’ll look at PCR. We return a mse of 73.48, slightly higher than our Lasso and Ridge Regression model.

pcr.model.bos = pcr(crim~.,data=boston.train,scale=TRUE,validation="CV")
summary(pcr.model.bos)
## Data:    X dimension: 379 13 
##  Y dimension: 379 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           7.897    6.531    6.543    6.085    6.040    6.063    6.099
## adjCV        7.897    6.528    6.539    6.079    6.031    6.055    6.090
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.079    5.943    5.930     5.942     5.951     5.953     5.909
## adjCV    6.070    5.932    5.918     5.930     5.939     5.940     5.895
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.38    61.00    70.43    77.12    83.29    88.16    91.41    93.66
## crim    32.67    32.85    42.57    43.46    43.46    43.55    43.96    46.49
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.45     97.10     98.44     99.50    100.00
## crim    46.91     47.19     47.21     47.63     48.67
predict.pcr.bos = predict(pcr.model.bos,boston.test,ncomp=5)
mean((boston.test$crim-predict.pcr.bos)^2)
## [1] 73.48524
pcr.fit.bos = pcr(crim ~ .,data = boston.train,scale=TRUE,ncomp=5)
summary(pcr.fit.bos)
## Data:    X dimension: 379 13 
##  Y dimension: 379 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       48.38    61.00    70.43    77.12    83.29
## crim    32.67    32.85    42.57    43.46    43.46

Lastly, we’ll be looking at PLS. For PLS we return a 67.37 mse which is our lowest out of the other models. So it would ultimately appear that the PLS model preforms the best with the data.

pls.model.bos = plsr(crim~.,data=boston.train,scale=TRUE,validation="CV")

predict.pls.bos = predict(pls.model.bos,boston.test,ncomp=5)
mean((boston.test$crim-predict.pls.bos)^2)
## [1] 68.37759
pls.fit.bos = plsr(crim ~ ., data = boston.train, scale = TRUE, ncomp = 5)
summary(pls.fit.bos)
## Data:    X dimension: 379 13 
##  Y dimension: 379 1
## Fit method: kernelpls
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       47.89    57.54    62.26    71.53    78.23
## crim    36.89    45.51    47.55    48.04    48.27
  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

As mentioned previously after performing Ridge Regression, Lasso, Principle Component Regression, and Partial Least Squares that we return the lowest validation set error of 67.37 from the PLS model, which would be our proposal for the best model to run on the data.

  1. Does your chosen model involve all of the features in the data set? Why or why not?

We ended up using the entire model but decided on 5 components as they accounted for a little over 80% of the model variation in the PCR model.