2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias
(a) The lasso, relative to least squares, is:
(iii) The Lasso will in fact be less flexible than the least squares method and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance
(b) Repeat (a) for ridge regression relative to least squares.
(iii) Related to the Lasso method ridge regression will also have less flexibility than the least squares method and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance
(c) Repeat (a) for non-linear methods relative to least squares.
(ii) Lastly non-linear methods are more flexible than the least square methods hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias
9. In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
set.seed(1)
(a) Split the data set into a training set and a test set.
train.x=dim(College)[1]/2
train= sample(1:dim(College)[1], train.x)
test=-train
train.College= College[train,]
test.College= College[test,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit= lm(Apps~., data = train.College)
lm.pred = predict(lm.fit, test.College)
mean((test.College[,"Apps"]- lm.pred)^2)
## [1] 1135758
The test error rate for the linear model using least squares is 1135758
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
train.mat=model.matrix(Apps~., data = train.College)
test.mat=model.matrix(Apps~., data = test.College)
grid = 10^ seq(4,-2, length=100)
ridge.mod = cv.glmnet(train.mat, train.College[,"Apps"],alpha=0, lambda = grid, thresh = 1e-12)
lambda.best = ridge.mod$lambda.min
ridge.pred = predict(ridge.mod, newx = test.mat, s=lambda.best)
mean((test.College[,"Apps"]- ridge.pred)^2)
## [1] 1135714
The test error rate for ridge regression is 1135714, its barely smaller than the linear model
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.mod = cv.glmnet(train.mat, train.College[, "Apps"], alpha = 1, lambda = grid, thresh = 1e-12)
lambda.best = lasso.mod$lambda.min
lasso.pred=predict(lasso.mod, newx = test.mat, s=lambda.best)
mean((test.College[ , "Apps"]- lasso.pred)^2)
## [1] 1135660
The test error rate for the Lasso model is lower than both ridge regression and linear regression at 1135660
lasso.mod = glmnet(model.matrix(Apps~., data = College), College[ , "Apps"], alpha = 1)
predict(lasso.mod, s=lambda.best, type="coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -471.39372069
## (Intercept) .
## PrivateYes -491.04485135
## Accept 1.57033288
## Enroll -0.75961467
## Top10perc 48.14698891
## Top25perc -12.84690694
## F.Undergrad 0.04149116
## P.Undergrad 0.04438973
## Outstate -0.08328388
## Room.Board 0.14943472
## Books 0.01532293
## Personal 0.02909954
## PhD -8.39597537
## Terminal -3.26800340
## S.F.Ratio 14.59298267
## perc.alumni -0.04404771
## Expend 0.07712632
## Grad.Rate 8.28950241
Several Coefficients of the variables are closed to zero
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.0.4
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit = pcr(Apps~. , data = train.College , scale=T, validation="CV")
validationplot(pcr.fit, val.type= "MSEP")
With our plot above we conclude that we should keep 5 components , any number above that will be similar.
pcr.pred = predict(pcr.fit, test.College, ncomp = 5)
mean((test.College[,"Apps"] -c(pcr.pred))^2)
## [1] 1983650
The test error rate for pcr is 1983650
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit = plsr(Apps~., data = train.College, scale=T, validation="CV")
validationplot(pls.fit, val.type = "MSEP")
With pls the right number of components is at around 6
pls.pred=predict(pls.fit, test.College, ncomp = 6)
mean((test.College[, "Apps"]- data.frame(pls.pred))^2)
## Warning in mean.default((test.College[, "Apps"] - data.frame(pls.pred))^2):
## argument is not numeric or logical: returning NA
## [1] NA
(g) Comment on the results obtained. How accurately can we predict he number of college applications received? Is there much difference among the test errors resulting from these five approaches?
In conclusion based on the results above, there is a minimal performance difference when performing all 5 different models. The test error rate stays fairly similar.
11. We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter,such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
library(glmnet)
library(pls)
Lasso
x = model.matrix(crim ~ . -1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x , y, type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.4186415
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2298449
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.569011
Ridge regression
x = model.matrix(crim ~ .-1, data = Boston)
y= Boston$crim
cv.ridge = cv.glmnet(x,y,type.measure = "mse", alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.089904137
## zn -0.002618038
## indus 0.022395980
## chas -0.111271217
## nox 1.410367149
## rm -0.112325278
## age 0.004734543
## dis -0.070552811
## rad 0.032103662
## tax 0.001487374
## ptratio 0.053070897
## black -0.001844108
## lstat 0.026086260
## medv -0.017184827
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
## [1] 7.882159
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE , validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.190 7.182 6.778 6.775 6.771 6.787
## adjCV 8.61 7.188 7.181 6.773 6.767 6.766 6.781
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.781 6.637 6.672 6.675 6.658 6.620 6.551
## adjCV 6.774 6.629 6.664 6.666 6.650 6.609 6.541
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error. Results above
(c) Does your chosen model involve all of the features in the data set? Why or why not?
Pcr model seems to work the best with 13 components. BUt of course it depends if we care about interpretation