For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The lasso method is less flexible than the least squares method and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. (iii)
Similarly to the lasso method, ridge regression is less flexible than the least squares method and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. (iii)
In contrasts to the two previous methods, non-linear methods is more flexible than the least square method and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. (ii)
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
attach(College)
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]
The data was split equally so the training data set and the test data set contain 50% of the data.
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
LS.fit<-lm(Apps~., data=College, subset=train)
summary(LS.fit)
##
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5387.1 -468.0 -13.5 394.4 6645.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -676.07070 657.29073 -1.029 0.30435
## PrivateYes -577.50016 218.45328 -2.644 0.00855 **
## Accept 1.67767 0.05299 31.660 < 2e-16 ***
## Enroll -0.55248 0.29906 -1.847 0.06549 .
## Top10perc 66.28284 8.79947 7.533 3.85e-13 ***
## Top25perc -23.12334 7.10800 -3.253 0.00125 **
## F.Undergrad -0.05959 0.05733 -1.039 0.29928
## P.Undergrad 0.02930 0.05744 0.510 0.61025
## Outstate -0.11885 0.02990 -3.974 8.49e-05 ***
## Room.Board 0.22169 0.07652 2.897 0.00399 **
## Books -0.05290 0.38960 -0.136 0.89207
## Personal 0.09498 0.09145 1.039 0.29967
## PhD -13.88234 8.02488 -1.730 0.08448 .
## Terminal 0.95034 8.07090 0.118 0.90633
## S.F.Ratio 21.41748 21.54419 0.994 0.32081
## perc.alumni 8.82698 6.39055 1.381 0.16803
## Expend 0.06876 0.01630 4.217 3.11e-05 ***
## Grad.Rate 11.24407 4.37044 2.573 0.01048 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1133 on 370 degrees of freedom
## Multiple R-squared: 0.93, Adjusted R-squared: 0.9268
## F-statistic: 289.1 on 17 and 370 DF, p-value: < 2.2e-16
pred.app<-predict(LS.fit, College.test)
test.error<-mean((College.test$Apps-pred.app)^2)
test.error
## [1] 1016996
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.5.3
## Loading required package: Matrix
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.5.3
## Loaded glmnet 2.0-16
#Ridge Model
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
#Best Lambda
cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 429.864
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 906938.4
The MSE or test error for the ridge model is 906,938.4.
(d) Fit a lasso model on the training set, with \(\lambda\) chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
#Lasso Model
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
#Best Lambda
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
## [1] 23.48036
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 930235
# Lasso Coefficients
out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -6.216869e+02 -4.143091e+02 1.443950e+00 -1.638136e-01 3.230157e+01
## Top25perc P.Undergrad Outstate Room.Board Personal
## -1.464598e+00 1.701264e-02 -5.512815e-02 1.221390e-01 5.212088e-04
## PhD Terminal S.F.Ratio perc.alumni Expend
## -5.297377e+00 -3.347805e+00 3.356347e+00 -1.006902e+00 6.885043e-02
## Grad.Rate
## 4.875837e+00
The MSE or test error for the lasso model is 930,235.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4193 4171 2422 2321 2341 2054 1987
## adjCV 4193 4172 2418 2315 2342 2042 1979
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1987 1959 1844 1848 1872 1878 1898
## adjCV 1982 1953 1835 1840 1863 1869 1888
## 14 comps 15 comps 16 comps 17 comps
## CV 1883 1666 1377 1380
## adjCV 1898 1637 1361 1364
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 31.510 56.98 64.18 70.42 76.06 81.04 84.66
## Apps 1.913 68.37 71.21 71.45 78.19 80.06 80.06
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.95 90.78 93.13 95.25 96.86 97.95 98.74
## Apps 80.79 83.34 83.35 83.37 83.37 83.64 84.66
## 15 comps 16 comps 17 comps
## X 99.42 99.87 100
## Apps 91.09 92.95 93
validationplot(pcr.college, val.type="MSEP")
#CV error with choice of M=10
pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)
## [1] 1371446
The number of components of selected for this particular model is 10; this number of components has the lowest CV (1857) with the most variance explained (83.35);for a more parisimonious model.The MSE or test error for the pcr model is 1,371,446.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
#PLS Model
pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")
summary(pls.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4193 2221 1901 1768 1697 1573 1398
## adjCV 4193 2213 1893 1763 1677 1547 1380
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1370 1359 1358 1363 1364 1364 1361
## adjCV 1355 1345 1344 1348 1349 1349 1346
## 14 comps 15 comps 16 comps 17 comps
## CV 1359 1359 1358 1358
## adjCV 1345 1344 1344 1343
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 25.52 38.49 62.37 66.41 70.45 73.47 77.99
## Apps 74.33 83.18 85.86 89.52 91.82 92.82 92.88
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 80.21 84.21 87.98 89.49 91.24 92.24 95.13
## Apps 92.95 92.98 92.98 92.99 92.99 93.00 93.00
## 15 comps 16 comps 17 comps
## X 96.53 98.31 100
## Apps 93.00 93.00 93
#CV error with M=9
pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)
## [1] 1046329
In contrast to the PCR model, the PLS model resulted in model with 9 components. With 9 components the CV is the lowest (1358) with the highest amount of variance explained(84.21); for a more parisimonious model.The MSE or test error for the pls model is 1,046,329.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred.app)^2) /mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="yellow", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")
detach(College)
Comparing the results of the \(R^2\) for each model shows that all the models could be used to accurately predict the number of college applications. The PCR model has the lowest \(R^2\) value ~ .85 and PLS following with an \(R^2\) equal to ~.87. Comparing the MSEs for each model (in table below), show that Ridge has the lowest MSE, so this method may result in the best model.
| Model | Test error |
|---|---|
| OLS | 1,016,996 |
| Ridge | 906,938.4 |
| Lasso | 930,235 |
| PCR | 1,371,446 |
| PLS | 1,046,329 |
We will now try to predict per capita crime rate in the Boston data set.
library(leaps)
library(MASS)
set.seed(1)
attach(Boston)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")
which.min(mean.cv.errors)
## [1] 9
mean.cv.errors[which.min(mean.cv.errors)]
## [1] 43.47287
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.0894283
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2643196
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.405176
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.017516870
## zn -0.002805664
## indus 0.034405928
## chas -0.225250601
## nox 2.249887495
## rm -0.162546004
## age 0.007343331
## dis -0.114928730
## rad 0.059813843
## tax 0.002659110
## ptratio 0.086423005
## black -0.003342067
## lstat 0.044495212
## medv -0.029124577
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
## [1] 7.456946
pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.170 7.163 6.733 6.728 6.740 6.765
## adjCV 8.61 7.169 7.162 6.730 6.723 6.737 6.760
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.760 6.634 6.652 6.642 6.652 6.607 6.546
## adjCV 6.754 6.628 6.644 6.635 6.643 6.598 6.536
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 93.45 95.40 97.04 98.46 99.52 100.0
## crim 42.47 42.55 42.78 43.04 44.13 45.4
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
The best subset, the lasso, ridge resulted in MSE at ~42. The lowest CV error for the PCR method (6.567), Lasso (7.42108) and Ridge (7.549673).
(c) Does your chosen model involve all of the features in the data set? Why or why not?
The model I would select is the best subset model. This model has a low MSE and has the least number of predictors with 9. The Lasso and Ridge models have all the available predictors from this dataset. The PCR method has the lowest CV but it has 13 predictors so there isn’t any reduction to create the most parisimonious model.