library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readxl)
library(tinytex)
library(ISLR)
library(ISLR2)
##
## Attaching package: 'ISLR2'
##
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:ISLR2':
##
## Boston
##
## The following object is masked from 'package:dplyr':
##
## select
library(class)
library(e1071)
library(boot)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
##
## The following object is masked from 'package:boot':
##
## melanoma
##
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-7
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:caret':
##
## R2
##
## The following object is masked from 'package:stats':
##
## loadings
library(leaps)
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
- More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
- More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
- Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
- Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The Lasso, relative to least squares, is correct for statement iii. The book states that “the Lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large…the lasso performs variable selection. As a result, models generated from the lasso are generally much easier to interpret.” They go on to explain that when the least squares estimates have high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions. In other words, Lasso is less flexible than least squares which produces more accurate predictions.
Ridge Regression relative to least squares is correct for statement iii as well. The book states that Ridge’s “advantage over least squares is rooted in the bias-variance trade-off. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.” With linear data, least squares will have low bias but could have a much higher variance which in turn means small waves in the training data can turn into big waves in the least squares coefficient. Ridge Regression however can still perform well “by trading off a small increase in bias for a large decrease in variation.” Basically, Ridge has similar characteristics to Lasso in this regard but the the constraint on the coefficients is different.
For non-linear methods relative to least squares the correct statement is ii. Non-linear won’t be an constrained and can help reduce bias when fitting to the data but due to the increase in dimensions there can be significant amount of variance. If the data ends up being non-linear there will be greater accuracy in predictions.
In this exercise, we will predict the number of applications received using the other variables in the
Collegedata set.
- Split the data set into a training set and a test set.
The College data set in ISLR2 is described as statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
A data frame with 777 observations on the following 18 variables.
Private A factor with levels No and Yes indicating
private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S.
class
Top25perc Pct. new students from top 25% of H.S.
class
F.Undergrad Number of fulltime undergraduates
P.Undergrad Number of parttime undergraduates
Outstate Out-of-state tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.’s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate
#SPLIT 60/40
set.seed(1)
sample1 <- sample(c(TRUE, FALSE), nrow(College), replace=TRUE, prob = c(.6,.4))
train <- College[sample1, ]
test <- College[!sample1, ]
dim(train)
## [1] 472 18
dim(test)
## [1] 305 18
- Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit.college <- lm(Apps ~ ., data = train)
summary(lm.fit.college)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2573.3 -423.5 -51.1 307.6 6762.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.471e+02 4.923e+02 -0.908 0.364325
## PrivateYes -6.778e+02 1.762e+02 -3.847 0.000136 ***
## Accept 1.191e+00 6.572e-02 18.116 < 2e-16 ***
## Enroll -1.281e-01 2.272e-01 -0.564 0.573054
## Top10perc 5.197e+01 7.163e+00 7.255 1.75e-12 ***
## Top25perc -1.713e+01 5.675e+00 -3.019 0.002683 **
## F.Undergrad 7.921e-02 3.890e-02 2.036 0.042314 *
## P.Undergrad -5.039e-03 4.706e-02 -0.107 0.914772
## Outstate -9.835e-03 2.430e-02 -0.405 0.685853
## Room.Board 1.073e-01 6.230e-02 1.722 0.085775 .
## Books 1.339e-01 3.250e-01 0.412 0.680506
## Personal 3.000e-02 7.844e-02 0.383 0.702247
## PhD -1.036e+01 6.303e+00 -1.644 0.100868
## Terminal 1.455e+00 6.830e+00 0.213 0.831348
## S.F.Ratio 3.405e+00 1.538e+01 0.221 0.824845
## perc.alumni -7.875e+00 5.093e+00 -1.546 0.122735
## Expend 5.961e-02 1.465e-02 4.069 5.57e-05 ***
## Grad.Rate 8.848e+00 3.598e+00 2.459 0.014291 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1005 on 454 degrees of freedom
## Multiple R-squared: 0.9253, Adjusted R-squared: 0.9225
## F-statistic: 330.7 on 17 and 454 DF, p-value: < 2.2e-16
lm.pred <- predict(lm.fit.college, test)
lm.MSE <- mean((lm.pred - test$Apps)^2)
lm.MSE
## [1] 1622925
The MSE is calculated as 1622925.
- Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
train.rm <- model.matrix(Apps~., data=train)
test.rm <- model.matrix(Apps ~., data=test)
grid = 10^seq(10, -2, length=100)
par(mfrow = c(1, 2))
ridgemodel <- glmnet(train.rm, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
plot(ridgemodel)
cv.ridge <- cv.glmnet(train.rm, train[, "Apps"], alpha=0)
plot(cv.ridge)
lambest <- cv.ridge$lambda.min
lambest
## [1] 337.6929
pred <- predict(ridgemodel,test.rm,s=lambest)
rss <- sum((pred-test$Apps)^2)
tss <- sum((test$Apps-mean(test$Apps))^2)
test.rsq <- 1-(rss/tss)
test.rsq
## [1] 0.8458411
- Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lassomod <- cv.glmnet(train.rm, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
lassbest <- lassomod$lambda.min
plot(lassomod)
### TEST ERROR
pred <- predict(lassomod,test.rm,s=lassbest)
rss <- sum((pred-test$Apps)^2)
tss <- sum((test$Apps-mean(test$Apps))^2)
test.rsq <- 1-(rss/tss)
test.rsq
## [1] 0.9066181
sum(coef(lassomod)[,1]==0)
## [1] 14
names(coef(lassomod)[, 1][coef(lassomod)[, 1] == 0])
## [1] "(Intercept)" "PrivateYes" "Enroll" "Top25perc" "P.Undergrad"
## [6] "Outstate" "Room.Board" "Books" "Personal" "PhD"
## [11] "Terminal" "S.F.Ratio" "perc.alumni" "Grad.Rate"
- Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pcr.fit <- pcr(Apps~.,data=train, scale=TRUE, validation="CV")
summary(pcr.fit)
## Data: X dimension: 472 17
## Y dimension: 472 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3614 3542 1721 1714 1265 1266 1255
## adjCV 3614 3544 1719 1716 1241 1259 1252
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1255 1254 1175 1165 1170 1176 1180
## adjCV 1253 1254 1172 1162 1167 1173 1177
## 14 comps 15 comps 16 comps 17 comps
## CV 1180 1188 1074 1068
## adjCV 1176 1185 1070 1064
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.695 58.25 65.13 70.86 76.21 81.16 84.76 88.00
## Apps 5.324 77.87 78.23 88.62 88.65 88.65 88.76 88.76
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.01 93.38 95.5 97.25 98.30 98.98 99.47
## Apps 90.16 90.29 90.3 90.34 90.34 90.37 90.38
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.21 92.53
Number of components considered: 17
pred <- predict(pcr.fit,test,ncomp=17)
rss <- sum((pred-test$Apps)^2)
tss <- sum((test$Apps-mean(test$Apps))^2)
test.rsq <- 1-(rss/tss)
test.rsq
## [1] 0.9096891
- Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pls.fit <- plsr(Apps~.,data=train, scale=TRUE, validation="CV")
summary(pls.fit)
## Data: X dimension: 472 17
## Y dimension: 472 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3614 1555 1273 1163 1141 1110 1086
## adjCV 3614 1552 1277 1160 1137 1111 1079
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1073 1064 1066 1066 1068 1068 1068
## adjCV 1068 1060 1062 1062 1064 1064 1064
## 14 comps 15 comps 16 comps 17 comps
## CV 1068 1068 1068 1068
## adjCV 1064 1064 1064 1064
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.90 43.71 63.83 67.12 71.61 74.29 76.87 79.98
## Apps 82.26 88.02 90.43 91.15 91.72 92.41 92.47 92.49
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.16 85.64 87.28 88.41 92.52 94.25 96.82
## Apps 92.50 92.52 92.52 92.53 92.53 92.53 92.53
## 16 comps 17 comps
## X 98.87 100.00
## Apps 92.53 92.53
Number of components considered: 8
pred <- predict(pls.fit,test,ncomp=8)
rss <- sum((pred-test$Apps)^2)
tss <- sum((test$Apps-mean(test$Apps))^2)
test.rsq <- 1-(rss/tss)
test.rsq
## [1] 0.9067511
- Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The Ridge Regression method provided the best results (.85) but and the rest provided similar Test Error results (~.9). That being said I’m not sure if my training/test data set was the best balanced because the Test Errors in my opinion are significant for all predictors.
We will now try to predict per capita crime rate in the
Bostondata set.
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
Boston <- na.omit(Boston)
set.seed(1)
pcr.fit <- pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.250 7.253 6.833 6.815 6.826 6.847
## adjCV 8.61 7.245 7.247 6.825 6.803 6.818 6.838
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.837 6.710 6.735 6.723 6.714 6.696 6.624
## adjCV 6.827 6.698 6.724 6.710 6.702 6.682 6.609
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
validationplot(pcr.fit,val.type="MSEP")
set.seed(1)
sample2 <- sample(c(TRUE, FALSE), nrow(Boston), replace=TRUE, prob = c(.6,.4))
train <- Boston[sample2, ]
test <- Boston[!sample2, ]
dim(train)
## [1] 314 14
dim(test)
## [1] 192 14
regfit.full=regsubsets(crim~.,Boston)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., Boston)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
reg.summary=summary(regfit.full)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
set.seed(1)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
9 provides the best # of components to lower MSE around 6.55
set.seed(1)
x <- model.matrix(crim ~ . - 1, data = Boston)
y <- Boston$crim
cv.lasso = cv.glmnet(x,y,type.measure = "mse")
plot(cv.lasso)
besLasso = cv.lasso$lambda.min
besLasso
## [1] 0.05630926
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.0894283
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2643196
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.438669
Higher than subset MSE