Lasso’s decrease in variance is met with only a slight increase in its bias (p.223). Lasso will end up with some variables having coefficients equal to zero, so it will be less flexible than least squares.
Similar to above: RR’s variance really drops before its bias comes up to meet it (p.218), and it will zero out some variables so it is less flexible.
Non-linear methods are going to be more flexible than least squares, which is linear. Since we want the intersection of bias and variance to be as far to the left as possible, we want the bias to drop off real quick before the variance meets it.
College data set.Going with an 50/50 train-test split.
set.seed(2021)
college.data <- College
train <- sample(1:nrow(college.data), 0.5*nrow(college.data))
test = -train
college.train <- college.data[train, ]
college.test <- college.data[test, ]
lm9B <- lm(Apps ~ . , college.train)
lm9B.preds <- predict(lm9B, college.test)
mean((college.test$Apps - lm9B.preds)^2)
## [1] 1408299
Test error of LM returns 1408299.
set.seed(2021)
grid <- 10 ^ seq(10, -2, length=100)
train.matrix <- model.matrix(Apps ~ ., college.train)
test.matrix <- model.matrix(Apps ~ ., college.test)
#train the model using CV
ridge.model <- cv.glmnet(train.matrix, college.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam <- ridge.model$lambda.min
bestlam
## [1] 4.641589
ridge.model.pred <- predict(ridge.model, test.matrix, s=bestlam)
mean((ridge.model.pred - college.test$Apps)^2)
## [1] 1412127
Test error for ridge regression returns 1412127. This is higher than the LM’s 1408299.
set.seed(2021)
grid <- 10 ^ seq(10, -2, length=100)
train.matrix <- model.matrix(Apps ~ ., college.train)
test.matrix <- model.matrix(Apps ~ ., college.test)
#train the model using CV
lasso.model <- cv.glmnet(train.matrix, college.train$Apps, alpha = 1)
bestlam <- lasso.model$lambda.min
bestlam
## [1] 9.082179
lasso.pred <- predict(lasso.model, test.matrix, s=bestlam)
mean((lasso.pred - college.test$Apps)^2)
## [1] 1406953
Test error for lasso returns 1406953. This is a little better than LM RSS of 1408299. This is also better than ridge regression’s RSS of 1412127.
Incorporating instructor’s demonstration to check coefficients.
x = model.matrix(Apps ~ ., college.data)
y = college.data$Apps
out <- glmnet(x,y, alpha=1, lambda = grid)
lasso.coef <- predict(out, type = 'coefficient', s=bestlam)
lasso.coef
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -5.246442e+02
## (Intercept) .
## PrivateYes -4.805574e+02
## Accept 1.529824e+00
## Enroll -4.473636e-01
## Top10perc 4.294125e+01
## Top25perc -8.861363e+00
## F.Undergrad 1.669031e-03
## P.Undergrad 4.195095e-02
## Outstate -7.527460e-02
## Room.Board 1.420197e-01
## Books .
## Personal 2.152610e-02
## PhD -7.503658e+00
## Terminal -3.038887e+00
## S.F.Ratio 1.142353e+01
## perc.alumni -5.145633e-01
## Expend 7.437734e-02
## Grad.Rate 7.074171e+00
of M selected by cross-validation.
set.seed(2021)
pcr.fit <- pcr(Apps ~ ., data = college.train, scale=TRUE, validation = 'CV')
validationplot(pcr.fit, val.type = 'MSEP')
summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3402 3414 1637 1661 1618 1335 1320
## adjCV 3402 3414 1634 1664 1621 1325 1318
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1307 1300 1237 1241 1204 1211 1215
## adjCV 1305 1307 1233 1239 1200 1208 1212
## 14 comps 15 comps 16 comps 17 comps
## CV 1215 1221 1009 962.0
## adjCV 1212 1219 1005 957.5
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.927 57.28 64.66 70.99 76.57 81.55 84.95 88.08
## Apps 1.057 77.34 77.34 78.53 85.67 85.79 86.04 86.12
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.04 93.39 95.35 97.01 98.09 98.85 99.39
## Apps 87.71 87.81 88.43 88.43 88.48 88.54 88.87
## 16 comps 17 comps
## X 99.81 100.0
## Apps 92.35 93.3
Going with 10 components.
set.seed(2021)
pcr.pred <- predict(pcr.fit, college.test, ncomp = 10)
mean((pcr.pred - college.test$Apps)^2)
## [1] 2868755
The test error obtained from PCR is 2868755. This is much higher than the LM error of 1408299.
Just for giggles, let’s demonstrate that a PCR with all of the components is equivalent to the least squares regression.
set.seed(2021)
pcr.predX <- predict(pcr.fit, college.test, ncomp = 17)
mean((pcr.predX - college.test$Apps)^2)
## [1] 1408299
How ’bout that!
of M selected by cross-validation.
set.seed(2021) #because I like being thorough
pls.fit <- plsr(Apps ~ ., data=college.train, scale=TRUE, validation='CV')
validationplot(pls.fit)
Looks like 10 components should work again.
pls.pred <- predict(pls.fit, college.test, ncomp = 10)
mean((pls.pred - college.test$Apps)^2)
## [1] 1402090
Partial least squares looking good with an error of 1402090, better than least squares regression’s 1408299.
Boston data set.Best Subset Selection:
boston.data <- Boston
set.seed(2021)
bos.train <- sample(1:nrow(boston.data), 0.5*nrow(boston.data))
boston.train <- boston.data[bos.train, ]
boston.test <- boston.data[-bos.train, ]
boston.bss.full <- regsubsets(crim ~ ., boston.data, nvmax = 13)
bss.summary <- summary(boston.bss.full)
bss.summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., boston.data, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
names(bss.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
par(mfrow = c(1,1))
plot(boston.bss.full$rss, xlab = "Number of Variables", ylab = "RSS", type = "l" )
Looks like we could work with anywhere from 9 to 12 variables. What variables?
#trying out Dr. Campbell's cool code from the labs
coef(boston.bss.full, 12)
## (Intercept) zn indus chas nox
## 16.985713928 0.044673247 -0.063848469 -0.744367726 -10.202169211
## rm dis rad tax ptratio
## 0.439588002 -0.993556631 0.587660185 -0.003767546 -0.269948860
## black lstat medv
## -0.007518904 0.128120290 -0.198877768
test.mat <- model.matrix(crim~., data=boston.test)
val.errors <- rep(NA, 13)
for(i in 1:13){
coefi = coef(boston.bss.full, id=i)
pred = test.mat[ ,names(coefi)]%*%coefi
val.errors[i] = mean((boston.test$crim - pred)^2)
}
val.errors
## [1] 55.50955 52.12001 53.05658 51.51829 52.26508 51.80218 51.31701 50.85214
## [9] 50.62286 50.29730 50.25833 50.23052 50.23926
which.min(val.errors)
## [1] 12
Fit the model.
boston.lm <- lm(crim ~ zn + indus + chas + nox + rm + dis + rad + tax + ptratio + black + lstat + medv, boston.train)
summary(boston.lm)
##
## Call:
## lm(formula = crim ~ zn + indus + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat + medv, data = boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.817 -1.806 -0.197 0.891 55.110
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.602975 8.842401 3.122 0.00202 **
## zn 0.044322 0.023994 1.847 0.06594 .
## indus -0.057781 0.093441 -0.618 0.53692
## chas -0.494034 1.441064 -0.343 0.73203
## nox -8.625412 5.785930 -1.491 0.13734
## rm -1.095898 0.759810 -1.442 0.15051
## dis -0.932314 0.337034 -2.766 0.00611 **
## rad 0.534305 0.104734 5.102 6.85e-07 ***
## tax -0.003495 0.005919 -0.590 0.55547
## ptratio -0.179945 0.233382 -0.771 0.44145
## black -0.020328 0.004339 -4.685 4.69e-06 ***
## lstat 0.009001 0.086933 0.104 0.91762
## medv -0.088597 0.070880 -1.250 0.21253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.498 on 240 degrees of freedom
## Multiple R-squared: 0.5132, Adjusted R-squared: 0.4889
## F-statistic: 21.09 on 12 and 240 DF, p-value: < 2.2e-16
Let’s see what the error is of this LM by making predictions.
boston.lm.pred <- predict (boston.lm, boston.test)
mean((boston.lm.pred - boston.test$crim)^2)
## [1] 55.4715
Error of the LM shows as 55.47.
Lasso method is next:
set.seed(2021)
grid<-10 ^ seq(10, -2, length = 100)
train.matrix <- model.matrix(crim ~ . , boston.train)
test.matrix <- model.matrix(crim ~ ., boston.test)
boston.lasso <- cv.glmnet(train.matrix, boston.train$crim, alpha = 1)
bestlam.boston <- boston.lasso$lambda.min
bestlam.boston
## [1] 0.02055351
plot(boston.lasso)
boston.lasso.pred <- predict(boston.lasso, test.matrix, s=bestlam.boston)
mean((boston.lasso.pred - boston.test$crim)^2)
## [1] 55.62816
The error of Lasso is 55.63.
Ridge regression is next:
set.seed(2021)
grid<-10 ^ seq(10, -2, length = 100)
train.matrix <- model.matrix(crim ~ . , boston.train)
test.matrix <- model.matrix(crim ~ . , boston.test)
boston.ridge <- cv.glmnet(train.matrix, boston.train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.boston <- boston.ridge$lambda.min
bestlam.boston
## [1] 0.07054802
plot(boston.ridge)
boston.ridge.pred <- predict(boston.ridge, test.matrix, s=bestlam.boston)
mean((boston.ridge.pred - boston.test$crim)^2)
## [1] 55.57771
Error of Ridge Model is 55.58.
Doing PCR now:
set.seed(2021)
boston.pcr <- pcr(crim ~ ., data=boston.train, scale = TRUE, validation = 'CV')
validationplot(boston.pcr, val.type = 'MSEP')
summary(boston.pcr)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 7.706 6.400 6.404 5.814 5.781 5.755 5.809
## adjCV 7.706 6.397 6.403 5.805 5.776 5.748 5.799
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5.824 5.763 5.753 5.803 5.793 5.732 5.638
## adjCV 5.815 5.755 5.729 5.788 5.780 5.717 5.623
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 45.37 58.18 68.58 76.36 83.14 88.24 91.31 93.56
## crim 31.91 31.91 44.41 45.35 45.96 46.14 46.16 47.73
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.38 97.00 98.38 99.47 100.00
## crim 48.69 48.69 48.83 49.78 51.39
Going to go with 13 components, because it has lowest CV/adjCV RMSEP.
set.seed(2021)
boston.pcr.pred <- predict(boston.pcr, boston.test, ncomp = 13)
mean((boston.pcr.pred - boston.test$crim)^2)
## [1] 55.56365
Error of PCR is 55.56.
Here’s the summary (in ascending order of RSS) from the various methods.
* Error of LM is 55.47.
* Error of PCR is 55.56.
* Error of Ridge Model is 55.58. * Error of Lasso is 55.63.
The LM model based on best subset selection based on 12 variables has the lowest error, based upon evaluation across training and testing sets.
The best subset selection model does not include all of the features, because the method seeks to develop the most efficient model with greatest predictive power. The variables not chosen did not contribute to improved predictive power. As a bonus, the best subset model is a little simpler than the PCR model (which had 13 features).
Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much
difference among the test errors resulting from these five approaches?
Let’s compare the two models with lowest scores: PLS and OLS.
The PLS model can explain 93.26% of the variance in Apps, which is very good.
Least square regression can predict the variation in the behavior of Apps with an adjusted R^2 of 92.99%.
Lasso comes in at 1406953 which should put its R^2 into the low 90s. Ridge regression comes in at 1412127 which should also put its R^2 into the low 90s.
PCR’s performance came in last with an error of 2868755, and explains about 87.8% of the variance in Apps.