For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Lovaas answer ex 2a: Lasso is less flexible than least squares, meaning either iii or iv is correct. Lasso will have less variance than least squares but slightly more bias (as it removes some variables from the model). Thus, iii is the true statement.
Lovaas answer ex 2b: Ridge regression is very similar to lasso, so the answer from 2a applies. Ridge regression is also less flexible than least squares and will have slightly more bias (since many betas are “shrunk” towards 0)
Lovaas answer ex 2c: Non linear methods are generally more flexible than least squares; as least squares is linear. However, they will only give improved prediction accuracy when the increase in variance is less than the decrease in bias. This corresponds to ii above.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
# ?College
sum(is.na(College$Apps)) # no need to fix data
## [1] 0
attach(College)
train=sample(c(TRUE,FALSE), nrow(College),rep=TRUE)
test=(!train)
college.train = College[train,]
college.test = College[test,]
college.ols = lm(Apps~., data = college.train)
college.pred = predict(college.ols, college.test)
x = mean((college.test[,"Apps"] - college.pred)^2)
x
## [1] 1248602
Lovaas answer ex 9b: Test RSS using this split of test and train data is 1,598,447.
set.seed(42)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.5
## Loading required package: Matrix
## Loaded glmnet 4.1-1
train.mat = model.matrix(Apps~., data=college.train) #test and train need to be in matrices
test.mat = model.matrix(Apps~., data=college.test)
grid=10^seq(10,-2,length=100) #set grid
mod.ridge = cv.glmnet(train.mat, college.train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)
lambda.best = mod.ridge$lambda.min
ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best)
mean((college.test[, "Apps"] - ridge.pred)^2)
## [1] 1248538
lambda.best
## [1] 0.01
Lovaas answer ex 9c: Test error is 1,746,436. Best lambda is 32.75
lasso.mod=glmnet(train.mat,college.train[, "Apps"],alpha=1,lambda=grid)
set.seed(1)
cv.out=cv.glmnet(train.mat,college.train[, "Apps"],alpha=1)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=test.mat)
mean((lasso.pred-college.test[, "Apps"])^2)
## [1] 1222732
x=(model.matrix(Apps~.,College)[,-1])
out=glmnet(x,College$Apps,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef
## (Intercept) PrivateYes Accept Enroll Top10perc
## -471.06667555 -491.14165926 1.56962287 -0.75610852 48.05994347
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -12.79467114 0.04128853 0.04401976 -0.08310752 0.14936232
## Books Personal PhD Terminal S.F.Ratio
## 0.01482470 0.02889000 -8.39203241 -3.25657190 14.49559720
## perc.alumni Expend Grad.Rate
## -0.04454733 0.07707805 8.26923799
lasso.coef[lasso.coef!=0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -471.06667555 -491.14165926 1.56962287 -0.75610852 48.05994347
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -12.79467114 0.04128853 0.04401976 -0.08310752 0.14936232
## Books Personal PhD Terminal S.F.Ratio
## 0.01482470 0.02889000 -8.39203241 -3.25657190 14.49559720
## perc.alumni Expend Grad.Rate
## -0.04454733 0.07707805 8.26923799
bestlam
## [1] 2.238926
Lovaas answer ex 9d: MSE is 1,673,137. This is a little bit better than a ridge regression. The coefficients on F.Undergrad and Books are 0, so there are only 16 non-zero coefficient estimates.
library(pls)
## Warning: package 'pls' was built under R version 4.0.5
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(42)
pcr.fit=pcr(Apps~., data=college.train,scale=TRUE,validation="CV")
summary(pcr.fit)
## Data: X dimension: 380 17
## Y dimension: 380 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4401 4192 2249 2245 2262 1856 1849
## adjCV 4401 4194 2246 2243 2272 1834 1842
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1854 1837 1831 1808 1817 1822 1821
## adjCV 1846 1823 1825 1801 1809 1814 1813
## 14 comps 15 comps 16 comps 17 comps
## CV 1820 1851 1488 1271
## adjCV 1812 1828 1464 1255
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.71 56.81 64.64 70.64 76.13 80.99 84.70 87.91
## Apps 10.05 74.57 75.19 75.20 84.14 84.22 84.29 84.72
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.94 93.35 95.34 97.07 98.07 98.81 99.39
## Apps 84.94 85.38 85.63 85.63 85.85 85.93 90.88
## 16 comps 17 comps
## X 99.84 100.0
## Apps 93.54 94.7
pcr.pred=predict(pcr.fit,x[test,],ncomp=7)
mean((pcr.pred-college.test[, "Apps"])^2)
## [1] 2145484
Lovaas answer ex 9e: MSE is 3,235,916. This is worse than the other models we’ve created so far. M = 16; that is the point at which the CV is the lowest.
set.seed(42)
pls.fit=plsr(Apps~., data=college.train,scale=TRUE, validation="CV")
summary(pls.fit)
## Data: X dimension: 380 17
## Y dimension: 380 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4401 2117 1945 1758 1701 1600 1412
## adjCV 4401 2112 1941 1747 1680 1572 1388
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1363 1335 1295 1286 1279 1271 1270
## adjCV 1341 1314 1277 1270 1263 1256 1255
## 14 comps 15 comps 16 comps 17 comps
## CV 1269 1270 1270 1271
## adjCV 1254 1254 1255 1255
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.28 49.90 62.23 66.09 70.92 73.57 76.30 79.48
## Apps 78.39 83.06 87.59 90.94 93.07 94.11 94.28 94.42
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.23 86.05 88.81 91.30 93.53 94.73 97.51
## Apps 94.56 94.61 94.66 94.69 94.70 94.70 94.70
## 16 comps 17 comps
## X 98.53 100.0
## Apps 94.70 94.7
pls.pred=predict(pls.fit,x[test,],ncomp=2)
mean((pls.pred-college.test[, "Apps"])^2)
## [1] 2651064
Lovaas answer ex 9f: MSE is 3,017,032. This is worse than other models we’ve created so far, but better than PCR. M = 8; that is the point at which the CV is the lowest.
test.avg = mean(college.test[, "Apps"])
lm.test.r2 = 1 - mean((college.test[, "Apps"] - college.pred)^2) / mean((college.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((college.test[, "Apps"] - ridge.pred)^2) / mean((college.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((college.test[, "Apps"] - lasso.pred)^2) / mean((college.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((college.test[, "Apps"] - (pcr.pred))^2) / mean((college.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((college.test[, "Apps"] - (pls.pred))^2) / mean((college.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="red", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")
Lovaas answer ex 9g: I stole this code from the online answer sheet because it’s just amazing! Using this we can see that the R-squared values are around 90% for the OLS, Ridge, and Lasso techniques, and slightly lower for PCR and PLS. Depending on the task at hand, these are really good r-squared values. PCR and PLS both had higher MSE’s as well, so they’re definitely not my favorite models of the 5. OLS, Ridge, and Lasso all had fairly similar results. The MSE was lowest for OLS, so it’s my favorite model of the group.
We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.5
# ?Boston
sum(is.na(Boston$crim)) # no need to fix data
## [1] 0
attach(Boston)
train=sample(c(TRUE,FALSE), nrow(Boston),rep=TRUE)
test=(!train)
boston.train = Boston[train,]
boston.test = Boston[test,]
# OLS
boston.ols = lm(crim~., data = boston.train)
boston.pred = predict(boston.ols, boston.test)
x = mean((boston.test[,"crim"] - boston.pred)^2)
print("OLS R-squared:")
## [1] "OLS R-squared:"
x
## [1] 21.47189
#Ridge
set.seed(42)
train.mat = model.matrix(crim~., data=boston.train) #test and train need to be in matrices
test.mat = model.matrix(crim~., data=boston.test)
grid=10^seq(10,-2,length=100) #set grid
mod.ridge = cv.glmnet(train.mat, boston.train[, "crim"], alpha=0, lambda=grid, thresh=1e-12)
lambda.best = mod.ridge$lambda.min
ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best)
print("Ridge R-squared:")
## [1] "Ridge R-squared:"
mean((boston.test[, "crim"] - ridge.pred)^2)
## [1] 20.94954
lambda.best
## [1] 0.1232847
#Lasso
lasso.mod=glmnet(train.mat,boston.train[, "crim"],alpha=1,lambda=grid)
set.seed(1)
cv.out=cv.glmnet(train.mat,boston.train[, "crim"],alpha=1)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=test.mat)
print("Lasso R-squared:")
## [1] "Lasso R-squared:"
mean((lasso.pred-boston.test[, "crim"])^2)
## [1] 20.42005
x=(model.matrix(crim~.,Boston)[,-1])
out=glmnet(x,Boston$crim,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:12,]
lasso.coef
## (Intercept) zn indus chas nox rm
## 10.298685476 0.032772646 -0.056588035 -0.533258828 -4.668242943 0.084006702
## age dis rad tax ptratio black
## 0.000000000 -0.654165652 0.500428269 0.000000000 -0.130458060 -0.007562849
lasso.coef[lasso.coef!=0]
## (Intercept) zn indus chas nox rm
## 10.298685476 0.032772646 -0.056588035 -0.533258828 -4.668242943 0.084006702
## dis rad ptratio black
## -0.654165652 0.500428269 -0.130458060 -0.007562849
bestlam
## [1] 0.08611276
#PCR
set.seed(42)
pcr.fit=pcr(crim~., data=boston.train,scale=TRUE,validation="CV")
summary(pcr.fit)
## Data: X dimension: 250 13
## Y dimension: 250 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 11 9.500 9.498 8.919 8.934 8.941 9.053
## adjCV 11 9.488 9.499 8.899 8.910 8.922 9.028
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 9.046 8.919 8.972 8.988 9.007 8.875 8.806
## adjCV 9.018 8.888 8.941 8.957 8.975 8.840 8.771
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.41 60.56 70.23 76.82 83.22 88.01 91.15 93.56
## crim 28.35 29.03 38.43 38.71 38.75 38.88 39.60 41.55
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.57 97.11 98.46 99.51 100.00
## crim 41.70 41.71 41.75 43.90 44.97
pcr.pred=predict(pcr.fit,x[test,],ncomp=7)
print("PCR R-squared:")
## [1] "PCR R-squared:"
mean((pcr.pred-boston.test[, "crim"])^2)
## [1] 22.00876
#PLS
set.seed(42)
pls.fit=plsr(crim~., data=boston.train,scale=TRUE, validation="CV")
summary(pls.fit)
## Data: X dimension: 250 13
## Y dimension: 250 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 11 9.274 8.877 8.963 8.886 8.810 8.826
## adjCV 11 9.261 8.856 8.918 8.849 8.778 8.790
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 8.816 8.807 8.800 8.805 8.806 8.806 8.806
## adjCV 8.780 8.772 8.765 8.770 8.771 8.771 8.771
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.84 58.91 62.67 71.03 76.12 80.29 83.84 86.49
## crim 32.56 41.00 43.61 44.15 44.60 44.84 44.90 44.95
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 88.99 94.81 96.94 98.51 100.00
## crim 44.97 44.97 44.97 44.97 44.97
pls.pred=predict(pls.fit,x[test,],ncomp=2)
print("PLS R-squared:")
## [1] "PLS R-squared:"
mean((pls.pred-boston.test[, "crim"])^2)
## [1] 21.5677
#Chart
test.avg = mean(boston.test[, "crim"])
lm.test.r2 = 1 - mean((boston.test[, "crim"] - boston.pred)^2) / mean((boston.test[, "crim"] - test.avg)^2)
ridge.test.r2 = 1 - mean((boston.test[, "crim"] - ridge.pred)^2) / mean((boston.test[, "crim"] - test.avg)^2)
lasso.test.r2 = 1 - mean((boston.test[, "crim"] - lasso.pred)^2) / mean((boston.test[, "crim"] - test.avg)^2)
pcr.test.r2 = 1 - mean((boston.test[, "crim"] - (pcr.pred))^2) / mean((boston.test[, "crim"] - test.avg)^2)
pls.test.r2 = 1 - mean((boston.test[, "crim"] - (pls.pred))^2) / mean((boston.test[, "crim"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="blue", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")
Lovaas answer ex 11a: I decided to try everything from exercise 9 on the Boston dataset. The r-squared metrics are lower for this problem, around 50%. The model with the best r-squared was Ridge, then PCR, then PLS, then Lasso, then OLS. Interestingly, that’s about the opposite of how the models on the College dataset went. Ridge, Lasso, and OLS all had lower MSE’s than PCR and PLS. Ridge was slightly lower than the others.
Lovaas answer ex 11b: Based on just r-squared and MSE, I would guess that Ridge will be the best model.
lambda.best
## [1] 0.1232847
summary(mod.ridge)
## Length Class Mode
## lambda 100 -none- numeric
## cvm 100 -none- numeric
## cvsd 100 -none- numeric
## cvup 100 -none- numeric
## cvlo 100 -none- numeric
## nzero 100 -none- numeric
## call 6 -none- call
## name 1 -none- character
## glmnet.fit 12 elnet list
## lambda.min 1 -none- numeric
## lambda.1se 1 -none- numeric
## index 2 -none- numeric
Lovaas answer ex 11c: My chosen model does involve all of the variables of the data set. All models in this chapter run very quickly regardless of the number of variables included, so I saw no harm in including all.