Chapter 06 (page 259): 2, 9, 11

Chapter 6 ex 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Lovaas answer ex 2a: Lasso is less flexible than least squares, meaning either iii or iv is correct. Lasso will have less variance than least squares but slightly more bias (as it removes some variables from the model). Thus, iii is the true statement.

Lovaas answer ex 2b: Ridge regression is very similar to lasso, so the answer from 2a applies. Ridge regression is also less flexible than least squares and will have slightly more bias (since many betas are “shrunk” towards 0)

Lovaas answer ex 2c: Non linear methods are generally more flexible than least squares; as least squares is linear. However, they will only give improved prediction accuracy when the increase in variance is less than the decrease in bias. This corresponds to ii above.

Chapter 6 ex 9

  1. In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
# ?College
sum(is.na(College$Apps)) # no need to fix data
## [1] 0
attach(College)
train=sample(c(TRUE,FALSE), nrow(College),rep=TRUE)
test=(!train)
college.train = College[train,]
college.test = College[test,]
college.ols = lm(Apps~., data = college.train)
college.pred = predict(college.ols, college.test)
x = mean((college.test[,"Apps"] - college.pred)^2)
x
## [1] 1248602

Lovaas answer ex 9b: Test RSS using this split of test and train data is 1,598,447.

set.seed(42)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.5
## Loading required package: Matrix
## Loaded glmnet 4.1-1
train.mat = model.matrix(Apps~., data=college.train) #test and train need to be in matrices
test.mat = model.matrix(Apps~., data=college.test)
grid=10^seq(10,-2,length=100) #set grid
mod.ridge = cv.glmnet(train.mat, college.train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)
lambda.best = mod.ridge$lambda.min
ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best)
mean((college.test[, "Apps"] - ridge.pred)^2)
## [1] 1248538
lambda.best
## [1] 0.01

Lovaas answer ex 9c: Test error is 1,746,436. Best lambda is 32.75

lasso.mod=glmnet(train.mat,college.train[, "Apps"],alpha=1,lambda=grid)
set.seed(1)
cv.out=cv.glmnet(train.mat,college.train[, "Apps"],alpha=1)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=test.mat)
mean((lasso.pred-college.test[, "Apps"])^2)
## [1] 1222732
x=(model.matrix(Apps~.,College)[,-1])
out=glmnet(x,College$Apps,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -471.06667555 -491.14165926    1.56962287   -0.75610852   48.05994347 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -12.79467114    0.04128853    0.04401976   -0.08310752    0.14936232 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.01482470    0.02889000   -8.39203241   -3.25657190   14.49559720 
##   perc.alumni        Expend     Grad.Rate 
##   -0.04454733    0.07707805    8.26923799
lasso.coef[lasso.coef!=0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -471.06667555 -491.14165926    1.56962287   -0.75610852   48.05994347 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -12.79467114    0.04128853    0.04401976   -0.08310752    0.14936232 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.01482470    0.02889000   -8.39203241   -3.25657190   14.49559720 
##   perc.alumni        Expend     Grad.Rate 
##   -0.04454733    0.07707805    8.26923799
bestlam
## [1] 2.238926

Lovaas answer ex 9d: MSE is 1,673,137. This is a little bit better than a ridge regression. The coefficients on F.Undergrad and Books are 0, so there are only 16 non-zero coefficient estimates.

library(pls)
## Warning: package 'pls' was built under R version 4.0.5
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(42)
pcr.fit=pcr(Apps~., data=college.train,scale=TRUE,validation="CV")
summary(pcr.fit)
## Data:    X dimension: 380 17 
##  Y dimension: 380 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4401     4192     2249     2245     2262     1856     1849
## adjCV         4401     4194     2246     2243     2272     1834     1842
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1854     1837     1831      1808      1817      1822      1821
## adjCV     1846     1823     1825      1801      1809      1814      1813
##        14 comps  15 comps  16 comps  17 comps
## CV         1820      1851      1488      1271
## adjCV      1812      1828      1464      1255
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       31.71    56.81    64.64    70.64    76.13    80.99    84.70    87.91
## Apps    10.05    74.57    75.19    75.20    84.14    84.22    84.29    84.72
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.94     93.35     95.34     97.07     98.07     98.81     99.39
## Apps    84.94     85.38     85.63     85.63     85.85     85.93     90.88
##       16 comps  17 comps
## X        99.84     100.0
## Apps     93.54      94.7
pcr.pred=predict(pcr.fit,x[test,],ncomp=7)
mean((pcr.pred-college.test[, "Apps"])^2)
## [1] 2145484

Lovaas answer ex 9e: MSE is 3,235,916. This is worse than the other models we’ve created so far. M = 16; that is the point at which the CV is the lowest.

set.seed(42)
pls.fit=plsr(Apps~., data=college.train,scale=TRUE, validation="CV")
summary(pls.fit)
## Data:    X dimension: 380 17 
##  Y dimension: 380 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4401     2117     1945     1758     1701     1600     1412
## adjCV         4401     2112     1941     1747     1680     1572     1388
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1363     1335     1295      1286      1279      1271      1270
## adjCV     1341     1314     1277      1270      1263      1256      1255
##        14 comps  15 comps  16 comps  17 comps
## CV         1269      1270      1270      1271
## adjCV      1254      1254      1255      1255
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.28    49.90    62.23    66.09    70.92    73.57    76.30    79.48
## Apps    78.39    83.06    87.59    90.94    93.07    94.11    94.28    94.42
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.23     86.05     88.81     91.30     93.53     94.73     97.51
## Apps    94.56     94.61     94.66     94.69     94.70     94.70     94.70
##       16 comps  17 comps
## X        98.53     100.0
## Apps     94.70      94.7
pls.pred=predict(pls.fit,x[test,],ncomp=2)
mean((pls.pred-college.test[, "Apps"])^2)
## [1] 2651064

Lovaas answer ex 9f: MSE is 3,017,032. This is worse than other models we’ve created so far, but better than PCR. M = 8; that is the point at which the CV is the lowest.

test.avg = mean(college.test[, "Apps"])
lm.test.r2 = 1 - mean((college.test[, "Apps"] - college.pred)^2) / mean((college.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((college.test[, "Apps"] - ridge.pred)^2) / mean((college.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((college.test[, "Apps"] - lasso.pred)^2) / mean((college.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((college.test[, "Apps"] - (pcr.pred))^2) / mean((college.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((college.test[, "Apps"] - (pls.pred))^2) / mean((college.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="red", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

Lovaas answer ex 9g: I stole this code from the online answer sheet because it’s just amazing! Using this we can see that the R-squared values are around 90% for the OLS, Ridge, and Lasso techniques, and slightly lower for PCR and PLS. Depending on the task at hand, these are really good r-squared values. PCR and PLS both had higher MSE’s as well, so they’re definitely not my favorite models of the 5. OLS, Ridge, and Lasso all had fairly similar results. The MSE was lowest for OLS, so it’s my favorite model of the group.

Chapter 6 ex 11

We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)
## Warning: package 'MASS' was built under R version 4.0.5
# ?Boston
sum(is.na(Boston$crim)) # no need to fix data
## [1] 0
attach(Boston)
train=sample(c(TRUE,FALSE), nrow(Boston),rep=TRUE)
test=(!train)
boston.train = Boston[train,]
boston.test = Boston[test,]
# OLS
boston.ols = lm(crim~., data = boston.train)
boston.pred = predict(boston.ols, boston.test)
x = mean((boston.test[,"crim"] - boston.pred)^2)
print("OLS R-squared:")
## [1] "OLS R-squared:"
x
## [1] 21.47189
#Ridge 
set.seed(42)
train.mat = model.matrix(crim~., data=boston.train) #test and train need to be in matrices
test.mat = model.matrix(crim~., data=boston.test)
grid=10^seq(10,-2,length=100) #set grid
mod.ridge = cv.glmnet(train.mat, boston.train[, "crim"], alpha=0, lambda=grid, thresh=1e-12)
lambda.best = mod.ridge$lambda.min
ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best)
print("Ridge R-squared:")
## [1] "Ridge R-squared:"
mean((boston.test[, "crim"] - ridge.pred)^2)
## [1] 20.94954
lambda.best
## [1] 0.1232847
#Lasso
lasso.mod=glmnet(train.mat,boston.train[, "crim"],alpha=1,lambda=grid)
set.seed(1)
cv.out=cv.glmnet(train.mat,boston.train[, "crim"],alpha=1)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=test.mat)
print("Lasso R-squared:")
## [1] "Lasso R-squared:"
mean((lasso.pred-boston.test[, "crim"])^2)
## [1] 20.42005
x=(model.matrix(crim~.,Boston)[,-1])
out=glmnet(x,Boston$crim,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:12,]
lasso.coef
##  (Intercept)           zn        indus         chas          nox           rm 
## 10.298685476  0.032772646 -0.056588035 -0.533258828 -4.668242943  0.084006702 
##          age          dis          rad          tax      ptratio        black 
##  0.000000000 -0.654165652  0.500428269  0.000000000 -0.130458060 -0.007562849
lasso.coef[lasso.coef!=0]
##  (Intercept)           zn        indus         chas          nox           rm 
## 10.298685476  0.032772646 -0.056588035 -0.533258828 -4.668242943  0.084006702 
##          dis          rad      ptratio        black 
## -0.654165652  0.500428269 -0.130458060 -0.007562849
bestlam
## [1] 0.08611276
#PCR
set.seed(42)
pcr.fit=pcr(crim~., data=boston.train,scale=TRUE,validation="CV")
summary(pcr.fit)
## Data:    X dimension: 250 13 
##  Y dimension: 250 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV              11    9.500    9.498    8.919    8.934    8.941    9.053
## adjCV           11    9.488    9.499    8.899    8.910    8.922    9.028
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       9.046    8.919    8.972     8.988     9.007     8.875     8.806
## adjCV    9.018    8.888    8.941     8.957     8.975     8.840     8.771
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.41    60.56    70.23    76.82    83.22    88.01    91.15    93.56
## crim    28.35    29.03    38.43    38.71    38.75    38.88    39.60    41.55
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.57     97.11     98.46     99.51    100.00
## crim    41.70     41.71     41.75     43.90     44.97
pcr.pred=predict(pcr.fit,x[test,],ncomp=7)
print("PCR R-squared:")
## [1] "PCR R-squared:"
mean((pcr.pred-boston.test[, "crim"])^2)
## [1] 22.00876
#PLS
set.seed(42)
pls.fit=plsr(crim~., data=boston.train,scale=TRUE, validation="CV")
summary(pls.fit)
## Data:    X dimension: 250 13 
##  Y dimension: 250 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV              11    9.274    8.877    8.963    8.886    8.810    8.826
## adjCV           11    9.261    8.856    8.918    8.849    8.778    8.790
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       8.816    8.807    8.800     8.805     8.806     8.806     8.806
## adjCV    8.780    8.772    8.765     8.770     8.771     8.771     8.771
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.84    58.91    62.67    71.03    76.12    80.29    83.84    86.49
## crim    32.56    41.00    43.61    44.15    44.60    44.84    44.90    44.95
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       88.99     94.81     96.94     98.51    100.00
## crim    44.97     44.97     44.97     44.97     44.97
pls.pred=predict(pls.fit,x[test,],ncomp=2)
print("PLS R-squared:")
## [1] "PLS R-squared:"
mean((pls.pred-boston.test[, "crim"])^2)
## [1] 21.5677
#Chart
test.avg = mean(boston.test[, "crim"])
lm.test.r2 = 1 - mean((boston.test[, "crim"] - boston.pred)^2) / mean((boston.test[, "crim"] - test.avg)^2)
ridge.test.r2 = 1 - mean((boston.test[, "crim"] - ridge.pred)^2) / mean((boston.test[, "crim"] - test.avg)^2)
lasso.test.r2 = 1 - mean((boston.test[, "crim"] - lasso.pred)^2) / mean((boston.test[, "crim"] - test.avg)^2)
pcr.test.r2 = 1 - mean((boston.test[, "crim"] - (pcr.pred))^2) / mean((boston.test[, "crim"] - test.avg)^2)
pls.test.r2 = 1 - mean((boston.test[, "crim"] - (pls.pred))^2) / mean((boston.test[, "crim"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="blue", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

Lovaas answer ex 11a: I decided to try everything from exercise 9 on the Boston dataset. The r-squared metrics are lower for this problem, around 50%. The model with the best r-squared was Ridge, then PCR, then PLS, then Lasso, then OLS. Interestingly, that’s about the opposite of how the models on the College dataset went. Ridge, Lasso, and OLS all had lower MSE’s than PCR and PLS. Ridge was slightly lower than the others.

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error,crossvalidation,or some other reasonable alternative, as opposed to using training error.

Lovaas answer ex 11b: Based on just r-squared and MSE, I would guess that Ridge will be the best model.

lambda.best 
## [1] 0.1232847
summary(mod.ridge)
##            Length Class  Mode     
## lambda     100    -none- numeric  
## cvm        100    -none- numeric  
## cvsd       100    -none- numeric  
## cvup       100    -none- numeric  
## cvlo       100    -none- numeric  
## nzero      100    -none- numeric  
## call         6    -none- call     
## name         1    -none- character
## glmnet.fit  12    elnet  list     
## lambda.min   1    -none- numeric  
## lambda.1se   1    -none- numeric  
## index        2    -none- numeric

Lovaas answer ex 11c: My chosen model does involve all of the variables of the data set. All models in this chapter run very quickly regardless of the number of variables included, so I saw no harm in including all.