Linear Model Selection and Regularization

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

 i. More ﬂexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
 
 ii.    More ﬂexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.


 iii.   Less ﬂexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
 
 iv.    Less ﬂexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Of i through iv iii would be the most accurate. The bias/variance trade-off gives the lasso it’s advantage over least squares. The lasso uses a penalty term that yields a reduction in variance in comparison to least squares but a small increase in bias. This makes the model less flexible but more accurate when its increase in bias is less than its decrease in variance.

(b) Repeat (a) for ridge regression relative to least squares.

Ridge regression is similar to the lasso, so iii would be the most accurate. Just like the lasso, the ridge regression has the bias-variance trade-off. The penalty term used in ridge reduces variance but increases bias. This makes the model more accurate when its increase in bias is less than its decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.

Non-linear methods tend to be more flexible and have less bias than least squares. This gives non-linear methods better accuracy when its increase in variance is less than its decrease in bias. Making ii the most accurate answer.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

attach(College)
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

ls.fit<-lm(Apps~., data=College, subset=train)
ls.pred = predict(ls.fit, College.test, type="response")
test.error= mean((College.test$Apps-ls.pred)^2)
test.error

## [1] 1020100

The test error from the fitted liner model with all the variables is 1,020,100.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.0.5

## Loading required package: Matrix

## Loaded glmnet 4.1-3

set.seed(1)
#Set up matrices
grid=10^seq(10,-2,length=100)
#Fit a ridge regression
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

#Choose lambda using cross-validation
cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam

## [1] 411.3927

#Make predictions
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
#Calculate test error
mean((ridge.pred-y.test)^2)

## [1] 985020.1

MSE for the ridge model is 985020.1. This is lower than the MSE of the linear model using least squares.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

#fit lasso model
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

#choose lamda using CV
set.seed(2)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam

## [1] 27.0669

#Make predictions and calculate test error
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 1006648

The test error of the lasso model is 1,006,648.

#non-zero coefficient estimates
out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]

##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -647.69543453 -396.82408556    1.42411713   -0.09893105   30.63271978 
##     Top25perc   P.Undergrad      Outstate    Room.Board           PhD 
##   -0.57340223    0.01055132   -0.05004306    0.11718682   -4.81091793 
##      Terminal     S.F.Ratio   perc.alumni        Expend     Grad.Rate 
##   -3.24227399    1.89312526   -1.02549443    0.06707521    4.38907054

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

## Warning: package 'pls' was built under R version 4.0.5

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(1)
#fit a PCR model with M chosen by CV
pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     4335     2390     2401     2112     1954     1914
## adjCV         4347     4335     2386     2401     2085     1949     1905
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1910     1879     1871      1867      1867      1875      1894
## adjCV     1902     1862     1863      1860      1859      1867      1887
##        14 comps  15 comps  16 comps  17 comps
## CV         1853      1634      1323      1286
## adjCV      1934      1586      1310      1273
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     32.6794    56.94    64.38    70.61    76.27    80.97    84.48    87.54
## Apps   0.9148    71.17    71.36    79.85    81.49    82.73    82.79    83.70
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.89     94.96     96.81     97.97     98.73     99.39
## Apps    83.86     84.08     84.11     84.11     84.16     84.28     93.08
##       16 comps  17 comps
## X        99.86    100.00
## Apps     93.71     93.95

validationplot(pcr.college, val.type="MSEP")

pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)

## [1] 1422699

The test error of the PCR model is 1723100. The selected number of components by CV is 10. Looking at the summary and the plot of the components we can see that M = 10 has a lowest CV until M = 14. Although, M=10 has high variance explained and achieves the best dimension reduction.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

#Fit PLS model and visualize
pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")

summary(pls.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     2154     1836     1732     1620     1422     1314
## adjCV         4347     2148     1832     1724     1591     1397     1298
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1276     1264     1260      1266      1265      1262      1262
## adjCV     1264     1253     1250      1254      1253      1251      1250
##        14 comps  15 comps  16 comps  17 comps
## CV         1263      1263      1263      1263
## adjCV      1251      1251      1251      1252
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.27    38.72    62.64    65.26    69.01    73.96    78.86    82.18
## Apps    76.96    84.31    86.80    91.48    93.37    93.75    93.81    93.84
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       85.35     87.42     89.18     91.41     92.70     94.58     97.16
## Apps    93.88     93.91     93.93     93.94     93.95     93.95     93.95
##       16 comps  17 comps
## X        98.15    100.00
## Apps     93.95     93.95

Looking at the summary of the model we can clarify that M = 9 looks to be the best option in terms of lowest CV(1260).

set.seed(1)
pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)

## [1] 1049868

The test error rate for the PLS model with M=9 is 1049868.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these ﬁve approaches?

The model performance from best to worst is as follows (based upon test error): Ridge Regression (985020.1),Lasso model (1006648),linear model (1020100), PLS model (1049868), PCR model (1723100). The test errors of PLS, linear, and lasso are fairly similar to one another. PCR performs significantly worse than the other models. Ridge performs a little better than the others. To say how accurately we can predict the number of applications received, let’s compute the ridge regression R-square value.

test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - ls.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

Comparing the R-square of each model shows PCR has the lowest accuracy, while Ridge has the highest. Since Ridge has the highest R-squared value and the lowest MSE it would be the best model.

11. We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

detach(College)
library(MASS)
attach(Boston)
library(leaps)

## Warning: package 'leaps' was built under R version 4.0.5

set.seed(1)
train=sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test=(!train)

Boston.train = Boston[train,]
Boston.test = Boston[test,]

Best Subset Selection

regfit.full = regsubsets(crim~.,data=Boston.train,nvmax=15)
reg.summary = summary(regfit.full)

#Make plots of performance statistics to determine best subset
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")

plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
p = which.max(reg.summary$adjr2)
points(p,reg.summary$adjr2[p], col="red",cex=2,pch=20)

plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
p = which.min(reg.summary$cp)
points(p,reg.summary$cp[p],col="red",cex=2,pch=20)

plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
p = which.min(reg.summary$bic)
points(p,reg.summary$bic[p],col="red",cex=2,pch=20)

summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston.train, nvmax = 15)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " "   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  " " "*"   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" " " "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

Looking at the graphs above we can see that the adjusted RSq and the CP suggests a seven variable model, while the BIC suggest a 3 variable model. We will look at both and see which one is the better model of the two.

# p=3
lm.fit = lm(crim~rad+black+lstat, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)

## [1] 52.66064

#p=7
lm.fit = lm(crim~zn+indus+dis+rad+black+lstat+medv, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)

## [1] 51.04084

The test error of the linear model fit with the 3 best predictors is 52.66. The p=7 linear model has a test error of 51.04. The seven predictor model is only 1.62 higher than the 3 variable model. One would argue that the additional value is not worth the added complexity of 4 more variables in the model.

Lasso

set.seed(1)
#Set up matrices needed for the glmnet functions
train.mat = model.matrix(crim~., data = Boston.train)
test.mat = model.matrix(crim~., data = Boston.test)
#Choose lambda using cross-validation
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=1)
bestlam = cv.out$lambda.min
bestlam

## [1] 0.01973085

#Fit lasso model
lasso.mod = glmnet(train.mat,Boston.train$crim,alpha=1)
#Make predictions
lasso.pred = predict(lasso.mod,s=bestlam,newx=test.mat)
mean((lasso.pred - Boston.test$crim)^2)

## [1] 50.7379

lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam)[1:15,]
lasso.coef[lasso.coef!=0]

## (Intercept)          zn       indus        chas         nox          rm 
## 15.96344351  0.04355602 -0.10751503 -0.14897948 -6.30730424 -0.24512633 
##         age         dis         rad     ptratio       black       lstat 
##  0.01130326 -0.74018280  0.47687429 -0.15621440 -0.01470752  0.11414439 
##        medv 
## -0.11377842

Ridge Regression

#Choose lambda using cross-validation
cv.ridge = cv.glmnet(train.mat,Boston.train$crim,alpha=0)
bestlam = cv.ridge$lambda.min
bestlam

## [1] 0.5240686

#Fit a ridge regression
ridge.mod = glmnet(train.mat,Boston.train$crim,alpha = 0)
#Make predictions
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
#Calculate test error
mean((ridge.pred - Boston.test$crim)^2)

## [1] 51.46284

coef(cv.ridge)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.518428366
## (Intercept)  .          
## zn          -0.003319432
## indus        0.032707247
## chas        -0.202043154
## nox          2.120359324
## rm          -0.161829416
## age          0.007019656
## dis         -0.110482087
## rad          0.052144674
## tax          0.002404790
## ptratio      0.076371881
## black       -0.003655338
## lstat        0.037206684
## medv        -0.023652398

pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.249    7.244    6.813    6.820    6.824    6.867
## adjCV         8.61    7.244    7.239    6.806    6.809    6.816    6.856
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.858    6.721    6.773     6.743     6.755     6.734     6.669
## adjCV    6.847    6.710    6.760     6.731     6.741     6.717     6.652
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

The lasso and ridge regression have similar results but the lasso has the lowest MSE at 50.76 of the two.M=9 seems to be the best option for the this PCR model with the lowest CV until M=14. The lasso model has the lowest MSE of all the models used.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.

Based on the information above I would say that the best model to use is the linear model using p=3 or p=7. Even though Lasso has the smallest MSE it is not that much of a difference between the 3 predictor linear model to justify using it over the other. It also uses 9 more predictors than the linear model increasing the complexity for only a 1.9 MSE difference. Even still, if we wanted an increase in the MSE the p=7 linear model is only 0.28 less than the lasso and only uses 7 predictors.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

The model chosen is the best subset selection model with p=7. This model had the second lowest MSE at 51.04, which is only a 0.28 difference from the lowest MSE at 50.76. This model only includes 7 of the 13 available predictor variables. This reduces the dimensions and complexity of interpreting the model, while still obtaining low MSE and high accuracy.

Linear Model Selection and Regularization

Rahab Lawshea

3/11/2022

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

11. We will now try to predict per capita crime rate in the Boston data set.