Assignment 5/Chapter 6

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
1. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

The correct answer for part (a) is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: Lasso’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the Lasso decreases,leading to decreased variance but increased bias.

(b) Repeat (a) for ridge regression relative to least squares.

The correct answer for part (b) is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: The reason for this is because ridge regression, like Lasso, is rooted in the bias-variance trade off. However, Lasso is easier to interpret.

(c) Repeat (a) for non-linear methods relative to least squares.

The correct answer for part (c) is ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
attach(College)

(a) Split the data set into a training set and a test set.

set.seed(1)
training=sample(nrow(College),0.75*nrow(College))
train=College[training,]
test=College[-training,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit=lm(Apps~.,data=train)
summary(lm.fit)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5773.1  -425.2     4.5   327.9  7496.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.784e+02  4.707e+02  -1.229  0.21962    
## PrivateYes  -4.673e+02  1.571e+02  -2.975  0.00305 ** 
## Accept       1.712e+00  4.567e-02  37.497  < 2e-16 ***
## Enroll      -1.197e+00  2.151e-01  -5.564 4.08e-08 ***
## Top10perc    5.298e+01  6.158e+00   8.603  < 2e-16 ***
## Top25perc   -1.528e+01  4.866e+00  -3.141  0.00177 ** 
## F.Undergrad  7.085e-02  3.760e-02   1.884  0.06002 .  
## P.Undergrad  5.771e-02  3.530e-02   1.635  0.10266    
## Outstate    -8.143e-02  2.077e-02  -3.920 9.95e-05 ***
## Room.Board   1.609e-01  5.361e-02   3.002  0.00280 ** 
## Books        2.338e-01  2.634e-01   0.887  0.37521    
## Personal     6.611e-03  6.850e-02   0.097  0.92315    
## PhD         -1.114e+01  5.149e+00  -2.163  0.03093 *  
## Terminal     9.186e-01  5.709e+00   0.161  0.87223    
## S.F.Ratio    1.689e+01  1.542e+01   1.096  0.27368    
## perc.alumni  2.256e+00  4.635e+00   0.487  0.62667    
## Expend       5.567e-02  1.300e-02   4.284 2.16e-05 ***
## Grad.Rate    6.427e+00  3.307e+00   1.944  0.05243 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9316 
## F-statistic: 466.7 on 17 and 564 DF,  p-value: < 2.2e-16

lm.pred=predict(lm.fit,newdata=test)
lm.err=mean((test$Apps-lm.pred)^2)
lm.err

## [1] 1384604

R-Squared for the linear model is 93.36%. The test error for this model is 1384604

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-4

xtrain=model.matrix(Apps~.,data=train[,-1])
ytrain=train$Apps
xtest=model.matrix(Apps~.,data=test[,-1])
ytest=test$Apps
ridge.fit=cv.glmnet(xtrain,ytrain,alpha=0)
plot(ridge.fit)

ridge.lambda=ridge.fit$lambda.min
ridge.lambda

## [1] 364.6228

ridge.pred=predict(ridge.fit,s=ridge.lambda,newx=xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err

## [1] 1260111

Using cross-validation, we can see lambda value of 364.6228. Using this in our ridge regression model, we get a test error of 1260111 which is an improvement over our linear model.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.fit=cv.glmnet(xtrain,ytrain,alpha=1)
plot(lasso.fit)

lasso.lambda=lasso.fit$lambda.min
lasso.lambda

## [1] 1.945882

lasso.pred=predict(lasso.fit,s=lasso.lambda,newx=xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err

## [1] 1394834

Using cross-validation, we set lambda to 1.945882 giving the test error rate for lasso to be 1394834. This does not appear to be as strong as the ridge regression.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr.fit=pcr(Apps~.,data=train,scale=TRUE,validation="CV")
validationplot(pcr.fit,val.type="MSEP")

summary(pcr.fit)

## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     3761     2078     2076     1809     1709     1662
## adjCV         3862     3761     2075     2076     1790     1689     1657
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1654     1615     1573      1578      1584      1587      1586
## adjCV     1649     1605     1568      1572      1579      1582      1580
##        14 comps  15 comps  16 comps  17 comps
## CV         1588      1528      1193      1133
## adjCV      1583      1511      1183      1124
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.159    57.17    64.41    70.20    75.53    80.48    84.24    87.56
## Apps    5.226    71.83    71.84    80.02    83.01    83.07    83.21    84.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.54     92.81     94.92     96.73     97.81     98.69     99.35
## Apps    85.00     85.22     85.22     85.23     85.36     85.45     89.93
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.84     93.36

According to the summary, I think the MSE is lowest around 10 so we will use this. It also has a 92.81% explanation of the variance of the predictors.

pcr.pred=predict(pcr.fit,test,ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err

## [1] 1952693

Test error for PCR is 195263. This is the least favorable so far.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit=plsr(Apps~.,data=train,scale=TRUE,validation="CV")
validationplot(pls.fit,val.type="MSEP")

summary(pls.fit)

## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     1933     1688     1489     1424     1250     1167
## adjCV         3862     1927     1687     1482     1404     1227     1157
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1149     1144     1138      1134      1135      1132      1132
## adjCV     1140     1136     1130      1126      1127      1124      1123
##        14 comps  15 comps  16 comps  17 comps
## CV         1130      1131      1130      1130
## adjCV      1122      1122      1122      1122
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.09    62.54     65.0    67.54    72.28    76.80    80.63
## Apps    76.80    82.71    87.20     90.8    92.79    93.05    93.14    93.22
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.71     85.53     88.01     90.95     93.07     95.18     96.86
## Apps    93.30     93.32     93.34     93.35     93.36     93.36     93.36
##       16 comps  17 comps
## X        98.00    100.00
## Apps     93.36     93.36

I can see above that the MSE drops to it’s lowest around 5-7. 7 has the lowest CV according to the summary so we will use this for prediction.

pls.pred=predict(pls.fit,test,ncomp=7)
pls.err=mean((pls.pred-test$Apps)^2)
pls.err

## [1] 1333314

The PLS model gives us 1333314 which is much better than PCR and closer to the other test models we’ve done previously.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approach

First, let’s see how well all test errors compare side by side.

barplot(c(lm.err,ridge.err,lasso.err,pcr.err,pls.err),xlab="Regression Models",ylab="Test Error",main="Test Error Comparisons",names.arg=c("OLS","Ridge","Lasso","PCR","PLS"))

- As stated previously, PCR stands out as a poor test error. Ridge seems to have the lowest error rate, but let’s also look at our R-squared of these models to explain their variability.

avg.apps=mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lm.r2

## [1] 0.9086432

ridge.r2=1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2

## [1] 0.9168573

lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2

## [1] 0.9079682

pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2

## [1] 0.8711605

pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2

## [1] 0.9120273

barplot(c(lm.r2,ridge.r2,lasso.r2,pcr.r2,pls.r2),xlab="Regression Methods",ylab="Test R-Squared",names.arg=c("OLS","Ridge","Lasso","PCR","PLS"))

- As expected, PCR is the least accurate model across all of them, but the other 4 models are all really good at +90%! The Ridge Regression is the most accurate AND has the lowest test error, so this is the best model to use for predicting Applications.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

library(MASS)
detach(College)
attach(Boston)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Thank you to Selena Romero for the subset code!3 ##Subset Selection

library(leaps)
set.seed(1)
predict.regsubsets=function(object,newdata,id,...){
 form=as.formula (object$call [[2]])
 mat=model.matrix(form ,newdata )
 coefi=coef(object ,id=id)
 xvars=names(coefi)
 mat[,names(coefi)]%*%coefi
}
k=10
p=ncol(Boston)-1
folds=sample(rep(1:k,length=nrow(Boston)))
cv.errors=matrix(NA,k,p)
for(i in 1:k){
  best.fit=regsubsets(crim~.,data=Boston[folds!=i,  ],nvmax=p)
  for(j in 1:p){pred=predict(best.fit,Boston[folds == i, ],id=j)
  cv.errors[i,j]=mean((Boston$crim[folds==i]-pred)^2)}
}
rmse.cv=sqrt(apply(cv.errors,2,mean))
plot(rmse.cv,pch=19,type="b")

summary(best.fit)

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*"   "*" 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*"   "*" 
## 7  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

which.min(rmse.cv)

## [1] 9

boston.bsm.err=(rmse.cv[which.min(rmse.cv)]^2)
boston.bsm.err

## [1] 42.81453

For a subset selection model, Cross-validation selected variable 9. At 9-variables, the CV estimate is 42.30897. This includes zn, indus, nox, dis, rad, ptratio, black, lstat, and medv.

##Laso

boston.x=model.matrix(crim~.,data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1,type.measure="mse")
plot(boston.lasso)

coef(boston.lasso)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 2.176491
## zn          .       
## indus       .       
## chas        .       
## nox         .       
## rm          .       
## age         .       
## dis         .       
## rad         0.150484
## tax         .       
## ptratio     .       
## black       .       
## lstat       .       
## medv        .

boston.lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
boston.lasso.err

## [1] 62.74783

It looks like Lasso includes only the rad variable with a 62.74783 MSE.

##Ridge Regression

boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)

coef(boston.ridge)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.523899542
## zn          -0.002949852
## indus        0.029276741
## chas        -0.166526007
## nox          1.874769665
## rm          -0.142852604
## age          0.006207995
## dis         -0.094547258
## rad          0.045932737
## tax          0.002086668
## ptratio      0.071258052
## black       -0.002605281
## lstat        0.035745604
## medv        -0.023480540

boston.ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
boston.ridge.err

## [1] 58.8156

Ridge Regression’s MSE is 58.8156.

##PCR

boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.175    7.180    6.724    6.731    6.727    6.727
## adjCV         8.61    7.174    7.179    6.721    6.725    6.724    6.724
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.722    6.614    6.618     6.607     6.598     6.553     6.488
## adjCV    6.718    6.609    6.613     6.602     6.592     6.546     6.481
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

I am guessing that the PCR would only include 8 components. That is with a 93.45% variance for the predictors and a 42.47% variance for the response. MSE is at 44.38224.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

Test Error Rates: Subset = 42.30897, Lasso = 62.74783, Ridge = 58.8156, PCR = 44.38224,

After reviewing the above models and their test error rates, it appears that the subset selection model (with 9 variables) creates the most accurate model to predict crime rates in Boston.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

No.The subset selection model threw out chas, rm, age, and tax and only kept zn, indus, nox, dis, rad, ptratio, black, lstat, and medv.

Assignment 5/Chapter 6

Tyler Crawford

2022-10-21

Problem 2

Problem 9

Problem 11