Assignment 5 Chapter 06 (page 259): 2, 9, 11

  1. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
  1. The lasso, relative to least squares, is:
  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  1. The lasso, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

The lasso shrinks the coefficient estimates towards zero. The lasso performs variable selection. As a result, models generated from the lasso are generally much easier to interpret than other methods like ridge regression. The least squares estimates have excessively high variance and the lasso solution can yield a reduction in variance at the expense of an increase in bias. This can generate more accurate predictions.

  1. Repeat (a) for ridge regression relative to least squares. For ridge regression (relative to least squares) is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge regression has the effect of “shrinking” large values of β’s towards zero which should improve the fit, because shrinking the coefficients can significantly reduce their variance. The penalty term makes the ridge regression estimates biased, but can also substantially reduce variance.

  2. Repeat (a) for non-linear methods relative to least squares. For non-linear methods relative to least squares, it is more flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.

  1. In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
data(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
  1. Split the data set into a training set and a test set.
set.seed(1) 
trainingindex<-sample(nrow(College), 0.75*nrow(College))

train<-College[trainingindex, ]
test<-College[-trainingindex, ]
  1. Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit <- lm(Apps~.,data=train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5773.1  -425.2     4.5   327.9  7496.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.784e+02  4.707e+02  -1.229  0.21962    
## PrivateYes  -4.673e+02  1.571e+02  -2.975  0.00305 ** 
## Accept       1.712e+00  4.567e-02  37.497  < 2e-16 ***
## Enroll      -1.197e+00  2.151e-01  -5.564 4.08e-08 ***
## Top10perc    5.298e+01  6.158e+00   8.603  < 2e-16 ***
## Top25perc   -1.528e+01  4.866e+00  -3.141  0.00177 ** 
## F.Undergrad  7.085e-02  3.760e-02   1.884  0.06002 .  
## P.Undergrad  5.771e-02  3.530e-02   1.635  0.10266    
## Outstate    -8.143e-02  2.077e-02  -3.920 9.95e-05 ***
## Room.Board   1.609e-01  5.361e-02   3.002  0.00280 ** 
## Books        2.338e-01  2.634e-01   0.887  0.37521    
## Personal     6.611e-03  6.850e-02   0.097  0.92315    
## PhD         -1.114e+01  5.149e+00  -2.163  0.03093 *  
## Terminal     9.186e-01  5.709e+00   0.161  0.87223    
## S.F.Ratio    1.689e+01  1.542e+01   1.096  0.27368    
## perc.alumni  2.256e+00  4.635e+00   0.487  0.62667    
## Expend       5.567e-02  1.300e-02   4.284 2.16e-05 ***
## Grad.Rate    6.427e+00  3.307e+00   1.944  0.05243 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9316 
## F-statistic: 466.7 on 17 and 564 DF,  p-value: < 2.2e-16

Fit model to the test set and checking accuracy. Below is Mean Squared Error.

lm.pred <- predict(lm.fit, newdata=test)
lm.error <- mean((test$Apps-lm.pred)^2)
lm.error
## [1] 1384604

The test error obtained from the linear model fit on all variables is 1384604. The r2 for this model is 93.36%. This means that 93.36% of the variance is described by this model.

  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
trainX <- model.matrix(Apps~., data = train[,-1])
trainY <- train$Apps
testX <- model.matrix(Apps~., data=test[,-1])
testY <- test$Apps
set.seed(1)
grid <- 10^seq(10,-2,length=100)
ridge.fit <- cv.glmnet(trainX, trainY, alpha=0)#alpha =  0 for ridge regression
plot(ridge.fit)

By viewing the plot above we see that our mean squared error increases as log λ increases. Lets find the λ that most reduces the error.

best.lambda <- ridge.fit$lambda.min
best.lambda
## [1] 364.6228

After getting the best λ from cross-validation, I will predict with the ridge regression using the best λ value.

ridge.pred <- predict(ridge.fit, s=best.lambda, newx = testX)
ridge.err <- mean((ridge.pred-testY)^2)
ridge.err
## [1] 1260111

The ridge regression for this dataset has a test error of 1260111 which is better than the OLS. Ridge regression includes all variables in the model, but penalizes some variables and shrinks the coefficients towards zero. By doing so do not create a problem for prediction accuracy, but makes it challenging to interpret the relationship between the variables and the response variable.

  1. Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(1)
lasso.fit <- cv.glmnet(trainX, trainY, alpha=1)#alpha =  1 for lasso model
plot(lasso.fit)

best.lamb <- lasso.fit$lambda.min
best.lamb
## [1] 1.945882

After getting the best λ from cross-validation, I will predict with the ridge regression using the best λ value.

lasso.pred <- predict(lasso.fit, s=best.lamb, newx = testX)
lasso.err <- mean((lasso.pred-testY)^2)
lasso.err
## [1] 1394834

The Lasso model produces a test error higher than ridge regression which indicates that it has less predicting power than ridge regression.

  1. Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
pcr.fit <- pcr(Apps~.,data = train,scale=TRUE,validation="CV")
summary(pcr.fit)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     3767     2088     2097     1815     1682     1675
## adjCV         3862     3766     2085     2096     1807     1669     1670
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1671     1620     1574      1568      1575      1579      1580
## adjCV     1666     1610     1568      1562      1570      1574      1574
##        14 comps  15 comps  16 comps  17 comps
## CV         1575      1502      1204      1134
## adjCV      1571      1486      1193      1126
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.159    57.17    64.41    70.20    75.53    80.48    84.24    87.56
## Apps    5.226    71.83    71.84    80.02    83.01    83.07    83.21    84.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.54     92.81     94.92     96.73     97.81     98.69     99.35
## Apps    85.00     85.22     85.22     85.23     85.36     85.45     89.93
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.84     93.36
validationplot(pcr.fit, val.type = "MSEP")

Viewing the plot above we can see the MSEP decrease as the number of components increases.

Reviewing the summary of our PCR model we see that the smallest CV score, 1139, is produced with 17 of components.

pcr.pred <- predict(pcr.fit, test, ncomp=17)
pcr.err <- mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1384604

Our PCR has a test error of 1384604 which does not out perfom our ridge regression model. The nature of PCR model explains that as more principal components are used in the regression model, the bias decreases, but the variance increases.

  1. Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit <- plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type = "MSEP")

summary(pls.fit)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     1922     1704     1510     1425     1215     1158
## adjCV         3862     1918     1701     1503     1405     1196     1148
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1141     1136     1132      1129      1129      1125      1124
## adjCV     1133     1129     1124      1121      1121      1117      1116
##        14 comps  15 comps  16 comps  17 comps
## CV         1123      1123      1123      1123
## adjCV      1115      1116      1115      1115
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.09    62.54     65.0    67.54    72.28    76.80    80.63
## Apps    76.80    82.71    87.20     90.8    92.79    93.05    93.14    93.22
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.71     85.53     88.01     90.95     93.07     95.18     96.86
## Apps    93.30     93.32     93.34     93.35     93.36     93.36     93.36
##       16 comps  17 comps
## X        98.00    100.00
## Apps     93.36     93.36

The lowest CV occurs around 14 components.

pls.pred <- predict(pls.fit, test, ncomp = 14)
pls.err <- mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1385374

The PLS model produces a test error of 1385374 a little greater than our PCR model. PLS tries to reduce dimensions, it reduces the bias and therefore increase the variance of the model.

  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
barplot(c(lm.error, ridge.err, lasso.err, pcr.err, pls.err), col="red", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

The best model was the Ridge Regression model. From the barplot above the test errors are drastically different, but we can see that the ridge model produces the smallest error.

  1. We will now try to predict per capita crime rate in the Boston data set.
library(MASS)
attach(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
set.seed(1) 
train<-sample(1:nrow(Boston), nrow(Boston)/2)
train.boston<-Boston[train,]
test<-(-train)
test.boston<-Boston[test,]
crim.test<-Boston$crim[test]
x.mat.train<-model.matrix(crim~., data=train.boston)[,-crim]
x.mat.train<-model.matrix(crim~., data=test.boston)[,-crim]

Best Subset Selection

library(leaps)
regfit.best<-regsubsets(crim~.,data=Boston[train,],nvmax=13)
test.mat<-model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
   coefi=coef(regfit.best,id=i)
   pred=test.mat[,names(coefi)]%*%coefi
   val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
##  [1] 40.14557 41.87706 42.40901 42.45745 43.03836 42.13258 41.25016 41.28957
##  [9] 41.49271 41.50577 41.48839 41.46692 41.54639
which.min(val.errors)
## [1] 1
sub.mse<-val.errors[1]
sub.mse
## [1] 40.14557

Best Subset selection selects a 1-variable model based on the Test MSE. At 1-variables, the CV estimate for the test MSE is 40.14557- the lowest MSE reported.

Ridge Regression:

library(glmnet)
set.seed(1)
x<-model.matrix(crim~.,Boston)[,-1]
y<-Boston$crim

train <-sample(nrow(Boston), 0.75*nrow(Boston))
y.test<-y[-train]
grid<-10^seq(10,-2,length=100)
ridge.mod <- glmnet(x,y,lambda=grid, alpha=0)
cv.out<-cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)

bestlam<-cv.out$lambda.min
ridge.pred<-predict(ridge.mod,s=bestlam,newx=x[test,])
ridge.mse<-mean((ridge.pred-y.test)^2) 
## Warning in ridge.pred - y.test: longer object length is not a multiple of
## shorter object length
ridge.mse
## [1] 120.8606

Ridge Regression MSE 120.8606

Lasso Model:

grid<-10^seq(10,-2,length=100)
lasso.mod<-glmnet(x,y,lambda=grid, alpha=1)
cv.out<-cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

bestlam<-cv.out$lambda.min
lasso.pred<-predict(lasso.mod,s=bestlam,newx=x[test,])
lasso.mse<-mean((lasso.pred-y.test)^2)
## Warning in lasso.pred - y.test: longer object length is not a multiple of
## shorter object length
lasso.mse
## [1] 122.0856

The MSE for this method is 122.0856.

PCR

set.seed(1)
library(pls)
pcr.fit<-pcr(crim~., data=train.boston, Scale=TRUE, validation='CV')
summary(pcr.fit)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.305    7.267    7.254    7.234    7.130    7.115
## adjCV        9.275    7.300    7.259    7.246    7.226    7.121    7.107
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.795    6.763    6.743     6.694     6.694     6.696     6.700
## adjCV    6.782    6.753    6.732     6.682     6.682     6.684     6.686
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       81.27    97.18    99.08    99.72    99.89    99.93    99.97    99.99
## crim    39.26    40.61    40.87    41.25    43.24    43.95    48.88    49.34
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X      100.00    100.00    100.00    100.00    100.00
## crim    49.94     50.86     50.93     50.98     51.37
validationplot(pcr.fit,val.type="MSEP")

pcr.pred<-predict(pcr.fit,test.boston,ncomp = 11)
pcr.mse<-mean((pcr.pred-y.test)^2)
## Warning in pcr.pred - y.test: longer object length is not a multiple of shorter
## object length
pcr.mse
## [1] 131.4856

Reviewing the summary of our PCR model we see that the smallest CV score, 6.694 is produced with 10 and 11 components.Using 10 components our MSR produced is 131.4856

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

As computed above, the model that has the lowest cross-validation error is the one chosen by the Ridge Regression method. This model had an MSE of 40.14557 with one variable.

  1. Does your chosen model involve all of the features in the data set? Why or why not?

This model does not include all of the features in the dataset.