##Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii is the correct answer. The lasso, relative to least squares is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. This model works to reduce over fitting and variance in the predictions by applying constraints to the coefficients. As long as bias is not increased too much, this method will be a better fit than ordinary least squares which could be applying to much weight to spurious parameters.

(b) Repeat (a) for ridge regression relative to least squares.

iii is the correct answer. Ridge regression, relative to least squares is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. This model also works to reduce over fitting and variance in the predictions by applying constraints to the coefficients. As long as bias is not increased too much, this method will be a better fit than ordinary least squares which could be applying to much weight to spurious parameters.

(c) Repeat (a) for non-linear methods relative to least squares.

ii is the correct answer. Non-linear methods, relative to least squares are more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. These methods perform better when the linearity assumption is not appropriate. This method will generally have more variance due to its higher sensitivity to fitting the underlying data.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
library(leaps)
library(glmnet)
attach(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

(a) Split the data set into a training set and a test set.

set.seed(1)
train<-sample(nrow(College),0.75*nrow(College))
College_train<-College[train,]
College_test<-College[-train,]
dim(College)
## [1] 777  18
dim(College_test)
## [1] 195  18
dim(College_train)
## [1] 582  18

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm_fit<-lm(Apps~.,data=College_train)
summary(lm_fit)
## 
## Call:
## lm(formula = Apps ~ ., data = College_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5773.1  -425.2     4.5   327.9  7496.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.784e+02  4.707e+02  -1.229  0.21962    
## PrivateYes  -4.673e+02  1.571e+02  -2.975  0.00305 ** 
## Accept       1.712e+00  4.567e-02  37.497  < 2e-16 ***
## Enroll      -1.197e+00  2.151e-01  -5.564 4.08e-08 ***
## Top10perc    5.298e+01  6.158e+00   8.603  < 2e-16 ***
## Top25perc   -1.528e+01  4.866e+00  -3.141  0.00177 ** 
## F.Undergrad  7.085e-02  3.760e-02   1.884  0.06002 .  
## P.Undergrad  5.771e-02  3.530e-02   1.635  0.10266    
## Outstate    -8.143e-02  2.077e-02  -3.920 9.95e-05 ***
## Room.Board   1.609e-01  5.361e-02   3.002  0.00280 ** 
## Books        2.338e-01  2.634e-01   0.887  0.37521    
## Personal     6.611e-03  6.850e-02   0.097  0.92315    
## PhD         -1.114e+01  5.149e+00  -2.163  0.03093 *  
## Terminal     9.186e-01  5.709e+00   0.161  0.87223    
## S.F.Ratio    1.689e+01  1.542e+01   1.096  0.27368    
## perc.alumni  2.256e+00  4.635e+00   0.487  0.62667    
## Expend       5.567e-02  1.300e-02   4.284 2.16e-05 ***
## Grad.Rate    6.427e+00  3.307e+00   1.944  0.05243 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9316 
## F-statistic: 466.7 on 17 and 564 DF,  p-value: < 2.2e-16

The linear model yields a test error of 1,384,604.

lm_pred<-predict(lm_fit,College_test)
lm_error<-mean((lm_pred-College_test$Apps)^2)
lm_error
## [1] 1384604

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

set.seed(1)
train_mat<-model.matrix(Apps~.,College_train)[,-1]
test_mat<-model.matrix(Apps~.,data=College_test)[,-1]

Cross Validation identifies 364.6228 as the best lambda.

cv_out<-cv.glmnet(train_mat,College_train$Apps,alpha=0)
best_lambda<-cv_out$lambda.min
best_lambda
## [1] 364.6228

The test error yielded by the ridge regression model is 1,206,346, which is an improvement from the original linear regression model.

ridge_mod<-glmnet(train_mat,College_train$Apps,alpha=0)
ridge_pred<-predict(ridge_mod,s=best_lambda,newx=test_mat)
mean((ridge_pred-College_test$Apps)^2)
## [1] 1206346

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(1)
cv_out2<-cv.glmnet(train_mat,College_train$Apps,alpha=1)
best_lambda2<-cv_out2$lambda.min
best_lambda2
## [1] 2.135603

The test error yielded by the lasso model fit with the lambda chosen by cross validation is 1,370,403. While this is an improvement on the ordinary linear regression model, it is not as good as the ridge regression model.

lasso_mod<-glmnet(train_mat,College_train$Apps,alpha=1)
lasso_pred<-predict(lasso_mod,s=best_lambda2,newx=test_mat)
mean((lasso_pred-College_test$Apps)^2)
## [1] 1370403

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
set.seed(1)
pcr_fit<-pcr(Apps~.,data=College_train,scale=TRUE,validation="CV")
validationplot(pcr_fit,val_type="MSEP")

It appears that the lowest RMSEP with PCR dimension reduction occurs at M=10. This is also apparent from the summary below where we see the lowest root mean squared error of 1568 at 10 components.

summary(pcr_fit)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     3767     2088     2097     1815     1682     1675
## adjCV         3862     3766     2085     2096     1807     1669     1670
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1671     1620     1574      1568      1575      1579      1580
## adjCV     1666     1610     1568      1562      1570      1574      1574
##        14 comps  15 comps  16 comps  17 comps
## CV         1575      1502      1204      1134
## adjCV      1571      1486      1193      1126
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.159    57.17    64.41    70.20    75.53    80.48    84.24    87.56
## Apps    5.226    71.83    71.84    80.02    83.01    83.07    83.21    84.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.54     92.81     94.92     96.73     97.81     98.69     99.35
## Apps    85.00     85.22     85.22     85.23     85.36     85.45     89.93
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.84     93.36

Fitting a PCR model where \(M=10\), yields a test error of 1,952,693.

pcr_pred<-predict(pcr_fit,College_test,ncomp=10)
mean((pcr_pred-College_test$Apps)^2)
## [1] 1952693

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls_fit<-plsr(Apps~.,data=College_train,scale=TRUE,validation="CV")
validationplot(pls_fit,val.type="MSEP")

From the plot above, it appears that the lowest MSEP with PLS dimension reduction occurs after 6 components, but it is not entirely clear. In the summary below, we can see that the lowest MSEP of 1134 is given by considering 13 components.

summary(pls_fit)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     1921     1712     1511     1429     1255     1175
## adjCV         3862     1916     1710     1503     1410     1229     1164
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1153     1147     1139      1138      1138      1136      1134
## adjCV     1143     1138     1131      1130      1129      1128      1126
##        14 comps  15 comps  16 comps  17 comps
## CV         1134      1134      1134      1134
## adjCV      1126      1126      1126      1126
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.09    62.54     65.0    67.54    72.28    76.80    80.63
## Apps    76.80    82.71    87.20     90.8    92.79    93.05    93.14    93.22
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.71     85.53     88.01     90.95     93.07     95.18     96.86
## Apps    93.30     93.32     93.34     93.35     93.36     93.36     93.36
##       16 comps  17 comps
## X        98.00    100.00
## Apps     93.36     93.36

Fitting a PLS model where \(M=13\), yields a test error of 1,380,520.

pls_pred<-predict(pls_fit,College_test,ncomp=13)
mean((pls_pred-College_test$Apps)^2)
## [1] 1380520

Fitting a PLS model where \(M=6\) yields an even lower test error of 1,346,477.

pls_pred2<-predict(pls_fit,College_test,ncomp=6)
mean((pls_pred2-College_test$Apps)^2)
## [1] 1346477

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The ridge regression model appears to perform best, given a test error of 1,206,346. Performance of the other models based on their test error is as follows: PLS model with 6 components (1,346,477), Lasso model (1,370,403), PLS model with 13 components (1,380,520), Linear model (1,384,604), and PCR model with 10 components (1,952,693). The Lasso, PLS with both 6 and 13 components, and linear models all yield test errors that are similar, but the test error yielded by the PCR model is significantly worse.

The prediction accuracy of the ridge regression model is given by calculating the R-Squared value. From the R-squared value calculated, we can say that the ridge regression model can be used to explain 92.04% of variance in Apps.

Sumofsqrs<-sum((mean(College_test$Apps)-College_test$Apps)^2)
Sumofresd<-sum((ridge_pred-College_test$Apps)^2)
1-(Sumofresd)/(Sumofsqrs)
## [1] 0.9204048
detach(College)

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

library(MASS)
attach(Boston)
library(leaps)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best Subset Selection

I use cross validation to choose among models of different sizes. This approach involves performing best subset selection within each of the k training sets. We create a for loop to perform the cross validation.

set.seed(3)
predict.regsubsets <- function(object, newdata, id, ...) {
    form <- as.formula(object$call[[2]])
    mat <- model.matrix(form, newdata)
    coefi <- coef(object, id = id)
    xvars <- names(coefi)
    mat[, xvars] %*% coefi
}

k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
    best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
    for (i in 1:13) {
        pred <- predict(best.fit, Boston[folds == j, ], id = i)
        cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
    }
}

The cross validation approach suggests that the best model includes 9 variables and yields an estimated test MSE of 39.33337.

mean_cv_errors <- apply(cv.errors, 2, mean)
plot(mean_cv_errors, type = "b", xlab = "Number of variables", ylab = "CV error")

mean_cv_errors
##        1        2        3        4        5        6        7        8 
## 41.80889 40.34632 40.88195 41.11506 40.05517 40.07341 39.81669 39.75006 
##        9       10       11       12       13 
## 39.33337 39.40103 39.46477 39.34757 39.44623
which.min(mean_cv_errors)
## 9 
## 9

We now perform best subset selection on the full data set in order to obtain the coefficients for the 9-variable model. From the summary below, we can see that this model includes the following variables along with their corresponding coefficients: zn( 0.04279), indus(-0.09939), nox(-10.46649), dis(-1.0026), rad(0.53950), ptratio(-0.27084), black(-0.008), lstat(0.11781) and medv(-0.18059).

reg_best<-regsubsets(crim~.,data=Boston,nvmax=13)
coef(reg_best,9)
##   (Intercept)            zn         indus           nox           dis 
##  19.124636156   0.042788127  -0.099385948 -10.466490364  -1.002597606 
##           rad       ptratio         black         lstat          medv 
##   0.539503547  -0.270835584  -0.008003761   0.117805932  -0.180593877

Ridge Regression

library(glmnet)
x<-model.matrix(crim~.,Boston[,-1])
y<-Boston$crim
grid<-10^seq(10,-2,length=100)
ridge_mod=glmnet(x,y,alpha=0,lambda=grid)
set.seed(4)
train<-sample(1:nrow(x),nrow(x)/2)
test<-(-train)
y_test<-y[test]
cv_out<-cv.glmnet(x[train,],y[train],alpha=0)
plot(cv_out)

bestlam<-cv_out$lambda.min
bestlam
## [1] 0.5974583

A ridge regression model using a lambda of 0.59746 yields a mean squared error of 23.77766.

ridge_pred<-predict(ridge_mod,s=bestlam,newx=x[test,])
mean((ridge_pred-y_test)^2)
## [1] 23.7693

The summary below provides the coefficients for the variables in the ridge regression model.

out<-glmnet(x,y,alpha=0)
predict(out,type='coefficients',s=bestlam,)[1:15,]
##  (Intercept)  (Intercept)           zn        indus         chas          nox 
##  8.558405146  0.000000000  0.032264880 -0.081038069 -0.740321915 -5.055549457 
##           rm          age          dis          rad          tax      ptratio 
##  0.327086045  0.002091213 -0.681302675  0.413098467  0.003733519 -0.126490776 
##        black        lstat         medv 
## -0.008539294  0.142709077 -0.135868175

Lasso

lasso_mod<-glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso_mod)

set.seed(3)
cv_out2<-cv.glmnet(x[train,],y[train],alpha=1)
plot(cv_out2)

bestlam2<-cv_out2$lambda.min
bestlam2
## [1] 0.1200433

A lasso model using a lambda of 0.1200433 yields a mean squared error of 26.98767. While this model does improve on the estimated test MSE of the best subset regression model, it does not appear to have an advantage over the ridge regression model.

lasso_pred<-predict(lasso_mod,s=bestlam2,newx=x[test,])
mean((lasso_pred-y_test)^2)
## [1] 26.98846

The summary below provides the coefficients for the variables in the lasso model.

out2<-glmnet(x,y,alpha=1,lambda=grid)
lasso_coef<-predict(out2,type='coefficient',s=best_lambda2)[1:15,]
lasso_coef
## (Intercept) (Intercept)          zn       indus        chas         nox 
##  -0.2842671   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
##          rm         age         dis         rad         tax     ptratio 
##   0.0000000   0.0000000   0.0000000   0.3569035   0.0000000   0.0000000 
##       black       lstat        medv 
##   0.0000000   0.0386921   0.0000000

Principle Components Regression (PCR)

library(pls)

It appears that the lowest CV score of 6.646 is yielded by a model that includes 9 dimensions This is apparent from the validation plot as well. A model with 9 dimensions that has a CV score of 6.646, which corresponds to an MSE of \(6.646^2=44.1693\).

set.seed(2)
pcr_fit<-pcr(crim~.,data=Boston,scale=TRUE,validation="CV")
summary(pcr_fit)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.229    7.227    6.814    6.799    6.795    6.794
## adjCV         8.61    7.225    7.222    6.807    6.789    6.788    6.787
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.787    6.654    6.664     6.673     6.676     6.651     6.573
## adjCV    6.780    6.645    6.656     6.664     6.666     6.639     6.562
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(pcr_fit,val.type="MSEP")

When we perform PCR on the training set and evaluate the test set performance, it appears that a model with 4 components yields a lower CV score of 8.018. This CV score corresponds to an MSE of \(8.018^2=64.2883\)

set.seed(1)
pcr_fit2<-pcr(crim~.,data=Boston[train,],scale=TRUE,validation="CV")
summary(pcr_fit2)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.916    8.328    8.331    8.006    8.007    8.042    8.027
## adjCV        9.916    8.326    8.329    8.001    8.003    8.038    8.019
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       8.048    8.023    8.007     8.085     8.043     7.899     7.784
## adjCV    8.039    7.986    7.992     8.069     8.026     7.877     7.763
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.54    60.71    69.92    76.99    82.88    87.59    91.15    93.56
## crim    29.46    29.54    35.16    35.31    35.38    36.46    36.65    38.53
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.86     97.43     98.72     99.60    100.00
## crim    38.54     38.57     39.05     42.36     43.81
validationplot(pcr_fit2,val.type="MSEP")

The test MSE computed for the PCR model which includes 3 components is 27.98243.

pcr_pred<-predict(pcr_fit2,Boston[test,],ncomp=4)
mean((pcr_pred-Boston$crim[test])^2)
## [1] 27.02963

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

Cross validation was used to evaluate the performance of each model. The ridge regression model appears to perform best on the data set, given that it yielded the lowest test MSE achieved of 23.77766.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

The ridge regression model selected does include all of the predictor variables. This method applies a constraint to the coefficients of each predictor in the model in order to reduce variance.