##Problem 2
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii is the correct answer. The lasso, relative to least squares is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. This model works to reduce over fitting and variance in the predictions by applying constraints to the coefficients. As long as bias is not increased too much, this method will be a better fit than ordinary least squares which could be applying to much weight to spurious parameters.
(b) Repeat (a) for ridge regression relative to least squares.
iii is the correct answer. Ridge regression, relative to least squares is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. This model also works to reduce over fitting and variance in the predictions by applying constraints to the coefficients. As long as bias is not increased too much, this method will be a better fit than ordinary least squares which could be applying to much weight to spurious parameters.
(c) Repeat (a) for non-linear methods relative to least squares.
ii is the correct answer. Non-linear methods, relative to least squares are more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. These methods perform better when the linearity assumption is not appropriate. This method will generally have more variance due to its higher sensitivity to fitting the underlying data.
In this exercise, we will predict the number of applications
received using the other variables in the College data
set.
library(ISLR)
library(leaps)
library(glmnet)
attach(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
(a) Split the data set into a training set and a test set.
set.seed(1)
train<-sample(nrow(College),0.75*nrow(College))
College_train<-College[train,]
College_test<-College[-train,]
dim(College)
## [1] 777 18
dim(College_test)
## [1] 195 18
dim(College_train)
## [1] 582 18
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm_fit<-lm(Apps~.,data=College_train)
summary(lm_fit)
##
## Call:
## lm(formula = Apps ~ ., data = College_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5773.1 -425.2 4.5 327.9 7496.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.784e+02 4.707e+02 -1.229 0.21962
## PrivateYes -4.673e+02 1.571e+02 -2.975 0.00305 **
## Accept 1.712e+00 4.567e-02 37.497 < 2e-16 ***
## Enroll -1.197e+00 2.151e-01 -5.564 4.08e-08 ***
## Top10perc 5.298e+01 6.158e+00 8.603 < 2e-16 ***
## Top25perc -1.528e+01 4.866e+00 -3.141 0.00177 **
## F.Undergrad 7.085e-02 3.760e-02 1.884 0.06002 .
## P.Undergrad 5.771e-02 3.530e-02 1.635 0.10266
## Outstate -8.143e-02 2.077e-02 -3.920 9.95e-05 ***
## Room.Board 1.609e-01 5.361e-02 3.002 0.00280 **
## Books 2.338e-01 2.634e-01 0.887 0.37521
## Personal 6.611e-03 6.850e-02 0.097 0.92315
## PhD -1.114e+01 5.149e+00 -2.163 0.03093 *
## Terminal 9.186e-01 5.709e+00 0.161 0.87223
## S.F.Ratio 1.689e+01 1.542e+01 1.096 0.27368
## perc.alumni 2.256e+00 4.635e+00 0.487 0.62667
## Expend 5.567e-02 1.300e-02 4.284 2.16e-05 ***
## Grad.Rate 6.427e+00 3.307e+00 1.944 0.05243 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared: 0.9336, Adjusted R-squared: 0.9316
## F-statistic: 466.7 on 17 and 564 DF, p-value: < 2.2e-16
The linear model yields a test error of 1,384,604.
lm_pred<-predict(lm_fit,College_test)
lm_error<-mean((lm_pred-College_test$Apps)^2)
lm_error
## [1] 1384604
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
set.seed(1)
train_mat<-model.matrix(Apps~.,College_train)[,-1]
test_mat<-model.matrix(Apps~.,data=College_test)[,-1]
Cross Validation identifies 364.6228 as the best lambda.
cv_out<-cv.glmnet(train_mat,College_train$Apps,alpha=0)
best_lambda<-cv_out$lambda.min
best_lambda
## [1] 364.6228
The test error yielded by the ridge regression model is 1,206,346, which is an improvement from the original linear regression model.
ridge_mod<-glmnet(train_mat,College_train$Apps,alpha=0)
ridge_pred<-predict(ridge_mod,s=best_lambda,newx=test_mat)
mean((ridge_pred-College_test$Apps)^2)
## [1] 1206346
(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(1)
cv_out2<-cv.glmnet(train_mat,College_train$Apps,alpha=1)
best_lambda2<-cv_out2$lambda.min
best_lambda2
## [1] 2.135603
The test error yielded by the lasso model fit with the lambda chosen by cross validation is 1,370,403. While this is an improvement on the ordinary linear regression model, it is not as good as the ridge regression model.
lasso_mod<-glmnet(train_mat,College_train$Apps,alpha=1)
lasso_pred<-predict(lasso_mod,s=best_lambda2,newx=test_mat)
mean((lasso_pred-College_test$Apps)^2)
## [1] 1370403
(e) Fit a PCR model on the training set, with M
chosen by cross validation. Report the test error obtained, along with
the value of M selected by cross-validation.
library(pls)
set.seed(1)
pcr_fit<-pcr(Apps~.,data=College_train,scale=TRUE,validation="CV")
validationplot(pcr_fit,val_type="MSEP")
It appears that the lowest RMSEP with PCR dimension reduction occurs at M=10. This is also apparent from the summary below where we see the lowest root mean squared error of 1568 at 10 components.
summary(pcr_fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 3767 2088 2097 1815 1682 1675
## adjCV 3862 3766 2085 2096 1807 1669 1670
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1671 1620 1574 1568 1575 1579 1580
## adjCV 1666 1610 1568 1562 1570 1574 1574
## 14 comps 15 comps 16 comps 17 comps
## CV 1575 1502 1204 1134
## adjCV 1571 1486 1193 1126
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.159 57.17 64.41 70.20 75.53 80.48 84.24 87.56
## Apps 5.226 71.83 71.84 80.02 83.01 83.07 83.21 84.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.54 92.81 94.92 96.73 97.81 98.69 99.35
## Apps 85.00 85.22 85.22 85.23 85.36 85.45 89.93
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.84 93.36
Fitting a PCR model where \(M=10\), yields a test error of 1,952,693.
pcr_pred<-predict(pcr_fit,College_test,ncomp=10)
mean((pcr_pred-College_test$Apps)^2)
## [1] 1952693
(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pls_fit<-plsr(Apps~.,data=College_train,scale=TRUE,validation="CV")
validationplot(pls_fit,val.type="MSEP")
From the plot above, it appears that the lowest MSEP with PLS dimension reduction occurs after 6 components, but it is not entirely clear. In the summary below, we can see that the lowest MSEP of 1134 is given by considering 13 components.
summary(pls_fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 1921 1712 1511 1429 1255 1175
## adjCV 3862 1916 1710 1503 1410 1229 1164
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1153 1147 1139 1138 1138 1136 1134
## adjCV 1143 1138 1131 1130 1129 1128 1126
## 14 comps 15 comps 16 comps 17 comps
## CV 1134 1134 1134 1134
## adjCV 1126 1126 1126 1126
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.09 62.54 65.0 67.54 72.28 76.80 80.63
## Apps 76.80 82.71 87.20 90.8 92.79 93.05 93.14 93.22
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.71 85.53 88.01 90.95 93.07 95.18 96.86
## Apps 93.30 93.32 93.34 93.35 93.36 93.36 93.36
## 16 comps 17 comps
## X 98.00 100.00
## Apps 93.36 93.36
Fitting a PLS model where \(M=13\), yields a test error of 1,380,520.
pls_pred<-predict(pls_fit,College_test,ncomp=13)
mean((pls_pred-College_test$Apps)^2)
## [1] 1380520
Fitting a PLS model where \(M=6\) yields an even lower test error of 1,346,477.
pls_pred2<-predict(pls_fit,College_test,ncomp=6)
mean((pls_pred2-College_test$Apps)^2)
## [1] 1346477
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The ridge regression model appears to perform best, given a test error of 1,206,346. Performance of the other models based on their test error is as follows: PLS model with 6 components (1,346,477), Lasso model (1,370,403), PLS model with 13 components (1,380,520), Linear model (1,384,604), and PCR model with 10 components (1,952,693). The Lasso, PLS with both 6 and 13 components, and linear models all yield test errors that are similar, but the test error yielded by the PCR model is significantly worse.
The prediction accuracy of the ridge regression model is given by
calculating the R-Squared value. From the R-squared value calculated, we
can say that the ridge regression model can be used to explain 92.04% of
variance in Apps.
Sumofsqrs<-sum((mean(College_test$Apps)-College_test$Apps)^2)
Sumofresd<-sum((ridge_pred-College_test$Apps)^2)
1-(Sumofresd)/(Sumofsqrs)
## [1] 0.9204048
detach(College)
We will now try to predict per capita crime rate in the
Boston data set.
library(MASS)
attach(Boston)
library(leaps)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
Best Subset Selection
I use cross validation to choose among models of different sizes. This approach involves performing best subset selection within each of the k training sets. We create a for loop to perform the cross validation.
set.seed(3)
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred <- predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
}
}
The cross validation approach suggests that the best model includes 9 variables and yields an estimated test MSE of 39.33337.
mean_cv_errors <- apply(cv.errors, 2, mean)
plot(mean_cv_errors, type = "b", xlab = "Number of variables", ylab = "CV error")
mean_cv_errors
## 1 2 3 4 5 6 7 8
## 41.80889 40.34632 40.88195 41.11506 40.05517 40.07341 39.81669 39.75006
## 9 10 11 12 13
## 39.33337 39.40103 39.46477 39.34757 39.44623
which.min(mean_cv_errors)
## 9
## 9
We now perform best subset selection on the full data set in order to
obtain the coefficients for the 9-variable model. From the summary
below, we can see that this model includes the following variables along
with their corresponding coefficients: zn( 0.04279),
indus(-0.09939), nox(-10.46649),
dis(-1.0026), rad(0.53950),
ptratio(-0.27084), black(-0.008),
lstat(0.11781) and medv(-0.18059).
reg_best<-regsubsets(crim~.,data=Boston,nvmax=13)
coef(reg_best,9)
## (Intercept) zn indus nox dis
## 19.124636156 0.042788127 -0.099385948 -10.466490364 -1.002597606
## rad ptratio black lstat medv
## 0.539503547 -0.270835584 -0.008003761 0.117805932 -0.180593877
Ridge Regression
library(glmnet)
x<-model.matrix(crim~.,Boston[,-1])
y<-Boston$crim
grid<-10^seq(10,-2,length=100)
ridge_mod=glmnet(x,y,alpha=0,lambda=grid)
set.seed(4)
train<-sample(1:nrow(x),nrow(x)/2)
test<-(-train)
y_test<-y[test]
cv_out<-cv.glmnet(x[train,],y[train],alpha=0)
plot(cv_out)
bestlam<-cv_out$lambda.min
bestlam
## [1] 0.5974583
A ridge regression model using a lambda of 0.59746 yields a mean squared error of 23.77766.
ridge_pred<-predict(ridge_mod,s=bestlam,newx=x[test,])
mean((ridge_pred-y_test)^2)
## [1] 23.7693
The summary below provides the coefficients for the variables in the ridge regression model.
out<-glmnet(x,y,alpha=0)
predict(out,type='coefficients',s=bestlam,)[1:15,]
## (Intercept) (Intercept) zn indus chas nox
## 8.558405146 0.000000000 0.032264880 -0.081038069 -0.740321915 -5.055549457
## rm age dis rad tax ptratio
## 0.327086045 0.002091213 -0.681302675 0.413098467 0.003733519 -0.126490776
## black lstat medv
## -0.008539294 0.142709077 -0.135868175
Lasso
lasso_mod<-glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso_mod)
set.seed(3)
cv_out2<-cv.glmnet(x[train,],y[train],alpha=1)
plot(cv_out2)
bestlam2<-cv_out2$lambda.min
bestlam2
## [1] 0.1200433
A lasso model using a lambda of 0.1200433 yields a mean squared error of 26.98767. While this model does improve on the estimated test MSE of the best subset regression model, it does not appear to have an advantage over the ridge regression model.
lasso_pred<-predict(lasso_mod,s=bestlam2,newx=x[test,])
mean((lasso_pred-y_test)^2)
## [1] 26.98846
The summary below provides the coefficients for the variables in the lasso model.
out2<-glmnet(x,y,alpha=1,lambda=grid)
lasso_coef<-predict(out2,type='coefficient',s=best_lambda2)[1:15,]
lasso_coef
## (Intercept) (Intercept) zn indus chas nox
## -0.2842671 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## rm age dis rad tax ptratio
## 0.0000000 0.0000000 0.0000000 0.3569035 0.0000000 0.0000000
## black lstat medv
## 0.0000000 0.0386921 0.0000000
Principle Components Regression (PCR)
library(pls)
It appears that the lowest CV score of 6.646 is yielded
by a model that includes 9 dimensions This is apparent from the
validation plot as well. A model with 9 dimensions that has a
CV score of 6.646, which corresponds to an MSE of \(6.646^2=44.1693\).
set.seed(2)
pcr_fit<-pcr(crim~.,data=Boston,scale=TRUE,validation="CV")
summary(pcr_fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.229 7.227 6.814 6.799 6.795 6.794
## adjCV 8.61 7.225 7.222 6.807 6.789 6.788 6.787
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.787 6.654 6.664 6.673 6.676 6.651 6.573
## adjCV 6.780 6.645 6.656 6.664 6.666 6.639 6.562
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
validationplot(pcr_fit,val.type="MSEP")
When we perform PCR on the training set and evaluate the test set
performance, it appears that a model with 4 components yields a lower
CV score of 8.018. This CV score corresponds
to an MSE of \(8.018^2=64.2883\)
set.seed(1)
pcr_fit2<-pcr(crim~.,data=Boston[train,],scale=TRUE,validation="CV")
summary(pcr_fit2)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.916 8.328 8.331 8.006 8.007 8.042 8.027
## adjCV 9.916 8.326 8.329 8.001 8.003 8.038 8.019
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 8.048 8.023 8.007 8.085 8.043 7.899 7.784
## adjCV 8.039 7.986 7.992 8.069 8.026 7.877 7.763
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.54 60.71 69.92 76.99 82.88 87.59 91.15 93.56
## crim 29.46 29.54 35.16 35.31 35.38 36.46 36.65 38.53
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.86 97.43 98.72 99.60 100.00
## crim 38.54 38.57 39.05 42.36 43.81
validationplot(pcr_fit2,val.type="MSEP")
The test MSE computed for the PCR model which includes 3 components is 27.98243.
pcr_pred<-predict(pcr_fit2,Boston[test,],ncomp=4)
mean((pcr_pred-Boston$crim[test])^2)
## [1] 27.02963
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.
Cross validation was used to evaluate the performance of each model. The ridge regression model appears to perform best on the data set, given that it yielded the lowest test MSE achieved of 23.77766.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
The ridge regression model selected does include all of the predictor variables. This method applies a constraint to the coefficients of each predictor in the model in order to reduce variance.