For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
The correct answer for part (a) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: Lasso’s advantage over least squares is rooted in the bias-variance trade-off. When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias. This consequently can generate more accurate predictions. In addition, lasso performs variable selection which makes it easier to interpret than other methods like ridge regression.
The correct answer for part (b) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: Ridge regression and lasso’s advantage over least squares is rooted in the bias-variance trade-off. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases leading to decreased variance but increased bias. The relationship between \(\lambda\) and variance and bias in this regression method is the key holder to understanding the relationship. When there is small change in the training data, the least squares coefficient produces a large change and larger value for variance. Whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, between these two methods, ridge regression works best in situations where the least squares estimates have high variance. The big difference between ridge and lasso is that lasso performs variance selection and makes it easier to interpret.
The correct answer for part (c) is: ii. More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
data(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
pairs(College)
set.seed(1)
trainingindex<-sample(nrow(College), 0.75*nrow(College))
head(trainingindex)
## [1] 207 289 444 703 156 694
train<-College[trainingindex, ]
test<-College[-trainingindex, ]
dim(College)
## [1] 777 18
dim(train)
## [1] 582 18
dim(test)
## [1] 195 18
lm.fit=lm(Apps~., data=train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5098.3 -430.6 -10.4 297.1 7309.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -251.47922 469.10371 -0.536 0.592112
## PrivateYes -516.87524 158.90204 -3.253 0.001211 **
## Accept 1.62758 0.04355 37.373 < 2e-16 ***
## Enroll -0.87037 0.20440 -4.258 2.41e-05 ***
## Top10perc 49.52847 6.61170 7.491 2.65e-13 ***
## Top25perc -14.31333 5.22516 -2.739 0.006352 **
## F.Undergrad 0.04218 0.03759 1.122 0.262229
## P.Undergrad -0.03403 0.04725 -0.720 0.471675
## Outstate -0.08314 0.02218 -3.748 0.000197 ***
## Room.Board 0.16414 0.05508 2.980 0.003007 **
## Books 0.02537 0.30798 0.082 0.934377
## Personal 0.05613 0.07492 0.749 0.454068
## PhD -5.98926 5.71897 -1.047 0.295428
## Terminal -6.08900 6.35359 -0.958 0.338293
## S.F.Ratio 12.95405 15.08968 0.858 0.390997
## perc.alumni -2.51887 4.79885 -0.525 0.599866
## Expend 0.07217 0.01468 4.915 1.16e-06 ***
## Grad.Rate 7.45308 3.45763 2.156 0.031541 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1040 on 564 degrees of freedom
## Multiple R-squared: 0.9363, Adjusted R-squared: 0.9343
## F-statistic: 487.4 on 17 and 564 DF, p-value: < 2.2e-16
lm.pred=predict(lm.fit, newdata=test)
lm.err=mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1147256
The test error obtained from the linear model fit on all variables is 1147256. The \(r^2\) for this model is 93.63%. This means that 93.63% of the variance is described by this model. In order to try to reduce the error and increase the \(r^2\), I am going to fit the model on only significant variables.
xtrain=model.matrix(Apps~., data=train[,-1])
ytrain=train$Apps
xtest=model.matrix(Apps~., data=test[,-1])
ytest=test$Apps
set.seed(1)
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
ridge.fit=cv.glmnet(xtrain, ytrain, alpha=0)
plot(ridge.fit)
After looking at the relationship between log \(\lambda\) and the MSE for the ridge regression, we are going to find the \(\lambda\) that most reduces the error.
ridge.lambda=ridge.fit$lambda.min
ridge.lambda
## [1] 423.3014
After getting the best \(\lambda\) from cross-validation, I will predict with the ridge regression using the best \(\lambda\) value.
ridge.pred=predict(ridge.fit, s=ridge.lambda, newx = xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err
## [1] 1146729
The ridge regression for this dataset has a better test error than the OLS. As explained in the previous question, ridge regressions typically predict better when OLS has higher variance due to a large change in the data. One thing that may not be helpful for this model is the interpretation. Since ridge regression includes all variables in the model but penalizes some variables and shrinks the coefficients towards zero. By doing so do not create a problem for prediction accuracy, but makes it challenging to interpret the relationship between the variables and the response variable.
set.seed(1)
lasso.fit=cv.glmnet(xtrain, ytrain, alpha=1)
plot(lasso.fit)
Now that we can see the relationship between log lambda and the MSE, we are going to find the best lambda that reduces the error the most for the lasso model and test it on the hold-out data.
lasso.lambda=lasso.fit$lambda.min
lasso.lambda
## [1] 3.277469
With this lambda, I am going to use it to help predict the training model on the test data.
lasso.pred=predict(lasso.fit, s=lasso.lambda, newx = xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err
## [1] 1152340
The lasso model is not as strong at predicting as the ridge regression on this data.
lasso.coeff=predict(lasso.fit, type="coefficients", s=lasso.lambda)[1:18,]
lasso.coeff
## (Intercept) (Intercept) Accept Enroll Top10perc
## -7.529479e+02 0.000000e+00 1.614586e+00 -7.036607e-01 4.646142e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -1.230245e+01 3.107013e-02 -1.251723e-02 -1.039411e-01 1.431761e-01
## Books Personal PhD Terminal S.F.Ratio
## -4.900052e-03 5.167434e-02 -3.444074e+00 -4.154387e+00 1.911491e+01
## perc.alumni Expend Grad.Rate
## -3.723708e+00 7.466592e-02 6.279007e+00
As seen above, lasso chose some variables that we helpful in prediction on the test dataset. While a more simple and interpretable model than ridge regression, it did not perform as well in prediction accuracy. However, because the variance of the ridge regression was slightly lower than that of the lasso, it produced a more minimum error. This instance may have resulted in the ridge-outperformance due to the fact that the response may be a function of many predictors.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit=pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type = "MSEP")
summary(pcr.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4064 4049 2196 2185 2171 1708 1693
## adjCV 4064 4052 2193 2184 2205 1694 1688
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1694 1708 1597 1594 1600 1602 1611
## adjCV 1689 1704 1590 1590 1595 1598 1606
## 14 comps 15 comps 16 comps 17 comps
## CV 1602 1563 1213 1189
## adjCV 1597 1545 1203 1180
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 31.144 57.84 64.56 70.34 75.87 80.88 84.37
## Apps 1.528 71.98 72.43 72.46 83.85 84.06 84.13
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.72 90.57 92.97 95.03 96.84 97.96 98.74
## Apps 84.13 85.89 85.98 86.04 86.04 86.06 86.19
## 15 comps 16 comps 17 comps
## X 99.37 99.84 100.00
## Apps 91.01 93.35 93.63
From just the looks of this model’s graph, I think that the MSE is lowest around 10 components. Additionally, the summary describe that 10 components the lowest CV error and explains 92.97% of the variance in predictors and 85.98% of variance in the response variable–Apps. Based on my conclusion, I am going to predict on the model with 10 components.
pcr.pred=predict(pcr.fit, test, ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1525762
The PCR model does not outperform any of the previous models. I believe that this may have occured since I chose a relatively large amount of principal components. The nature of PCR model explains that as more principal components are used in the regression model, the bias decreases, but the variance increases. However, I don’t believe that reducing the principal component number would’ve made for a better outcome because then I would’ve been struck with a higher CV error. When just looking at the numbers provided for the CV error, you can see a somewhat U-shape for these errors. 10 components had the least amount of CV error, so that made the most sense to use.
pls.fit=plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type = "MSEP")
summary(pls.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4064 1989 1601 1545 1443 1325 1210
## adjCV 4064 1986 1589 1551 1427 1310 1200
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1206 1201 1200 1199 1199 1198 1197
## adjCV 1195 1191 1190 1189 1189 1188 1187
## 14 comps 15 comps 16 comps 17 comps
## CV 1196 1196 1196 1196
## adjCV 1186 1186 1186 1186
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 26.66 35.76 63.00 65.64 69.53 73.50 76.77
## Apps 77.22 86.50 87.83 91.25 92.84 93.44 93.50
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 79.45 81.88 85.1 88.72 90.87 92.38 93.94
## Apps 93.55 93.59 93.6 93.61 93.62 93.63 93.63
## 15 comps 16 comps 17 comps
## X 95.16 97.12 100.00
## Apps 93.63 93.63 93.63
With this the validation plot, I can see that MSE drops to it’s lowest around 5-7 components. Based on the summary of the model, the lowest CV error occurs at 7 components. At 7 components, 76.77% of the variance in the predictors is explained and 93.55% of the variance in the response variable is explained. In comparison with the other componets, this components qualities are not that dissimilar, but since we are looking to really reduce the amount of components involved, I will use 7 components for prediction purposes.
pls.pred=predict(pls.fit, test, ncomp = 7)
pls.err=mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1160474
The PLS model comes really close to beating the ridge regression, but does not outperform it. As stated in the book, PCR normally doesn’t perform better than ridge regression. This is because while PLS tries to reduce dimensions, it reduces the bias and therefore increase the variance of the model.
The best model for this problem was the Ridge Regression. As demonstrated in the barplot, the test errors were not that far apart or different.
barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err), col="darkslategray1", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
In order to determine how accurately we can predict the number of college applications received, I will compute the \(r^2\) values for each method. These test \(r^2\) values will tell me how much of the models explain the variability of the response on the test data.
avg.apps=mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lm.r2
## [1] 0.8901095
ridge.r2=1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2
## [1] 0.89016
lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2
## [1] 0.8896226
pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2
## [1] 0.8538542
pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2
## [1] 0.8888434
barplot(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), col="darkslategray2", xlab="Regression Methods", ylab="Test R-Squared", main="Test R-Squared for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
Based on the test \(r^2\) values, these model’s are not that bad in accuracy. All of the model, with the exception of the PCR, have high accuracy.
We will now try to predict per capita crime rate in the Boston data set.
set.seed(1)
library(MASS)
data(Boston)
attach(Boston)
library(leaps)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
After fitting 13 models (# of variables - 1), I will need to find the one model that minimizes the CV error on the test data.
summary(best.fit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
which.min(rmse.cv)
## [1] 9
boston.bsm.err=(rmse.cv[which.min(rmse.cv)])^2
boston.bsm.err
## [1] 43.47287
I see that cross-validation selects a 9-variable model based on the Test MSE. At 9-variables, the CV estimate for the test MSE is 43.47287–the lowest MSE reported.
The variables that are included in this model are zn, indus, nox, dis, rad, ptratio, black, lstat, and medv.
Next I will fit a Lasso model.
boston.x=model.matrix(crim~., data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(boston.lasso)
The graph above depicts the relationship between log \(\lambda\) and MSE. To help predict the training model on the test model, I will need to find the \(\lambda\) that reduces the error the most.
coef(boston.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.0894283
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2643196
## tax .
## ptratio .
## black .
## lstat .
## medv .
boston.lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
boston.lasso.err
## [1] 54.83663
It’s important to keep in mind that Lasso is a variable reduction method. From what it looks like, the lasso model that reduces the MSE the model includes only 1 variable and has an MSE of 54.83663. The only variable included in this model is rad.
Since Ridge Regression is already pretty similar to lasso, I won’t have to do as much work. The code stays the same with the exception of one field.
boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)
Unlike Lasso, Ridge Regression attempts keep all the variables but push their coefficient value close to 0 if they don’t have significance in the relationship with the response.
coef(boston.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.017516870
## zn -0.002805664
## indus 0.034405928
## chas -0.225250601
## nox 2.249887495
## rm -0.162546004
## age 0.007343331
## dis -0.114928730
## rad 0.059813843
## tax 0.002659110
## ptratio 0.086423005
## black -0.003342067
## lstat 0.044495212
## medv -0.029124577
boston.ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
boston.ridge.err
## [1] 55.60604
In comparison with the other two methods, the ridge regression does not perform well. The MSE for this method is 55.60604.
Now I will explore the PCR method.
library(pls)
boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.170 7.163 6.733 6.728 6.740 6.765
## adjCV 8.61 7.169 7.162 6.730 6.723 6.737 6.760
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.760 6.634 6.652 6.642 6.652 6.607 6.546
## adjCV 6.754 6.628 6.644 6.635 6.643 6.598 6.536
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 93.45 95.40 97.04 98.46 99.52 100.0
## crim 42.47 42.55 42.78 43.04 44.13 45.4
Based on the CV error as well as the variances explained, I think that the appropriate PCR model would only include 8 components. With 8 components, 93.45% of the variance is explained in the predictors by the model, and 42.47% of the variance is explained in the response variable by the model. Additionally, at 8 components, the MSE is at 44.38224. Aside from the BSM, this model predicts pretty well!
As computed above, the model that has the lowest cross-validation error is the one chosen by the best subset selection method. This model had an MSE of 43.47287.
The model that was chosen by Best Subset Selection only includes 9 variables. The variables that are included in this model are zn, indus, nox, dis, rad, ptratio, black, lstat, and medv. If the model were to include of the thrown-out features, more variation of the response would be present. For this particular problem, we are looking to have model prediction accuracy with low variance and low MSE.