The lasso, relative to least squares is: less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Ridge regression, relative to least squares is: less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Non-linear methods, relative to least squares are: more flexible and will give improved prediction accuracy when their increase in variance are less than their decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
Split the data set into a training set and a test set.
dat = College
set.seed(123)
smp_sze = floor(.70*nrow(dat))
samp = sample(seq_len(nrow(dat)), size = smp_sze)
train = dat[samp, ]
test = dat[-samp, ]
Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit = lm(Apps ~ ., train)
lm.pred = predict(lm.fit, test)
mse = mean((lm.pred - test$Apps)^2)
mse
## [1] 1734841
The test MSE 1.734841 ^6.
Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
set.seed(123) # set seed for repreoducibility
train.mat = model.matrix(Apps ~., train)
test.mat = model.matrix(Apps ~., test)
grid = 10^seq(10,-2, length=100) # set grid from null model to least squares fit
ridge.fit = cv.glmnet(train.mat, train$Apps, alpha=0, lambda = grid) # 10 fold cross validate to find best lambda
bestlam.ridge = ridge.fit$lambda.min
ridge.pred = predict(ridge.fit, s = bestlam.ridge, newx = test.mat)
mse = mean((ridge.pred - test$Apps)^2)
The test MSE increased from the least squares model and is now 1.939282 ^6
out = glmnet(train.mat, train$Apps, alpha = 0)
predict(out, type="coefficients", s = bestlam.ridge) # ridge regression does not perform variable selection
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.108384e+03
## (Intercept) .
## PrivateYes -6.344812e+02
## Accept 7.799468e-01
## Enroll 7.025220e-01
## Top10perc 2.888869e+01
## Top25perc -2.480330e+00
## F.Undergrad 1.000467e-01
## P.Undergrad 7.509154e-03
## Outstate -1.062903e-02
## Room.Board 2.160271e-01
## Books 2.845690e-01
## Personal -6.359640e-02
## PhD -2.824524e+00
## Terminal -5.297962e+00
## S.F.Ratio -2.570072e+00
## perc.alumni -1.259409e+01
## Expend 7.644084e-02
## Grad.Rate 1.090898e+01
Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(123)
lasso.fit = glmnet(train.mat, train$Apps, alpha = 1, lambda = grid)
cv.lasso = cv.glmnet(train.mat, train$Apps, alpha = 1, lambda = grid)
bestlam.lasso = cv.lasso$lambda.min
bestlam.lasso
## [1] 6.135907
lasso.pred = predict(lasso.fit, s = bestlam.lasso, newx = test.mat)
mse = mean((lasso.pred-test$Apps)^2)
The test MSE decreased from the least squares and the ridge model. This model has the lowest MSE at 1.729545 ^6
predict(lasso.fit, type="coefficients", s = bestlam.lasso)
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -437.25898480
## (Intercept) .
## PrivateYes -660.98080564
## Accept 1.21931066
## Enroll 0.08323063
## Top10perc 45.77913227
## Top25perc -13.13933555
## F.Undergrad 0.02274101
## P.Undergrad 0.02964089
## Outstate -0.04845178
## Room.Board 0.17834817
## Books 0.16979458
## Personal -0.01898038
## PhD -5.30560631
## Terminal -4.85092509
## S.F.Ratio .
## perc.alumni -7.57316944
## Expend 0.07559669
## Grad.Rate 9.69692642
Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(123)
pcr.fit = pcr(Apps ~., data = train, scale = T, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred = predict(pcr.fit, test, ncomp = 17)
mean((pcr.pred - test$Apps)^2)
## [1] 1734841
The best value of M selected through cross validation is 17. This number of components gives the lowest test MSE at 1.734841 ^6. This model performs similarly to the least squares model.
Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit = plsr(Apps ~., data = train, scale = T, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")
pls.pred = predict(pls.fit, test, ncomp = 17)
mean((pls.pred - test$Apps)^2)
## [1] 1734841
In this PLS model, using the same number of components as the PCR model the test MSE is lower than the least squares model.
Comment on the results obtained. How accurately can we pre- dict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
avg.actual = mean(test$Apps)
lm.r2 = 1-mean((lm.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # least squares model
ridge.r2 = 1-mean((ridge.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # ridge model
lasso.r2 = 1-mean((lasso.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # lasso model
pcr.r2 = 1-mean((pcr.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # pcr model
pls.r2 = 1-mean((pls.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # pls model
To compare each of the models we can calculate R^2 by dividing the residual sum of squares by the total sum of squares. The lasso model had the highest amount of variance explained by the model at around 92.43%. The ridge model had the lowest amount of variance explained by the model at 91%. The least squares, pcr, and pls models had the same amount of variance at 92.40% and were all within 15 dp of each other. We are able to pretty with high accuracy in all of the models without much difference in these approaches. The lasso approach seems to work best and the ridge model seemed to have the lowest performance out of these 5 models.
We will now try to predict per capita crime rate in the Boston data set.
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
b = Boston
Practicing with caret to get a consistent object to work with across models
ctrl = trainControl(method = "repeatedcv", number = 10 , repeats = 10)
Best Subset Model
model = regsubsets(crim ~.,
data = b,
nvmax = ncol(b) - 1,
method = "exhaustive")
# cross-validation to compare MSE
CV_MSE = c()
set.seed(123)
for (i in 1:(ncol(b)-1)) {
b_temp = b[ ,c("crim", names(coef(model, id=i)[-1]))]
model_temp = train(crim ~.,
data = b_temp,
method = "lm",
trControl = ctrl)
CV_MSE[i] = model_temp$results$RMSE
}
Plotting Best Subset
data.frame(CV_MSE = CV_MSE, subset_size = 1:(ncol(Boston)-1)) %>%
mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
ggplot(aes(x = subset_size, y = CV_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Best Subset Selection",
subtitle = "Selecting parameter subset size with cross-validation",
x = "Subset Size",
y = "CV MSE")
Minimum CV RMSE:
min(CV_MSE)
## [1] 5.777091
Lasso Model
set.seed(123)
model_lasso = train(crim ~.,
data = b,
method = "glmnet",
preProcess = c("center", "scale"),
metric = "RMSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(alpha = 1,
lambda = seq(0, 1, length = 100)))
Plotting Lasso Model
mlr = model_lasso$results %>%
rename(CV_MSE = RMSE) %>%
mutate(min_CV_MSE = as.numeric(lambda == model_lasso$bestTune$lambda)) %>%
as.data.frame()
ggplot(mlr, aes(x = lambda, y = CV_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Lasso Regression",
subtitle = "Selecting shrinkage parameter with cross-validation",
y = "CV MSE")
Minimum CV RMSE:
min(mlr$CV_MSE)
## [1] 5.831888
Ridge Model
set.seed(123)
model_ridge = train(crim~.,
data = b,
method = "glmnet",
preProcess = c("center", "scale"),
metric = "RMSE",
maximize = F,
trcontrol = ctrl,
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0, 1, length = 100)))
mrr = model_ridge$results %>%
rename(CV_MSE = RMSE) %>%
mutate(min_CV_MSE = as.numeric(lambda == model_ridge$bestTune$lambda)) %>%
as.data.frame()
Plotting Ridge Model
ggplot(mrr, aes(x = lambda, y = CV_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Ridge Regression",
subtitle = "Selecting shrinkage parameter with cross-validation",
y = "CV MSE")
Minimum CV RMSE:
min(mrr$CV_MSE)
## [1] 6.376497
Principal Components Regression Model
set.seed(123)
model_pcr = train(crim~.,
data = b,
method = "pcr",
preProcess = c("center", "scale"),
metric = "RMSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))
mpr = model_pcr$results %>%
rename(CV_MSE = RMSE) %>%
mutate(min_CV_MSE = as.numeric(ncomp == model_pcr$bestTune$ncomp)) %>%
as.data.frame()
Plotting PCR Model
ggplot(mpr,aes(x = ncomp, y = CV_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Principal Components Regression",
subtitle = "Selecting number of principal components with cross-validation",
x = "Principal Components",
y = "CV MSE")
Minimum CV RMSE:
min(mpr$CV_MSE)
## [1] 5.869671
Partial Least Squares Model
set.seed(123)
model_pls = train(crim~.,
data = b,
method = "pls",
preProcess = c("center", "scale"),
metric = "RMSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))
mplr = model_pls$results %>%
rename(CV_MSE = RMSE) %>%
mutate(min_CV_MSE = as.numeric(ncomp == model_pls$bestTune$ncomp)) %>%
as.data.frame()
Plotting PLS Model
ggplot(mplr, aes(x = ncomp, y = CV_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Partial Least Squares",
subtitle = "Selecting number of principal components with cross-validation",
x = "Principal Components",
y = "CV MSE")
Minimum CV RMSE:
min(mplr$CV_MSE)
## [1] 5.869343
Q: Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.
A:
Using RMSE as a standard, the best performing model uses best subset selection with RMSE = 5.777091 This comparative metric finds the model with the least amount of error so, best subset selection is the best model to be used out of these choices.
Q:Does your chosen model involve all of the features in the data set? Why or why not?
A:
Our final model does not include all of the features. It uses only 9 features as shown below:
coef(model, id=9)
## (Intercept) zn indus nox dis
## 19.124636156 0.042788127 -0.099385948 -10.466490364 -1.002597606
## rad ptratio black lstat medv
## 0.539503547 -0.270835584 -0.008003761 0.117805932 -0.180593877
Including any additional features would increase the error rate from this optimal value.