Regularization is a set of methods that manage the bias-variance trade-off problem in linear regression.
Whereas OLS estimates the linear model coefficients by minimizing the loss function
\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2\]
ridge regression minimizes the loss function
\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2 + \lambda \sum_{j=1}^k \hat{\beta}_j^2,\]
lasso regression minimizes the loss function
\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2 + \lambda \sum_{j=1}^k \left| \hat{\beta}_j \right|,\]
and elastic net minimizes the loss function
\[L = \frac{\sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2}{2n} + \lambda \frac{1 - \alpha}{2}\sum_{j=1}^k \hat{\beta}_j^2 + \lambda \alpha\left| \hat{\beta}_j \right|.\]
The best model was elastic net. The model run on the full training data set had a holdout set accuracy of 0.8305.
After fitting to the full training data set, the performance on the testing data set on kaggle was 0.78468 accuracy - a half percent worse than the straight logistic regression 0.78947 accuracy.
The initial data management created the data set full, with training rows indexed by train_index. I added another predictor variables in the exploratory analysis and split the data into training and testing, then 80:20 split training into training_80 for training and training_20 to compare models.
I’ll use 10-fold CV.
train_control <- trainControl(
method = "cv", number = 10,
savePredictions = "final",
classProbs = TRUE
)I’ll try three models: ridge, lasso, and elastic net.
My model data set variables are PassengerID, Survived, and 13 predictors.
## [1] "PassengerId" "Survived" "Pclass" "Sex"
## [5] "Age" "SibSp" "Parch" "Embarked"
## [9] "TicketN" "FarePerPass" "Employee" "Deck"
## [13] "AgeCohort" "TicketNCohort" "NetSurv"
I’ll use the recipe method to train. From the exploratory analysis section I’ve decided to create interactions Pclass*Sex, Embarked*Sex, TicketN:TicketNCohort, and Age*AgeCohort*Sex. My recipe does not center and scale the predictors because the glmnet() algorithm will do it automatically.
rcpe <- recipe(Survived ~ ., data = training_80[, mdl_vars]) %>%
update_role(PassengerId, new_role = "id variable") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
# model will center/scale - don't do it here
# step_center(Age, SibSp, Parch, TicketN, FarePerPass, NetSurv) %>%
# step_scale(Age, SibSp, Parch, TicketN, FarePerPass, NetSurv) %>%
step_interact(terms = ~
starts_with("Sex_"):starts_with("Pclass_") +
starts_with("Sex_"):starts_with("Embarked_") +
TicketN:starts_with("TicketNCohort_") +
Age:starts_with("AgeCohort_"):starts_with("Sex_"))
prep(rcpe, training = training_80)## Data Recipe
##
## Inputs:
##
## role #variables
## id variable 1
## outcome 1
## predictor 13
##
## Training data contained 714 data points and no missing data.
##
## Operations:
##
## Dummy variables from Pclass, Sex, Embarked, Employee, Deck, ... [trained]
## Interactions with Sex_female:(Pclass_X2 + Pclass_X3) + Sex_female:(Embarked_Q + Embarked_S) + TicketN:TicketNCohort_gt4 + Age:AgeCohort_gt10:Sex_female [trained]
The ridge model had a holdout set accuracy of 0.8249, sensitivity of 0.6324, specificity of 0.9450, and AUC of 0.8768.
Fit the ridge model by specifying tuning parameter alpha = 0 (meaning percentage mix between ridge and lasso as 0% lasso). I set up the tuning grid for lambda with a little trial and error. Here is the fit summary.
set.seed(1970)
mdl_ridge <- train(
rcpe,
training_80[, mdl_vars],
method = "glmnet",
family = "binomial",
tuneGrid = expand.grid(
.alpha = 0, # optimize a ridge regression
.lambda = seq(0, 2, length.out = 101)
),
trControl = train_control,
metric = "Accuracy"
)
mdl_ridge$bestTune## alpha lambda
## 3 0 0.04
The best tuning parameter is lambda = 0.04, so almost no penalty on the coefficient estimator sizes. You can see the relationship between accuracy and lambda in the plot.
varImp() ranks the predictors by the absolute value of the coefficients in the tuned model. The most important variable here was AgeCohort. In the straight logistic regression model, it was NetSurv, followed by Pclass.
## glmnet variable importance
##
## only 20 most important variables shown (out of 25)
##
## Overall
## AgeCohort_gt10 100.000
## Sex_female 97.300
## Sex_female_x_Pclass_X2 92.823
## Sex_female_x_Embarked_Q 89.908
## NetSurv 68.207
## Pclass_X3 57.600
## TicketNCohort_gt4 54.158
## Employee_X1 47.141
## Sex_female_x_Pclass_X3 39.001
## Pclass_X2 35.388
## Embarked_S 34.401
## Embarked_Q 27.500
## Sex_female_x_Embarked_S 21.611
## Deck_C 17.252
## Deck_D 16.094
## SibSp 13.750
## Deck_B 11.116
## TicketN 9.646
## Deck_E 4.226
## Deck_F 4.147
The accuracy from the confusion matrix is 0.8389.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 57.3 11.8
## Yes 4.3 26.6
##
## Accuracy (average) : 0.8389
Here is the model performance on the holdout data set.
predicted_classes <- predict(mdl_ridge, newdata = training_20)
predicted_probs <- predict(mdl_ridge, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 103 25
## Yes 6 43
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6093
##
## Mcnemar's Test P-Value : 0.001225
##
## Sensitivity : 0.6324
## Specificity : 0.9450
## Pos Pred Value : 0.8776
## Neg Pred Value : 0.8047
## Prevalence : 0.3842
## Detection Rate : 0.2429
## Detection Prevalence : 0.2768
## Balanced Accuracy : 0.7887
##
## 'Positive' Class : Yes
##
The sensitivity is 0.6324 and the specificity is 0.9450, so the model is prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8249. precrec::evalmod() will calculate the confusion matrix values from the model using the holdout data set. The AUC on the holdout set is 0.8768.
mdl_ridge_preds <- predict(mdl_ridge, newdata = training_20, type = "prob")
(mdl_ridge_eval <- evalmod(
scores = mdl_ridge_preds$Yes,
labels = training_20$Survived
))##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 m1 1 ROC 0.8768214
## 2 m1 1 PRC 0.8490097
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 m1 1 109 68
options(yardstick.event_first = FALSE) # set the second level as success
data.frame(
pred = mdl_ridge_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::roc_curve(obs, pred) %>%
autoplot() +
labs(
title = "Ridge Model ROC Curve, Test Data",
subtitle = "AUC = 0.8783"
)The gain curve plots the cumulative summed true outcome versus the fraction of items seen when sorted by the predicted value. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicted the response variable well. The gray area is the “gain” over a random prediction.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).
It encountered the 68th survivor on the 163th observation (92%).
The bottom of the gray area is the outcome of a random model. Only half the survivors would be observed within 50% of the observations. The top of the gray area is the outcome of the perfect model, the “wizard curve”. Half the survivors would be observed in 68/177=38% of the observations.
data.frame(
pred = mdl_ridge_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::gain_curve(obs, pred) %>%
autoplot() +
labs(
title = "Ridge Model Gain Curve on Holdout Set"
)The lasso model had a holdout set accuracy of 0.8305 - a little better than ridge’s 0.8249, sensitivity of 0.6618, specificity of 0.9358, and AUC of 0.8651.
Fit the lasso model by specifying tuning parameter alpha = 1 (meaning percentage mix between ridge and lasso as 100% lasso). I set up the tuning grid for lambda with a little trial and error. Here is the fit summary.
set.seed(1970)
mdl_lasso <- train(
rcpe,
training_80[, mdl_vars],
method = "glmnet",
family = "binomial",
tuneGrid = expand.grid(
.alpha = 1, # optimize a lasso regression
.lambda = seq(0, 1, length.out = 101)
),
trControl = train_control,
metric = "Accuracy"
)
mdl_lasso$bestTune## alpha lambda
## 3 1 0.02
The best tuning parameter is lambda = 0.02, so again almost no penalty on the coefficient estimator sizes. You can see the relationship between accuracy and lambda in the plot.
The most important variable was Sex, then relative importance drops quickly. Second place goes to NetSurv. In ridge, second place was AgeCohort, then Sex. Lasso looks more like the straight logistic regression model.
## glmnet variable importance
##
## only 20 most important variables shown (out of 25)
##
## Overall
## Sex_female 100.0000
## NetSurv 61.5683
## AgeCohort_gt10 52.8553
## Sex_female_x_Pclass_X2 38.7697
## Pclass_X3 38.1470
## TicketNCohort_gt4 16.3332
## Embarked_S 14.2456
## Sex_female_x_Pclass_X3 13.5522
## Sex_female_x_Embarked_Q 11.2780
## Sex_female_x_Age_x_AgeCohort_gt10 2.0565
## SibSp 1.1662
## FarePerPass 0.9601
## Age 0.8307
## Deck_E 0.0000
## Deck_C 0.0000
## Pclass_X2 0.0000
## Sex_female_x_Embarked_S 0.0000
## Employee_X1 0.0000
## Deck_B 0.0000
## Deck_F 0.0000
The accuracy from the confusion matrix is 0.8375 - a little worse than ridge’s 0.8389.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 56.2 10.8
## Yes 5.5 27.6
##
## Accuracy (average) : 0.8375
Here is the model performance on the holdout data set.
predicted_classes <- predict(mdl_lasso, newdata = training_20)
predicted_probs <- predict(mdl_lasso, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 102 23
## Yes 7 45
##
## Accuracy : 0.8305
## 95% CI : (0.767, 0.8826)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 4.325e-10
##
## Kappa : 0.6252
##
## Mcnemar's Test P-Value : 0.00617
##
## Sensitivity : 0.6618
## Specificity : 0.9358
## Pos Pred Value : 0.8654
## Neg Pred Value : 0.8160
## Prevalence : 0.3842
## Detection Rate : 0.2542
## Detection Prevalence : 0.2938
## Balanced Accuracy : 0.7988
##
## 'Positive' Class : Yes
##
The sensitivity is 0.6618 and the specificity is 0.9358. The accuracy is 0.8305 - a little better than ridge’s 0.8249. The AUC on the holdout set is 0.8651.
mdl_lasso_preds <- predict(mdl_lasso, newdata = training_20, type = "prob")
(mdl_lasso_eval <- evalmod(
scores = mdl_lasso_preds$Yes,
labels = training_20$Survived
))##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 m1 1 ROC 0.8651511
## 2 m1 1 PRC 0.8394165
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 m1 1 109 68
options(yardstick.event_first = FALSE) # set the second level as success
data.frame(
pred = mdl_lasso_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::roc_curve(obs, pred) %>%
autoplot() +
labs(
title = "Lasso Model ROC Curve, Test Data",
subtitle = "AUC = 0.8522"
)Here is the gain curve.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).
It encountered the 68th survivor on the 159th observation (89%).
data.frame(
pred = mdl_lasso_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::gain_curve(obs, pred) %>%
autoplot() +
labs(
title = "Lasso Model Gain Curve on Holdout Set"
)The elastic net model had a holdout set accuracy of 0.8249 - same as ridge, sensitivity of 0.6324, specificity of 0.9450, and AUC of 0.8783.
Fit the elastic net model by tuning both lambda and alpha. I set up the tuning grid with by trial and error. Here is the fit summary.
set.seed(1970)
mdl_elnet <- train(
rcpe,
training_80[, mdl_vars],
method = "glmnet",
family = "binomial",
tuneGrid = expand.grid(
.alpha = seq(0, 1, length.out = 11), # optimize an elastic net regression
.lambda = seq(0, 1, length.out = 101)
),
trControl = train_control,
metric = "Accuracy"
)
mdl_elnet$bestTune## alpha lambda
## 4 0 0.03
The best tuning parameter is alpha = 0 and lambda = 0.03, so no mixing at all (100% ridge) and almost no penalty on the coefficient estimator sizes.
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 11. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 505 rows containing missing values (geom_point).
This might be a better way of looking at it. The AUC is maximized with very little regularization. As regularization increases, the ridge model performs relatively well compared to lasso.
glmnPlot <- plot(mdl_elnet,
plotType = "level",
cuts = 15,
scales = list(x = list(rot = 90, cex = .65)))
update(glmnPlot,
ylab = "Mixing Percentage\nRidge <---------> Lasso",
sub = "",
main = "Area Under the ROC Curve",
xlab = "Amount of Regularization")The most important variable was AgeCohort again.
## glmnet variable importance
##
## only 20 most important variables shown (out of 25)
##
## Overall
## AgeCohort_gt10 100.000
## Sex_female 94.977
## Sex_female_x_Embarked_Q 88.167
## Sex_female_x_Pclass_X2 87.773
## NetSurv 65.668
## Pclass_X3 56.825
## TicketNCohort_gt4 56.814
## Employee_X1 45.839
## Sex_female_x_Pclass_X3 43.680
## Pclass_X2 40.108
## Embarked_S 31.662
## Embarked_Q 29.810
## Deck_C 18.212
## Sex_female_x_Embarked_S 16.876
## Deck_D 15.659
## SibSp 14.677
## TicketN 10.504
## Deck_B 9.586
## Deck_E 4.770
## Deck_F 3.770
The accuracy from the confusion matrix is 0.8403 - a little better than ridge.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 57.3 11.6
## Yes 4.3 26.8
##
## Accuracy (average) : 0.8403
Here is the model performance on the holdout data set.
predicted_classes <- predict(mdl_elnet, newdata = training_20)
predicted_probs <- predict(mdl_elnet, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 103 25
## Yes 6 43
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6093
##
## Mcnemar's Test P-Value : 0.001225
##
## Sensitivity : 0.6324
## Specificity : 0.9450
## Pos Pred Value : 0.8776
## Neg Pred Value : 0.8047
## Prevalence : 0.3842
## Detection Rate : 0.2429
## Detection Prevalence : 0.2768
## Balanced Accuracy : 0.7887
##
## 'Positive' Class : Yes
##
The sensitivity is 0.6324 and the specificity is 0.9450. The accuracy is 0.8249 - same as ridge. The AUC on the holdout set is 0.8783.
mdl_elnet_preds <- predict(mdl_elnet, newdata = training_20, type = "prob")
(mdl_elnet_eval <- evalmod(
scores = mdl_elnet_preds$Yes,
labels = training_20$Survived
))##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 m1 1 ROC 0.8783055
## 2 m1 1 PRC 0.8510103
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 m1 1 109 68
options(yardstick.event_first = FALSE) # set the second level as success
data.frame(
pred = mdl_elnet_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::roc_curve(obs, pred) %>%
autoplot() +
labs(
title = "Elastic Net Model ROC Curve, Test Data",
subtitle = "AUC = 0.8733"
)Here is the gain curve.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).
It encountered the 68th survivor on the 163th observation (92%).
data.frame(
pred = mdl_elnet_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::gain_curve(obs, pred) %>%
autoplot() +
labs(
title = "Elastic Net Model Gain Curve on Holdout Set"
)Compare the models with evalmod().
scores_list <- join_scores(
predict(mdl_ridge, newdata = training_20, type = "prob")$Yes,
predict(mdl_lasso, newdata = training_20, type = "prob")$Yes,
predict(mdl_elnet, newdata = training_20, type = "prob")$Yes
)
labels_list <- join_labels(
training_20$Survived,
training_20$Survived,
training_20$Survived
)
pe <- evalmod(
scores = scores_list,
labels = labels_list,
modnames = c("Ridge", "Lasso", "Elastic Net"),
posclass = "Yes")
autoplot(pe, "ROC")##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 Ridge 1 ROC 0.8768214
## 2 Ridge 1 PRC 0.8490097
## 3 Lasso 1 ROC 0.8651511
## 4 Lasso 1 PRC 0.8394165
## 5 Elastic Net 1 ROC 0.8783055
## 6 Elastic Net 1 PRC 0.8510103
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 Ridge 1 109 68
## 2 Lasso 1 109 68
## 3 Elastic Net 1 109 68
The highest AUC was with elastic net. Elastic net also had the bestmedian accuracy.
resamps <- resamples(list('Ridge' = mdl_ridge,
'Lasso' = mdl_lasso,
'Elastic Net' = mdl_elnet))
summary(resamps)##
## Call:
## summary.resamples(object = resamps)
##
## Models: Ridge, Lasso, Elastic Net
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Ridge 0.8028169 0.8175372 0.8321596 0.8390063 0.8561718 0.9014085 0
## Lasso 0.7746479 0.8175372 0.8380282 0.8375587 0.8561718 0.9154930 0
## Elastic Net 0.8028169 0.8175372 0.8391041 0.8403756 0.8591549 0.9014085 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Ridge 0.556250 0.5937978 0.6360042 0.6454318 0.6917002 0.7830642 0
## Lasso 0.507799 0.5937978 0.6437060 0.6460703 0.6941293 0.8181042 0
## Elastic Net 0.556250 0.5937978 0.6385230 0.6482558 0.7000939 0.7830642 0
I’ll do a final fit with the elastic net model to the entire training set to predict on testing.
Here is the fit summary.
set.seed(1970)
mdl_final <- train(
rcpe,
training[, mdl_vars],
method = "glmnet",
family = "binomial",
tuneGrid = expand.grid(
.alpha = 0.0,
.lambda = 0.03
),
trControl = train_control,
metric = "Accuracy"
)
mdl_final$bestTune## alpha lambda
## 1 0 0.03
AgeCohort continues to be the most important predictor.
## glmnet variable importance
##
## only 20 most important variables shown (out of 25)
##
## Overall
## AgeCohort_gt10 100.000
## Sex_female 97.260
## Sex_female_x_Embarked_Q 96.029
## Sex_female_x_Pclass_X2 87.733
## NetSurv 67.090
## Pclass_X3 65.482
## TicketNCohort_gt4 59.663
## Employee_X1 52.858
## Pclass_X2 48.294
## Sex_female_x_Pclass_X3 40.090
## Embarked_Q 35.460
## Deck_C 25.340
## Embarked_S 24.992
## Sex_female_x_Embarked_S 15.785
## SibSp 9.609
## Deck_E 7.486
## Deck_B 7.096
## TicketN 6.968
## Parch 4.726
## Deck_D 3.875
The accuracy from the confusion matrix below is 0.8305.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 56.1 11.4
## Yes 5.5 26.9
##
## Accuracy (average) : 0.8305