Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method. The algorithm constructs B regression trees using B bootstrapped training sets, and averages the resulting predictions. These trees are grown deep, and are not pruned. Hence each individual tree has high variance, but low bias. Averaging the B trees reduces the variance. The predicted value for an observation is the mode (classification) or mean (regression) of the trees. B usually equals ~25.

Random forests improve bagged trees by way of a small tweak that de-correlates the trees. As in bagging, the algorithm builds a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of mtry predictors is chosen as split candidates from the full set of p predictors. A fresh sample of mtry predictors is taken at each split. Typically \(mtry \sim \sqrt{p}\). Bagged trees are thus a special case of random forests where mtry = p.

Gradient boosting machine (GBM) is an additive modeling algorithm that gradually builds a composite model by iteratively adding M weak sub-models based on the performance of the prior iteration’s composite. The idea is to fit a weak model, then replace the response values with the residuals from that model, and fit another model. Adding the residual prediction model to the original response prediction model produces a more accurate model. GBM repeats this process over and over, running new models to predict the residuals of the previous composite models, and adding the results to produce new composites. With each iteration, the model becomes stronger and stronger.

Extreme Gradient Boosting (XGBoost) is a boosting algorithm based on GBM. XGboost applies regularization to reduce overfitting. I don’t know anything about how it works, but it does well in Kaggle competitions, and you have to learn it somehow, so I’ll give it a try.

The best model was XGBoost. The model run on the full training data set had a holdout set accuracy of 0.8644.

After fitting to the full training data set, the performance on the testing data set on kaggle was 0.79186 accuracy.

Setup

library(tidyverse)
library(caret)
library(recipes)
## Warning: package 'recipes' was built under R version 4.0.2
library(plotROC)
library(precrec)

The initial data management created the data set full, with training rows indexed by train_index. I added another predictor variables in the exploratory analysis and split the data into training and testing, then 80:20 split training into training_80 for training and training_20 to compare models.

load("./titanic_02.RData")

I’ll use 10-fold CV.

train_control <- trainControl(
  method = "cv", number = 10,
  savePredictions = "final",
  summaryFunction = twoClassSummary,
  classProbs = TRUE
)

I’ll try four models: bagging, random forests, gbm, and xgboost.

Model

My model data set variables are PassengerID, Survived, and 13 predictors.

mdl_vars <- c("PassengerId", mdl_vars)
mdl_vars
##  [1] "PassengerId"   "Survived"      "Pclass"        "Sex"          
##  [5] "Age"           "SibSp"         "Parch"         "Embarked"     
##  [9] "TicketN"       "FarePerPass"   "Employee"      "Deck"         
## [13] "AgeCohort"     "TicketNCohort" "NetSurv"

I’ll use the recipe method to train. From the exploratory analysis section I’ve decided to create interactions Pclass*Sex, Embarked*Sex, TicketN:TicketNCohort, and Age*AgeCohort*Sex.

rcpe <- recipe(Survived ~ ., data = training_80[, mdl_vars]) %>%
  update_role(PassengerId, new_role = "id variable") %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  # I don't think centering/scaling helps with tree models
  # step_center(Age, SibSp, Parch, TicketN, FarePerPass, NetSurv) %>%
  # step_scale(Age, SibSp, Parch, TicketN, FarePerPass, NetSurv) %>%
  step_interact(terms = ~
                  starts_with("Sex_"):starts_with("Pclass_") +
                  starts_with("Sex_"):starts_with("Embarked_") +
                  TicketN:starts_with("TicketNCohort_") +
                  Age:starts_with("AgeCohort_"):starts_with("Sex_")) 

prep(rcpe, training = training_80)
## Data Recipe
## 
## Inputs:
## 
##         role #variables
##  id variable          1
##      outcome          1
##    predictor         13
## 
## Training data contained 714 data points and no missing data.
## 
## Operations:
## 
## Dummy variables from Pclass, Sex, Embarked, Employee, Deck, ... [trained]
## Interactions with Sex_female:(Pclass_X2 + Pclass_X3) + Sex_female:(Embarked_Q + Embarked_S) + TicketN:TicketNCohort_gt4 + Age:AgeCohort_gt10:Sex_female [trained]

Bagging

The bagging model had a holdout set accuracy of 0.8531, sensitivity of 0.7206, specificity of 0.9358, and AUC of 0.8831.

Caret has no hyperparameters to tune with this model. Here is the fit summary.

set.seed(1970)
mdl_bag <- train(
  rcpe,
  training_80[, mdl_vars],
  method = "treebag",
  trControl = train_control,
  metric = "ROC"
)
mdl_bag
## Bagged CART 
## 
## 714 samples
##  14 predictor
##   2 classes: 'No', 'Yes' 
## 
## Recipe steps: dummy, interact 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8848785  0.8863636  0.7664021

varImp() ranks the predictors by the sum of the reduction in the loss function attributed to each variable at each split. The most important variable here was FarePerPass. In the straight logistic regression model, it was NetSurv, followed by Pclass.

plot(varImp(mdl_bag), main = "Boosting Variable Importance")

Resampling Performance

The accuracy from the confusion matrix is 0.8403.

confusionMatrix(mdl_bag)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  54.6  9.0
##        Yes  7.0 29.4
##                             
##  Accuracy (average) : 0.8403

Holdout Performance

Here is the model performance on the holdout data set.

preds_bag <- bind_cols(
  predict(mdl_bag, newdata = training_20, type = "prob"),
  Predicted = predict(mdl_bag, newdata = training_20, type = "raw"),
  Actual = training_20$Survived
)

confusionMatrix(preds_bag$Predicted, reference = preds_bag$Actual, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  102  19
##        Yes   7  49
##                                           
##                Accuracy : 0.8531          
##                  95% CI : (0.7922, 0.9017)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 3.506e-12       
##                                           
##                   Kappa : 0.6789          
##                                           
##  Mcnemar's Test P-Value : 0.03098         
##                                           
##             Sensitivity : 0.7206          
##             Specificity : 0.9358          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 0.8430          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2768          
##    Detection Prevalence : 0.3164          
##       Balanced Accuracy : 0.8282          
##                                           
##        'Positive' Class : Yes             
## 

The sensitivity is 0.7206 and the specificity is 0.9358, so the model is more prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8531. precrec::evalmod() will calculate the confusion matrix values from the model using the holdout data set. The AUC on the holdout set is 0.8831.

mdl_auc <- Metrics::auc(actual = preds_bag$Actual == "Yes", preds_bag$Yes)
yardstick::roc_curve(preds_bag, Actual, Yes) %>%
  autoplot() +
  labs(
    title = "Bagging Model ROC Curve, Test Data",
    subtitle = paste0("AUC = ", round(mdl_auc, 4))
  )

The gain curve plots the cumulative summed true outcome versus the fraction of items seen when sorted by the predicted value. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicted the response variable well. The gray area is the “gain” over a random prediction.

68 of the 177 passengers in the holdout set survived.

  • The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).

  • It encountered the 68th survivor on the 177th observation (100%).

options(yardstick.event_first = FALSE)  # set the second level as success
yardstick::gain_curve(preds_bag, Actual, Yes) %>%
  autoplot() +
  labs(title = "Bagging Model Gain Curve on Test Data")

Random Forest

The random forest model had a holdout set accuracy of 0.8475 - a little worse than bagging’s 0.8531, sensitivity of 0.6765, specificity of 0.9541, and AUC of 0.8732 - worse than bagging’s 0.8831.

The sensitivity is (worse than bagging) and the specificity is (better than bagging). The accuracy is - (worse than bagging). The AUC on the holdout set is .

Each time a split in a tree is considered, a random forest model takes a random sample of mtry predictors as split candidates from the full set of p predictors. Hyperparameter mtry can take any value from 1 to 13 (the number of predictors) and I expect the best value to be near \(\sqrt{13} \sim 4\). Here is the fit summary.

set.seed(1970)
mdl_rf <- train(
  rcpe,
  training_80[, mdl_vars],
  method = "rf",
  metric = "ROC",
  tuneGrid = expand.grid(mtry = 1:13), # searching around mtry=4,
  trControl = train_control
)
## Loading required namespace: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
mdl_rf$bestTune
##   mtry
## 7    7

The best tuning parameter is mtry = 7. You can see the relationship between AUC and mtry in the plot.

ggplot(mdl_rf) +
  labs(title = "Random Forest Parameter Tuning", x = "mtry")

The most important variable was the interaction Sex:Age:AgeCohort and FarePerPass second. Bagging had the same top two, but with flipped order.

ggplot(varImp(mdl_rf))

Resampling Performance

The accuracy from the confusion matrix is 0.8487 - a little better than bagging’s 0.8403.

confusionMatrix(mdl_rf)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  56.6 10.1
##        Yes  5.0 28.3
##                             
##  Accuracy (average) : 0.8487

Holdout Performance

Here is the model performance on the holdout data set.

preds_rf <- bind_cols(
  predict(mdl_rf, newdata = training_20, type = "prob"),
  Predicted = predict(mdl_rf, newdata = training_20, type = "raw"),
  Actual = training_20$Survived
)

confusionMatrix(preds_rf$Predicted, reference = preds_rf$Actual, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  104  22
##        Yes   5  46
##                                          
##                Accuracy : 0.8475         
##                  95% CI : (0.7859, 0.897)
##     No Information Rate : 0.6158         
##     P-Value [Acc > NIR] : 1.244e-11      
##                                          
##                   Kappa : 0.6617         
##                                          
##  Mcnemar's Test P-Value : 0.002076       
##                                          
##             Sensitivity : 0.6765         
##             Specificity : 0.9541         
##          Pos Pred Value : 0.9020         
##          Neg Pred Value : 0.8254         
##              Prevalence : 0.3842         
##          Detection Rate : 0.2599         
##    Detection Prevalence : 0.2881         
##       Balanced Accuracy : 0.8153         
##                                          
##        'Positive' Class : Yes            
## 

The sensitivity is 0.6765 (worse than bagging) and the specificity is 0.9541 (better than bagging). The accuracy is 0.8475 - (worse than bagging). The AUC on the holdout set is 0.8732.

mdl_auc <- Metrics::auc(actual = preds_rf$Actual == "Yes", preds_rf$Yes)
yardstick::roc_curve(preds_rf, Actual, Yes) %>%
  autoplot() +
  labs(
    title = "Random Forest Model ROC Curve, Test Data",
    subtitle = paste0("AUC = ", round(mdl_auc, 4))
  )

Here is the gain curve.

68 of the 177 passengers in the holdout set survived.

  • The gain curve encountered 34 survivors (50%) within the first 36 observations (21%).

  • It encountered the 68th survivor on the 177th observation (100%).

options(yardstick.event_first = FALSE)  # set the second level as success
yardstick::gain_curve(preds_rf, Actual, Yes) %>%
  autoplot() +
  labs(title = "Random Forest Model Gain Curve on Test Data")

Gradient Boosting

The GBM model had a holdout set accuracy of 0.8475 - tied with random forest but worse than bagging (0.8531), sensitivity of 0.6765, specificity of 0.9541, and AUC of 0.8772.

gbm has the following tuneable hyperparameters (see modelLookup("gbm")).

set.seed(1970)
garbage <- capture.output(
mdl_gbm <- train(
  rcpe,
  training_80[, mdl_vars],
  method = "gbm",
  metric = "ROC",
  tuneLength = 5,
  trControl = train_control
))
## Loading required namespace: gbm
## Loaded gbm 2.1.5
mdl_gbm
## Stochastic Gradient Boosting 
## 
## 714 samples
##  14 predictor
##   2 classes: 'No', 'Yes' 
## 
## Recipe steps: dummy, interact 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens       Spec     
##   1                   50      0.8760853  0.8681818  0.7300265
##   1                  100      0.8838143  0.8886364  0.7300265
##   1                  150      0.8871708  0.9022727  0.7263228
##   1                  200      0.8870265  0.8977273  0.7046296
##   1                  250      0.8846305  0.9000000  0.6936508
##   2                   50      0.8898268  0.9113636  0.7148148
##   2                  100      0.8873256  0.9136364  0.7082011
##   2                  150      0.8884695  0.8909091  0.7152116
##   2                  200      0.8921702  0.8954545  0.7226190
##   2                  250      0.8882651  0.9113636  0.7301587
##   3                   50      0.8902026  0.8931818  0.7006614
##   3                  100      0.8867995  0.9068182  0.7191799
##   3                  150      0.8885567  0.9045455  0.7228836
##   3                  200      0.8877811  0.9090909  0.7298942
##   3                  250      0.8817054  0.9068182  0.7227513
##   4                   50      0.8873076  0.9022727  0.6862434
##   4                  100      0.8835934  0.8977273  0.7265873
##   4                  150      0.8844321  0.9000000  0.7374339
##   4                  200      0.8832296  0.9045455  0.7519841
##   4                  250      0.8824330  0.9000000  0.7412698
##   5                   50      0.8864313  0.8931818  0.7005291
##   5                  100      0.8828598  0.8977273  0.7373016
##   5                  150      0.8838549  0.8886364  0.7519841
##   5                  200      0.8857368  0.8909091  0.7482804
##   5                  250      0.8830102  0.8931818  0.7447090
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.
mdl_gbm$bestTune
##   n.trees interaction.depth shrinkage n.minobsinnode
## 9     200                 2       0.1             10

train() held constant shrinkage = 0.1 () and n.minobsinnode = 10, and tuned for optimal values n.trees = 200 ($M) and interaction.depth = 2. You can see the relationship between AUC and mtry in the plot.

ggplot(mdl_gbm) +
  labs(title = "Gradient Boosting Parameter Tuning")

The most important variable was Sex, then the interaction Sex:Age:AgeCohort and FarePerPass.

ggplot(varImp(mdl_gbm), "Gradient Boosting Variable Importance")

Resampling Performance

The accuracy from the confusion matrix is 0.8291 - the worst of the three models considered so far.

confusionMatrix(mdl_gbm)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  55.2 10.6
##        Yes  6.4 27.7
##                             
##  Accuracy (average) : 0.8291

Holdout Performance

Here is the model performance on the holdout data set.

preds_gbm <- bind_cols(
  predict(mdl_gbm, newdata = training_20, type = "prob"),
  Predicted = predict(mdl_gbm, newdata = training_20, type = "raw"),
  Actual = training_20$Survived
)

confusionMatrix(preds_gbm$Predicted, reference = preds_gbm$Actual, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  104  22
##        Yes   5  46
##                                          
##                Accuracy : 0.8475         
##                  95% CI : (0.7859, 0.897)
##     No Information Rate : 0.6158         
##     P-Value [Acc > NIR] : 1.244e-11      
##                                          
##                   Kappa : 0.6617         
##                                          
##  Mcnemar's Test P-Value : 0.002076       
##                                          
##             Sensitivity : 0.6765         
##             Specificity : 0.9541         
##          Pos Pred Value : 0.9020         
##          Neg Pred Value : 0.8254         
##              Prevalence : 0.3842         
##          Detection Rate : 0.2599         
##    Detection Prevalence : 0.2881         
##       Balanced Accuracy : 0.8153         
##                                          
##        'Positive' Class : Yes            
## 

The sensitivity is 0.6765 and the specificity is 0.9541 (same as random forest). The accuracy is 0.8475 - (same as random forest, worse than bagging). The AUC on the holdout set is 0.8772.

mdl_auc <- Metrics::auc(actual = preds_gbm$Actual == "Yes", preds_gbm$Yes)
yardstick::roc_curve(preds_gbm, Actual, Yes) %>%
  autoplot() +
  labs(
    title = "Gradient Boosting Model ROC Curve, Test Data",
    subtitle = paste0("AUC = ", round(mdl_auc, 4))
  )

Here is the gain curve.

68 of the 177 passengers in the holdout set survived.

  • The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).

  • It encountered the 68th survivor on the 152nd observation (86%).

options(yardstick.event_first = FALSE)  # set the second level as success
yardstick::gain_curve(preds_gbm, Actual, Yes) %>%
  autoplot() +
  labs(title = "Gradient Boosting Model Gain Curve on Test Data")

XGBoost

The XGBoost model had a holdout set accuracy of 0.8644 (best of all the models), sensitivity of 0.7353, specificity of 0.9450, and AUC of 0.8859.

xgbTree has the following tuneable hyperparameters (see modelLookup("xgbTree")). The first three are the same as xgb.

set.seed(1970)
garbage <- capture.output(
mdl_xgb <- train(
  rcpe,
  training_80[, mdl_vars],
  method = "xgbTree",
  metric = "ROC",
  tuneLength = 5,
  trControl = train_control
))
## Loading required namespace: xgboost
## Warning: package 'xgboost' was built under R version 4.0.2
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
#mdl_xgb
mdl_xgb$bestTune
##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 52     100         2 0.3     0              0.6                1       0.5

train() held constant gamma = 0, min_child_weight = 1, and tuned for optimal values nrounds = 100, max_depth = 2, eta = 0.3, colsample_bytree = 0.6, and subsample = 0.5. There are too many tuneable parameters to see their relationships with AUC.

#ggplot(mdl_xgb) +
#  labs(title = "XGBoost Parameter Tuning", x = "mtry")

The most important variable was the interaction Sex:Age:AgeCohort and Age second.

ggplot(varImp(mdl_xgb))

Resampling Performance

The accuracy from the confusion matrix is 0.8207 - worst of the four models.

confusionMatrix(mdl_xgb)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  55.0 11.3
##        Yes  6.6 27.0
##                             
##  Accuracy (average) : 0.8207

Holdout Performance

Here is the model performance on the holdout data set.

preds_xgb <- bind_cols(
  predict(mdl_xgb, newdata = training_20, type = "prob"),
  Predicted = predict(mdl_xgb, newdata = training_20, type = "raw"),
  Actual = training_20$Survived
)

confusionMatrix(preds_xgb$Predicted, reference = preds_xgb$Actual, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  103  18
##        Yes   6  50
##                                          
##                Accuracy : 0.8644         
##                  95% CI : (0.805, 0.9112)
##     No Information Rate : 0.6158         
##     P-Value [Acc > NIR] : 2.439e-13      
##                                          
##                   Kappa : 0.7036         
##                                          
##  Mcnemar's Test P-Value : 0.02474        
##                                          
##             Sensitivity : 0.7353         
##             Specificity : 0.9450         
##          Pos Pred Value : 0.8929         
##          Neg Pred Value : 0.8512         
##              Prevalence : 0.3842         
##          Detection Rate : 0.2825         
##    Detection Prevalence : 0.3164         
##       Balanced Accuracy : 0.8401         
##                                          
##        'Positive' Class : Yes            
## 

The sensitivity is 0.7353 and the specificity is 0.9450. The accuracy is 0.8644 (best of the bunch). The AUC on the holdout set is 0.8859.

mdl_auc <- Metrics::auc(actual = preds_xgb$Actual == "Yes", preds_xgb$Yes)
yardstick::roc_curve(preds_xgb, Actual, Yes) %>%
  autoplot() +
  labs(
    title = "XGBoost Model ROC Curve, Test Data",
    subtitle = paste0("AUC = ", round(mdl_auc, 4))
  )

Here is the gain curve.

68 of the 177 passengers in the holdout set survived.

  • The gain curve encountered 34 survivors (50%) within the first 36 observations (21%).

  • It encountered the 68th survivor on the 162th observation (92%).

options(yardstick.event_first = FALSE)  # set the second level as success
yardstick::gain_curve(preds_xgb, Actual, Yes) %>%
  autoplot() +
  labs(title = "XGBoost Model Gain Curve on Test Data")

Conclusions

Compare the models with evalmod().

scores_list <- join_scores(
  predict(mdl_bag, newdata = training_20, type = "prob")$Yes,
  predict(mdl_rf, newdata = training_20, type = "prob")$Yes,
  predict(mdl_gbm, newdata = training_20, type = "prob")$Yes,
  predict(mdl_xgb, newdata = training_20, type = "prob")$Yes
)
labels_list <- join_labels(
  training_20$Survived,
  training_20$Survived,
  training_20$Survived,
  training_20$Survived
)

pe <- evalmod(
  scores = scores_list, 
  labels = labels_list,
  modnames = c("Bagging", "Random Forest", "Gradient Boosting", "XGBoost"),
  posclass = "Yes")

autoplot(pe, "ROC")

pe
## 
##     === AUCs ===
## 
##             Model name Dataset ID Curve type       AUC
##    1           Bagging          1        ROC 0.8830950
##    2           Bagging          1        PRC 0.8581268
##    3     Random Forest          1        ROC 0.8732461
##    4     Random Forest          1        PRC 0.8575061
##    5 Gradient Boosting          1        ROC 0.8771587
##    6 Gradient Boosting          1        PRC 0.8628425
##    7           XGBoost          1        ROC 0.8859282
##    8           XGBoost          1        PRC 0.8698934
## 
## 
##     === Input data ===
## 
##             Model name Dataset ID # of negatives # of positives
##    1           Bagging          1            109             68
##    2     Random Forest          1            109             68
##    3 Gradient Boosting          1            109             68
##    4           XGBoost          1            109             68

The highest AUC was with XGBoost.

resamps <- resamples(list('Bagging' = mdl_bag, 
                          'Random Forest' = mdl_rf,
                          'Gradient Boosting' = mdl_gbm,
                          'XGBoost' = mdl_xgb))
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: Bagging, Random Forest, Gradient Boosting, XGBoost 
## Number of resamples: 10 
## 
## ROC 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Bagging           0.8457792 0.8692130 0.8824330 0.8848785 0.9013836 0.9297138
## Random Forest     0.8400673 0.8693182 0.8886785 0.8856166 0.9007711 0.9241071
## Gradient Boosting 0.8316498 0.8756163 0.8998316 0.8921702 0.9179744 0.9326299
## XGBoost           0.8627946 0.8741395 0.8949916 0.8933366 0.9029356 0.9358766
##                   NA's
## Bagging              0
## Random Forest        0
## Gradient Boosting    0
## XGBoost              0
## 
## Sens 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Bagging           0.7727273 0.8693182 0.8977273 0.8863636 0.9261364 0.9545455
## Random Forest     0.8636364 0.8920455 0.9318182 0.9181818 0.9318182 0.9772727
## Gradient Boosting 0.7954545 0.8863636 0.8863636 0.8954545 0.9261364 0.9545455
## XGBoost           0.8409091 0.8636364 0.9090909 0.8931818 0.9090909 0.9545455
##                   NA's
## Bagging              0
## Random Forest        0
## Gradient Boosting    0
## XGBoost              0
## 
## Spec 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Bagging           0.6296296 0.6964286 0.7638889 0.7664021 0.8055556 0.9259259
## Random Forest     0.5925926 0.6580688 0.7460317 0.7375661 0.7777778 0.8888889
## Gradient Boosting 0.5714286 0.6388889 0.7142857 0.7226190 0.8148148 0.8928571
## XGBoost           0.5714286 0.6296296 0.6785714 0.7044974 0.7962963 0.8571429
##                   NA's
## Bagging              0
## Random Forest        0
## Gradient Boosting    0
## XGBoost              0
bwplot(resamps, layout = c(3, 1))

Refit Final Model

I’ll do a final fit with the elastic net model to the entire training set to predict on testing.

train() held constant gamma = 0, min_child_weight = 1, and tuned for optimal values nrounds = 100, max_depth = 2, eta = 0.3, colsample_bytree = 0.6, and subsample = 0.5. There are too many tuneable parameters to see their relationships with AUC.

Here is the fit summary.

set.seed(1970)
mdl_final <- train(
  rcpe,
  training[, mdl_vars],
  method = "xgbTree",
  tuneGrid = expand.grid(
    gamma = 0.0,
    min_child_weight = 1,
    nrounds = 100,
    max_depth = 2,
    eta = 0.3, 
    colsample_bytree = 0.6,
    subsample = 0.5
  ),
  trControl = train_control,
  metric = "ROC"
)
mdl_final
## eXtreme Gradient Boosting 
## 
## 891 samples
##  14 predictor
##   2 classes: 'No', 'Yes' 
## 
## Recipe steps: dummy, interact 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 801, 801, 802, 802, 802, ... 
## Resampling results:
## 
##   ROC        Sens      Spec     
##   0.8845126  0.899798  0.7630252
## 
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
##  held constant at a value of 1
## Tuning parameter 'subsample' was held
##  constant at a value of 0.5

FarePerPass ws the most important predictor this time!

varImp(mdl_final)
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 25)
## 
##                                   Overall
## Sex_female_x_Age_x_AgeCohort_gt10 100.000
## FarePerPass                        94.510
## Age                                72.384
## Pclass_X3                          39.166
## NetSurv                            37.008
## TicketN                            34.460
## Sex_female_x_Pclass_X3             17.287
## Sex_female                         15.372
## Deck_C                             11.215
## SibSp                              11.001
## TicketNCohort_gt4                   7.082
## Deck_E                              6.820
## Embarked_S                          6.749
## Sex_female_x_Pclass_X2              6.190
## Deck_F                              5.600
## Sex_female_x_Embarked_Q             5.487
## Deck_D                              3.846
## Embarked_Q                          3.515
## Sex_female_x_Embarked_S             2.014
## Deck_B                              1.909

Resampling Performance

The accuracy from the confusion matrix below is 0.8361.

confusionMatrix(mdl_final)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  55.4  9.1
##        Yes  6.2 29.3
##                             
##  Accuracy (average) : 0.8474

Create Submission File

preds <- predict(mdl_final, newdata = testing) %>% {ifelse(. == "Yes", 1, 0)}
sub_file <- data.frame(PassengerId = testing$PassengerId, Survived = preds)
write.csv(sub_file, file = "./titanic_05_cart.csv", row.names = FALSE)