Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method. The algorithm constructs B regression trees using B bootstrapped training sets, and averages the resulting predictions. These trees are grown deep, and are not pruned. Hence each individual tree has high variance, but low bias. Averaging the B trees reduces the variance. The predicted value for an observation is the mode (classification) or mean (regression) of the trees. B usually equals ~25.
Random forests improve bagged trees by way of a small tweak that de-correlates the trees. As in bagging, the algorithm builds a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of mtry predictors is chosen as split candidates from the full set of p predictors. A fresh sample of mtry predictors is taken at each split. Typically \(mtry \sim \sqrt{p}\). Bagged trees are thus a special case of random forests where mtry = p.
Gradient boosting machine (GBM) is an additive modeling algorithm that gradually builds a composite model by iteratively adding M weak sub-models based on the performance of the prior iteration’s composite. The idea is to fit a weak model, then replace the response values with the residuals from that model, and fit another model. Adding the residual prediction model to the original response prediction model produces a more accurate model. GBM repeats this process over and over, running new models to predict the residuals of the previous composite models, and adding the results to produce new composites. With each iteration, the model becomes stronger and stronger.
Extreme Gradient Boosting (XGBoost) is a boosting algorithm based on GBM. XGboost applies regularization to reduce overfitting. I don’t know anything about how it works, but it does well in Kaggle competitions, and you have to learn it somehow, so I’ll give it a try.
The best model was XGBoost. The model run on the full training data set had a holdout set accuracy of 0.8644.
After fitting to the full training data set, the performance on the testing data set on kaggle was 0.79186 accuracy.
## Warning: package 'recipes' was built under R version 4.0.2
The initial data management created the data set full
, with training rows indexed by train_index
. I added another predictor variables in the exploratory analysis and split the data into training
and testing
, then 80:20 split training
into training_80
for training and training_20
to compare models.
I’ll use 10-fold CV.
train_control <- trainControl(
method = "cv", number = 10,
savePredictions = "final",
summaryFunction = twoClassSummary,
classProbs = TRUE
)
I’ll try four models: bagging, random forests, gbm, and xgboost.
My model data set variables are PassengerID
, Survived
, and 13 predictors.
## [1] "PassengerId" "Survived" "Pclass" "Sex"
## [5] "Age" "SibSp" "Parch" "Embarked"
## [9] "TicketN" "FarePerPass" "Employee" "Deck"
## [13] "AgeCohort" "TicketNCohort" "NetSurv"
I’ll use the recipe method to train. From the exploratory analysis section I’ve decided to create interactions Pclass*Sex
, Embarked*Sex
, TicketN:TicketNCohort
, and Age*AgeCohort*Sex
.
rcpe <- recipe(Survived ~ ., data = training_80[, mdl_vars]) %>%
update_role(PassengerId, new_role = "id variable") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
# I don't think centering/scaling helps with tree models
# step_center(Age, SibSp, Parch, TicketN, FarePerPass, NetSurv) %>%
# step_scale(Age, SibSp, Parch, TicketN, FarePerPass, NetSurv) %>%
step_interact(terms = ~
starts_with("Sex_"):starts_with("Pclass_") +
starts_with("Sex_"):starts_with("Embarked_") +
TicketN:starts_with("TicketNCohort_") +
Age:starts_with("AgeCohort_"):starts_with("Sex_"))
prep(rcpe, training = training_80)
## Data Recipe
##
## Inputs:
##
## role #variables
## id variable 1
## outcome 1
## predictor 13
##
## Training data contained 714 data points and no missing data.
##
## Operations:
##
## Dummy variables from Pclass, Sex, Embarked, Employee, Deck, ... [trained]
## Interactions with Sex_female:(Pclass_X2 + Pclass_X3) + Sex_female:(Embarked_Q + Embarked_S) + TicketN:TicketNCohort_gt4 + Age:AgeCohort_gt10:Sex_female [trained]
The bagging model had a holdout set accuracy of 0.8531, sensitivity of 0.7206, specificity of 0.9358, and AUC of 0.8831.
Caret has no hyperparameters to tune with this model. Here is the fit summary.
set.seed(1970)
mdl_bag <- train(
rcpe,
training_80[, mdl_vars],
method = "treebag",
trControl = train_control,
metric = "ROC"
)
mdl_bag
## Bagged CART
##
## 714 samples
## 14 predictor
## 2 classes: 'No', 'Yes'
##
## Recipe steps: dummy, interact
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8848785 0.8863636 0.7664021
varImp()
ranks the predictors by the sum of the reduction in the loss function attributed to each variable at each split. The most important variable here was FarePerPass
. In the straight logistic regression model, it was NetSurv
, followed by Pclass
.
The accuracy from the confusion matrix is 0.8403.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 54.6 9.0
## Yes 7.0 29.4
##
## Accuracy (average) : 0.8403
Here is the model performance on the holdout data set.
preds_bag <- bind_cols(
predict(mdl_bag, newdata = training_20, type = "prob"),
Predicted = predict(mdl_bag, newdata = training_20, type = "raw"),
Actual = training_20$Survived
)
confusionMatrix(preds_bag$Predicted, reference = preds_bag$Actual, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 102 19
## Yes 7 49
##
## Accuracy : 0.8531
## 95% CI : (0.7922, 0.9017)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 3.506e-12
##
## Kappa : 0.6789
##
## Mcnemar's Test P-Value : 0.03098
##
## Sensitivity : 0.7206
## Specificity : 0.9358
## Pos Pred Value : 0.8750
## Neg Pred Value : 0.8430
## Prevalence : 0.3842
## Detection Rate : 0.2768
## Detection Prevalence : 0.3164
## Balanced Accuracy : 0.8282
##
## 'Positive' Class : Yes
##
The sensitivity is 0.7206 and the specificity is 0.9358, so the model is more prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8531. precrec::evalmod()
will calculate the confusion matrix values from the model using the holdout data set. The AUC on the holdout set is 0.8831.
mdl_auc <- Metrics::auc(actual = preds_bag$Actual == "Yes", preds_bag$Yes)
yardstick::roc_curve(preds_bag, Actual, Yes) %>%
autoplot() +
labs(
title = "Bagging Model ROC Curve, Test Data",
subtitle = paste0("AUC = ", round(mdl_auc, 4))
)
The gain curve plots the cumulative summed true outcome versus the fraction of items seen when sorted by the predicted value. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicted the response variable well. The gray area is the “gain” over a random prediction.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).
It encountered the 68th survivor on the 177th observation (100%).
options(yardstick.event_first = FALSE) # set the second level as success
yardstick::gain_curve(preds_bag, Actual, Yes) %>%
autoplot() +
labs(title = "Bagging Model Gain Curve on Test Data")
The random forest model had a holdout set accuracy of 0.8475 - a little worse than bagging’s 0.8531, sensitivity of 0.6765, specificity of 0.9541, and AUC of 0.8732 - worse than bagging’s 0.8831.
The sensitivity is (worse than bagging) and the specificity is (better than bagging). The accuracy is - (worse than bagging). The AUC on the holdout set is .
Each time a split in a tree is considered, a random forest model takes a random sample of mtry predictors as split candidates from the full set of p predictors. Hyperparameter mtry
can take any value from 1 to 13 (the number of predictors) and I expect the best value to be near \(\sqrt{13} \sim 4\). Here is the fit summary.
set.seed(1970)
mdl_rf <- train(
rcpe,
training_80[, mdl_vars],
method = "rf",
metric = "ROC",
tuneGrid = expand.grid(mtry = 1:13), # searching around mtry=4,
trControl = train_control
)
## Loading required namespace: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## mtry
## 7 7
The best tuning parameter is mtry = 7
. You can see the relationship between AUC and mtry
in the plot.
The most important variable was the interaction Sex:Age:AgeCohort
and FarePerPass
second. Bagging had the same top two, but with flipped order.
The accuracy from the confusion matrix is 0.8487 - a little better than bagging’s 0.8403.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 56.6 10.1
## Yes 5.0 28.3
##
## Accuracy (average) : 0.8487
Here is the model performance on the holdout data set.
preds_rf <- bind_cols(
predict(mdl_rf, newdata = training_20, type = "prob"),
Predicted = predict(mdl_rf, newdata = training_20, type = "raw"),
Actual = training_20$Survived
)
confusionMatrix(preds_rf$Predicted, reference = preds_rf$Actual, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 104 22
## Yes 5 46
##
## Accuracy : 0.8475
## 95% CI : (0.7859, 0.897)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.244e-11
##
## Kappa : 0.6617
##
## Mcnemar's Test P-Value : 0.002076
##
## Sensitivity : 0.6765
## Specificity : 0.9541
## Pos Pred Value : 0.9020
## Neg Pred Value : 0.8254
## Prevalence : 0.3842
## Detection Rate : 0.2599
## Detection Prevalence : 0.2881
## Balanced Accuracy : 0.8153
##
## 'Positive' Class : Yes
##
The sensitivity is 0.6765 (worse than bagging) and the specificity is 0.9541 (better than bagging). The accuracy is 0.8475 - (worse than bagging). The AUC on the holdout set is 0.8732.
mdl_auc <- Metrics::auc(actual = preds_rf$Actual == "Yes", preds_rf$Yes)
yardstick::roc_curve(preds_rf, Actual, Yes) %>%
autoplot() +
labs(
title = "Random Forest Model ROC Curve, Test Data",
subtitle = paste0("AUC = ", round(mdl_auc, 4))
)
Here is the gain curve.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 36 observations (21%).
It encountered the 68th survivor on the 177th observation (100%).
options(yardstick.event_first = FALSE) # set the second level as success
yardstick::gain_curve(preds_rf, Actual, Yes) %>%
autoplot() +
labs(title = "Random Forest Model Gain Curve on Test Data")
The GBM model had a holdout set accuracy of 0.8475 - tied with random forest but worse than bagging (0.8531), sensitivity of 0.6765, specificity of 0.9541, and AUC of 0.8772.
gbm
has the following tuneable hyperparameters (see modelLookup("gbm")
).
n.trees
: number of boosting iterations, \(M\)interaction.depth
: maximum tree depthshrinkage
: shrinkage, \(\eta\)n.minobsinnode
: minimum terminal node sizeset.seed(1970)
garbage <- capture.output(
mdl_gbm <- train(
rcpe,
training_80[, mdl_vars],
method = "gbm",
metric = "ROC",
tuneLength = 5,
trControl = train_control
))
## Loading required namespace: gbm
## Loaded gbm 2.1.5
## Stochastic Gradient Boosting
##
## 714 samples
## 14 predictor
## 2 classes: 'No', 'Yes'
##
## Recipe steps: dummy, interact
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees ROC Sens Spec
## 1 50 0.8760853 0.8681818 0.7300265
## 1 100 0.8838143 0.8886364 0.7300265
## 1 150 0.8871708 0.9022727 0.7263228
## 1 200 0.8870265 0.8977273 0.7046296
## 1 250 0.8846305 0.9000000 0.6936508
## 2 50 0.8898268 0.9113636 0.7148148
## 2 100 0.8873256 0.9136364 0.7082011
## 2 150 0.8884695 0.8909091 0.7152116
## 2 200 0.8921702 0.8954545 0.7226190
## 2 250 0.8882651 0.9113636 0.7301587
## 3 50 0.8902026 0.8931818 0.7006614
## 3 100 0.8867995 0.9068182 0.7191799
## 3 150 0.8885567 0.9045455 0.7228836
## 3 200 0.8877811 0.9090909 0.7298942
## 3 250 0.8817054 0.9068182 0.7227513
## 4 50 0.8873076 0.9022727 0.6862434
## 4 100 0.8835934 0.8977273 0.7265873
## 4 150 0.8844321 0.9000000 0.7374339
## 4 200 0.8832296 0.9045455 0.7519841
## 4 250 0.8824330 0.9000000 0.7412698
## 5 50 0.8864313 0.8931818 0.7005291
## 5 100 0.8828598 0.8977273 0.7373016
## 5 150 0.8838549 0.8886364 0.7519841
## 5 200 0.8857368 0.8909091 0.7482804
## 5 250 0.8830102 0.8931818 0.7447090
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 10.
## n.trees interaction.depth shrinkage n.minobsinnode
## 9 200 2 0.1 10
train()
held constant shrinkage = 0.1
() and n.minobsinnode = 10
, and tuned for optimal values n.trees = 200
($M) and interaction.depth = 2
. You can see the relationship between AUC and mtry
in the plot.
The most important variable was Sex
, then the interaction Sex:Age:AgeCohort
and FarePerPass
.
The accuracy from the confusion matrix is 0.8291 - the worst of the three models considered so far.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 55.2 10.6
## Yes 6.4 27.7
##
## Accuracy (average) : 0.8291
Here is the model performance on the holdout data set.
preds_gbm <- bind_cols(
predict(mdl_gbm, newdata = training_20, type = "prob"),
Predicted = predict(mdl_gbm, newdata = training_20, type = "raw"),
Actual = training_20$Survived
)
confusionMatrix(preds_gbm$Predicted, reference = preds_gbm$Actual, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 104 22
## Yes 5 46
##
## Accuracy : 0.8475
## 95% CI : (0.7859, 0.897)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.244e-11
##
## Kappa : 0.6617
##
## Mcnemar's Test P-Value : 0.002076
##
## Sensitivity : 0.6765
## Specificity : 0.9541
## Pos Pred Value : 0.9020
## Neg Pred Value : 0.8254
## Prevalence : 0.3842
## Detection Rate : 0.2599
## Detection Prevalence : 0.2881
## Balanced Accuracy : 0.8153
##
## 'Positive' Class : Yes
##
The sensitivity is 0.6765 and the specificity is 0.9541 (same as random forest). The accuracy is 0.8475 - (same as random forest, worse than bagging). The AUC on the holdout set is 0.8772.
mdl_auc <- Metrics::auc(actual = preds_gbm$Actual == "Yes", preds_gbm$Yes)
yardstick::roc_curve(preds_gbm, Actual, Yes) %>%
autoplot() +
labs(
title = "Gradient Boosting Model ROC Curve, Test Data",
subtitle = paste0("AUC = ", round(mdl_auc, 4))
)
Here is the gain curve.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).
It encountered the 68th survivor on the 152nd observation (86%).
options(yardstick.event_first = FALSE) # set the second level as success
yardstick::gain_curve(preds_gbm, Actual, Yes) %>%
autoplot() +
labs(title = "Gradient Boosting Model Gain Curve on Test Data")
The XGBoost model had a holdout set accuracy of 0.8644 (best of all the models), sensitivity of 0.7353, specificity of 0.9450, and AUC of 0.8859.
xgbTree
has the following tuneable hyperparameters (see modelLookup("xgbTree")
). The first three are the same as xgb
.
nrounds
: number of boosting iterations, \(M\)max_depth
: maximum tree deptheta
: shrinkage, \(\eta\)gamma
: minimum loss reductioncolsamle_bytree
: subsample ratio of columnsmin_child_weight
: minimum size of instance weightsubstample
: subsample percentageset.seed(1970)
garbage <- capture.output(
mdl_xgb <- train(
rcpe,
training_80[, mdl_vars],
method = "xgbTree",
metric = "ROC",
tuneLength = 5,
trControl = train_control
))
## Loading required namespace: xgboost
## Warning: package 'xgboost' was built under R version 4.0.2
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 52 100 2 0.3 0 0.6 1 0.5
train()
held constant gamma = 0
, min_child_weight = 1
, and tuned for optimal values nrounds = 100
, max_depth = 2
, eta = 0.3
, colsample_bytree = 0.6
, and subsample = 0.5
. There are too many tuneable parameters to see their relationships with AUC.
The most important variable was the interaction Sex:Age:AgeCohort
and Age
second.
The accuracy from the confusion matrix is 0.8207 - worst of the four models.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 55.0 11.3
## Yes 6.6 27.0
##
## Accuracy (average) : 0.8207
Here is the model performance on the holdout data set.
preds_xgb <- bind_cols(
predict(mdl_xgb, newdata = training_20, type = "prob"),
Predicted = predict(mdl_xgb, newdata = training_20, type = "raw"),
Actual = training_20$Survived
)
confusionMatrix(preds_xgb$Predicted, reference = preds_xgb$Actual, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 103 18
## Yes 6 50
##
## Accuracy : 0.8644
## 95% CI : (0.805, 0.9112)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 2.439e-13
##
## Kappa : 0.7036
##
## Mcnemar's Test P-Value : 0.02474
##
## Sensitivity : 0.7353
## Specificity : 0.9450
## Pos Pred Value : 0.8929
## Neg Pred Value : 0.8512
## Prevalence : 0.3842
## Detection Rate : 0.2825
## Detection Prevalence : 0.3164
## Balanced Accuracy : 0.8401
##
## 'Positive' Class : Yes
##
The sensitivity is 0.7353 and the specificity is 0.9450. The accuracy is 0.8644 (best of the bunch). The AUC on the holdout set is 0.8859.
mdl_auc <- Metrics::auc(actual = preds_xgb$Actual == "Yes", preds_xgb$Yes)
yardstick::roc_curve(preds_xgb, Actual, Yes) %>%
autoplot() +
labs(
title = "XGBoost Model ROC Curve, Test Data",
subtitle = paste0("AUC = ", round(mdl_auc, 4))
)
Here is the gain curve.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 36 observations (21%).
It encountered the 68th survivor on the 162th observation (92%).
options(yardstick.event_first = FALSE) # set the second level as success
yardstick::gain_curve(preds_xgb, Actual, Yes) %>%
autoplot() +
labs(title = "XGBoost Model Gain Curve on Test Data")
Compare the models with evalmod()
.
scores_list <- join_scores(
predict(mdl_bag, newdata = training_20, type = "prob")$Yes,
predict(mdl_rf, newdata = training_20, type = "prob")$Yes,
predict(mdl_gbm, newdata = training_20, type = "prob")$Yes,
predict(mdl_xgb, newdata = training_20, type = "prob")$Yes
)
labels_list <- join_labels(
training_20$Survived,
training_20$Survived,
training_20$Survived,
training_20$Survived
)
pe <- evalmod(
scores = scores_list,
labels = labels_list,
modnames = c("Bagging", "Random Forest", "Gradient Boosting", "XGBoost"),
posclass = "Yes")
autoplot(pe, "ROC")
##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 Bagging 1 ROC 0.8830950
## 2 Bagging 1 PRC 0.8581268
## 3 Random Forest 1 ROC 0.8732461
## 4 Random Forest 1 PRC 0.8575061
## 5 Gradient Boosting 1 ROC 0.8771587
## 6 Gradient Boosting 1 PRC 0.8628425
## 7 XGBoost 1 ROC 0.8859282
## 8 XGBoost 1 PRC 0.8698934
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 Bagging 1 109 68
## 2 Random Forest 1 109 68
## 3 Gradient Boosting 1 109 68
## 4 XGBoost 1 109 68
The highest AUC was with XGBoost.
resamps <- resamples(list('Bagging' = mdl_bag,
'Random Forest' = mdl_rf,
'Gradient Boosting' = mdl_gbm,
'XGBoost' = mdl_xgb))
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: Bagging, Random Forest, Gradient Boosting, XGBoost
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Bagging 0.8457792 0.8692130 0.8824330 0.8848785 0.9013836 0.9297138
## Random Forest 0.8400673 0.8693182 0.8886785 0.8856166 0.9007711 0.9241071
## Gradient Boosting 0.8316498 0.8756163 0.8998316 0.8921702 0.9179744 0.9326299
## XGBoost 0.8627946 0.8741395 0.8949916 0.8933366 0.9029356 0.9358766
## NA's
## Bagging 0
## Random Forest 0
## Gradient Boosting 0
## XGBoost 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Bagging 0.7727273 0.8693182 0.8977273 0.8863636 0.9261364 0.9545455
## Random Forest 0.8636364 0.8920455 0.9318182 0.9181818 0.9318182 0.9772727
## Gradient Boosting 0.7954545 0.8863636 0.8863636 0.8954545 0.9261364 0.9545455
## XGBoost 0.8409091 0.8636364 0.9090909 0.8931818 0.9090909 0.9545455
## NA's
## Bagging 0
## Random Forest 0
## Gradient Boosting 0
## XGBoost 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Bagging 0.6296296 0.6964286 0.7638889 0.7664021 0.8055556 0.9259259
## Random Forest 0.5925926 0.6580688 0.7460317 0.7375661 0.7777778 0.8888889
## Gradient Boosting 0.5714286 0.6388889 0.7142857 0.7226190 0.8148148 0.8928571
## XGBoost 0.5714286 0.6296296 0.6785714 0.7044974 0.7962963 0.8571429
## NA's
## Bagging 0
## Random Forest 0
## Gradient Boosting 0
## XGBoost 0
I’ll do a final fit with the elastic net model to the entire training
set to predict on testing
.
train()
held constant gamma = 0
, min_child_weight = 1
, and tuned for optimal values nrounds = 100
, max_depth = 2
, eta = 0.3
, colsample_bytree = 0.6
, and subsample = 0.5
. There are too many tuneable parameters to see their relationships with AUC.
Here is the fit summary.
set.seed(1970)
mdl_final <- train(
rcpe,
training[, mdl_vars],
method = "xgbTree",
tuneGrid = expand.grid(
gamma = 0.0,
min_child_weight = 1,
nrounds = 100,
max_depth = 2,
eta = 0.3,
colsample_bytree = 0.6,
subsample = 0.5
),
trControl = train_control,
metric = "ROC"
)
mdl_final
## eXtreme Gradient Boosting
##
## 891 samples
## 14 predictor
## 2 classes: 'No', 'Yes'
##
## Recipe steps: dummy, interact
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 802, 801, 801, 802, 802, 802, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8845126 0.899798 0.7630252
##
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
## held constant at a value of 1
## Tuning parameter 'subsample' was held
## constant at a value of 0.5
FarePerPass
ws the most important predictor this time!
## xgbTree variable importance
##
## only 20 most important variables shown (out of 25)
##
## Overall
## Sex_female_x_Age_x_AgeCohort_gt10 100.000
## FarePerPass 94.510
## Age 72.384
## Pclass_X3 39.166
## NetSurv 37.008
## TicketN 34.460
## Sex_female_x_Pclass_X3 17.287
## Sex_female 15.372
## Deck_C 11.215
## SibSp 11.001
## TicketNCohort_gt4 7.082
## Deck_E 6.820
## Embarked_S 6.749
## Sex_female_x_Pclass_X2 6.190
## Deck_F 5.600
## Sex_female_x_Embarked_Q 5.487
## Deck_D 3.846
## Embarked_Q 3.515
## Sex_female_x_Embarked_S 2.014
## Deck_B 1.909
The accuracy from the confusion matrix below is 0.8361.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 55.4 9.1
## Yes 6.2 29.3
##
## Accuracy (average) : 0.8474