The logistic regression model is simple and can be used for inference. The binary logistic regression model is
\[y = logit(\pi) = \ln \left( \frac{\pi}{1 - \pi} \right) = X \beta\]
where \(\pi\) is the event probability. The model predicts the log odds of the response variable. The model is fit with maximum likelihood estimation. There is no closed-form solution, so GLM estimates coefficients with interatively reweighted least squares.
The logistic model run with the final set of predictors had a holdout set accuracy of 0.8305, sensitivity of 0.6029, specificity of 0.9725, and AUC of 0.8821.
After fitting to the full training data set, the performance on the testing data set on kaggle was 0.78947 accuracy.
The initial data management created the data set full, with training rows indexed by train_index. I added another predictor variables in the exploratory analysis and split the data into training and testing, then 80:20 split training into training_80 for training and training_20 to compare models.
I’ll use 10-fold CV.
train_control <- trainControl(
method = "cv", number = 10,
savePredictions = "final",
classProbs = TRUE
)I’ll try two models: a “full” model with (almost) all predictors, and a parsimonious model using glmStepAIC.
By “all predictors”, I mean all non-character predictors - I’m throwing out Surname, Name, Ticket, and Cabin. I’m also throwing out Fare since it is redundant to FarePerPass, and TicketN since I binned it into TktSize. That leaves me with PassengerID, Survived, and 13 predictors.
## [1] "PassengerId" "Survived" "Pclass" "Sex"
## [5] "Age" "SibSp" "Parch" "Embarked"
## [9] "TicketN" "FarePerPass" "Employee" "Deck"
## [13] "AgeCohort" "TicketNCohort" "NetSurv"
I’ll use the recipe method to train. From the exploratory analysis section I’ve decided to create interactions Pclass*Sex, Embarked*Sex, TicketN:TicketNCohort, and Age*AgeCohort*Sex.
rcpe <- recipe(Survived ~ ., data = training_80[, mdl_vars]) %>%
update_role(PassengerId, new_role = "id variable") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_interact(
terms = ~
starts_with("Sex_"):starts_with("Pclass_") +
starts_with("Sex_"):starts_with("Embarked_") +
TicketN:starts_with("TicketNCohort_") +
Age:starts_with("AgeCohort_"):starts_with("Sex_")
)
prep(rcpe, training = training_80)## Data Recipe
##
## Inputs:
##
## role #variables
## id variable 1
## outcome 1
## predictor 13
##
## Training data contained 714 data points and no missing data.
##
## Operations:
##
## Dummy variables from Pclass, Sex, Embarked, Employee, Deck, ... [trained]
## Interactions with Sex_female:(Pclass_X2 + Pclass_X3) + Sex_female:(Embarked_Q + Embarked_S) + TicketN:TicketNCohort_gt4 + Age:AgeCohort_gt10:Sex_female [trained]
The logistic model run with the full set of predictors had a holdout set accuracy of 0.8192, sensitivity of 0.6029, specificity of 0.9541, and AUC of 0.8821.
Here is the fit summary.
set.seed(1970)
mdl_full <- train(
rcpe,
training_80[, mdl_vars],
method = "glm",
family = "binomial",
trControl = train_control,
metric = "Accuracy"
)
summary(mdl_full)##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9365 -0.5326 -0.3689 0.3093 2.5718
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.6230749 0.9989503 2.626 0.008644 **
## Age -0.0312234 0.0126808 -2.462 0.013806 *
## SibSp -0.4904314 0.1665775 -2.944 0.003238 **
## Parch -0.1981775 0.2081470 -0.952 0.341045
## TicketN 0.3155560 0.1872779 1.685 0.091996 .
## FarePerPass -0.0000135 0.0150963 -0.001 0.999287
## NetSurv 1.1151445 0.2175755 5.125 2.97e-07 ***
## Pclass_X2 -1.6116251 0.5821576 -2.768 0.005634 **
## Pclass_X3 -1.5097748 0.5449751 -2.770 0.005600 **
## Sex_female 2.8959607 1.1132743 2.601 0.009287 **
## Embarked_Q -0.7941796 0.7512128 -1.057 0.290422
## Embarked_S -0.3668720 0.3474722 -1.056 0.291045
## Employee_X1 -1.0949792 1.1265330 -0.972 0.331055
## Deck_B -0.0100641 0.5094084 -0.020 0.984238
## Deck_C -0.5491667 0.4670041 -1.176 0.239621
## Deck_D 0.2197890 0.4848076 0.453 0.650295
## Deck_E -0.2271941 0.4678783 -0.486 0.627262
## Deck_F -0.0774744 0.4394544 -0.176 0.860061
## AgeCohort_gt10 -2.2100287 0.6057894 -3.648 0.000264 ***
## TicketNCohort_gt4 -4.5056335 2.1038403 -2.142 0.032224 *
## Sex_female_x_Pclass_X2 0.6605795 1.0220719 0.646 0.518076
## Sex_female_x_Pclass_X3 -2.0983916 0.8672194 -2.420 0.015534 *
## Sex_female_x_Embarked_Q 1.4893787 1.0208529 1.459 0.144577
## Sex_female_x_Embarked_S -0.3158003 0.6711569 -0.471 0.637975
## TicketN_x_TicketNCohort_gt4 0.4095766 0.3450534 1.187 0.235230
## Sex_female_x_Age_x_AgeCohort_gt10 0.0547660 0.0198725 2.756 0.005854 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 512.03 on 688 degrees of freedom
## AIC: 564.03
##
## Number of Fisher Scoring iterations: 6
Parch, FarePerPass, Embarked, and Employee fail the significance test. varImp() ranks the coefficients by the absolute value of the t-statistic. NetSurv is most important.
## glm variable importance
##
## only 20 most important variables shown (out of 25)
##
## Overall
## NetSurv 100.000
## AgeCohort_gt10 71.174
## SibSp 57.436
## Pclass_X3 54.044
## Pclass_X2 54.005
## Sex_female_x_Age_x_AgeCohort_gt10 53.762
## Sex_female 50.745
## Age 48.032
## Sex_female_x_Pclass_X3 47.201
## TicketNCohort_gt4 41.775
## TicketN 32.864
## Sex_female_x_Embarked_Q 28.453
## TicketN_x_TicketNCohort_gt4 23.146
## Deck_C 22.930
## Embarked_Q 20.613
## Embarked_S 20.586
## Employee_X1 18.950
## Parch 18.562
## Sex_female_x_Pclass_X2 12.595
## Deck_E 9.458
The model performance on the 10-fold resampling was:
## Generalized Linear Model
##
## 714 samples
## 14 predictor
## 2 classes: 'No', 'Yes'
##
## Recipe steps: dummy, interact
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8346831 0.6366391
The model accuracy is 0.8347. More detail below.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 56.6 11.5
## Yes 5.0 26.9
##
## Accuracy (average) : 0.8347
Here is the model performance on the holdout data set.
predicted_classes <- predict(mdl_full, newdata = training_20)
predicted_probs <- predict(mdl_full, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 104 27
## Yes 5 41
##
## Accuracy : 0.8192
## 95% CI : (0.7545, 0.8729)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 3.784e-09
##
## Kappa : 0.5932
##
## Mcnemar's Test P-Value : 0.0002054
##
## Sensitivity : 0.6029
## Specificity : 0.9541
## Pos Pred Value : 0.8913
## Neg Pred Value : 0.7939
## Prevalence : 0.3842
## Detection Rate : 0.2316
## Detection Prevalence : 0.2599
## Balanced Accuracy : 0.7785
##
## 'Positive' Class : Yes
##
The sensitivity is 0.6029 and the specificity is 0.9541, so the model is prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8192. precrec::evalmod() will calculate the confusion matrix values from the model using the holdout data set. The AUC on the holdout set is 0.8821. pRoc::plot.roc() and plotROC::geom_roc() are options, but I like the way yardstick::roc_curve() looks.
mdl_full_preds <- predict(mdl_full, newdata = training_20, type = "prob")
(mdl_full_eval <- evalmod(
scores = mdl_full_preds$Yes,
labels = training_20$Survived
))##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 m1 1 ROC 0.8820831
## 2 m1 1 PRC 0.8620615
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 m1 1 109 68
options(yardstick.event_first = FALSE) # set the second level as success
data.frame(
pred = mdl_full_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::roc_curve(obs, pred) %>%
autoplot() +
labs(
title = "Full Model ROC Curve, Test Data",
subtitle = "AUC = 0.8821"
)The gain curve plots the cumulative summed true outcome versus the fraction of items seen when sorted by the predicted value. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicted the response variable well. The gray area is the “gain” over a random prediction.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).
It encountered all 68 survivors on the 157th observation (89%).
The bottom of the gray area is the outcome of a random model. Only half the survivors would be observed within 50% of the observations. The top of the gray area is the outcome of the perfect model, the “wizard curve”. Half the survivors would be observed in 68/177=38% of the observations.
data.frame(
pred = mdl_full_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::gain_curve(obs, pred) %>%
autoplot() +
labs(
title = "Full Model Gain Curve on Holdout Set"
)Let’s try this again with stepwise regression to see how a more parsimonious model might perform.
set.seed(1970)
mdl_step <- train(
rcpe,
training_80[, mdl_vars],
method = "glmStepAIC",
family = "binomial",
trace = FALSE, # suppress the iteration summaries
trControl = train_control,
metric = "Accuracy"
)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1163 -0.5358 -0.3811 0.3408 2.5254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.17080 0.81318 2.670 0.007596 **
## Age -0.03195 0.01234 -2.590 0.009609 **
## SibSp -0.49676 0.16592 -2.994 0.002754 **
## TicketN 0.27315 0.15512 1.761 0.078267 .
## NetSurv 1.16906 0.21195 5.516 3.48e-08 ***
## Pclass_X2 -1.57434 0.42726 -3.685 0.000229 ***
## Pclass_X3 -1.57409 0.37197 -4.232 2.32e-05 ***
## Sex_female 3.16761 0.78294 4.046 5.22e-05 ***
## Deck_C -0.52073 0.30800 -1.691 0.090897 .
## AgeCohort_gt10 -2.03549 0.58690 -3.468 0.000524 ***
## TicketNCohort_gt4 -5.23705 1.99014 -2.631 0.008501 **
## Sex_female_x_Pclass_X3 -2.51905 0.61582 -4.091 4.30e-05 ***
## Sex_female_x_Embarked_Q 1.28978 0.51091 2.524 0.011586 *
## TicketN_x_TicketNCohort_gt4 0.50332 0.31469 1.599 0.109729
## Sex_female_x_Age_x_AgeCohort_gt10 0.05130 0.01953 2.626 0.008628 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 518.76 on 699 degrees of freedom
## AIC: 548.76
##
## Number of Fisher Scoring iterations: 6
glmStepAIC dropped Parch, FarePerPass, Embarked, Employee, most of the Deck dummies (although it kept DeckC), and the sex interaction with one level of Pclass and one level of Embarked.
The model performance on the 10-fold resampling was:
## Generalized Linear Model with Stepwise Feature Selection
##
## 714 samples
## 14 predictor
## 2 classes: 'No', 'Yes'
##
## Recipe steps: dummy, interact
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8276995 0.6213608
The accuracy is 0.8277 - down from 0.8347 in the full model.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 56.4 12.0
## Yes 5.2 26.3
##
## Accuracy (average) : 0.8277
Here is the model performance on the holdout data set.
predicted_classes <- predict(mdl_step, newdata = training_20)
predicted_probs <- predict(mdl_step, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 105 27
## Yes 4 41
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6047
##
## Mcnemar's Test P-Value : 7.772e-05
##
## Sensitivity : 0.6029
## Specificity : 0.9633
## Pos Pred Value : 0.9111
## Neg Pred Value : 0.7955
## Prevalence : 0.3842
## Detection Rate : 0.2316
## Detection Prevalence : 0.2542
## Balanced Accuracy : 0.7831
##
## 'Positive' Class : Yes
##
The sensitivity is 0.6029 and the specificity is 0.9633, so the model is prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8249 - higher than the 0.8192 in the full model! precrec::evalmod() will calculate the confusion matrix values from the model using the holdout data set. The AUC on the holdout set is 0.8848.
mdl_step_preds <- predict(mdl_step, newdata = training_20, type = "prob")
(mdl_step_eval <- evalmod(
scores = mdl_step_preds$Yes,
labels = training_20$Survived
))##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 m1 1 ROC 0.8847814
## 2 m1 1 PRC 0.8650199
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 m1 1 109 68
options(yardstick.event_first = FALSE) # set the second level as success
data.frame(
pred = mdl_step_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::roc_curve(obs, pred) %>%
autoplot() +
labs(
title = "glmStepAIC Model ROC Curve, Test Data",
subtitle = "AUC = 0.8848"
)Here is the gain curve again.
68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 36 observations (20%).
It encountered all 68 survivors on the 150th observation (85%).
data.frame(
pred = mdl_step_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::gain_curve(obs, pred) %>%
autoplot() +
labs(
title = "glmStepAIC Model Gain Curve on Holdout Set"
)The stepwise regression actually performed better than the full model (surprising - can multicollinearity reduce predictive accuracy on the holdout?). I don’t want to just use the stepwise model though, because it keeps selected factor variable levels, and if I’m going to keep one level, I want the entire factor variable. So I’ll fit this one more time, just dropping the variables where glmStepAIC dropped the entire variable.
First I’ll fit it with the training_80 set, just so I can compare. Then I’ll do a final fit to the entire training set to predict on testing.
mdl_vars_final <- subset(mdl_vars, !mdl_vars %in% c("Parch", "FarePerPass", "Employee"))
rcpe_final <- recipe(Survived ~ ., data = training_80[, mdl_vars_final]) %>%
update_role(PassengerId, new_role = "id variable") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_interact(terms = ~
starts_with("Sex_"):starts_with("Pclass_") +
starts_with("Sex_"):starts_with("Embarked_") +
TicketN:starts_with("TicketNCohort_") +
Age:starts_with("AgeCohort_"):starts_with("Sex_"))
prep(rcpe_final, training = training_80)## Data Recipe
##
## Inputs:
##
## role #variables
## id variable 1
## outcome 1
## predictor 10
##
## Training data contained 714 data points and no missing data.
##
## Operations:
##
## Dummy variables from Pclass, Sex, Embarked, Deck, AgeCohort, TicketNCohort [trained]
## Interactions with Sex_female:(Pclass_X2 + Pclass_X3) + Sex_female:(Embarked_Q + Embarked_S) + TicketN:TicketNCohort_gt4 + Age:AgeCohort_gt10:Sex_female [trained]
The logistic model run with the final set of predictors had a holdout set accuracy of 0.8305, sensitivity of 0.6029, specificity of 0.9725, and AUC of 0.8821.
Here is the fit summary.
set.seed(1970)
mdl_final_train <- train(
rcpe_final,
training_80[, mdl_vars_final],
method = "glm",
family = "binomial",
trControl = train_control,
metric = "Accuracy"
)
summary(mdl_final_train)##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9913 -0.5318 -0.3617 0.3108 2.5461
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.60328 0.89802 2.899 0.003745 **
## Age -0.03105 0.01261 -2.463 0.013777 *
## SibSp -0.48844 0.16559 -2.950 0.003181 **
## TicketN 0.22812 0.16320 1.398 0.162184
## NetSurv 1.15699 0.21512 5.378 7.52e-08 ***
## Pclass_X2 -1.58330 0.49573 -3.194 0.001404 **
## Pclass_X3 -1.45530 0.40810 -3.566 0.000362 ***
## Sex_female 3.07658 1.10426 2.786 0.005335 **
## Embarked_Q -0.78420 0.74058 -1.059 0.289647
## Embarked_S -0.39803 0.34057 -1.169 0.242517
## Deck_B -0.02944 0.50447 -0.058 0.953461
## Deck_C -0.53058 0.46316 -1.146 0.251973
## Deck_D 0.20294 0.48237 0.421 0.673971
## Deck_E -0.22790 0.46427 -0.491 0.623517
## Deck_F -0.08433 0.43721 -0.193 0.847045
## AgeCohort_gt10 -2.12640 0.59065 -3.600 0.000318 ***
## TicketNCohort_gt4 -5.07595 2.01230 -2.522 0.011654 *
## Sex_female_x_Pclass_X2 0.58767 1.01673 0.578 0.563266
## Sex_female_x_Pclass_X3 -2.23216 0.85970 -2.596 0.009420 **
## Sex_female_x_Embarked_Q 1.43878 1.01147 1.422 0.154891
## Sex_female_x_Embarked_S -0.33703 0.66776 -0.505 0.613754
## TicketN_x_TicketNCohort_gt4 0.52484 0.32083 1.636 0.101861
## Sex_female_x_Age_x_AgeCohort_gt10 0.05230 0.01970 2.654 0.007954 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 514.10 on 691 degrees of freedom
## AIC: 560.1
##
## Number of Fisher Scoring iterations: 6
NetSurv continues to be the most important predictor.
## glm variable importance
##
## only 20 most important variables shown (out of 22)
##
## Overall
## NetSurv 100.000
## AgeCohort_gt10 66.575
## Pclass_X3 65.936
## Pclass_X2 58.940
## SibSp 54.349
## Sex_female 51.274
## Sex_female_x_Age_x_AgeCohort_gt10 48.791
## Sex_female_x_Pclass_X3 47.709
## TicketNCohort_gt4 46.319
## Age 45.201
## TicketN_x_TicketNCohort_gt4 29.654
## Sex_female_x_Embarked_Q 25.642
## TicketN 25.177
## Embarked_S 20.872
## Deck_C 20.437
## Embarked_Q 18.807
## Sex_female_x_Pclass_X2 9.768
## Sex_female_x_Embarked_S 8.390
## Deck_E 8.130
## Deck_D 6.811
The model performance on the 10-fold resampling was:
## Generalized Linear Model
##
## 714 samples
## 11 predictor
## 2 classes: 'No', 'Yes'
##
## Recipe steps: dummy, interact
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8388889 0.645208
The accuracy is 0.8389.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 57.0 11.5
## Yes 4.6 26.9
##
## Accuracy (average) : 0.8389
Here is the model performance on the holdout data set.
predicted_classes <- predict(mdl_final_train, newdata = training_20)
predicted_probs <- predict(mdl_final_train, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 106 27
## Yes 3 41
##
## Accuracy : 0.8305
## 95% CI : (0.767, 0.8826)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 4.325e-10
##
## Kappa : 0.6163
##
## Mcnemar's Test P-Value : 2.679e-05
##
## Sensitivity : 0.6029
## Specificity : 0.9725
## Pos Pred Value : 0.9318
## Neg Pred Value : 0.7970
## Prevalence : 0.3842
## Detection Rate : 0.2316
## Detection Prevalence : 0.2486
## Balanced Accuracy : 0.7877
##
## 'Positive' Class : Yes
##
The sensitivity is 0.6029 and the specificity is 0.9725, so the model is prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8305. The AUC on the holdout set is 0.8774.
mdl_final_train_preds <- predict(mdl_final_train, newdata = training_20, type = "prob")
(mdl_final_train_eval <- evalmod(
scores = mdl_final_train_preds$Yes,
labels = training_20$Survived
))##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 m1 1 ROC 0.8774285
## 2 m1 1 PRC 0.8565620
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 m1 1 109 68
options(yardstick.event_first = FALSE) # set the second level as success
data.frame(
pred = mdl_final_train_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::roc_curve(obs, pred) %>%
autoplot() +
labs(
title = "Final Model (Training) ROC Curve, Test Data",
subtitle = "AUC = 0.8774"
)68 of the 177 passengers in the holdout set survived.
The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).
It encountered all 68 survivors on the 156th observation (88%).
data.frame(
pred = mdl_final_train_preds$Yes,
obs = training_20$Survived
) %>%
yardstick::gain_curve(obs, pred) %>%
autoplot() +
labs(
title = "Final Model (Training) Gain Curve on Holdout Set"
)Compare the models with evalmod().
scores_list <- join_scores(
predict(mdl_full, newdata = training_20, type = "prob")$Yes,
predict(mdl_step, newdata = training_20, type = "prob")$Yes,
predict(mdl_final_train, newdata = training_20, type = "prob")$Yes
)
labels_list <- join_labels(
training_20$Survived,
training_20$Survived,
training_20$Survived
)
pe <- evalmod(
scores = scores_list,
labels = labels_list,
modnames = c("Full", "glmStepAIC", "Final (Training)"),
posclass = "Yes")
autoplot(pe, "ROC")##
## === AUCs ===
##
## Model name Dataset ID Curve type AUC
## 1 Full 1 ROC 0.8820831
## 2 Full 1 PRC 0.8620615
## 3 glmStepAIC 1 ROC 0.8847814
## 4 glmStepAIC 1 PRC 0.8650199
## 5 Final (Training) 1 ROC 0.8774285
## 6 Final (Training) 1 PRC 0.8565620
##
##
## === Input data ===
##
## Model name Dataset ID # of negatives # of positives
## 1 Full 1 109 68
## 2 glmStepAIC 1 109 68
## 3 Final (Training) 1 109 68
The highest AUC was with the step-wise selection model, 0.8847.
resamps <- resamples(list('Full' = mdl_full,
'Reduced' = mdl_step,
'Final' = mdl_final_train))
summary(resamps)##
## Call:
## summary.resamples(object = resamps)
##
## Models: Full, Reduced, Final
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Full 0.8028169 0.8090278 0.8309859 0.8346831 0.8431631 0.9027778 0
## Reduced 0.8028169 0.8035016 0.8181729 0.8276995 0.8561718 0.8611111 0
## Final 0.8028169 0.8125000 0.8450704 0.8388889 0.8466843 0.9027778 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Full 0.5494107 0.5726738 0.6280895 0.6366391 0.6578025 0.8018868 0
## Reduced 0.5494107 0.5628848 0.6025258 0.6213608 0.6884466 0.7115385 0
## Final 0.5494107 0.5796978 0.6586154 0.6452080 0.6677171 0.8018868 0
I’ll do a final fit to the entire training set to predict on testing.
Here is the fit summary.
set.seed(1970)
mdl_final <- train(
rcpe_final,
training[, mdl_vars_final],
method = "glm",
family = "binomial",
trControl = train_control,
metric = "Accuracy"
)
summary(mdl_final)##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0262 -0.5630 -0.3375 0.3383 2.6004
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.34463 0.81257 4.116 3.85e-05 ***
## Age -0.04185 0.01161 -3.605 0.000312 ***
## SibSp -0.32984 0.14810 -2.227 0.025933 *
## TicketN 0.10369 0.15020 0.690 0.490002
## NetSurv 1.23194 0.19687 6.258 3.91e-10 ***
## Pclass_X2 -1.95622 0.42335 -4.621 3.82e-06 ***
## Pclass_X3 -1.87988 0.35725 -5.262 1.42e-07 ***
## Sex_female 3.06475 1.03120 2.972 0.002958 **
## Embarked_Q -0.92885 0.70751 -1.313 0.189236
## Embarked_S -0.23839 0.30029 -0.794 0.427279
## Deck_B -0.49324 0.45002 -1.096 0.273065
## Deck_C -0.77666 0.40464 -1.919 0.054938 .
## Deck_D -0.06908 0.42182 -0.164 0.869915
## Deck_E -0.03050 0.38973 -0.078 0.937631
## Deck_F -0.16565 0.37656 -0.440 0.660012
## AgeCohort_gt10 -2.00553 0.54306 -3.693 0.000222 ***
## TicketNCohort_gt4 -5.23109 1.91020 -2.739 0.006172 **
## Sex_female_x_Pclass_X2 0.50806 0.92254 0.551 0.581824
## Sex_female_x_Pclass_X3 -2.20667 0.81674 -2.702 0.006897 **
## Sex_female_x_Embarked_Q 1.56985 0.95106 1.651 0.098814 .
## Sex_female_x_Embarked_S -0.40996 0.59294 -0.691 0.489317
## TicketN_x_TicketNCohort_gt4 0.56246 0.30306 1.856 0.063467 .
## Sex_female_x_Age_x_AgeCohort_gt10 0.05546 0.01815 3.056 0.002245 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 649.47 on 868 degrees of freedom
## AIC: 695.47
##
## Number of Fisher Scoring iterations: 6
NetSurv continues to be the most important predictor.
## glm variable importance
##
## only 20 most important variables shown (out of 22)
##
## Overall
## NetSurv 100.000
## Pclass_X3 83.890
## Pclass_X2 73.512
## AgeCohort_gt10 58.498
## Age 57.074
## Sex_female_x_Age_x_AgeCohort_gt10 48.185
## Sex_female 46.830
## TicketNCohort_gt4 43.051
## Sex_female_x_Pclass_X3 42.457
## SibSp 34.777
## Deck_C 29.795
## TicketN_x_TicketNCohort_gt4 28.768
## Sex_female_x_Embarked_Q 25.446
## Embarked_Q 19.979
## Deck_B 16.471
## Embarked_S 11.581
## Sex_female_x_Embarked_S 9.923
## TicketN 9.905
## Sex_female_x_Pclass_X2 7.646
## Deck_F 5.853
The model performance on the 10-fold resampling was:
## Generalized Linear Model
##
## 891 samples
## 11 predictor
## 2 classes: 'No', 'Yes'
##
## Recipe steps: dummy, interact
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 802, 801, 801, 802, 802, 802, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8248127 0.6191291
The accuracy from the confusion matrix below is 0.8249.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 55.7 11.6
## Yes 5.9 26.8
##
## Accuracy (average) : 0.8249