The logistic regression model is simple and can be used for inference. The binary logistic regression model is

\[y = logit(\pi) = \ln \left( \frac{\pi}{1 - \pi} \right) = X \beta\]

where \(\pi\) is the event probability. The model predicts the log odds of the response variable. The model is fit with maximum likelihood estimation. There is no closed-form solution, so GLM estimates coefficients with interatively reweighted least squares.

The logistic model run with the final set of predictors had a holdout set accuracy of 0.8305, sensitivity of 0.6029, specificity of 0.9725, and AUC of 0.8821.

After fitting to the full training data set, the performance on the testing data set on kaggle was 0.78947 accuracy.

Setup

library(tidyverse)
library(caret)
library(recipes)
library(plotROC)
library(precrec)

The initial data management created the data set full, with training rows indexed by train_index. I added another predictor variables in the exploratory analysis and split the data into training and testing, then 80:20 split training into training_80 for training and training_20 to compare models.

load("./titanic_02.RData")

I’ll use 10-fold CV.

train_control <- trainControl(
  method = "cv", number = 10,
  savePredictions = "final",
  classProbs = TRUE
)

I’ll try two models: a “full” model with (almost) all predictors, and a parsimonious model using glmStepAIC.

Model

By “all predictors”, I mean all non-character predictors - I’m throwing out Surname, Name, Ticket, and Cabin. I’m also throwing out Fare since it is redundant to FarePerPass, and TicketN since I binned it into TktSize. That leaves me with PassengerID, Survived, and 13 predictors.

mdl_vars <- c("PassengerId", mdl_vars)
mdl_vars
##  [1] "PassengerId"   "Survived"      "Pclass"        "Sex"          
##  [5] "Age"           "SibSp"         "Parch"         "Embarked"     
##  [9] "TicketN"       "FarePerPass"   "Employee"      "Deck"         
## [13] "AgeCohort"     "TicketNCohort" "NetSurv"

I’ll use the recipe method to train. From the exploratory analysis section I’ve decided to create interactions Pclass*Sex, Embarked*Sex, TicketN:TicketNCohort, and Age*AgeCohort*Sex.

rcpe <- recipe(Survived ~ ., data = training_80[, mdl_vars]) %>%
  update_role(PassengerId, new_role = "id variable") %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_interact(
    terms = ~
    starts_with("Sex_"):starts_with("Pclass_") +
    starts_with("Sex_"):starts_with("Embarked_") +
    TicketN:starts_with("TicketNCohort_") +
    Age:starts_with("AgeCohort_"):starts_with("Sex_")
  ) 

prep(rcpe, training = training_80)
## Data Recipe
## 
## Inputs:
## 
##         role #variables
##  id variable          1
##      outcome          1
##    predictor         13
## 
## Training data contained 714 data points and no missing data.
## 
## Operations:
## 
## Dummy variables from Pclass, Sex, Embarked, Employee, Deck, ... [trained]
## Interactions with Sex_female:(Pclass_X2 + Pclass_X3) + Sex_female:(Embarked_Q + Embarked_S) + TicketN:TicketNCohort_gt4 + Age:AgeCohort_gt10:Sex_female [trained]

Full Model

The logistic model run with the full set of predictors had a holdout set accuracy of 0.8192, sensitivity of 0.6029, specificity of 0.9541, and AUC of 0.8821.

Here is the fit summary.

set.seed(1970)
mdl_full <- train(
  rcpe,
  training_80[, mdl_vars],
  method = "glm",
  family = "binomial",
  trControl = train_control,
  metric = "Accuracy"
)
summary(mdl_full)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9365  -0.5326  -0.3689   0.3093   2.5718  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        2.6230749  0.9989503   2.626 0.008644 ** 
## Age                               -0.0312234  0.0126808  -2.462 0.013806 *  
## SibSp                             -0.4904314  0.1665775  -2.944 0.003238 ** 
## Parch                             -0.1981775  0.2081470  -0.952 0.341045    
## TicketN                            0.3155560  0.1872779   1.685 0.091996 .  
## FarePerPass                       -0.0000135  0.0150963  -0.001 0.999287    
## NetSurv                            1.1151445  0.2175755   5.125 2.97e-07 ***
## Pclass_X2                         -1.6116251  0.5821576  -2.768 0.005634 ** 
## Pclass_X3                         -1.5097748  0.5449751  -2.770 0.005600 ** 
## Sex_female                         2.8959607  1.1132743   2.601 0.009287 ** 
## Embarked_Q                        -0.7941796  0.7512128  -1.057 0.290422    
## Embarked_S                        -0.3668720  0.3474722  -1.056 0.291045    
## Employee_X1                       -1.0949792  1.1265330  -0.972 0.331055    
## Deck_B                            -0.0100641  0.5094084  -0.020 0.984238    
## Deck_C                            -0.5491667  0.4670041  -1.176 0.239621    
## Deck_D                             0.2197890  0.4848076   0.453 0.650295    
## Deck_E                            -0.2271941  0.4678783  -0.486 0.627262    
## Deck_F                            -0.0774744  0.4394544  -0.176 0.860061    
## AgeCohort_gt10                    -2.2100287  0.6057894  -3.648 0.000264 ***
## TicketNCohort_gt4                 -4.5056335  2.1038403  -2.142 0.032224 *  
## Sex_female_x_Pclass_X2             0.6605795  1.0220719   0.646 0.518076    
## Sex_female_x_Pclass_X3            -2.0983916  0.8672194  -2.420 0.015534 *  
## Sex_female_x_Embarked_Q            1.4893787  1.0208529   1.459 0.144577    
## Sex_female_x_Embarked_S           -0.3158003  0.6711569  -0.471 0.637975    
## TicketN_x_TicketNCohort_gt4        0.4095766  0.3450534   1.187 0.235230    
## Sex_female_x_Age_x_AgeCohort_gt10  0.0547660  0.0198725   2.756 0.005854 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 512.03  on 688  degrees of freedom
## AIC: 564.03
## 
## Number of Fisher Scoring iterations: 6

Parch, FarePerPass, Embarked, and Employee fail the significance test. varImp() ranks the coefficients by the absolute value of the t-statistic. NetSurv is most important.

varImp(mdl_full)
## glm variable importance
## 
##   only 20 most important variables shown (out of 25)
## 
##                                   Overall
## NetSurv                           100.000
## AgeCohort_gt10                     71.174
## SibSp                              57.436
## Pclass_X3                          54.044
## Pclass_X2                          54.005
## Sex_female_x_Age_x_AgeCohort_gt10  53.762
## Sex_female                         50.745
## Age                                48.032
## Sex_female_x_Pclass_X3             47.201
## TicketNCohort_gt4                  41.775
## TicketN                            32.864
## Sex_female_x_Embarked_Q            28.453
## TicketN_x_TicketNCohort_gt4        23.146
## Deck_C                             22.930
## Embarked_Q                         20.613
## Embarked_S                         20.586
## Employee_X1                        18.950
## Parch                              18.562
## Sex_female_x_Pclass_X2             12.595
## Deck_E                              9.458

Resampling Performance

The model performance on the 10-fold resampling was:

mdl_full
## Generalized Linear Model 
## 
## 714 samples
##  14 predictor
##   2 classes: 'No', 'Yes' 
## 
## Recipe steps: dummy, interact 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8346831  0.6366391

The model accuracy is 0.8347. More detail below.

confusionMatrix(mdl_full)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  56.6 11.5
##        Yes  5.0 26.9
##                             
##  Accuracy (average) : 0.8347

Holdout Performance

Here is the model performance on the holdout data set.

predicted_classes <- predict(mdl_full, newdata = training_20) 
predicted_probs <- predict(mdl_full, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  104  27
##        Yes   5  41
##                                           
##                Accuracy : 0.8192          
##                  95% CI : (0.7545, 0.8729)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 3.784e-09       
##                                           
##                   Kappa : 0.5932          
##                                           
##  Mcnemar's Test P-Value : 0.0002054       
##                                           
##             Sensitivity : 0.6029          
##             Specificity : 0.9541          
##          Pos Pred Value : 0.8913          
##          Neg Pred Value : 0.7939          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2316          
##    Detection Prevalence : 0.2599          
##       Balanced Accuracy : 0.7785          
##                                           
##        'Positive' Class : Yes             
## 

The sensitivity is 0.6029 and the specificity is 0.9541, so the model is prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8192. precrec::evalmod() will calculate the confusion matrix values from the model using the holdout data set. The AUC on the holdout set is 0.8821. pRoc::plot.roc() and plotROC::geom_roc() are options, but I like the way yardstick::roc_curve() looks.

mdl_full_preds <- predict(mdl_full, newdata = training_20, type = "prob")
(mdl_full_eval <- evalmod(
  scores = mdl_full_preds$Yes,
  labels = training_20$Survived
))
## 
##     === AUCs ===
## 
##      Model name Dataset ID Curve type       AUC
##    1         m1          1        ROC 0.8820831
##    2         m1          1        PRC 0.8620615
## 
## 
##     === Input data ===
## 
##      Model name Dataset ID # of negatives # of positives
##    1         m1          1            109             68
options(yardstick.event_first = FALSE)  # set the second level as success
data.frame(
  pred = mdl_full_preds$Yes, 
  obs = training_20$Survived
) %>%
  yardstick::roc_curve(obs, pred) %>%
  autoplot() +
  labs(
    title = "Full Model ROC Curve, Test Data",
    subtitle = "AUC = 0.8821"
  )

The gain curve plots the cumulative summed true outcome versus the fraction of items seen when sorted by the predicted value. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicted the response variable well. The gray area is the “gain” over a random prediction.

68 of the 177 passengers in the holdout set survived.

  • The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).

  • It encountered all 68 survivors on the 157th observation (89%).

  • The bottom of the gray area is the outcome of a random model. Only half the survivors would be observed within 50% of the observations. The top of the gray area is the outcome of the perfect model, the “wizard curve”. Half the survivors would be observed in 68/177=38% of the observations.

data.frame(
  pred = mdl_full_preds$Yes, 
  obs = training_20$Survived
) %>%
  yardstick::gain_curve(obs, pred) %>%
  autoplot() +
  labs(
    title = "Full Model Gain Curve on Holdout Set"
  )

glmStepAIC

Let’s try this again with stepwise regression to see how a more parsimonious model might perform.

set.seed(1970)
mdl_step <- train(
  rcpe,
  training_80[, mdl_vars],
  method = "glmStepAIC",
  family = "binomial",
  trace = FALSE,  # suppress the iteration summaries
  trControl = train_control,
  metric = "Accuracy"
)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
summary(mdl_step)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1163  -0.5358  -0.3811   0.3408   2.5254  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        2.17080    0.81318   2.670 0.007596 ** 
## Age                               -0.03195    0.01234  -2.590 0.009609 ** 
## SibSp                             -0.49676    0.16592  -2.994 0.002754 ** 
## TicketN                            0.27315    0.15512   1.761 0.078267 .  
## NetSurv                            1.16906    0.21195   5.516 3.48e-08 ***
## Pclass_X2                         -1.57434    0.42726  -3.685 0.000229 ***
## Pclass_X3                         -1.57409    0.37197  -4.232 2.32e-05 ***
## Sex_female                         3.16761    0.78294   4.046 5.22e-05 ***
## Deck_C                            -0.52073    0.30800  -1.691 0.090897 .  
## AgeCohort_gt10                    -2.03549    0.58690  -3.468 0.000524 ***
## TicketNCohort_gt4                 -5.23705    1.99014  -2.631 0.008501 ** 
## Sex_female_x_Pclass_X3            -2.51905    0.61582  -4.091 4.30e-05 ***
## Sex_female_x_Embarked_Q            1.28978    0.51091   2.524 0.011586 *  
## TicketN_x_TicketNCohort_gt4        0.50332    0.31469   1.599 0.109729    
## Sex_female_x_Age_x_AgeCohort_gt10  0.05130    0.01953   2.626 0.008628 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 518.76  on 699  degrees of freedom
## AIC: 548.76
## 
## Number of Fisher Scoring iterations: 6

glmStepAIC dropped Parch, FarePerPass, Embarked, Employee, most of the Deck dummies (although it kept DeckC), and the sex interaction with one level of Pclass and one level of Embarked.

Resampling Performance

The model performance on the 10-fold resampling was:

mdl_step
## Generalized Linear Model with Stepwise Feature Selection 
## 
## 714 samples
##  14 predictor
##   2 classes: 'No', 'Yes' 
## 
## Recipe steps: dummy, interact 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8276995  0.6213608

The accuracy is 0.8277 - down from 0.8347 in the full model.

confusionMatrix(mdl_step)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  56.4 12.0
##        Yes  5.2 26.3
##                             
##  Accuracy (average) : 0.8277

Holdout Performance

Here is the model performance on the holdout data set.

predicted_classes <- predict(mdl_step, newdata = training_20) 
predicted_probs <- predict(mdl_step, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  105  27
##        Yes   4  41
##                                           
##                Accuracy : 0.8249          
##                  95% CI : (0.7607, 0.8778)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.304e-09       
##                                           
##                   Kappa : 0.6047          
##                                           
##  Mcnemar's Test P-Value : 7.772e-05       
##                                           
##             Sensitivity : 0.6029          
##             Specificity : 0.9633          
##          Pos Pred Value : 0.9111          
##          Neg Pred Value : 0.7955          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2316          
##    Detection Prevalence : 0.2542          
##       Balanced Accuracy : 0.7831          
##                                           
##        'Positive' Class : Yes             
## 

The sensitivity is 0.6029 and the specificity is 0.9633, so the model is prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8249 - higher than the 0.8192 in the full model! precrec::evalmod() will calculate the confusion matrix values from the model using the holdout data set. The AUC on the holdout set is 0.8848.

mdl_step_preds <- predict(mdl_step, newdata = training_20, type = "prob")
(mdl_step_eval <- evalmod(
  scores = mdl_step_preds$Yes,
  labels = training_20$Survived
))
## 
##     === AUCs ===
## 
##      Model name Dataset ID Curve type       AUC
##    1         m1          1        ROC 0.8847814
##    2         m1          1        PRC 0.8650199
## 
## 
##     === Input data ===
## 
##      Model name Dataset ID # of negatives # of positives
##    1         m1          1            109             68
options(yardstick.event_first = FALSE)  # set the second level as success
data.frame(
  pred = mdl_step_preds$Yes, 
  obs = training_20$Survived
) %>%
  yardstick::roc_curve(obs, pred) %>%
  autoplot() +
  labs(
    title = "glmStepAIC Model ROC Curve, Test Data",
    subtitle = "AUC = 0.8848"
  )

Here is the gain curve again.

68 of the 177 passengers in the holdout set survived.

  • The gain curve encountered 34 survivors (50%) within the first 36 observations (20%).

  • It encountered all 68 survivors on the 150th observation (85%).

data.frame(
  pred = mdl_step_preds$Yes, 
  obs = training_20$Survived
) %>%
  yardstick::gain_curve(obs, pred) %>%
  autoplot() +
  labs(
    title = "glmStepAIC Model Gain Curve on Holdout Set"
  )

Final Model

The stepwise regression actually performed better than the full model (surprising - can multicollinearity reduce predictive accuracy on the holdout?). I don’t want to just use the stepwise model though, because it keeps selected factor variable levels, and if I’m going to keep one level, I want the entire factor variable. So I’ll fit this one more time, just dropping the variables where glmStepAIC dropped the entire variable.

First I’ll fit it with the training_80 set, just so I can compare. Then I’ll do a final fit to the entire training set to predict on testing.

mdl_vars_final <- subset(mdl_vars, !mdl_vars %in% c("Parch", "FarePerPass", "Employee"))
rcpe_final <- recipe(Survived ~ ., data = training_80[, mdl_vars_final]) %>%
  update_role(PassengerId, new_role = "id variable") %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_interact(terms = ~
                  starts_with("Sex_"):starts_with("Pclass_") +
                  starts_with("Sex_"):starts_with("Embarked_") +
                  TicketN:starts_with("TicketNCohort_") +
                  Age:starts_with("AgeCohort_"):starts_with("Sex_")) 

prep(rcpe_final, training = training_80)
## Data Recipe
## 
## Inputs:
## 
##         role #variables
##  id variable          1
##      outcome          1
##    predictor         10
## 
## Training data contained 714 data points and no missing data.
## 
## Operations:
## 
## Dummy variables from Pclass, Sex, Embarked, Deck, AgeCohort, TicketNCohort [trained]
## Interactions with Sex_female:(Pclass_X2 + Pclass_X3) + Sex_female:(Embarked_Q + Embarked_S) + TicketN:TicketNCohort_gt4 + Age:AgeCohort_gt10:Sex_female [trained]

The logistic model run with the final set of predictors had a holdout set accuracy of 0.8305, sensitivity of 0.6029, specificity of 0.9725, and AUC of 0.8821.

Here is the fit summary.

set.seed(1970)
mdl_final_train <- train(
  rcpe_final,
  training_80[, mdl_vars_final],
  method = "glm",
  family = "binomial",
  trControl = train_control,
  metric = "Accuracy"
)
summary(mdl_final_train)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9913  -0.5318  -0.3617   0.3108   2.5461  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        2.60328    0.89802   2.899 0.003745 ** 
## Age                               -0.03105    0.01261  -2.463 0.013777 *  
## SibSp                             -0.48844    0.16559  -2.950 0.003181 ** 
## TicketN                            0.22812    0.16320   1.398 0.162184    
## NetSurv                            1.15699    0.21512   5.378 7.52e-08 ***
## Pclass_X2                         -1.58330    0.49573  -3.194 0.001404 ** 
## Pclass_X3                         -1.45530    0.40810  -3.566 0.000362 ***
## Sex_female                         3.07658    1.10426   2.786 0.005335 ** 
## Embarked_Q                        -0.78420    0.74058  -1.059 0.289647    
## Embarked_S                        -0.39803    0.34057  -1.169 0.242517    
## Deck_B                            -0.02944    0.50447  -0.058 0.953461    
## Deck_C                            -0.53058    0.46316  -1.146 0.251973    
## Deck_D                             0.20294    0.48237   0.421 0.673971    
## Deck_E                            -0.22790    0.46427  -0.491 0.623517    
## Deck_F                            -0.08433    0.43721  -0.193 0.847045    
## AgeCohort_gt10                    -2.12640    0.59065  -3.600 0.000318 ***
## TicketNCohort_gt4                 -5.07595    2.01230  -2.522 0.011654 *  
## Sex_female_x_Pclass_X2             0.58767    1.01673   0.578 0.563266    
## Sex_female_x_Pclass_X3            -2.23216    0.85970  -2.596 0.009420 ** 
## Sex_female_x_Embarked_Q            1.43878    1.01147   1.422 0.154891    
## Sex_female_x_Embarked_S           -0.33703    0.66776  -0.505 0.613754    
## TicketN_x_TicketNCohort_gt4        0.52484    0.32083   1.636 0.101861    
## Sex_female_x_Age_x_AgeCohort_gt10  0.05230    0.01970   2.654 0.007954 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 514.10  on 691  degrees of freedom
## AIC: 560.1
## 
## Number of Fisher Scoring iterations: 6

NetSurv continues to be the most important predictor.

varImp(mdl_final_train)
## glm variable importance
## 
##   only 20 most important variables shown (out of 22)
## 
##                                   Overall
## NetSurv                           100.000
## AgeCohort_gt10                     66.575
## Pclass_X3                          65.936
## Pclass_X2                          58.940
## SibSp                              54.349
## Sex_female                         51.274
## Sex_female_x_Age_x_AgeCohort_gt10  48.791
## Sex_female_x_Pclass_X3             47.709
## TicketNCohort_gt4                  46.319
## Age                                45.201
## TicketN_x_TicketNCohort_gt4        29.654
## Sex_female_x_Embarked_Q            25.642
## TicketN                            25.177
## Embarked_S                         20.872
## Deck_C                             20.437
## Embarked_Q                         18.807
## Sex_female_x_Pclass_X2              9.768
## Sex_female_x_Embarked_S             8.390
## Deck_E                              8.130
## Deck_D                              6.811

Resampling Performance

The model performance on the 10-fold resampling was:

mdl_final_train
## Generalized Linear Model 
## 
## 714 samples
##  11 predictor
##   2 classes: 'No', 'Yes' 
## 
## Recipe steps: dummy, interact 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 643, 643, 643, 643, 642, 643, ... 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.8388889  0.645208

The accuracy is 0.8389.

confusionMatrix(mdl_final_train)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  57.0 11.5
##        Yes  4.6 26.9
##                             
##  Accuracy (average) : 0.8389

Holdout Performance

Here is the model performance on the holdout data set.

predicted_classes <- predict(mdl_final_train, newdata = training_20) 
predicted_probs <- predict(mdl_final_train, newdata = training_20, type = "prob")
confusionMatrix(predicted_classes, training_20$Survived, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  106  27
##        Yes   3  41
##                                          
##                Accuracy : 0.8305         
##                  95% CI : (0.767, 0.8826)
##     No Information Rate : 0.6158         
##     P-Value [Acc > NIR] : 4.325e-10      
##                                          
##                   Kappa : 0.6163         
##                                          
##  Mcnemar's Test P-Value : 2.679e-05      
##                                          
##             Sensitivity : 0.6029         
##             Specificity : 0.9725         
##          Pos Pred Value : 0.9318         
##          Neg Pred Value : 0.7970         
##              Prevalence : 0.3842         
##          Detection Rate : 0.2316         
##    Detection Prevalence : 0.2486         
##       Balanced Accuracy : 0.7877         
##                                          
##        'Positive' Class : Yes            
## 

The sensitivity is 0.6029 and the specificity is 0.9725, so the model is prone to under-predicting survivors. The accuracy from the confusion matrix is 0.8305. The AUC on the holdout set is 0.8774.

mdl_final_train_preds <- predict(mdl_final_train, newdata = training_20, type = "prob")
(mdl_final_train_eval <- evalmod(
  scores = mdl_final_train_preds$Yes,
  labels = training_20$Survived
))
## 
##     === AUCs ===
## 
##      Model name Dataset ID Curve type       AUC
##    1         m1          1        ROC 0.8774285
##    2         m1          1        PRC 0.8565620
## 
## 
##     === Input data ===
## 
##      Model name Dataset ID # of negatives # of positives
##    1         m1          1            109             68
options(yardstick.event_first = FALSE)  # set the second level as success
data.frame(
  pred = mdl_final_train_preds$Yes, 
  obs = training_20$Survived
) %>%
  yardstick::roc_curve(obs, pred) %>%
  autoplot() +
  labs(
    title = "Final Model (Training) ROC Curve, Test Data",
    subtitle = "AUC = 0.8774"
  )

68 of the 177 passengers in the holdout set survived.

  • The gain curve encountered 34 survivors (50%) within the first 37 observations (21%).

  • It encountered all 68 survivors on the 156th observation (88%).

data.frame(
  pred = mdl_final_train_preds$Yes, 
  obs = training_20$Survived
) %>%
  yardstick::gain_curve(obs, pred) %>%
  autoplot() +
  labs(
    title = "Final Model (Training) Gain Curve on Holdout Set"
  )

Conclusions

Compare the models with evalmod().

scores_list <- join_scores(
  predict(mdl_full, newdata = training_20, type = "prob")$Yes,
  predict(mdl_step, newdata = training_20, type = "prob")$Yes,
  predict(mdl_final_train, newdata = training_20, type = "prob")$Yes
)
labels_list <- join_labels(
  training_20$Survived,
  training_20$Survived,
  training_20$Survived
)

pe <- evalmod(
  scores = scores_list, 
  labels = labels_list,
  modnames = c("Full", "glmStepAIC", "Final (Training)"),
  posclass = "Yes")

autoplot(pe, "ROC")

pe
## 
##     === AUCs ===
## 
##            Model name Dataset ID Curve type       AUC
##    1             Full          1        ROC 0.8820831
##    2             Full          1        PRC 0.8620615
##    3       glmStepAIC          1        ROC 0.8847814
##    4       glmStepAIC          1        PRC 0.8650199
##    5 Final (Training)          1        ROC 0.8774285
##    6 Final (Training)          1        PRC 0.8565620
## 
## 
##     === Input data ===
## 
##            Model name Dataset ID # of negatives # of positives
##    1             Full          1            109             68
##    2       glmStepAIC          1            109             68
##    3 Final (Training)          1            109             68

The highest AUC was with the step-wise selection model, 0.8847.

resamps <- resamples(list('Full' = mdl_full, 
                          'Reduced' = mdl_step,
                          'Final' = mdl_final_train))
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: Full, Reduced, Final 
## Number of resamples: 10 
## 
## Accuracy 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Full    0.8028169 0.8090278 0.8309859 0.8346831 0.8431631 0.9027778    0
## Reduced 0.8028169 0.8035016 0.8181729 0.8276995 0.8561718 0.8611111    0
## Final   0.8028169 0.8125000 0.8450704 0.8388889 0.8466843 0.9027778    0
## 
## Kappa 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Full    0.5494107 0.5726738 0.6280895 0.6366391 0.6578025 0.8018868    0
## Reduced 0.5494107 0.5628848 0.6025258 0.6213608 0.6884466 0.7115385    0
## Final   0.5494107 0.5796978 0.6586154 0.6452080 0.6677171 0.8018868    0
bwplot(resamps, layout = c(3, 1))

Refit Final Model

I’ll do a final fit to the entire training set to predict on testing.

Here is the fit summary.

set.seed(1970)
mdl_final <- train(
  rcpe_final,
  training[, mdl_vars_final],
  method = "glm",
  family = "binomial",
  trControl = train_control,
  metric = "Accuracy"
)
summary(mdl_final)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0262  -0.5630  -0.3375   0.3383   2.6004  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        3.34463    0.81257   4.116 3.85e-05 ***
## Age                               -0.04185    0.01161  -3.605 0.000312 ***
## SibSp                             -0.32984    0.14810  -2.227 0.025933 *  
## TicketN                            0.10369    0.15020   0.690 0.490002    
## NetSurv                            1.23194    0.19687   6.258 3.91e-10 ***
## Pclass_X2                         -1.95622    0.42335  -4.621 3.82e-06 ***
## Pclass_X3                         -1.87988    0.35725  -5.262 1.42e-07 ***
## Sex_female                         3.06475    1.03120   2.972 0.002958 ** 
## Embarked_Q                        -0.92885    0.70751  -1.313 0.189236    
## Embarked_S                        -0.23839    0.30029  -0.794 0.427279    
## Deck_B                            -0.49324    0.45002  -1.096 0.273065    
## Deck_C                            -0.77666    0.40464  -1.919 0.054938 .  
## Deck_D                            -0.06908    0.42182  -0.164 0.869915    
## Deck_E                            -0.03050    0.38973  -0.078 0.937631    
## Deck_F                            -0.16565    0.37656  -0.440 0.660012    
## AgeCohort_gt10                    -2.00553    0.54306  -3.693 0.000222 ***
## TicketNCohort_gt4                 -5.23109    1.91020  -2.739 0.006172 ** 
## Sex_female_x_Pclass_X2             0.50806    0.92254   0.551 0.581824    
## Sex_female_x_Pclass_X3            -2.20667    0.81674  -2.702 0.006897 ** 
## Sex_female_x_Embarked_Q            1.56985    0.95106   1.651 0.098814 .  
## Sex_female_x_Embarked_S           -0.40996    0.59294  -0.691 0.489317    
## TicketN_x_TicketNCohort_gt4        0.56246    0.30306   1.856 0.063467 .  
## Sex_female_x_Age_x_AgeCohort_gt10  0.05546    0.01815   3.056 0.002245 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  649.47  on 868  degrees of freedom
## AIC: 695.47
## 
## Number of Fisher Scoring iterations: 6

NetSurv continues to be the most important predictor.

varImp(mdl_final)
## glm variable importance
## 
##   only 20 most important variables shown (out of 22)
## 
##                                   Overall
## NetSurv                           100.000
## Pclass_X3                          83.890
## Pclass_X2                          73.512
## AgeCohort_gt10                     58.498
## Age                                57.074
## Sex_female_x_Age_x_AgeCohort_gt10  48.185
## Sex_female                         46.830
## TicketNCohort_gt4                  43.051
## Sex_female_x_Pclass_X3             42.457
## SibSp                              34.777
## Deck_C                             29.795
## TicketN_x_TicketNCohort_gt4        28.768
## Sex_female_x_Embarked_Q            25.446
## Embarked_Q                         19.979
## Deck_B                             16.471
## Embarked_S                         11.581
## Sex_female_x_Embarked_S             9.923
## TicketN                             9.905
## Sex_female_x_Pclass_X2              7.646
## Deck_F                              5.853

Resampling Performance

The model performance on the 10-fold resampling was:

mdl_final
## Generalized Linear Model 
## 
## 891 samples
##  11 predictor
##   2 classes: 'No', 'Yes' 
## 
## Recipe steps: dummy, interact 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 801, 801, 802, 802, 802, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8248127  0.6191291

The accuracy from the confusion matrix below is 0.8249.

confusionMatrix(mdl_final)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  55.7 11.6
##        Yes  5.9 26.8
##                             
##  Accuracy (average) : 0.8249

Create Submission File

preds <- predict(mdl_final, newdata = testing) %>% {ifelse(. == "Yes", 1, 0)}
sub_file <- data.frame(PassengerId = testing$PassengerId, Survived = preds)
write.csv(sub_file, file = "./titanic_03_glm.csv", row.names = FALSE)