Binary Logistic Regression: Titanic

Mihir

4/4/2022

1 Abstract

This study demonstrates the development of a binary logistic regression model to describe patterns of survival in passengers on the Titanic, based on passenger age, sex, ticket class, and the number of family members accompanying each passenger.

2 Loading Data and introduction

# loading required libraries
library(rms)
library(tinytex)

# get dataset from web site
getHdata(titanic3) 

# rows x columns
dim(titanic3)
## [1] 1309   14

Let us look at the data dictionary for Titanic3 data.

Column Description
pclass passanger class: levels 1st, 2nd, and 3rd
survived Survival (0 = No; 1 = Yes)
name name of the passenger
sex gender of the passenger: levels female and male
age age of the passenger in years
sibsp Number of Siblings/Spouses Aboard for a passenger
parch Number of Parents/Children Aboard for a passenger
ticket Ticket Number
fare Passenger Fare
cabin allotted cabin
embarked a factor with levels Cherbourg, Queenstown, and Southampton
boat Lifeboat
body Body Identification Number
home.dest Home/Destination

3 Descriptive statistics

# List of names of variables to analyze
v <- c('pclass', 'survived' , 'age' , 'sex' , 'sibsp' , 'parch')
t3 <- titanic3[, v]
units(t3$age) <- 'years'

# univariate summaries
describe(t3)
## t3 
## 
##  6  Variables      1309  Observations
## --------------------------------------------------------------------------------
## pclass 
##        n  missing distinct 
##     1309        0        3 
##                             
## Value        1st   2nd   3rd
## Frequency    323   277   709
## Proportion 0.247 0.212 0.542
## --------------------------------------------------------------------------------
## survived : Survived 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1309        0        2    0.708      500    0.382   0.4725 
## 
## --------------------------------------------------------------------------------
## age : Age [years] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1046      263       98    0.999    29.88    16.06        5       14 
##      .25      .50      .75      .90      .95 
##       21       28       39       50       57 
## 
## lowest :  0.1667  0.3333  0.4167  0.6667  0.7500
## highest: 70.5000 71.0000 74.0000 76.0000 80.0000
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##     1309        0        2 
##                         
## Value      female   male
## Frequency     466    843
## Proportion  0.356  0.644
## --------------------------------------------------------------------------------
## sibsp : Number of Siblings/Spouses Aboard 
##        n  missing distinct     Info     Mean      Gmd 
##     1309        0        7     0.67   0.4989    0.777 
## 
## lowest : 0 1 2 3 4, highest: 2 3 4 5 8
##                                                     
## Value          0     1     2     3     4     5     8
## Frequency    891   319    42    20    22     6     9
## Proportion 0.681 0.244 0.032 0.015 0.017 0.005 0.007
## --------------------------------------------------------------------------------
## parch : Number of Parents/Children Aboard 
##        n  missing distinct     Info     Mean      Gmd 
##     1309        0        8    0.549    0.385   0.6375 
## 
## lowest : 0 1 2 3 4, highest: 3 4 5 6 9
##                                                           
## Value          0     1     2     3     4     5     6     9
## Frequency   1002   170   113     8     6     6     2     2
## Proportion 0.765 0.130 0.086 0.006 0.005 0.005 0.002 0.002
## --------------------------------------------------------------------------------

Next, we obtain access to the needed variables and observations, and save data distribution characteristics for plotting and for computing predictor effects.

There are not many passengers having more than 3 siblings or spouses or more than 3 children, so we truncate two variables at 3 for the purpose of estimating stratified survival probabilities.

# data distribution
dd <- datadist(t3)
# describe distributions of variables to rms
options(datadist = 'dd')
s <- summary(survived ~ age + sex + pclass + cut2(sibsp ,0:3) + cut2 (parch ,0:3), data = t3)
plot(s, main = '', subtitles = FALSE)
Univariable summaries of Titanic survival

Figure 3.1: Univariable summaries of Titanic survival

OBSERVATIONS:

  1. Note the large number of missing ages.

  2. Also note the strong effects of sex and passenger class on the probability of surviving.

  3. The age effect does not appear to be very strong.

  4. The effects of the last two variables are unclear as the estimated proportions are not monotonic in the values of these descriptors.

Although some of the cell sizes are small, we can show four-way empirical relationships with the fraction of surviving passengers by creating four cells for sibsp × parch combinations and by creating two age groups.

tn <- transform(t3 ,
                 agec = ifelse(age < 21, 'child' , 'adult'),
                 sibsp = ifelse(sibsp == 0, 'no sib/sp' , 'sib/sp' ),
                 parch = ifelse(parch == 0, 'no par/child' , 'par/child'))

# suppress proportions based on fewer than 25 passengers in a cell
g <- function(y) if (length(y) < 25) NA else mean(y) 

#llist , summarize in Hmisc package
s <- with(tn , summarize(survived ,llist(agec , sex , pclass , sibsp , parch), g))

#plot
ggplot(subset(s, agec != 'NA' ), aes(x = survived , y = pclass , shape = sex)) +
  geom_point() + facet_grid(agec ~ sibsp * parch) +
  xlab('Proportion Surviving') + ylab('Passenger Class') +
  scale_x_continuous(breaks = c(0, .5 , 1))
Multi-way summary of Titanic survival

Figure 3.2: Multi-way summary of Titanic survival

Note that none of the effects of sibsp or parch for common passenger groups appear strong on an absolute risk scale.

5 Binary Logistic Model With Casewise Deletion of Missing Values

What follows is the standard analysis based on eliminating observations having any missing data.

We develop an initial somewhat saturated logistic model, allowing for a flexible nonlinear age effect that can differ in shape for all six sex ×classstrata.

The meaning of these variables does depend on the passenger’s age, so we consider only age interactions involving sibsp and parch.

f1 <- lrm(survived ~ sex*pclass * rcs(age ,5) + rcs(age ,5)*(sibsp + parch ), data = t3) 
anova(f1)
## Warning in anova.rms(f1): tests of nonlinear interaction with respect to single component 
## variables ignore 3-way interactions
##                 Wald Statistics          Response: survived 
## 
##  Factor                                            Chi-Square d.f. P     
##  sex  (Factor+Higher Order Factors)                187.15     15   <.0001
##   All Interactions                                  59.74     14   <.0001
##  pclass  (Factor+Higher Order Factors)             100.10     20   <.0001
##   All Interactions                                  46.51     18   0.0003
##  age  (Factor+Higher Order Factors)                 56.20     32   0.0052
##   All Interactions                                  34.57     28   0.1826
##   Nonlinear (Factor+Higher Order Factors)           28.66     24   0.2331
##  sibsp  (Factor+Higher Order Factors)               19.67      5   0.0014
##   All Interactions                                  12.13      4   0.0164
##  parch  (Factor+Higher Order Factors)                3.51      5   0.6217
##   All Interactions                                   3.51      4   0.4761
##  sex * pclass  (Factor+Higher Order Factors)        42.43     10   <.0001
##  sex * age  (Factor+Higher Order Factors)           15.89     12   0.1962
##   Nonlinear (Factor+Higher Order Factors)           14.47      9   0.1066
##   Nonlinear Interaction : f(A,B) vs. AB              4.17      3   0.2441
##  pclass * age  (Factor+Higher Order Factors)        13.47     16   0.6385
##   Nonlinear (Factor+Higher Order Factors)           12.92     12   0.3749
##   Nonlinear Interaction : f(A,B) vs. AB              6.88      6   0.3324
##  age * sibsp  (Factor+Higher Order Factors)         12.13      4   0.0164
##   Nonlinear                                          1.76      3   0.6235
##   Nonlinear Interaction : f(A,B) vs. AB              1.76      3   0.6235
##  age * parch  (Factor+Higher Order Factors)          3.51      4   0.4761
##   Nonlinear                                          1.80      3   0.6147
##   Nonlinear Interaction : f(A,B) vs. AB              1.80      3   0.6147
##  sex * pclass * age  (Factor+Higher Order Factors)   8.34      8   0.4006
##   Nonlinear                                          7.74      6   0.2581
##  TOTAL NONLINEAR                                    28.66     24   0.2331
##  TOTAL INTERACTION                                  75.61     30   <.0001
##  TOTAL NONLINEAR + INTERACTION                      79.49     33   <.0001
##  TOTAL                                             241.93     39   <.0001
  1. Three-way interactions are clearly insignificant (P = 0.4).

  2. So is parch (P = 0.6 for testing the combined main effect + interaction effects for parch, i.e., whether parch is important for any age).

The model not containing those terms is fitted below. The ^2 in the model formula means to expand the terms in parentheses to include all main effects and second-order interactions.

f <- lrm(survived ~ (sex + pclass + rcs(age ,5))^2 + rcs(age ,5) *sibsp , data=t3)
# immportant stats
f$stats
##          Obs    Max Deriv   Model L.R.         d.f.            P            C 
## 1.046000e+03 5.613527e-06 5.538700e+02 2.600000e+01 0.000000e+00 8.780158e-01 
##          Dxy        Gamma        Tau-a           R2        Brier            g 
## 7.560317e-01 7.575105e-01 3.656289e-01 5.545145e-01 1.302728e-01 2.426971e+00 
##           gr           gp 
## 1.132453e+01 3.654654e-01
# Wald Statistics
anova(f)
##                 Wald Statistics          Response: survived 
## 
##  Factor                                      Chi-Square d.f. P     
##  sex  (Factor+Higher Order Factors)          199.42      7   <.0001
##   All Interactions                            56.14      6   <.0001
##  pclass  (Factor+Higher Order Factors)       108.73     12   <.0001
##   All Interactions                            42.83     10   <.0001
##  age  (Factor+Higher Order Factors)           47.04     20   0.0006
##   All Interactions                            24.51     16   0.0789
##   Nonlinear (Factor+Higher Order Factors)     22.72     15   0.0902
##  sibsp  (Factor+Higher Order Factors)         19.95      5   0.0013
##   All Interactions                            10.99      4   0.0267
##  sex * pclass  (Factor+Higher Order Factors)  35.40      2   <.0001
##  sex * age  (Factor+Higher Order Factors)     10.08      4   0.0391
##   Nonlinear                                    8.17      3   0.0426
##   Nonlinear Interaction : f(A,B) vs. AB        8.17      3   0.0426
##  pclass * age  (Factor+Higher Order Factors)   6.86      8   0.5516
##   Nonlinear                                    6.11      6   0.4113
##   Nonlinear Interaction : f(A,B) vs. AB        6.11      6   0.4113
##  age * sibsp  (Factor+Higher Order Factors)   10.99      4   0.0267
##   Nonlinear                                    1.81      3   0.6134
##   Nonlinear Interaction : f(A,B) vs. AB        1.81      3   0.6134
##  TOTAL NONLINEAR                              22.72     15   0.0902
##  TOTAL INTERACTION                            67.58     18   <.0001
##  TOTAL NONLINEAR + INTERACTION                70.68     21   <.0001
##  TOTAL                                       253.18     26   <.0001
  1. This is a very powerful model (ROC area = c = 0.88); the survival patterns are easy to detect.

  2. The Wald ANOVA in the table above indicates especially strong sex and pclass effects (χ2 = 199 and 109, respectively).

  3. There is a very strong sex x pclass interaction and a strong age x sibsp interaction, considering the strength of sibsp overall.

6 Predictor effects

Let us examine the shapes of predictor effects. With so many interactions in the model we need to obtain predicted values at least for all combinations of sex and pclass. For sibsp we consider only two of its possible values.

p <- Predict(f, age , sex , pclass , sibsp = 0, fun = plogis)
ggplot(p)
Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings or spouses

Figure 6.1: Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings or spouses

Note the agreement between the lower right-hand panel of Figure 2.1 with Figure 3.1 (above). This results from our use of similar flexibility in the parametric and nonparametric approaches (and similar effective degrees of freedom).

The estimated effect of sibsp as a function of age is shown below:

ggplot(Predict(f, sibsp , age = c(10 ,15 ,20,50), conf.int = FALSE))

Note that children having many siblings apparently had lower survival. Married adults had slightly higher survival than unmarried ones.

7 Model Validation

There will never be another Titanic, so we do not need to validate the model for prospective use. But we use the bootstrap to validate the model anyway, in an effort to detect whether it is overfitting the data.

f <- update(f, x = TRUE , y = TRUE)
# x=TRUE , y=TRUE adds raw data to fit object so can bootstrap

set.seed(131) # so can replicate re-samples

validate(f, B = 200) # default = bootstrap
##           index.orig training    test optimism index.corrected   n
## Dxy           0.7560   0.7721  0.7407   0.0314          0.7246 200
## R2            0.5545   0.5778  0.5262   0.0516          0.5029 200
## Intercept     0.0000   0.0000 -0.0788   0.0788         -0.0788 200
## Slope         1.0000   1.0000  0.8649   0.1351          0.8649 200
## Emax          0.0000   0.0000  0.0460   0.0460          0.0460 200
## D             0.5286   0.5585  0.4936   0.0649          0.4637 200
## U            -0.0019  -0.0019  0.0061  -0.0080          0.0061 200
## Q             0.5305   0.5604  0.4876   0.0729          0.4576 200
## B             0.1303   0.1256  0.1342  -0.0086          0.1389 200
## g             2.4270   2.7789  2.3963   0.3826          2.0444 200
## gp            0.3655   0.3725  0.3536   0.0189          0.3466 200
cal <- calibrate(f, B = 200) 
plot(cal, subtitles = FALSE )
Bootstrap overfitting-corrected loess nonparametric calibration curve for casewise deletion model

Figure 7.1: Bootstrap overfitting-corrected loess nonparametric calibration curve for casewise deletion model

## 
## n=1046   Mean absolute error=0.011   Mean squared error=0.00016
## 0.9 Quantile of absolute error=0.018

The output of validate indicates minor overfitting. Overfitting would have been worse had the risk factors not been so strong. The closeness of the calibration curve to the 45◦ line in Figure 12.7 demonstrates excellent validation on an absolute probability scale. But the extent of missing data casts some doubt on the validity of this model, and on the efficiency of its parameter estimates.

8 Survival probability for madeup cases

phat <- predict(f,
                combos <- expand.grid(age = c(2,21,50) , sex = levels(t3$sex), 
                                      pclass = levels(t3$pclass),sibsp = 0), 
                type = 'fitted')

data.frame(combos , phat)
##    age    sex pclass sibsp       phat
## 1    2 female    1st     0 0.97024806
## 2   21 female    1st     0 0.98280142
## 3   50 female    1st     0 0.96479314
## 4    2   male    1st     0 0.86973741
## 5   21   male    1st     0 0.48473392
## 6   50   male    1st     0 0.26915109
## 7    2 female    2nd     0 0.99989125
## 8   21 female    2nd     0 0.90394112
## 9   50 female    2nd     0 0.82809992
## 10   2   male    2nd     0 0.99885604
## 11  21   male    2nd     0 0.06703969
## 12  50   male    2nd     0 0.02915321
## 13   2 female    3rd     0 0.81969146
## 14  21 female    3rd     0 0.55337465
## 15  50 female    3rd     0 0.31174432
## 16   2   male    3rd     0 0.88896760
## 17  21   male    3rd     0 0.14926854
## 18  50   male    3rd     0 0.04975613