Binary Logistic Regression: Titanic

Mihir

4/4/2022

1 Abstract

This study demonstrates the development of a binary logistic regression model to describe patterns of survival in passengers on the Titanic, based on passenger age, sex, ticket class, and the number of family members accompanying each passenger.

2 Loading Data and introduction

# loading required libraries
library(rms)
library(tinytex)

# get dataset from web site
getHdata(titanic3) 

# rows x columns
dim(titanic3)

## [1] 1309   14

Let us look at the data dictionary for Titanic3 data.

Column	Description
`pclass`	passanger class: levels 1st, 2nd, and 3rd
`survived`	Survival (0 = No; 1 = Yes)
`name`	name of the passenger
`sex`	gender of the passenger: levels female and male
`age`	age of the passenger in years
`sibsp`	Number of Siblings/Spouses Aboard for a passenger
`parch`	Number of Parents/Children Aboard for a passenger
`ticket`	Ticket Number
`fare`	Passenger Fare
`cabin`	allotted cabin
`embarked`	a factor with levels Cherbourg, Queenstown, and Southampton
`boat`	Lifeboat
`body`	Body Identification Number
`home.dest`	Home/Destination

3 Descriptive statistics

# List of names of variables to analyze
v <- c('pclass', 'survived' , 'age' , 'sex' , 'sibsp' , 'parch')
t3 <- titanic3[, v]
units(t3$age) <- 'years'

# univariate summaries
describe(t3)

## t3 
## 
##  6  Variables      1309  Observations
## --------------------------------------------------------------------------------
## pclass 
##        n  missing distinct 
##     1309        0        3 
##                             
## Value        1st   2nd   3rd
## Frequency    323   277   709
## Proportion 0.247 0.212 0.542
## --------------------------------------------------------------------------------
## survived : Survived 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1309        0        2    0.708      500    0.382   0.4725 
## 
## --------------------------------------------------------------------------------
## age : Age [years] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1046      263       98    0.999    29.88    16.06        5       14 
##      .25      .50      .75      .90      .95 
##       21       28       39       50       57 
## 
## lowest :  0.1667  0.3333  0.4167  0.6667  0.7500
## highest: 70.5000 71.0000 74.0000 76.0000 80.0000
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##     1309        0        2 
##                         
## Value      female   male
## Frequency     466    843
## Proportion  0.356  0.644
## --------------------------------------------------------------------------------
## sibsp : Number of Siblings/Spouses Aboard 
##        n  missing distinct     Info     Mean      Gmd 
##     1309        0        7     0.67   0.4989    0.777 
## 
## lowest : 0 1 2 3 4, highest: 2 3 4 5 8
##                                                     
## Value          0     1     2     3     4     5     8
## Frequency    891   319    42    20    22     6     9
## Proportion 0.681 0.244 0.032 0.015 0.017 0.005 0.007
## --------------------------------------------------------------------------------
## parch : Number of Parents/Children Aboard 
##        n  missing distinct     Info     Mean      Gmd 
##     1309        0        8    0.549    0.385   0.6375 
## 
## lowest : 0 1 2 3 4, highest: 3 4 5 6 9
##                                                           
## Value          0     1     2     3     4     5     6     9
## Frequency   1002   170   113     8     6     6     2     2
## Proportion 0.765 0.130 0.086 0.006 0.005 0.005 0.002 0.002
## --------------------------------------------------------------------------------

Next, we obtain access to the needed variables and observations, and save data distribution characteristics for plotting and for computing predictor effects.

There are not many passengers having more than 3 siblings or spouses or more than 3 children, so we truncate two variables at 3 for the purpose of estimating stratified survival probabilities.

# data distribution
dd <- datadist(t3)
# describe distributions of variables to rms
options(datadist = 'dd')
s <- summary(survived ~ age + sex + pclass + cut2(sibsp ,0:3) + cut2 (parch ,0:3), data = t3)
plot(s, main = '', subtitles = FALSE)

Figure 3.1: Univariable summaries of Titanic survival

OBSERVATIONS:

Note the large number of missing ages.
Also note the strong effects of sex and passenger class on the probability of surviving.
The age effect does not appear to be very strong.
The effects of the last two variables are unclear as the estimated proportions are not monotonic in the values of these descriptors.

Although some of the cell sizes are small, we can show four-way empirical relationships with the fraction of surviving passengers by creating four cells for sibsp Ã parch combinations and by creating two age groups.

tn <- transform(t3 ,
                 agec = ifelse(age < 21, 'child' , 'adult'),
                 sibsp = ifelse(sibsp == 0, 'no sib/sp' , 'sib/sp' ),
                 parch = ifelse(parch == 0, 'no par/child' , 'par/child'))

# suppress proportions based on fewer than 25 passengers in a cell
g <- function(y) if (length(y) < 25) NA else mean(y) 

#llist , summarize in Hmisc package
s <- with(tn , summarize(survived ,llist(agec , sex , pclass , sibsp , parch), g))

#plot
ggplot(subset(s, agec != 'NA' ), aes(x = survived , y = pclass , shape = sex)) +
  geom_point() + facet_grid(agec ~ sibsp * parch) +
  xlab('Proportion Surviving') + ylab('Passenger Class') +
  scale_x_continuous(breaks = c(0, .5 , 1))

Figure 3.2: Multi-way summary of Titanic survival

Note that none of the effects of sibsp or parch for common passenger groups appear strong on an absolute risk scale.

4 Exploring Trends with Nonparametric Regression

The loess smoother has excellent performance when the response is binary, as long as outlier detection is turned off. Here we use a ggplot2 add-on function histSpikeg in the Hmisc package to obtain and plot the loess fit and age distribution.

library(ggplot2)
b <- scale_size_discrete (range =c(.1, .85))

yl <- ylab(NULL)

p1 <- ggplot(t3, aes(x = age , y = survived)) +
  histSpikeg(survived ~ age , lowess = TRUE , data = t3) +
  ylim(0,1) + yl

p2 <- ggplot(t3 , aes(x = age, y = survived, color = sex)) +
  histSpikeg (survived ~ age + sex, lowess = TRUE, data = t3) + 
  ylim(0,1) + yl

p3 <- ggplot(t3, aes(x = age, y = survived , size = pclass)) +
  histSpikeg(survived ~ age + pclass, lowess =TRUE, data = t3) +
  b + ylim(0,1) + yl

p4 <- ggplot(t3, aes(x = age, y = survived, color = sex, size = pclass)) +
  histSpikeg(survived ~ age + sex + pclass, lowess = TRUE, data = t3) +
  b + ylim (0,1) + yl

# combine 4
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)

Nonparametric regression (loess) estimates of the relationship between age and the probability of surviving the Titanic. The top left panel shows unstratified estimates of the probability ofsurvival. Other panels show nonparametric estimates by various stratifications.

Figure 4.1: Nonparametric regression (loess) estimates of the relationship between age and the probability of surviving the Titanic. The top left panel shows unstratified estimates of the probability ofsurvival. Other panels show nonparametric estimates by various stratifications.

The above plots shows much of the story of passenger survival patterns.

"Women and children firstâseems to be true except for women in third class.
It is interesting that there is no real cutoff for who is considered a child.
For men,the younger the greater chance of surviving.
The interpretation of the effects of the ânumber of relativesâ-type variables will be more difficult, as their definitions are a function of age. Let us try to plot that below:

top <- theme(legend.position = 'top')

p1 <- ggplot(t3 , aes(x = age , y = survived , color = cut2(sibsp ,0:2))) +
  stat_plsmo() + b + ylim(0,1) + yl + 
  top + scale_color_discrete(name = ' siblings /spouses')

p2 <- ggplot(t3, aes(x = age, y = survived, color = cut2(parch,0:2))) + 
  stat_plsmo() + b + ylim(0,1) + yl + top + scale_color_discrete( name = 'parents/children')

gridExtra::grid.arrange(p1, p2, ncol = 2)

Relationship between age and survival stratified by the number of siblings or spouses on board (left panel) or by the number of parents or children of the passenger on board (right panel).

Figure 4.2: Relationship between age and survival stratified by the number of siblings or spouses on board (left panel) or by the number of parents or children of the passenger on board (right panel).

5 Binary Logistic Model With Casewise Deletion of Missing Values

What follows is the standard analysis based on eliminating observations having any missing data.

We develop an initial somewhat saturated logistic model, allowing for a flexible nonlinear age effect that can differ in shape for all six sex Ãclassstrata.

The meaning of these variables does depend on the passenger’s age, so we consider only age interactions involving sibsp and parch.

f1 <- lrm(survived ~ sex*pclass * rcs(age ,5) + rcs(age ,5)*(sibsp + parch ), data = t3) 
anova(f1)

## Warning in anova.rms(f1): tests of nonlinear interaction with respect to single component 
## variables ignore 3-way interactions

##                 Wald Statistics          Response: survived 
## 
##  Factor                                            Chi-Square d.f. P     
##  sex  (Factor+Higher Order Factors)                187.15     15   <.0001
##   All Interactions                                  59.74     14   <.0001
##  pclass  (Factor+Higher Order Factors)             100.10     20   <.0001
##   All Interactions                                  46.51     18   0.0003
##  age  (Factor+Higher Order Factors)                 56.20     32   0.0052
##   All Interactions                                  34.57     28   0.1826
##   Nonlinear (Factor+Higher Order Factors)           28.66     24   0.2331
##  sibsp  (Factor+Higher Order Factors)               19.67      5   0.0014
##   All Interactions                                  12.13      4   0.0164
##  parch  (Factor+Higher Order Factors)                3.51      5   0.6217
##   All Interactions                                   3.51      4   0.4761
##  sex * pclass  (Factor+Higher Order Factors)        42.43     10   <.0001
##  sex * age  (Factor+Higher Order Factors)           15.89     12   0.1962
##   Nonlinear (Factor+Higher Order Factors)           14.47      9   0.1066
##   Nonlinear Interaction : f(A,B) vs. AB              4.17      3   0.2441
##  pclass * age  (Factor+Higher Order Factors)        13.47     16   0.6385
##   Nonlinear (Factor+Higher Order Factors)           12.92     12   0.3749
##   Nonlinear Interaction : f(A,B) vs. AB              6.88      6   0.3324
##  age * sibsp  (Factor+Higher Order Factors)         12.13      4   0.0164
##   Nonlinear                                          1.76      3   0.6235
##   Nonlinear Interaction : f(A,B) vs. AB              1.76      3   0.6235
##  age * parch  (Factor+Higher Order Factors)          3.51      4   0.4761
##   Nonlinear                                          1.80      3   0.6147
##   Nonlinear Interaction : f(A,B) vs. AB              1.80      3   0.6147
##  sex * pclass * age  (Factor+Higher Order Factors)   8.34      8   0.4006
##   Nonlinear                                          7.74      6   0.2581
##  TOTAL NONLINEAR                                    28.66     24   0.2331
##  TOTAL INTERACTION                                  75.61     30   <.0001
##  TOTAL NONLINEAR + INTERACTION                      79.49     33   <.0001
##  TOTAL                                             241.93     39   <.0001

Three-way interactions are clearly insignificant (P = 0.4).
So is parch (P = 0.6 for testing the combined main effect + interaction effects for parch, i.e., whether parch is important for any age).

The model not containing those terms is fitted below. The ^2 in the model formula means to expand the terms in parentheses to include all main effects and second-order interactions.

f <- lrm(survived ~ (sex + pclass + rcs(age ,5))^2 + rcs(age ,5) *sibsp , data=t3)
# immportant stats
f$stats

##          Obs    Max Deriv   Model L.R.         d.f.            P            C 
## 1.046000e+03 5.613527e-06 5.538700e+02 2.600000e+01 0.000000e+00 8.780158e-01 
##          Dxy        Gamma        Tau-a           R2        Brier            g 
## 7.560317e-01 7.575105e-01 3.656289e-01 5.545145e-01 1.302728e-01 2.426971e+00 
##           gr           gp 
## 1.132453e+01 3.654654e-01

# Wald Statistics
anova(f)

##                 Wald Statistics          Response: survived 
## 
##  Factor                                      Chi-Square d.f. P     
##  sex  (Factor+Higher Order Factors)          199.42      7   <.0001
##   All Interactions                            56.14      6   <.0001
##  pclass  (Factor+Higher Order Factors)       108.73     12   <.0001
##   All Interactions                            42.83     10   <.0001
##  age  (Factor+Higher Order Factors)           47.04     20   0.0006
##   All Interactions                            24.51     16   0.0789
##   Nonlinear (Factor+Higher Order Factors)     22.72     15   0.0902
##  sibsp  (Factor+Higher Order Factors)         19.95      5   0.0013
##   All Interactions                            10.99      4   0.0267
##  sex * pclass  (Factor+Higher Order Factors)  35.40      2   <.0001
##  sex * age  (Factor+Higher Order Factors)     10.08      4   0.0391
##   Nonlinear                                    8.17      3   0.0426
##   Nonlinear Interaction : f(A,B) vs. AB        8.17      3   0.0426
##  pclass * age  (Factor+Higher Order Factors)   6.86      8   0.5516
##   Nonlinear                                    6.11      6   0.4113
##   Nonlinear Interaction : f(A,B) vs. AB        6.11      6   0.4113
##  age * sibsp  (Factor+Higher Order Factors)   10.99      4   0.0267
##   Nonlinear                                    1.81      3   0.6134
##   Nonlinear Interaction : f(A,B) vs. AB        1.81      3   0.6134
##  TOTAL NONLINEAR                              22.72     15   0.0902
##  TOTAL INTERACTION                            67.58     18   <.0001
##  TOTAL NONLINEAR + INTERACTION                70.68     21   <.0001
##  TOTAL                                       253.18     26   <.0001

This is a very powerful model (ROC area = c = 0.88); the survival patterns are easy to detect.
The Wald ANOVA in the table above indicates especially strong sex and pclass effects (Ï2 = 199 and 109, respectively).
There is a very strong sex x pclass interaction and a strong age x sibsp interaction, considering the strength of sibsp overall.

6 Predictor effects

Let us examine the shapes of predictor effects. With so many interactions in the model we need to obtain predicted values at least for all combinations of sex and pclass. For sibsp we consider only two of its possible values.

p <- Predict(f, age , sex , pclass , sibsp = 0, fun = plogis)
ggplot(p)

Figure 6.1: Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings or spouses

Note the agreement between the lower right-hand panel of Figure 2.1 with Figure 3.1 (above). This results from our use of similar flexibility in the parametric and nonparametric approaches (and similar effective degrees of freedom).

The estimated effect of sibsp as a function of age is shown below:

ggplot(Predict(f, sibsp , age = c(10 ,15 ,20,50), conf.int = FALSE))

Note that children having many siblings apparently had lower survival. Married adults had slightly higher survival than unmarried ones.

7 Model Validation

There will never be another Titanic, so we do not need to validate the model for prospective use. But we use the bootstrap to validate the model anyway, in an effort to detect whether it is overfitting the data.

f <- update(f, x = TRUE , y = TRUE)
# x=TRUE , y=TRUE adds raw data to fit object so can bootstrap

set.seed(131) # so can replicate re-samples

validate(f, B = 200) # default = bootstrap

##           index.orig training    test optimism index.corrected   n
## Dxy           0.7560   0.7721  0.7407   0.0314          0.7246 200
## R2            0.5545   0.5778  0.5262   0.0516          0.5029 200
## Intercept     0.0000   0.0000 -0.0788   0.0788         -0.0788 200
## Slope         1.0000   1.0000  0.8649   0.1351          0.8649 200
## Emax          0.0000   0.0000  0.0460   0.0460          0.0460 200
## D             0.5286   0.5585  0.4936   0.0649          0.4637 200
## U            -0.0019  -0.0019  0.0061  -0.0080          0.0061 200
## Q             0.5305   0.5604  0.4876   0.0729          0.4576 200
## B             0.1303   0.1256  0.1342  -0.0086          0.1389 200
## g             2.4270   2.7789  2.3963   0.3826          2.0444 200
## gp            0.3655   0.3725  0.3536   0.0189          0.3466 200

cal <- calibrate(f, B = 200) 
plot(cal, subtitles = FALSE )

Figure 7.1: Bootstrap overfitting-corrected loess nonparametric calibration curve for casewise deletion model

## 
## n=1046   Mean absolute error=0.011   Mean squared error=0.00016
## 0.9 Quantile of absolute error=0.018

The output of validate indicates minor overfitting. Overfitting would have been worse had the risk factors not been so strong. The closeness of the calibration curve to the 45â¦ line in Figure 12.7 demonstrates excellent validation on an absolute probability scale. But the extent of missing data casts some doubt on the validity of this model, and on the efficiency of its parameter estimates.

8 Survival probability for madeup cases

phat <- predict(f,
                combos <- expand.grid(age = c(2,21,50) , sex = levels(t3$sex), 
                                      pclass = levels(t3$pclass),sibsp = 0), 
                type = 'fitted')

data.frame(combos , phat)

##    age    sex pclass sibsp       phat
## 1    2 female    1st     0 0.97024806
## 2   21 female    1st     0 0.98280142
## 3   50 female    1st     0 0.96479314
## 4    2   male    1st     0 0.86973741
## 5   21   male    1st     0 0.48473392
## 6   50   male    1st     0 0.26915109
## 7    2 female    2nd     0 0.99989125
## 8   21 female    2nd     0 0.90394112
## 9   50 female    2nd     0 0.82809992
## 10   2   male    2nd     0 0.99885604
## 11  21   male    2nd     0 0.06703969
## 12  50   male    2nd     0 0.02915321
## 13   2 female    3rd     0 0.81969146
## 14  21 female    3rd     0 0.55337465
## 15  50 female    3rd     0 0.31174432
## 16   2   male    3rd     0 0.88896760
## 17  21   male    3rd     0 0.14926854
## 18  50   male    3rd     0 0.04975613