1 Abstract
This study demonstrates the development of a binary logistic regression model to describe patterns of survival in passengers on the Titanic, based on passenger age, sex, ticket class, and the number of family members accompanying each passenger.
2 Loading Data and introduction
# loading required libraries
library(rms)
library(tinytex)
# get dataset from web site
getHdata(titanic3)
# rows x columns
dim(titanic3)
## [1] 1309 14
Let us look at the data dictionary for Titanic3 data.
| Column | Description |
|---|---|
pclass |
passanger class: levels 1st, 2nd, and 3rd |
survived |
Survival (0 = No; 1 = Yes) |
name |
name of the passenger |
sex |
gender of the passenger: levels female and male |
age |
age of the passenger in years |
sibsp |
Number of Siblings/Spouses Aboard for a passenger |
parch |
Number of Parents/Children Aboard for a passenger |
ticket |
Ticket Number |
fare |
Passenger Fare |
cabin |
allotted cabin |
embarked |
a factor with levels Cherbourg, Queenstown, and Southampton |
boat |
Lifeboat |
body |
Body Identification Number |
home.dest |
Home/Destination |
3 Descriptive statistics
# List of names of variables to analyze
v <- c('pclass', 'survived' , 'age' , 'sex' , 'sibsp' , 'parch')
t3 <- titanic3[, v]
units(t3$age) <- 'years'
# univariate summaries
describe(t3)
## t3
##
## 6 Variables 1309 Observations
## --------------------------------------------------------------------------------
## pclass
## n missing distinct
## 1309 0 3
##
## Value 1st 2nd 3rd
## Frequency 323 277 709
## Proportion 0.247 0.212 0.542
## --------------------------------------------------------------------------------
## survived : Survived
## n missing distinct Info Sum Mean Gmd
## 1309 0 2 0.708 500 0.382 0.4725
##
## --------------------------------------------------------------------------------
## age : Age [years]
## n missing distinct Info Mean Gmd .05 .10
## 1046 263 98 0.999 29.88 16.06 5 14
## .25 .50 .75 .90 .95
## 21 28 39 50 57
##
## lowest : 0.1667 0.3333 0.4167 0.6667 0.7500
## highest: 70.5000 71.0000 74.0000 76.0000 80.0000
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 1309 0 2
##
## Value female male
## Frequency 466 843
## Proportion 0.356 0.644
## --------------------------------------------------------------------------------
## sibsp : Number of Siblings/Spouses Aboard
## n missing distinct Info Mean Gmd
## 1309 0 7 0.67 0.4989 0.777
##
## lowest : 0 1 2 3 4, highest: 2 3 4 5 8
##
## Value 0 1 2 3 4 5 8
## Frequency 891 319 42 20 22 6 9
## Proportion 0.681 0.244 0.032 0.015 0.017 0.005 0.007
## --------------------------------------------------------------------------------
## parch : Number of Parents/Children Aboard
## n missing distinct Info Mean Gmd
## 1309 0 8 0.549 0.385 0.6375
##
## lowest : 0 1 2 3 4, highest: 3 4 5 6 9
##
## Value 0 1 2 3 4 5 6 9
## Frequency 1002 170 113 8 6 6 2 2
## Proportion 0.765 0.130 0.086 0.006 0.005 0.005 0.002 0.002
## --------------------------------------------------------------------------------
Next, we obtain access to the needed variables and observations, and save data distribution characteristics for plotting and for computing predictor effects.
There are not many passengers having more than 3 siblings or spouses or more than 3 children, so we truncate two variables at 3 for the purpose of estimating stratified survival probabilities.
# data distribution
dd <- datadist(t3)
# describe distributions of variables to rms
options(datadist = 'dd')
s <- summary(survived ~ age + sex + pclass + cut2(sibsp ,0:3) + cut2 (parch ,0:3), data = t3)
plot(s, main = '', subtitles = FALSE)
Figure 3.1: Univariable summaries of Titanic survival
OBSERVATIONS:
Note the large number of missing ages.
Also note the strong effects of sex and passenger class on the probability of surviving.
The age effect does not appear to be very strong.
The effects of the last two variables are unclear as the estimated proportions are not monotonic in the values of these descriptors.
Although some of the cell sizes are small, we can show four-way empirical relationships with the fraction of surviving passengers by creating four cells for sibsp à parch combinations and by creating two age groups.
tn <- transform(t3 ,
agec = ifelse(age < 21, 'child' , 'adult'),
sibsp = ifelse(sibsp == 0, 'no sib/sp' , 'sib/sp' ),
parch = ifelse(parch == 0, 'no par/child' , 'par/child'))
# suppress proportions based on fewer than 25 passengers in a cell
g <- function(y) if (length(y) < 25) NA else mean(y)
#llist , summarize in Hmisc package
s <- with(tn , summarize(survived ,llist(agec , sex , pclass , sibsp , parch), g))
#plot
ggplot(subset(s, agec != 'NA' ), aes(x = survived , y = pclass , shape = sex)) +
geom_point() + facet_grid(agec ~ sibsp * parch) +
xlab('Proportion Surviving') + ylab('Passenger Class') +
scale_x_continuous(breaks = c(0, .5 , 1))
Figure 3.2: Multi-way summary of Titanic survival
Note that none of the effects of sibsp or parch for common passenger groups appear strong on an absolute risk scale.
4 Exploring Trends with Nonparametric Regression
The loess smoother has excellent performance when the response is binary, as long as outlier detection is turned off. Here we use a ggplot2 add-on function histSpikeg in the Hmisc package to obtain and plot the loess fit and age distribution.
library(ggplot2)
b <- scale_size_discrete (range =c(.1, .85))
yl <- ylab(NULL)
p1 <- ggplot(t3, aes(x = age , y = survived)) +
histSpikeg(survived ~ age , lowess = TRUE , data = t3) +
ylim(0,1) + yl
p2 <- ggplot(t3 , aes(x = age, y = survived, color = sex)) +
histSpikeg (survived ~ age + sex, lowess = TRUE, data = t3) +
ylim(0,1) + yl
p3 <- ggplot(t3, aes(x = age, y = survived , size = pclass)) +
histSpikeg(survived ~ age + pclass, lowess =TRUE, data = t3) +
b + ylim(0,1) + yl
p4 <- ggplot(t3, aes(x = age, y = survived, color = sex, size = pclass)) +
histSpikeg(survived ~ age + sex + pclass, lowess = TRUE, data = t3) +
b + ylim (0,1) + yl
# combine 4
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)
Figure 4.1: Nonparametric regression (loess) estimates of the relationship between age and the probability of surviving the Titanic. The top left panel shows unstratified estimates of the probability ofsurvival. Other panels show nonparametric estimates by various stratifications.
The above plots shows much of the story of passenger survival patterns.
"Women and children firstâseems to be true except for women in third class.
It is interesting that there is no real cutoff for who is considered a child.
For men,the younger the greater chance of surviving.
The interpretation of the effects of the ânumber of relativesâ-type variables will be more difficult, as their definitions are a function of age. Let us try to plot that below:
top <- theme(legend.position = 'top')
p1 <- ggplot(t3 , aes(x = age , y = survived , color = cut2(sibsp ,0:2))) +
stat_plsmo() + b + ylim(0,1) + yl +
top + scale_color_discrete(name = ' siblings /spouses')
p2 <- ggplot(t3, aes(x = age, y = survived, color = cut2(parch,0:2))) +
stat_plsmo() + b + ylim(0,1) + yl + top + scale_color_discrete( name = 'parents/children')
gridExtra::grid.arrange(p1, p2, ncol = 2)
Figure 4.2: Relationship between age and survival stratified by the number of siblings or spouses on board (left panel) or by the number of parents or children of the passenger on board (right panel).
5 Binary Logistic Model With Casewise Deletion of Missing Values
What follows is the standard analysis based on eliminating observations having any missing data.
We develop an initial somewhat saturated logistic model, allowing for a flexible nonlinear age effect that can differ in shape for all six sex Ãclassstrata.
The meaning of these variables does depend on the passenger’s age, so we consider only age interactions involving sibsp and parch.
f1 <- lrm(survived ~ sex*pclass * rcs(age ,5) + rcs(age ,5)*(sibsp + parch ), data = t3)
anova(f1)
## Warning in anova.rms(f1): tests of nonlinear interaction with respect to single component
## variables ignore 3-way interactions
## Wald Statistics Response: survived
##
## Factor Chi-Square d.f. P
## sex (Factor+Higher Order Factors) 187.15 15 <.0001
## All Interactions 59.74 14 <.0001
## pclass (Factor+Higher Order Factors) 100.10 20 <.0001
## All Interactions 46.51 18 0.0003
## age (Factor+Higher Order Factors) 56.20 32 0.0052
## All Interactions 34.57 28 0.1826
## Nonlinear (Factor+Higher Order Factors) 28.66 24 0.2331
## sibsp (Factor+Higher Order Factors) 19.67 5 0.0014
## All Interactions 12.13 4 0.0164
## parch (Factor+Higher Order Factors) 3.51 5 0.6217
## All Interactions 3.51 4 0.4761
## sex * pclass (Factor+Higher Order Factors) 42.43 10 <.0001
## sex * age (Factor+Higher Order Factors) 15.89 12 0.1962
## Nonlinear (Factor+Higher Order Factors) 14.47 9 0.1066
## Nonlinear Interaction : f(A,B) vs. AB 4.17 3 0.2441
## pclass * age (Factor+Higher Order Factors) 13.47 16 0.6385
## Nonlinear (Factor+Higher Order Factors) 12.92 12 0.3749
## Nonlinear Interaction : f(A,B) vs. AB 6.88 6 0.3324
## age * sibsp (Factor+Higher Order Factors) 12.13 4 0.0164
## Nonlinear 1.76 3 0.6235
## Nonlinear Interaction : f(A,B) vs. AB 1.76 3 0.6235
## age * parch (Factor+Higher Order Factors) 3.51 4 0.4761
## Nonlinear 1.80 3 0.6147
## Nonlinear Interaction : f(A,B) vs. AB 1.80 3 0.6147
## sex * pclass * age (Factor+Higher Order Factors) 8.34 8 0.4006
## Nonlinear 7.74 6 0.2581
## TOTAL NONLINEAR 28.66 24 0.2331
## TOTAL INTERACTION 75.61 30 <.0001
## TOTAL NONLINEAR + INTERACTION 79.49 33 <.0001
## TOTAL 241.93 39 <.0001
Three-way interactions are clearly insignificant (P = 0.4).
So is
parch(P = 0.6 for testing the combined main effect + interaction effects for parch, i.e., whether parch is important for any age).
The model not containing those terms is fitted below. The ^2 in the model formula means to expand the terms in parentheses to include all main effects and second-order interactions.
f <- lrm(survived ~ (sex + pclass + rcs(age ,5))^2 + rcs(age ,5) *sibsp , data=t3)
# immportant stats
f$stats
## Obs Max Deriv Model L.R. d.f. P C
## 1.046000e+03 5.613527e-06 5.538700e+02 2.600000e+01 0.000000e+00 8.780158e-01
## Dxy Gamma Tau-a R2 Brier g
## 7.560317e-01 7.575105e-01 3.656289e-01 5.545145e-01 1.302728e-01 2.426971e+00
## gr gp
## 1.132453e+01 3.654654e-01
# Wald Statistics
anova(f)
## Wald Statistics Response: survived
##
## Factor Chi-Square d.f. P
## sex (Factor+Higher Order Factors) 199.42 7 <.0001
## All Interactions 56.14 6 <.0001
## pclass (Factor+Higher Order Factors) 108.73 12 <.0001
## All Interactions 42.83 10 <.0001
## age (Factor+Higher Order Factors) 47.04 20 0.0006
## All Interactions 24.51 16 0.0789
## Nonlinear (Factor+Higher Order Factors) 22.72 15 0.0902
## sibsp (Factor+Higher Order Factors) 19.95 5 0.0013
## All Interactions 10.99 4 0.0267
## sex * pclass (Factor+Higher Order Factors) 35.40 2 <.0001
## sex * age (Factor+Higher Order Factors) 10.08 4 0.0391
## Nonlinear 8.17 3 0.0426
## Nonlinear Interaction : f(A,B) vs. AB 8.17 3 0.0426
## pclass * age (Factor+Higher Order Factors) 6.86 8 0.5516
## Nonlinear 6.11 6 0.4113
## Nonlinear Interaction : f(A,B) vs. AB 6.11 6 0.4113
## age * sibsp (Factor+Higher Order Factors) 10.99 4 0.0267
## Nonlinear 1.81 3 0.6134
## Nonlinear Interaction : f(A,B) vs. AB 1.81 3 0.6134
## TOTAL NONLINEAR 22.72 15 0.0902
## TOTAL INTERACTION 67.58 18 <.0001
## TOTAL NONLINEAR + INTERACTION 70.68 21 <.0001
## TOTAL 253.18 26 <.0001
This is a very powerful model (ROC area = c = 0.88); the survival patterns are easy to detect.
The Wald ANOVA in the table above indicates especially strong
sexandpclasseffects (Ï2 = 199 and 109, respectively).There is a very strong
sexxpclassinteraction and a strongagexsibspinteraction, considering the strength ofsibspoverall.
6 Predictor effects
Let us examine the shapes of predictor effects. With so many interactions in the model we need to obtain predicted values at least for all combinations of sex and pclass. For sibsp we consider only two of its possible values.
p <- Predict(f, age , sex , pclass , sibsp = 0, fun = plogis)
ggplot(p)
Figure 6.1: Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings or spouses
Note the agreement between the lower right-hand panel of Figure 2.1 with Figure 3.1 (above). This results from our use of similar flexibility in the parametric and nonparametric approaches (and similar effective degrees of freedom).
The estimated effect of sibsp as a function of age is shown below:
ggplot(Predict(f, sibsp , age = c(10 ,15 ,20,50), conf.int = FALSE))
Note that children having many siblings apparently had lower survival. Married adults had slightly higher survival than unmarried ones.
7 Model Validation
There will never be another Titanic, so we do not need to validate the model for prospective use. But we use the bootstrap to validate the model anyway, in an effort to detect whether it is overfitting the data.
f <- update(f, x = TRUE , y = TRUE)
# x=TRUE , y=TRUE adds raw data to fit object so can bootstrap
set.seed(131) # so can replicate re-samples
validate(f, B = 200) # default = bootstrap
## index.orig training test optimism index.corrected n
## Dxy 0.7560 0.7721 0.7407 0.0314 0.7246 200
## R2 0.5545 0.5778 0.5262 0.0516 0.5029 200
## Intercept 0.0000 0.0000 -0.0788 0.0788 -0.0788 200
## Slope 1.0000 1.0000 0.8649 0.1351 0.8649 200
## Emax 0.0000 0.0000 0.0460 0.0460 0.0460 200
## D 0.5286 0.5585 0.4936 0.0649 0.4637 200
## U -0.0019 -0.0019 0.0061 -0.0080 0.0061 200
## Q 0.5305 0.5604 0.4876 0.0729 0.4576 200
## B 0.1303 0.1256 0.1342 -0.0086 0.1389 200
## g 2.4270 2.7789 2.3963 0.3826 2.0444 200
## gp 0.3655 0.3725 0.3536 0.0189 0.3466 200
cal <- calibrate(f, B = 200)
plot(cal, subtitles = FALSE )
Figure 7.1: Bootstrap overfitting-corrected loess nonparametric calibration curve for casewise deletion model
##
## n=1046 Mean absolute error=0.011 Mean squared error=0.00016
## 0.9 Quantile of absolute error=0.018
The output of validate indicates minor overfitting. Overfitting would have been worse had the risk factors not been so strong. The closeness of the calibration curve to the 45⦠line in Figure 12.7 demonstrates excellent validation on an absolute probability scale. But the extent of missing data casts some doubt on the validity of this model, and on the efficiency of its parameter estimates.
8 Survival probability for madeup cases
phat <- predict(f,
combos <- expand.grid(age = c(2,21,50) , sex = levels(t3$sex),
pclass = levels(t3$pclass),sibsp = 0),
type = 'fitted')
data.frame(combos , phat)
## age sex pclass sibsp phat
## 1 2 female 1st 0 0.97024806
## 2 21 female 1st 0 0.98280142
## 3 50 female 1st 0 0.96479314
## 4 2 male 1st 0 0.86973741
## 5 21 male 1st 0 0.48473392
## 6 50 male 1st 0 0.26915109
## 7 2 female 2nd 0 0.99989125
## 8 21 female 2nd 0 0.90394112
## 9 50 female 2nd 0 0.82809992
## 10 2 male 2nd 0 0.99885604
## 11 21 male 2nd 0 0.06703969
## 12 50 male 2nd 0 0.02915321
## 13 2 female 3rd 0 0.81969146
## 14 21 female 3rd 0 0.55337465
## 15 50 female 3rd 0 0.31174432
## 16 2 male 3rd 0 0.88896760
## 17 21 male 3rd 0 0.14926854
## 18 50 male 3rd 0 0.04975613