The sinking of the RMS Titanic on April 15, 1912, remains one of the most infamous maritime disasters in history. The tragedy claimed over 1,500 lives and has since become a subject of extensive historical and statistical analysis. Understanding the factors that influenced passenger survival can provide insights into the social dynamics and emergency response protocols of the era.
1.2 Objective
This analysis employs logistic regression modeling to identify and quantify the factors associated with passenger survival on the Titanic. Specifically, we aim to:
Examine the individual (univariable) associations between demographic and socioeconomic variables and survival outcomes
Develop a multivariable model to assess the adjusted effects of these predictors while controlling for potential confounding
Demonstrate a systematic approach to model selection in logistic regression
1.3 Dataset
The analysis uses the Titanic dataset available in R, which contains information on passengers’ survival status, sex, age (child or adult), and passenger class (1st, 2nd, 3rd, or crew). This dataset has been widely used in statistical education to illustrate classification and regression techniques.
1.4 Analytical Approach
We follow a two-stage modeling strategy commonly used in epidemiological and clinical research:
Univariable screening: Each predictor is examined individually to identify variables with sufficient evidence of association (p < 0.2) for inclusion in the multivariable model
Multivariable modeling: Variables that pass the screening threshold are included in a comprehensive model to estimate adjusted associations
This approach balances statistical efficiency with the need to control for confounding, providing a robust framework for understanding the complex relationships between passenger characteristics and survival.
2 Univariable Model
The univariable analysis examines the association between each predictor and survival independently. This screening step identifies variables with p-values < 0.2, which are then candidates for inclusion in the multivariable logistic regression model.
3 Univariable Model
The univariable analysis examines the association between each predictor and survival independently. This screening step identifies variables with p-values < 0.2, which are then candidates for inclusion in the multivariable logistic regression model.
Show the code
df %>%select(Survived, Sex, Age, Class) %>%tbl_uvregression(method = glm,y = Survived,method.args =list(family = binomial),exponentiate =TRUE,label =list( Age ~"Age (Years)", Sex ~"Sex of the Passengers" ) ) |>bold_labels() |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Univariable Logistic Regression Model**"),subtitle =md("Fate of passengers on the fatal maiden voyage of the ocean liner 'Titanic'") ) |>tab_source_note(source_note =md("Dawson, Robert J. MacG. (1995), The ‘Unusual Episode’ Data Revisited. Journal of Statistics Education, 3. *doi:10.1080/10691898.1995.11910499.*") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent" )
Univariable Logistic Regression Model
Fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’
Characteristic
N
OR
95% CI
p-value
Sex of the Passengers
2,201
Male
—
—
Female
10.1
8.05, 12.9
<0.001
Age (Years)
2,201
Child
—
—
Adult
0.41
0.28, 0.61
<0.001
Class
2,201
1st
—
—
2nd
0.42
0.31, 0.59
<0.001
3rd
0.20
0.15, 0.27
<0.001
Crew
0.19
0.14, 0.25
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Dawson, Robert J. MacG. (1995), The ‘Unusual Episode’ Data Revisited. Journal of Statistics Education, 3. doi:10.1080/10691898.1995.11910499.
4 Fit Initial Multivaribale Model
After identifying significant predictors in the univariable analysis, we fit a multivariable logistic regression model to examine the adjusted associations while controlling for confounding variables.
Show the code
full_model <-glm( Survived ~ Sex + Age + Class,data = df,family = binomial)tbl_regression( full_model,exponentiate = T,label =list( Age ~"Age (Years)", Sex ~"Sex of the Passengers" )) |>bold_labels() |>modify_header(estimate ~"**AOR**") |>bold_p() |>as_gt() |># Use as_gt() instead of gt()tab_header(title =md("**Mulativariable Logistic Regression Model**"),subtitle =md("Fate of passengers on the fatal maiden voyage of the ocean liner 'Titanic'") ) |>tab_source_note(source_note =md("Dawson, Robert J. MacG. (1995), The ‘Unusual Episode’ Data Revisited. Journal of Statistics Education, 3. *doi:10.1080/10691898.1995.11910499.*") ) |>opt_align_table_header(align ="left") |>tab_options(table_body.hlines.color ="transparent",table_body.vlines.color ="transparent" )
Mulativariable Logistic Regression Model
Fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’
Characteristic
AOR
95% CI
p-value
Sex of the Passengers
Male
—
—
Female
11.2
8.57, 14.9
<0.001
Age (Years)
Child
—
—
Adult
0.35
0.21, 0.56
<0.001
Class
1st
—
—
2nd
0.36
0.25, 0.53
<0.001
3rd
0.17
0.12, 0.24
<0.001
Crew
0.42
0.31, 0.58
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Dawson, Robert J. MacG. (1995), The ‘Unusual Episode’ Data Revisited. Journal of Statistics Education, 3. doi:10.1080/10691898.1995.11910499.
4.1 Interpretation
When interpreting the multivariable model, compare the adjusted odds ratios (AOR) to the crude odds ratios from the univariable analysis. Key considerations include:
Confounding: Variables whose effect estimates change substantially (typically >10-20%) when adjusted for other variables may indicate confounding.
Effect modification: Assess whether associations differ across subgroups.
Statistical significance: Variables that were significant in univariable analysis but become non-significant in the multivariable model may be explained by confounding.
4.2 Model Selection Decision
If all variables in the multivariable model remain statistically significant and are theoretically justified, no further variable selection is necessary. The model adequately represents the relationships between predictors and the outcome. However, if some variables become non-significant, consider:
Retaining variables that are clinically or theoretically important, regardless of statistical significance
Using backward elimination or other selection methods to arrive at a parsimonious model
Assessing model fit using criteria such as AIC, BIC, or likelihood ratio tests