This project explores what factors influenced on survival on the Titanic using real passenger data. If you’ve seen the movie, you probably remember “women and children first”, but was that actually true? Using logistic regression, I will test which factors really mattered, including gender, age, and passenger class.
The main research question for this analysis is: what factors significantly influence the probability of survival?
This dataset is available in R. The key variables used in my analysis are the following.
First, I uploaded the data, and inspected it using glimpse().
## Load packages ##
library(titanic)
library(tidyverse)
## Inspect data ##
titanic = titanic_train %>%
as_tibble
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
I have detected that there is a missing value in the age variable,so I have decided to look for the amount of missing values in this data.
## Missing Values ##
colSums(is.na(titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
The age variable misses 177 values.
In order to address it I have applied multiple imputation with the “mice” package. This method is a reliable as it fills in the missing values by using patterns from the other variables, thus making the estimates more realistic.
library(mice)
imputed_data <- mice(
titanic,
m = 5,
method = "pmm",
seed = 123
)
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 1 4 Age
## 1 5 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 2 4 Age
## 2 5 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 3 4 Age
## 3 5 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 4 4 Age
## 4 5 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
## 5 4 Age
## 5 5 Age
complete_data <- complete(imputed_data, 1)
Now we have a complete data without any missing values (NA).
Then, I visualized the outcome which is “Survived” using histogram. Since the outcome variable (Survived) is binary, meaning passengers either survived or did not survive, I applied a logistic regression model using a binomial distribution for the further analysis.
hist(complete_data$Survived)
Next, I estimated a baseline model without including any predictors. This model assumes that all passengers have the same probability of survival, regardless of their characteristics.
## Fit Baseline Model ##
fit.mean = glm(
Survived ~ 1,
data = complete_data,
family = binomial
)
summary(fit.mean)
##
## Call:
## glm(formula = Survived ~ 1, family = binomial, data = complete_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.47329 0.06889 -6.87 6.4e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 1186.7 on 890 degrees of freedom
## AIC: 1188.7
##
## Number of Fisher Scoring iterations: 4
coef(fit.mean)
## (Intercept)
## -0.4732877
The model result is given in logit form, which is hard to interpret directly. So, I transformed it into probability using the logistic function.
plogis(coef(fit.mean))
## (Intercept)
## 0.3838384
After the transformation, the baseline probability of survival is approximately 38%, meaning that about 38% of passengers survived overall.
Next, I added gender as a predictor to examine whether survival differed between males and females.
## Fit Single Predictor Model ##
fit.sex = glm(
Survived ~ Sex,
data = complete_data,
family = binomial
)
summary(fit.sex)
##
## Call:
## glm(formula = Survived ~ Sex, family = binomial, data = complete_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0566 0.1290 8.191 2.58e-16 ***
## Sexmale -2.5137 0.1672 -15.036 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 917.8 on 889 degrees of freedom
## AIC: 921.8
##
## Number of Fisher Scoring iterations: 4
coef(fit.sex)
## (Intercept) Sexmale
## 1.056589 -2.513710
The results are again presented in logit form, so I transformed them into probabilities to make them easier to interpret.
plogis(coef(fit.sex))
## (Intercept) Sexmale
## 0.74203822 0.07490265
After transformation, the results show that female passengers had a survival probability of about 74%, while male passengers had a much lower probability, around 15–20%. This indicates that gender is a very strong predictor of survival and supports the idea that women were prioritized during evacuation.
To better illustrate this difference, I also visualized survival by gender.
## Visualization ##
complete_data %>%
ggplot(
aes(
x = Sex,
fill = factor(Survived)
)
)+
geom_bar( position = "dodge")+
labs(
title = "Survival by Gender",
x = "Gender",
y = "Count",
fill = " Survived"
)+
theme_minimal()
After fitting the model with gender, I evaluated how well it performs. To do this, I used the performance package to calculate Tjur’s R², which measures how well the model distinguishes between those who survived and those who did not.
library(performance)
r2(fit.sex)
## # R2 for Logistic Regression
## Tjur's R2: 0.295
Tjur’s R² showed 0.295, indicating a moderate ability to separate survivors from non-survivors.
I also checked the model diagnostics using the DHARMa package.
library(DHARMa)
sim_res <- simulateResiduals(fit.sex)
plot(sim_res)
The diagnostic plots show no major issues, indicating that the model fits the data well.
Next, I extended the model by including multiple predictors:gender, age, passenger class, and fare.
fit.mlr <- glm(
Survived ~ Sex + Age + Pclass + Fare,
data = complete_data,
family = binomial
)
summary(fit.mlr)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fare, family = binomial,
## data = complete_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.5205972 0.4876536 9.270 < 2e-16 ***
## Sexmale -2.5828983 0.1869374 -13.817 < 2e-16 ***
## Age -0.0291686 0.0064845 -4.498 6.85e-06 ***
## Pclass -1.1449558 0.1349913 -8.482 < 2e-16 ***
## Fare 0.0003465 0.0020262 0.171 0.864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 805.41 on 886 degrees of freedom
## AIC: 815.41
##
## Number of Fisher Scoring iterations: 5
The results are again presented in logit, so I transformed them into probabilities to make interpretation easier.
exp(coef(fit.mlr))
## (Intercept) Sexmale Age Pclass Fare
## 91.89045829 0.07555471 0.97125270 0.31823798 1.00034660
The results show that gender, age, and passenger class are significant predictors of survival.Being male reduces the odds of survival by approximately 93%, confirming that gender has the strongest effect. Age also has a significant effect, where each additional year decreases survival odds by about 3%.Passenger class plays a major role as well, moving to a lower class reduces survival odds by about 68%.In contrast, fare was not a significant predictor, suggesting that its effect is already captured by passenger class.
After fitting the full model, I evaluated how well it performs using Tjur’s R².
r2(fit.mlr)
## # R2 for Logistic Regression
## Tjur's R2: 0.390
Tjur’s R² showed 0.391, indicating a good level of predictive performance.
I also assessed the model diagnostics.
sim_res_full = simulateResiduals(fit.mlr)
plot(sim_res_full)
The diagnostic results show no significant issues. This suggests that the model provides a good fit to the data and that the results can be considered reliable.
Finally, I used predicted probabilities to better understand how survival changes across different variables. Instead of interpreting logit, this approach allows us to see results directly in terms of probabilities, which are easier to interpret.
For this step, I used the ggeffects package.
library(ggeffects)
pred <- predict_response(fit.mlr, terms = c("Age", "Sex"))
plot(pred)
The visualization shows that survival probability decreases as age increases for both males and females. However, females consistently have a much higher probability of survival across all age groups.
Overall, the analysis shows that survival on the Titanic was strongly influenced by gender, age, and passenger class. Females and higher-class passengers had significantly higher chances of survival, while older passengers were less likely to survive.