Titanic Survival Analysis

Introduction

This project explores what factors influenced on survival on the Titanic using real passenger data. If you’ve seen the movie, you probably remember “women and children first”, but was that actually true? Using logistic regression, I will test which factors really mattered, including gender, age, and passenger class.

The main research question for this analysis is: what factors significantly influence the probability of survival?

This dataset is available in R. The key variables used in my analysis are the following.

Survived - indicates whether the passenger survived (1) or not (0)
Sex - the gender of passenger (male or female)
Age - the age of passenger in years
Pclass - passenger class (1st class, 2nd class or 3rd class), which reflects socioeconomic status
Fare - the price of a ticked paid by a passenger

First, I uploaded the data, and inspected it using glimpse().

## Load packages ##

library(titanic)
library(tidyverse)

## Inspect data ##

titanic = titanic_train %>% 
  as_tibble

glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

I have detected that there is a missing value in the age variable,so I have decided to look for the amount of missing values in this data.

## Missing Values ##

colSums(is.na(titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

The age variable misses 177 values.

In order to address it I have applied multiple imputation with the “mice” package. This method is a reliable as it fills in the missing values by using patterns from the other variables, thus making the estimates more realistic.

library(mice)


imputed_data <- mice(
  titanic,
  m = 5,
  method = "pmm",
  seed = 123
)

## 
##  iter imp variable
##   1   1  Age
##   1   2  Age
##   1   3  Age
##   1   4  Age
##   1   5  Age
##   2   1  Age
##   2   2  Age
##   2   3  Age
##   2   4  Age
##   2   5  Age
##   3   1  Age
##   3   2  Age
##   3   3  Age
##   3   4  Age
##   3   5  Age
##   4   1  Age
##   4   2  Age
##   4   3  Age
##   4   4  Age
##   4   5  Age
##   5   1  Age
##   5   2  Age
##   5   3  Age
##   5   4  Age
##   5   5  Age

complete_data <- complete(imputed_data, 1)

Now we have a complete data without any missing values (NA).

Visualisation

Then, I visualized the outcome which is “Survived” using histogram. Since the outcome variable (Survived) is binary, meaning passengers either survived or did not survive, I applied a logistic regression model using a binomial distribution for the further analysis.

hist(complete_data$Survived)

Baseline Model

Next, I estimated a baseline model without including any predictors. This model assumes that all passengers have the same probability of survival, regardless of their characteristics.

## Fit Baseline Model ##

fit.mean = glm(
    Survived ~ 1,
    data = complete_data,
    family = binomial
)

summary(fit.mean)

## 
## Call:
## glm(formula = Survived ~ 1, family = binomial, data = complete_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.47329    0.06889   -6.87  6.4e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance: 1186.7  on 890  degrees of freedom
## AIC: 1188.7
## 
## Number of Fisher Scoring iterations: 4

coef(fit.mean)

## (Intercept) 
##  -0.4732877

The model result is given in logit form, which is hard to interpret directly. So, I transformed it into probability using the logistic function.

plogis(coef(fit.mean))

## (Intercept) 
##   0.3838384

After the transformation, the baseline probability of survival is approximately 38%, meaning that about 38% of passengers survived overall.

Single Predictor

Next, I added gender as a predictor to examine whether survival differed between males and females.

## Fit Single Predictor Model ##

fit.sex = glm(
     Survived ~ Sex,
     data = complete_data,
     family = binomial
)

summary(fit.sex)

## 
## Call:
## glm(formula = Survived ~ Sex, family = binomial, data = complete_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0566     0.1290   8.191 2.58e-16 ***
## Sexmale      -2.5137     0.1672 -15.036  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  917.8  on 889  degrees of freedom
## AIC: 921.8
## 
## Number of Fisher Scoring iterations: 4

coef(fit.sex)

## (Intercept)     Sexmale 
##    1.056589   -2.513710

The results are again presented in logit form, so I transformed them into probabilities to make them easier to interpret.

plogis(coef(fit.sex))

## (Intercept)     Sexmale 
##  0.74203822  0.07490265

After transformation, the results show that female passengers had a survival probability of about 74%, while male passengers had a much lower probability, around 15–20%. This indicates that gender is a very strong predictor of survival and supports the idea that women were prioritized during evacuation.

To better illustrate this difference, I also visualized survival by gender.

## Visualization ##


complete_data %>% 
    ggplot(
    aes(
    x = Sex,
    fill = factor(Survived)
)
)+
geom_bar( position = "dodge")+
labs(
 title = "Survival by Gender",
 x = "Gender",
 y = "Count",
 fill = " Survived"
)+
 theme_minimal()

Model Fit and Diagnostics ( Single Predictor Model)

After fitting the model with gender, I evaluated how well it performs. To do this, I used the performance package to calculate Tjur’s R², which measures how well the model distinguishes between those who survived and those who did not.

library(performance)

r2(fit.sex)

## # R2 for Logistic Regression
##   Tjur's R2: 0.295

Tjur’s R² showed 0.295, indicating a moderate ability to separate survivors from non-survivors.

I also checked the model diagnostics using the DHARMa package.

library(DHARMa)

sim_res <- simulateResiduals(fit.sex)
plot(sim_res)

The diagnostic plots show no major issues, indicating that the model fits the data well.

Multiple Predictors Model

Next, I extended the model by including multiple predictors:gender, age, passenger class, and fare.

fit.mlr <- glm(
  Survived ~ Sex + Age + Pclass + Fare,
  data = complete_data,
  family = binomial
)

summary(fit.mlr)

## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fare, family = binomial, 
##     data = complete_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.5205972  0.4876536   9.270  < 2e-16 ***
## Sexmale     -2.5828983  0.1869374 -13.817  < 2e-16 ***
## Age         -0.0291686  0.0064845  -4.498 6.85e-06 ***
## Pclass      -1.1449558  0.1349913  -8.482  < 2e-16 ***
## Fare         0.0003465  0.0020262   0.171    0.864    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  805.41  on 886  degrees of freedom
## AIC: 815.41
## 
## Number of Fisher Scoring iterations: 5

The results are again presented in logit, so I transformed them into probabilities to make interpretation easier.

exp(coef(fit.mlr))

## (Intercept)     Sexmale         Age      Pclass        Fare 
## 91.89045829  0.07555471  0.97125270  0.31823798  1.00034660

The results show that gender, age, and passenger class are significant predictors of survival.Being male reduces the odds of survival by approximately 93%, confirming that gender has the strongest effect. Age also has a significant effect, where each additional year decreases survival odds by about 3%.Passenger class plays a major role as well, moving to a lower class reduces survival odds by about 68%.In contrast, fare was not a significant predictor, suggesting that its effect is already captured by passenger class.

Model Fit and Diagnostics ( Multiple Predictors Model)

After fitting the full model, I evaluated how well it performs using Tjur’s R².

r2(fit.mlr)

## # R2 for Logistic Regression
##   Tjur's R2: 0.390

Tjur’s R² showed 0.391, indicating a good level of predictive performance.

I also assessed the model diagnostics.

sim_res_full = simulateResiduals(fit.mlr)
plot(sim_res_full)

The diagnostic results show no significant issues. This suggests that the model provides a good fit to the data and that the results can be considered reliable.

Predicted Probabilities and Visualistaion

Finally, I used predicted probabilities to better understand how survival changes across different variables. Instead of interpreting logit, this approach allows us to see results directly in terms of probabilities, which are easier to interpret.

For this step, I used the ggeffects package.

library(ggeffects)

pred <- predict_response(fit.mlr, terms = c("Age", "Sex"))
plot(pred)

The visualization shows that survival probability decreases as age increases for both males and females. However, females consistently have a much higher probability of survival across all age groups.

Conclusion

Overall, the analysis shows that survival on the Titanic was strongly influenced by gender, age, and passenger class. Females and higher-class passengers had significantly higher chances of survival, while older passengers were less likely to survive.