Titanic Data Analysis using R

Author

Md Ahsanul Islam

Published

October 6, 2024

1 Project Setup

Loading required packages:

Code
library(dplyr)
library(ggplot2)
library(RColorBrewer) # for color palette

Setting some default behaviors:

Code
theme_set(
    # globally set ggplot2 theme to bw
    theme_bw() +   
    theme(
        axis.text = element_text(size = 12),     # axis text size
        axis.title = element_text(size = 14),    # axis title size
        legend.title = element_text(size = 14),  # legend title
        legend.text = element_text(size = 12),   # legend text
        strip.text = element_text(size = 14),    # facet title
        plot.title = element_text(size = 16, hjust = 0.5)  # plot title
        )
    )  
options(ggplot2.discrete.fill = brewer.pal(12, "Paired")) # globally set color palette

1.1 Data Import

Data description
  • Survival : 0 = No, 1 = Yes
  • Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Sex : Sex
  • Age : Age
  • SibSp : # of siblings / spouses onboard the Titanic
  • Parch : # of parents / children onboard the Titanic (Some children travelled only with a nanny, therefore parch=0 for them)
  • Ticket : Ticket number
  • Fare : Passenger fare
  • Cabin : Cabin number
  • Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Code
titanic <- read.csv("titanic.csv")

Inspecting first few rows:

Code
head(titanic)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q

Inspecting the structure of the dataset:

Code
str(titanic)
'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

Inspect summary of the datset:

Code
summary(titanic)
  PassengerId       Survived          Pclass          Name          
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
                                                                    
     Sex                 Age            SibSp           Parch       
 Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
 Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
 Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
                    Mean   :29.70   Mean   :0.523   Mean   :0.3816  
                    3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
                    Max.   :80.00   Max.   :8.000   Max.   :6.0000  
                    NA's   :177                                     
    Ticket               Fare           Cabin             Embarked        
 Length:891         Min.   :  0.00   Length:891         Length:891        
 Class :character   1st Qu.:  7.91   Class :character   Class :character  
 Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
                    Mean   : 32.20                                        
                    3rd Qu.: 31.00                                        
                    Max.   :512.33                                        
                                                                          
Warning

Age has missing values.

1.2 Data Cleaning

Code
df <- titanic %>%
  mutate(Survived = factor(Survived, levels = c(0, 1), labels = c("No", "Yes")),
         Pclass = factor(Pclass, levels = c(1, 2, 3), labels = c("1st", "2nd", "3rd")),
         Embarked = factor(Embarked, levels = c("C", "Q", "S"), 
                           labels = c("Cherbourg", "Queenstown", "Southampton")),
         Sex = factor(Sex, levels = c("male", "female"), labels = c("Male", "Female")),
         FamilySize = SibSp + Parch + 1   # including person himself/herself
         ) %>%
  na.omit()   # omitting rows with missing values in any column

I have omitted rows with missing values in any of the columns for the sake of this analysis. But this is not recommended since the missing values themselves could have valuable information such as age information from specific class may not be collected, etc.

2 Survival Rate by Biological Gender

Problem statement: I need to check whether there is any difference in the survival rate solely based on gender of the passengers.

Code
survival_rate_sex <- df %>% 
  group_by(Sex, Survived) %>% 
  summarise(Freq = n(), .groups = "drop") %>%
  group_by(Sex) %>% 
  mutate(Total = sum(Freq),
         SurvivalRate = Freq/Total)
survival_rate_sex
Sex Survived Freq Total SurvivalRate
Male No 360 453 0.7947020
Male Yes 93 453 0.2052980
Female No 64 259 0.2471042
Female Yes 195 259 0.7528958

2.1 Visualization

Code
survival_rate_sex %>% 
  ggplot(aes(x = Sex, y = SurvivalRate, fill = Survived)) +
  geom_bar(stat = "identity", position = "fill") +     # Stack and scale to 100%
  geom_text(aes(label = scales::percent(SurvivalRate)),  # Display percentage labels 
  position = position_fill(vjust = 0.5)) +
  labs(x = NULL, y = "Survival Rate", title = "Survival Rate by Sex") + 
  scale_y_continuous(labels = scales::percent)    # Y-axis in percentages

Code
survival_rate_sex %>% 
  filter(Survived == "Yes") %>%
  ggplot(aes(x = Sex, y = SurvivalRate)) +
  geom_point(size = 4) + 
  geom_segment(aes(xend = Sex, yend = 0), 
               linewidth = 1, alpha = 0.4, color = "black") +
  geom_text(aes(label = paste0(round(SurvivalRate*100, 2),"%")), 
            nudge_y = 0.04, size = 4, fontface = "bold") +
  labs(x = NULL, y = "Survival Rate", title = "Survival Rate by Sex") + 
  scale_y_continuous(labels = scales::percent) +
  guides(fill = "none")

2.2 Proportion Test

Code
prop.test(x = survival_rate_sex$Freq[survival_rate_sex$Survived == "Yes"], 
n = survival_rate_sex$Total[survival_rate_sex$Survived == "Yes"], 
alternative = "less")

    2-sample test for equality of proportions with continuity correction

data:  survival_rate_sex$Freq[survival_rate_sex$Survived == "Yes"] out of survival_rate_sex$Total[survival_rate_sex$Survived == "Yes"]
X-squared = 202.87, df = 1, p-value < 2.2e-16
alternative hypothesis: less
95 percent confidence interval:
 -1.0000000 -0.4905463
sample estimates:
   prop 1    prop 2 
0.2052980 0.7528958 
Findings

The survival rate for males is lower than females.

2.3 Association between Gender and Survival

Problem Statement: Perform a chi-square test to determine if there is a significant association between gender (Sex) and survival.

Code
table(df$Sex, df$Survived)
        
          No Yes
  Male   360  93
  Female  64 195
Code
chisq.test(df$Sex, df$Survived, correct = FALSE)

    Pearson's Chi-squared test

data:  df$Sex and df$Survived
X-squared = 205.14, df = 1, p-value < 2.2e-16
Code
# Alternatively
chisq.test(table(df$Sex, df$Survived), correct = FALSE)

    Pearson's Chi-squared test

data:  table(df$Sex, df$Survived)
X-squared = 205.14, df = 1, p-value < 2.2e-16
Findings

SInce the p-value is less than 0.05, it can concluded that there is a significant association between gender and survival.

3 Survival Rate by Passenger Class

Problem Statement: Investigate the relationship between passenger class (Pclass) and survival.

Code
survival_rate_pclass <- df %>% 
  group_by(Pclass, Survived) %>% 
  summarise(Freq = n(), .groups = "drop") %>% 
  group_by(Pclass) %>% 
  mutate(Total = sum(Freq),
         SurvivalRate = Freq/Total) 
survival_rate_pclass
Pclass Survived Freq Total SurvivalRate
1st No 64 184 0.3478261
1st Yes 120 184 0.6521739
2nd No 90 173 0.5202312
2nd Yes 83 173 0.4797688
3rd No 270 355 0.7605634
3rd Yes 85 355 0.2394366

3.1 Visualization

Code
survival_rate_pclass %>% 
  ggplot(aes(x = Pclass, y = Freq, fill = Survived)) +
  geom_bar(stat = "identity", position = "fill") +  
  geom_text(aes(label = scales::percent(SurvivalRate)), 
            position = position_fill(vjust = 0.5)) + 
  scale_y_continuous(labels = scales::percent) +  
  labs(x = "Passenger Class", y = "Percentage", fill = "Survived", 
       title = "Survival Rate by Passenger Class") + 
  coord_flip()

3.2 Pairwise Proportion Test with Bonferroni Correction

Code
pairwise.prop.test(x = survival_rate_pclass$Freq[survival_rate_pclass$Survived == "Yes"], 
                   n = survival_rate_pclass$Total[survival_rate_pclass$Survived == "Yes"], 
                   p.adjust.method = "bonferroni")

    Pairwise comparisons using Pairwise comparison of proportions 

data:  survival_rate_pclass$Freq[survival_rate_pclass$Survived == "Yes"] out of survival_rate_pclass$Total[survival_rate_pclass$Survived == "Yes"] 

  1       2      
2 0.0044  -      
3 < 2e-16 1.4e-07

P value adjustment method: bonferroni 
Findings
  • The first row of first column represents the comparison between Pclass 1 and Pclass 2. The p-value is 0.0044, indicating that the difference in survival rates between these two classes is statistically significant.

  • The second row of first column represents the comparison between Pclass 2 and Pclass 3. The p-value is 1.4e-07, indicating that the difference in survival rates between these two classes is also highly statistically significant.

  • The second row of second column represents the comparison between Pclass 1 and Pclass 3. The p-value is extremely small (< 2e-16), indicating that the difference in survival rates between these two classes is highly statistically significant.

4 Survival Rate by Age

Problem statement: Analyze the age distribution of passengers to see average age of survivors and deaths.

Code
df %>%
  group_by(Survived) %>%
  summarise(`Average Age` = mean(Age, na.rm = TRUE))
Survived Average Age
No 30.62618
Yes 28.19330
Note

Average age of survived passengers is 28.2 years and average age of passengers who did not survive is 30.6 years.0

4.1 Visualization

Code
df %>% 
  ggplot(aes(x = Survived, y = Age)) + 
  geom_boxplot() + 
  labs(x = "Survived", y = "Age", title = "Boxplot of Age by Survival")

Code
df %>% 
  ggplot(aes(fill = Survived, x = Age)) + 
  geom_density(alpha = 0.5) + 
  labs(x = "Age", y = "Density", fill = "Survived",
title = "Age Distribution by Survival")

4.2 Two-sample t-test

Code
t.test(Age ~ Survived, data = df)

    Welch Two Sample t-test

data:  Age by Survived
t = 2.1845, df = 596.68, p-value = 0.02931
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 0.2456156 4.6201457
sample estimates:
 mean in group No mean in group Yes 
         30.62618          28.19330 
Findings

The average age of survived passengers is higher than that of passengers who did not survive. The p-value is 0.02931, indicating that the difference in means is statistically significant.

5 Survival Rate by Family Size

Problem Statement: Visually analyze how family size affects the likelihood of survival.

Code
survival_rate_family_size <- df %>% 
  group_by(FamilySize) %>% 
  summarise(Survived = sum(Survived=="Yes"),
            Total = n()) %>%
  mutate(SurvivalRate = Survived/Total)
survival_rate_family_size
FamilySize Survived Total SurvivalRate
1 128 402 0.3184080
2 76 139 0.5467626
3 53 93 0.5698925
4 21 27 0.7777778
5 3 11 0.2727273
6 3 22 0.1363636
7 4 12 0.3333333
8 0 6 0.0000000

5.1 Visualization

Code
survival_rate_family_size %>% 
  ggplot(aes(x = as.factor(FamilySize), y = SurvivalRate, fill = SurvivalRate)) + 
  geom_col() + 
  labs(x = "Family Size", y = "Survival Rate", fill = "Survived",
title = "Family Size Distribution by Survival") +
  scale_fill_gradient(low = "red2", high = "darkgreen") +
  scale_y_continuous(labels = scales::percent)

5.2 Logistic Regression

Code
fit <- glm(Survived ~ as.factor(FamilySize), data = df, family = binomial(link = "logit"))
broom::tidy(fit, exponentiate = TRUE) 
term estimate std.error statistic p.value
(Intercept) 0.4671533 0.1070614 -7.1089872 0.0000000
as.factor(FamilySize)2 2.5823413 0.2012288 4.7145154 0.0000024
as.factor(FamilySize)3 2.8363281 0.2352234 4.4320004 0.0000093
as.factor(FamilySize)4 7.4921875 0.4751293 4.2385532 0.0000225
as.factor(FamilySize)5 0.8027344 0.6854163 -0.3205810 0.7485280
as.factor(FamilySize)6 0.3379934 0.6304182 -1.7206498 0.0853144
as.factor(FamilySize)7 1.0703125 0.6216608 0.1093050 0.9129605
as.factor(FamilySize)8 0.0000004 594.1635642 -0.0249173 0.9801209
Findings

Family sizes of 2, 3, and 4 were associated with a positive and significant effect on survival (p < 0.001), with log-odds of 2.58, 2.84, and 7.49, respectively. But, family sizes of 5, 6, 7, and 8 showed no statistically significant effect (p > 0.05). This suggests that smaller family sizes (2–4) are associated with higher chance to survive compared to single person family size, while larger family sizes do not have a meaningful impact compared to single person family size.

6 Visually Inspecting Age Distribution by Pclass and Gender

Code
df %>% 
  ggplot(aes(x = Age, fill = Sex)) + 
  geom_histogram(alpha = 0.45, bins = 30) + 
  facet_grid(Pclass~., labeller = label_both, scales = "free_y") +
  scale_fill_manual(values = c("blue", "red")) +
  labs(x = "Age", y = "Freq.", fill = "Gender", 
       title = "Age Distribution by Class and Gender")

7 Survival Rate by Age Group

Problem Statement: I’ve already investigated if there is any relationship between age and survival rate. But this time I have grouped the passengers by age and see the difference in survival rate.

Code
survival_rate_age_group <- df %>%
  mutate(AgeGroup = ifelse(Age < 18, "Children", ifelse(Age < 65, "Adults", "Elderly"))) %>%
  group_by(AgeGroup) %>%
  summarise(Survived = sum(Survived=="Yes"),
            Total = n(),
            SurvivalRate = Survived/Total) 
survival_rate_age_group
AgeGroup Survived Total SurvivalRate
Adults 226 588 0.3843537
Children 61 113 0.5398230
Elderly 1 11 0.0909091

7.1 Visualization

Code
survival_rate_age_group %>%
  ggplot(aes(x = AgeGroup, y = SurvivalRate, fill = SurvivalRate)) + 
  geom_col() +
  geom_text(aes(label = paste0(scales::percent(SurvivalRate), " (", Survived,"/",Total,")")), 
            nudge_y = 0.03, size = 5) +
  labs(x = "Age Group", y = "Survival Rate", fill = "Age Group",
       title = "Survival Rate by Age Group") +
  scale_fill_gradient(low = "#d7504b", high = "#548220") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  scale_x_discrete(limits = c("Children", "Adults", "Elderly")) +    # manually reposition the bars
  guides(fill = "none")

Code
survival_rate_age_group %>%
  ggplot(aes(x = reorder(AgeGroup, -SurvivalRate),    # arrange the bars from highest to lowest
             y = SurvivalRate, fill = SurvivalRate)) + 
  geom_col() +
  geom_text(aes(label = paste0(scales::percent(SurvivalRate), " (", Survived,"/",Total,")")), 
            nudge_y = 0.03, size = 5) +
  labs(x = "Age Group", y = "Survival Rate", fill = "Age Group",
       title = "Survival Rate by Age Group") +
  scale_fill_gradient(low = "#d7504b", high = "#548220") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  guides(fill = "none")

Findings

The survival rate of children is higher than that of adults and elderlies.