Titanic Data Analysis using R
1 Project Setup
Loading required packages:
Setting some default behaviors:
Code
theme_set(
# globally set ggplot2 theme to bw
theme_bw() +
theme(
axis.text = element_text(size = 12), # axis text size
axis.title = element_text(size = 14), # axis title size
legend.title = element_text(size = 14), # legend title
legend.text = element_text(size = 12), # legend text
strip.text = element_text(size = 14), # facet title
plot.title = element_text(size = 16, hjust = 0.5) # plot title
)
)
options(ggplot2.discrete.fill = brewer.pal(12, "Paired")) # globally set color palette1.1 Data Import
- Survival : 0 = No, 1 = Yes
- Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Sex : Sex
- Age : Age
- SibSp : # of siblings / spouses onboard the Titanic
- Parch : # of parents / children onboard the Titanic (Some children travelled only with a nanny, therefore parch=0 for them)
- Ticket : Ticket number
- Fare : Passenger fare
- Cabin : Cabin number
- Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Code
titanic <- read.csv("titanic.csv")Inspecting first few rows:
Code
head(titanic)| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
Inspecting the structure of the dataset:
Code
str(titanic)'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
Inspect summary of the datset:
Code
summary(titanic) PassengerId Survived Pclass Name
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch
Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :28.00 Median :0.000 Median :0.0000
Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Ticket Fare Cabin Embarked
Length:891 Min. : 0.00 Length:891 Length:891
Class :character 1st Qu.: 7.91 Class :character Class :character
Mode :character Median : 14.45 Mode :character Mode :character
Mean : 32.20
3rd Qu.: 31.00
Max. :512.33
Age has missing values.
1.2 Data Cleaning
Code
df <- titanic %>%
mutate(Survived = factor(Survived, levels = c(0, 1), labels = c("No", "Yes")),
Pclass = factor(Pclass, levels = c(1, 2, 3), labels = c("1st", "2nd", "3rd")),
Embarked = factor(Embarked, levels = c("C", "Q", "S"),
labels = c("Cherbourg", "Queenstown", "Southampton")),
Sex = factor(Sex, levels = c("male", "female"), labels = c("Male", "Female")),
FamilySize = SibSp + Parch + 1 # including person himself/herself
) %>%
na.omit() # omitting rows with missing values in any column2 Survival Rate by Biological Gender
Problem statement: I need to check whether there is any difference in the survival rate solely based on gender of the passengers.
Code
| Sex | Survived | Freq | Total | SurvivalRate |
|---|---|---|---|---|
| Male | No | 360 | 453 | 0.7947020 |
| Male | Yes | 93 | 453 | 0.2052980 |
| Female | No | 64 | 259 | 0.2471042 |
| Female | Yes | 195 | 259 | 0.7528958 |
2.1 Visualization
Code
survival_rate_sex %>%
ggplot(aes(x = Sex, y = SurvivalRate, fill = Survived)) +
geom_bar(stat = "identity", position = "fill") + # Stack and scale to 100%
geom_text(aes(label = scales::percent(SurvivalRate)), # Display percentage labels
position = position_fill(vjust = 0.5)) +
labs(x = NULL, y = "Survival Rate", title = "Survival Rate by Sex") +
scale_y_continuous(labels = scales::percent) # Y-axis in percentagesCode
survival_rate_sex %>%
filter(Survived == "Yes") %>%
ggplot(aes(x = Sex, y = SurvivalRate)) +
geom_point(size = 4) +
geom_segment(aes(xend = Sex, yend = 0),
linewidth = 1, alpha = 0.4, color = "black") +
geom_text(aes(label = paste0(round(SurvivalRate*100, 2),"%")),
nudge_y = 0.04, size = 4, fontface = "bold") +
labs(x = NULL, y = "Survival Rate", title = "Survival Rate by Sex") +
scale_y_continuous(labels = scales::percent) +
guides(fill = "none")2.2 Proportion Test
Code
prop.test(x = survival_rate_sex$Freq[survival_rate_sex$Survived == "Yes"],
n = survival_rate_sex$Total[survival_rate_sex$Survived == "Yes"],
alternative = "less")
2-sample test for equality of proportions with continuity correction
data: survival_rate_sex$Freq[survival_rate_sex$Survived == "Yes"] out of survival_rate_sex$Total[survival_rate_sex$Survived == "Yes"]
X-squared = 202.87, df = 1, p-value < 2.2e-16
alternative hypothesis: less
95 percent confidence interval:
-1.0000000 -0.4905463
sample estimates:
prop 1 prop 2
0.2052980 0.7528958
The survival rate for males is lower than females.
2.3 Association between Gender and Survival
Problem Statement: Perform a chi-square test to determine if there is a significant association between gender (Sex) and survival.
Code
table(df$Sex, df$Survived)
No Yes
Male 360 93
Female 64 195
Code
chisq.test(df$Sex, df$Survived, correct = FALSE)
Pearson's Chi-squared test
data: df$Sex and df$Survived
X-squared = 205.14, df = 1, p-value < 2.2e-16
Code
# Alternatively
chisq.test(table(df$Sex, df$Survived), correct = FALSE)
Pearson's Chi-squared test
data: table(df$Sex, df$Survived)
X-squared = 205.14, df = 1, p-value < 2.2e-16
SInce the p-value is less than 0.05, it can concluded that there is a significant association between gender and survival.
3 Survival Rate by Passenger Class
Problem Statement: Investigate the relationship between passenger class (Pclass) and survival.
Code
| Pclass | Survived | Freq | Total | SurvivalRate |
|---|---|---|---|---|
| 1st | No | 64 | 184 | 0.3478261 |
| 1st | Yes | 120 | 184 | 0.6521739 |
| 2nd | No | 90 | 173 | 0.5202312 |
| 2nd | Yes | 83 | 173 | 0.4797688 |
| 3rd | No | 270 | 355 | 0.7605634 |
| 3rd | Yes | 85 | 355 | 0.2394366 |
3.1 Visualization
Code
survival_rate_pclass %>%
ggplot(aes(x = Pclass, y = Freq, fill = Survived)) +
geom_bar(stat = "identity", position = "fill") +
geom_text(aes(label = scales::percent(SurvivalRate)),
position = position_fill(vjust = 0.5)) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Passenger Class", y = "Percentage", fill = "Survived",
title = "Survival Rate by Passenger Class") +
coord_flip()3.2 Pairwise Proportion Test with Bonferroni Correction
Code
pairwise.prop.test(x = survival_rate_pclass$Freq[survival_rate_pclass$Survived == "Yes"],
n = survival_rate_pclass$Total[survival_rate_pclass$Survived == "Yes"],
p.adjust.method = "bonferroni")
Pairwise comparisons using Pairwise comparison of proportions
data: survival_rate_pclass$Freq[survival_rate_pclass$Survived == "Yes"] out of survival_rate_pclass$Total[survival_rate_pclass$Survived == "Yes"]
1 2
2 0.0044 -
3 < 2e-16 1.4e-07
P value adjustment method: bonferroni
4 Survival Rate by Age
Problem statement: Analyze the age distribution of passengers to see average age of survivors and deaths.
| Survived | Average Age |
|---|---|
| No | 30.62618 |
| Yes | 28.19330 |
Average age of survived passengers is 28.2 years and average age of passengers who did not survive is 30.6 years.0
4.1 Visualization
Code
df %>%
ggplot(aes(x = Survived, y = Age)) +
geom_boxplot() +
labs(x = "Survived", y = "Age", title = "Boxplot of Age by Survival")Code
df %>%
ggplot(aes(fill = Survived, x = Age)) +
geom_density(alpha = 0.5) +
labs(x = "Age", y = "Density", fill = "Survived",
title = "Age Distribution by Survival")4.2 Two-sample t-test
Code
t.test(Age ~ Survived, data = df)
Welch Two Sample t-test
data: Age by Survived
t = 2.1845, df = 596.68, p-value = 0.02931
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
0.2456156 4.6201457
sample estimates:
mean in group No mean in group Yes
30.62618 28.19330
The average age of survived passengers is higher than that of passengers who did not survive. The p-value is 0.02931, indicating that the difference in means is statistically significant.
5 Survival Rate by Family Size
Problem Statement: Visually analyze how family size affects the likelihood of survival.
Code
| FamilySize | Survived | Total | SurvivalRate |
|---|---|---|---|
| 1 | 128 | 402 | 0.3184080 |
| 2 | 76 | 139 | 0.5467626 |
| 3 | 53 | 93 | 0.5698925 |
| 4 | 21 | 27 | 0.7777778 |
| 5 | 3 | 11 | 0.2727273 |
| 6 | 3 | 22 | 0.1363636 |
| 7 | 4 | 12 | 0.3333333 |
| 8 | 0 | 6 | 0.0000000 |
5.1 Visualization
Code
survival_rate_family_size %>%
ggplot(aes(x = as.factor(FamilySize), y = SurvivalRate, fill = SurvivalRate)) +
geom_col() +
labs(x = "Family Size", y = "Survival Rate", fill = "Survived",
title = "Family Size Distribution by Survival") +
scale_fill_gradient(low = "red2", high = "darkgreen") +
scale_y_continuous(labels = scales::percent)5.2 Logistic Regression
Code
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.4671533 | 0.1070614 | -7.1089872 | 0.0000000 |
| as.factor(FamilySize)2 | 2.5823413 | 0.2012288 | 4.7145154 | 0.0000024 |
| as.factor(FamilySize)3 | 2.8363281 | 0.2352234 | 4.4320004 | 0.0000093 |
| as.factor(FamilySize)4 | 7.4921875 | 0.4751293 | 4.2385532 | 0.0000225 |
| as.factor(FamilySize)5 | 0.8027344 | 0.6854163 | -0.3205810 | 0.7485280 |
| as.factor(FamilySize)6 | 0.3379934 | 0.6304182 | -1.7206498 | 0.0853144 |
| as.factor(FamilySize)7 | 1.0703125 | 0.6216608 | 0.1093050 | 0.9129605 |
| as.factor(FamilySize)8 | 0.0000004 | 594.1635642 | -0.0249173 | 0.9801209 |
Family sizes of 2, 3, and 4 were associated with a positive and significant effect on survival (p < 0.001), with log-odds of 2.58, 2.84, and 7.49, respectively. But, family sizes of 5, 6, 7, and 8 showed no statistically significant effect (p > 0.05). This suggests that smaller family sizes (2–4) are associated with higher chance to survive compared to single person family size, while larger family sizes do not have a meaningful impact compared to single person family size.
6 Visually Inspecting Age Distribution by Pclass and Gender
Code
df %>%
ggplot(aes(x = Age, fill = Sex)) +
geom_histogram(alpha = 0.45, bins = 30) +
facet_grid(Pclass~., labeller = label_both, scales = "free_y") +
scale_fill_manual(values = c("blue", "red")) +
labs(x = "Age", y = "Freq.", fill = "Gender",
title = "Age Distribution by Class and Gender")7 Survival Rate by Age Group
Problem Statement: I’ve already investigated if there is any relationship between age and survival rate. But this time I have grouped the passengers by age and see the difference in survival rate.
Code
| AgeGroup | Survived | Total | SurvivalRate |
|---|---|---|---|
| Adults | 226 | 588 | 0.3843537 |
| Children | 61 | 113 | 0.5398230 |
| Elderly | 1 | 11 | 0.0909091 |
7.1 Visualization
Code
survival_rate_age_group %>%
ggplot(aes(x = AgeGroup, y = SurvivalRate, fill = SurvivalRate)) +
geom_col() +
geom_text(aes(label = paste0(scales::percent(SurvivalRate), " (", Survived,"/",Total,")")),
nudge_y = 0.03, size = 5) +
labs(x = "Age Group", y = "Survival Rate", fill = "Age Group",
title = "Survival Rate by Age Group") +
scale_fill_gradient(low = "#d7504b", high = "#548220") +
scale_y_continuous(labels = scales::percent, limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
scale_x_discrete(limits = c("Children", "Adults", "Elderly")) + # manually reposition the bars
guides(fill = "none")Code
survival_rate_age_group %>%
ggplot(aes(x = reorder(AgeGroup, -SurvivalRate), # arrange the bars from highest to lowest
y = SurvivalRate, fill = SurvivalRate)) +
geom_col() +
geom_text(aes(label = paste0(scales::percent(SurvivalRate), " (", Survived,"/",Total,")")),
nudge_y = 0.03, size = 5) +
labs(x = "Age Group", y = "Survival Rate", fill = "Age Group",
title = "Survival Rate by Age Group") +
scale_fill_gradient(low = "#d7504b", high = "#548220") +
scale_y_continuous(labels = scales::percent, limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
guides(fill = "none")The survival rate of children is higher than that of adults and elderlies.