The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. Our objectives are to analyze and find insights about what kind of group that more likely to survive from the accident from the data given.
Before we dive into the problem, we need to understand the data. First, we need to import the data and show into a dataframe for brief knowledge about the dataset and show the data afterwards.
data <- read.csv("data_input/train.csv")
dataTable 2.1. Titanic Dataset
Additional information:
C = Cherbourg, Q = Queenstown, S = Southampton
We want to see the data structure and its components to evaluate whether some data types need to be changed.
str(data)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
From the information above, some data types need to be changed:
- Survived -> factor
- Pclass -> factor
- Sex -> factor
- Embarked -> factor
data$Pclass <- as.factor(data$Pclass)
data$Sex <- as.factor(data$Sex)
data$Embarked <- as.factor(data$Embarked)
data$Survived <- as.factor(data$Survived)
str(data)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Our dataset is ready to go to the further steps.
Importing libraries.
library(ggplot2)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RColorBrewer)In this step, we want to explore our data to find abnormalities and insights in our data.
summary(data)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 : 2
## 1st Qu.: 7.91 Class :character C:168
## Median : 14.45 Mode :character Q: 77
## Mean : 32.20 S:644
## 3rd Qu.: 31.00
## Max. :512.33
##
There are 177 missing values in “Age” column, we can do NA imputation to fill the missing values. Moreover, “Name” and “Cabin” is considered to be unique values. In predicting whether a passanger survived or not, we don’t need unique values. Thus, we can drop the columns. There are also 2 missing values in “Embarked” column.
Firstly, We want to know the information about those 2 passengers whose the embarked status is unknown.
as.data.frame(data[data$Embarked == "", ])Table 3.1.1. Passengers with Unknown Embarked Status
There are 2 passengers with the information as shown below:
1. Ms. Amelie Icard:
- PassengerId = 62
- Sex = Female
- Age = 38
- Pclass = 1
- SibSp = 0
- Parch = 0
- Fare = 80
- Survived = yes
2. Mrs. George Nelson Stone:
- PassengerId = 830
- Sex = Female
- Age = 62
- Pclass = 1
- SibSp = 0
- Parch = 0
- Fare = 80
- Survived = yes
aggregate(Fare ~ Pclass + Embarked, data = data, FUN = "median")Table 3.1.2. Aggregate Table of Passenger Class and Embarked Status Based on Fare Price
Both passengers have the same information in some columns. Hence, we can find the embarked status according some condition which are:
- Passenger Class = 1 (Because they were from 1st class)
- Fare = 78 < x < 90 (Because their fare price was 80)
- PassengerId > 60 (Because both passengers Id was above 60)
- SibSP and Parch = 0 (Because nethier of them went with their sibling/ spouse/ parent/ child)
summary(as.data.frame(data[data$Pclass == 1 & data$Fare > 78 & data$Fare < 90 & data$PassengerId > 60 & data$Parch == 0 & data$SibSp == 0, ]))## PassengerId Survived Pclass Name Sex Age
## Min. : 62.0 0:2 1:10 Length:10 female:8 Min. :16.00
## 1st Qu.:257.2 1:8 2: 0 Class :character male :2 1st Qu.:24.00
## Median :301.0 3: 0 Mode :character Median :30.00
## Mean :420.4 Mean :33.22
## 3rd Qu.:696.2 3rd Qu.:38.00
## Max. :830.0 Max. :62.00
## NA's :1
## SibSp Parch Ticket Fare Cabin
## Min. :0 Min. :0 Length:10 Min. :78.85 Length:10
## 1st Qu.:0 1st Qu.:0 Class :character 1st Qu.:79.20 Class :character
## Median :0 Median :0 Mode :character Median :80.00 Mode :character
## Mean :0 Mean :0 Mean :81.91
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:85.66
## Max. :0 Max. :0 Max. :86.50
##
## Embarked
## :2
## C:4
## Q:0
## S:4
##
##
##
We got 2 options which are “C” and “S”. We will choose “C” because the median fare for “C” is close to 80.
data[data$Embarked == "", "Embarked"] <- "C"In our data, the “Age” is vary and there are 3 passenger classes that may have different median of “Age”. So, we will make a boxplot to find the median for each class.
ggplot(data, aes(x = Pclass, y = Age, fill = Pclass)) +
geom_boxplot(na.rm = TRUE, alpha = 0.3) +
labs(title = "Boxplot Age for Each Passenger Class",
x = "Passenger Class",
y = "Age",
fill = "Passenger Class") +
theme(plot.title = element_text(hjust = 0.5, size = 14),
legend.position = "none") +
theme_minimal()There are 3 different median of “Age” for each class. But, We want to have updated value for each median. In order to cope this, we can subset the median “Age” for each class.
medianclass1 <- data[data$Pclass == 1, c("Age", "Pclass")]
medianclass2 <- data[data$Pclass == 2, c("Age", "Pclass")]
medianclass3 <- data[data$Pclass == 3, c("Age", "Pclass")]
median1 <- median(medianclass1$Age, na.rm = TRUE)
median2 <- median(medianclass2$Age, na.rm = TRUE)
median3 <- median(medianclass3$Age, na.rm = TRUE)impute_age <- function(age, pclass){
# age <- data$Age
# pclass <- data$Pclass
if (is.na(age)){
if (pclass == 1){
return (median1)
}
else if (pclass == 2){
return (median2)
}
else {
return (median3)
}
}
else {
return (age)
}
}data$Age <- mapply(FUN = impute_age, age = data$Age, pclass = data$Pclass)
summary(data)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :26.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.07 Mean :0.523 Mean :0.3816
## 3rd Qu.:37.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## Fare Cabin Embarked
## Min. : 0.00 Length:891 : 0
## 1st Qu.: 7.91 Class :character C:170
## Median : 14.45 Mode :character Q: 77
## Mean : 32.20 S:644
## 3rd Qu.: 31.00
## Max. :512.33
correlation between gender, pclass, and survived
ggplot(data = data, aes(x = Survived, fill = Sex)) +
geom_bar(position = "dodge") +
labs(title = "Survived Status Based on Gender",
x = "Survived Status",
y = "Total",
fill = "Gender") +
theme_minimal()FIgure 4.1. Survived Status for Each Gender
From the chart above, we know that the majority of people who survived were female. majority of male did not survive.
ggplot(data = data, aes(x = Survived, fill = Pclass)) +
geom_bar(position = "dodge") +
labs(title = "Passenger Class Based on Survival Status Barplot",
x = "Survived Status",
y = "Total Passenger",
fill = "Passenger Class") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position="top") +
theme_minimal()FIgure 4.2. Survived Status for Each Class
We got a new insight about the data distribution, people who from 3rd class are more likely to not survive, where the 2nd class is distributed evenly, and 1st class are more likely to survive from the accident.
Aggregating male passengers based on passenger class.
male <- data[data$Sex == "male", ]
aggregate(Sex ~ Pclass + Survived, data = male, FUN = "length")Table 4.1. Passengers Survived Male Only
Aggregating female passengers based on passenger class.
female <- data[data$Sex == "female", ]
aggregate(Sex ~ Pclass + Survived, data = male, FUN = "length")Table 4.2. Passengers Survived Female Only
ggplot(data = male, aes(x = Survived, fill = Pclass)) +
geom_bar(position = "dodge") +
labs(title = "Passenger Class (Male) Based on Survival Status Barplot",
x = "Survived Status",
y = "Total Passenger",
fill = "Passenger Class") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position="top") +
theme_minimal()FIgure 4.3. Distribution of Male Passengers Survived for Each Class
Male Passengers from 3rd class are less likely to survive rather than the other classes.
ggplot(data = female, aes(x = Survived, fill = Pclass)) +
geom_bar(position = "dodge") +
labs(title = "Passenger Class (Female) Based on Survival Status Barplot",
x = "Survived Status",
y = "Total Passenger",
fill = "Passenger Class") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position="top") +
theme_minimal()FIgure 4.4. Distribution of Female Passengers Survived for Each Class
Female passengers from 3rd class are less likely to survive rather than the other classes.
data_subset <- data[, c("Survived", "Pclass", "Sex")]
data_agg <- data_subset %>%
mutate(Survived = as.factor(Survived)) %>%
count(Sex, Survived, Pclass, sort = TRUE)
data_aggTable 4.3. Number of Passenger Survived Based on Gender and Passenger Class
ggplot(data_agg, mapping = aes(x = Survived, y = n)) +
geom_col(aes(fill = Sex), position = "dodge") +
labs(title = "Passenger Survived Based on Gender and Passenger Class",
x = "Survived Status",
y = "Total") +
facet_wrap(~Pclass) +
theme_bw()Figure 4.5. Barchart of People Survived Based on Gender and Passenger Class
We want to see the distribution of sruvived status based on their ages.
ggplot(data, aes(x = Age)) +
geom_histogram(data = data[data$Survived == 0, ], fill = "red", alpha=0.6, position="identity", bins = 20) +
geom_histogram(data = data[data$Survived == 1, ], fill = "blue", alpha=0.6, position="identity", bins = 20) +
theme_classic()Figure 4.6. Histogram of People Survived Based on Age (Survived = blue, Not Survived = red)
Most of passengers older than 60 years old didn’t survive at the accident. Meanwhile, most of passengers younger than 10 years old survived at the accident.
In some conditions, people from the same area tend to have same characters which might affect their reactions when an accident happened. So, we will try to find correlation between embarked status and survived status.
ggplot(data = data, aes(x = Survived, fill = Embarked)) +
geom_bar(position = "dodge") +
labs(title = "Embarked Status Based on Survived Status Barplot",
x = "Survived Status",
y = "Total Passenger",
fill = "Embarked Status") +
scale_fill_brewer(palette = "Set1") +
theme(legend.position="top") +
theme_minimal()Figure 4.7. Barplot of People Survived Based on Embarked Status
Passengers from Southampton (S) are less likely to survived, followed by passengers from Cherbourg (C). More passengers from Queenstown (Q) survived from the accident. We would like to know deeper about our findings. So, we will subset our data to find more information based on gender.
data_subset2 <- data[, c("Survived", "Embarked", "Sex")]
data_agg2 <- data_subset2 %>%
mutate(Survived = as.factor(Survived)) %>%
count(Sex, Survived, Embarked, sort = TRUE)
data_agg2Table 4.4. Table of People Survived Based on Gender and Embarked Status
ggplot(data_agg2, mapping = aes(x = Survived, y = n)) +
geom_col(aes(fill = Sex), position = "dodge") +
labs(title = "People Survived Based on Gender and Embarked Status",
x = "Survived Status",
y = "Total") +
facet_wrap(~Embarked) +
theme_bw()Figure 4.8. Barplot of People Survived Based on Gender and Embarked Status
As it shown above, most casualties are from male passengers who boarded from Southampton (S). THe number of female who boarded from Southampton (S) and didn’t survive almost the same as female passengers who boarded from Cherbourg (C) and survived from the accident.
data_subset3 <- data[, c("Survived", "Embarked", "Pclass")]
data_agg3 <- data_subset3 %>%
mutate(Survived = as.factor(Survived)) %>%
count(Pclass, Survived, Embarked, sort = TRUE)
data_agg3Table 4.5. Table of People Survived Based on Passenger Class and Embarked Status
ggplot(data_agg3, mapping = aes(x = Survived, y = n)) +
geom_col(aes(fill = Pclass), position = "dodge") +
labs(title = "People Survived Based on Passenger Class and Embarked Status",
x = "Survived Status",
y = "Total") +
scale_fill_brewer(palette = "Set2") +
facet_wrap(~Embarked)Figure 4.9. Barplot of People Survived Based on Passenger Class and Embarked Status
From the figure above, we could say that passengers from 1st class is not always more likely to survive. The casualty number of 1st class passengers boarded from Southampton (S) is higher than 3rd class passengers boarded from Cherbourg (C) and Queenstown (Q).