About the Data
This dataset is from Kaggle, which consists of train and test set. For this LBB however, I will only do Exploratory Data Analysis on the train data. Also, I won’t be using any library for EDA (e.g tidyverse or dplyr) either.
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Here are the description/note for some variables:
survival : 0 (no), 1 (yes)
pclass : ticket class, resembling socio-economic status (SES). 1 (upper), 2 (middle), 3 (lower)
sibsp : # of siblings / spouses aboard the Titanic. Sibling (brother/sister/stepbrother/stepsister), spouse (husband/wife. Mistresses & fiances are ignored)
parch : # of parents / children aboard the Titanic. If a child travelled with nanny, then parch = 0
ticket : Ticket number
fare : passenger fare
cabin : cabin number
embarked : port of embarkation. C (Cherbourg), Q (Queenstown), S (Southampton)
Data Preparation
Change the class & assign levels for some variables:
data$Survived <- as.factor(data$Survived)
levels(data$Survived) <- c("No", "Yes")
data$Pclass <- as.factor(data$Pclass)
levels(data$Pclass) <- c("Upper", "Middle", "Lower")
levels(data$Embarked) <- c("Unknown", "Cherbourg", "Queenstown", "Southampton")
head(data)## PassengerId Survived Pclass
## 1 1 No Lower
## 2 2 Yes Upper
## 3 3 Yes Lower
## 4 4 Yes Upper
## 5 5 No Lower
## 6 6 No Lower
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 Southampton
## 2 PC 17599 71.2833 C85 Cherbourg
## 3 STON/O2. 3101282 7.9250 Southampton
## 4 113803 53.1000 C123 Southampton
## 5 373450 8.0500 Southampton
## 6 330877 8.4583 Queenstown
Data Summary
## PassengerId Survived Pclass
## Min. : 1.0 No :549 Upper :216
## 1st Qu.:223.5 Yes:342 Middle:184
## Median :446.0 Lower :491
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 Unknown : 2
## B96 B98 : 4 Cherbourg :168
## C23 C25 C27: 4 Queenstown : 77
## G6 : 4 Southampton:644
## C22 C26 : 3
## D : 3
## (Other) :186
Some notable points from the above
summary:
1. The proportion of passangers who survived to those who did NOT is almost1 : 2
2. About half of the passangers were actually bought 3rd class tickets.
3. There are way more male than female passangers (almost twice the number).
# Comparison of survival status
plot(data$Survived,
col = "dark blue",
main = "How Many Titanic Passangers Were Survived?",
xlab = "Status",
ylab = "Amount")And there are stark contrast as to where they embarked from:
# install package for color palette in graphic
# install.packages("wesanderson")
library(wesanderson)# Comparison of survival status
plot(data$Embarked,
col = wes_palette(n=4, name = "GrandBudapest2"),
main = "Where The Titanic Passengers Embarked From",
xlab = "Place",
ylab = "Amount") Note: for more information about
wesanderson color palette, click here
From a Titanic route that I got from Google, the ship sailed from
Southamptonand carried its most passangers there. It then sailed toCherbourg&Queenstown, where Titanic picked a small amount of its passangers there.
Titanic Route
Survivor Profile
Let’s look at the survivor profile based on several passangers identifiers:
Based on Ticket Class
Does being able to purchase a first-class ticket ensure the safety of passangers?
# Create table to calculate proportion
surv_ticket <- as.data.frame(sort(prop.table(table(droplevels(data[data$Survived == "Yes","Pclass"]))),decreasing = T))
surv_ticket## Var1 Freq
## 1 Upper 0.3976608
## 2 Lower 0.3479532
## 3 Middle 0.2543860
# Creating pie chart
pie(surv_ticket$Freq, labels = surv_ticket$Var1, main = "Survived Passangers's Ticket Class(%)")In the middle of a disastrous situation, ticket class didn’t really matter as there are very little difference in the proportion of the ticket class held by survivors.
Based on Gender & Age
If we were in a sinking Titanic, would you be inclined to prioritize women over men? Or it didn’t matter at all?
## PassengerId Survived Pclass
## 2 2 Yes Upper
## 3 3 Yes Lower
## 4 4 Yes Upper
## 9 9 Yes Lower
## 10 10 Yes Middle
## 11 11 Yes Lower
## Name Sex Age SibSp Parch
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## 11 Sandstrom, Miss. Marguerite Rut female 4 1 1
## Ticket Fare Cabin Embarked
## 2 PC 17599 71.2833 C85 Cherbourg
## 3 STON/O2. 3101282 7.9250 Southampton
## 4 113803 53.1000 C123 Southampton
## 9 347742 11.1333 Southampton
## 10 237736 30.0708 Cherbourg
## 11 PP 9549 16.7000 G6 Southampton
And to remind you again the amount of those who survived:
## [1] 342
From 342 passangers, the
# Number of survivor based on gender
surv_age <- as.data.frame(sort(table(droplevels(surv$Sex)), decreasing = T))
# Change column name
names(surv_age)[1] <- "Survivor.Gender"
# Check table
surv_age## Survivor.Gender Freq
## 1 female 233
## 2 male 109
# Graph (using wesanderson library with "Moonrise3" color palette)
graphics::barplot(xtabs(Freq ~ Survivor.Gender, surv_age),
col = wes_palette(n=2, name = "Moonrise3"),
main = "Survivor's Gender")This means that being a female increased the survival chance by 2x than being a male, because they might be prioritized in a safe & rescue situation.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 19.00 28.00 28.34 36.00 80.00 52
From this summary alone, while there are 52 N/A data from age, we can see that those who survived: 1. Both infant & elderly had the chance to survive 2. The average age of survivors was 28
# Filter/subset age (dismissing N/A age)
sixty <- subset(surv, Age > 60)
thirtysixty <- subset(surv, Age %in% 30:60)
youngadult <- subset(surv, Age %in% 17:29)
kids <- subset(surv, Age %in% 0:16)
# adding age cluster
sixty$agecluster <- "Age > 60"
thirtysixty$agecluster <- "Age 30-60"
youngadult$agecluster <- "Age 17 to 29"
kids$agecluster <- "Below 16"
# Combine them all again
surv_agecl <- rbind(sixty, thirtysixty, youngadult, kids)
head(surv_agecl)## PassengerId Survived Pclass
## 276 276 Yes Upper
## 484 484 Yes Lower
## 571 571 Yes Middle
## 631 631 Yes Upper
## 830 830 Yes Upper
## 2 2 Yes Upper
## Name Sex Age SibSp Parch
## 276 Andrews, Miss. Kornelia Theodosia female 63 1 0
## 484 Turkula, Mrs. (Hedwig) female 63 0 0
## 571 Harris, Mr. George male 62 0 0
## 631 Barkworth, Mr. Algernon Henry Wilson male 80 0 0
## 830 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## Ticket Fare Cabin Embarked agecluster
## 276 13502 77.9583 D7 Southampton Age > 60
## 484 4134 9.5875 Southampton Age > 60
## 571 S.W./PP 752 10.5000 Southampton Age > 60
## 631 27042 30.0000 A23 Southampton Age > 60
## 830 113572 80.0000 B28 Unknown Age > 60
## 2 PC 17599 71.2833 C85 Cherbourg Age 30-60
# Change class
surv_agecl$agecluster <- as.factor(surv_agecl$agecluster)
# Calculate frequency of each age cluster
agecl <- as.data.frame(sort(table(droplevels(surv_agecl$agecluster)),decreasing = T))graphics::barplot(xtabs(Freq ~ Var1, agecl),
col = wes_palette(n=4, name = "Moonrise3"),
main = "Survivor's Age Cluster")As I expected, thos who were in ‘productive’/‘healthy’ age have higher chance to survive Titanic, although those below 16 (whom I think about as ‘kids’) have lower chance to survive. It may be because there are not many kids to begin with, plus they were easily distraught when separated from their family (and remember: there are kids who travelled with only their nannies). And elderly people have the slimmest chance of survival, especially due to harsh weather they were in.
Conclusion
To sum up our analysis above, there were 342 survivors from the total of 891 passangers sample in our train data. These passangers mostly came from the region of Southampton, and most of them bought Upper class ticket, although the proportion who bought middle and lower class ticket were almost the same.
Among the survivors, most of them are female. The survivors were also within ‘productive’/‘healthy’ age (17-60). If I can assume, a female passanger would be prioritized in a safe & rescue situation (remember Jack & Rose in the final scene of “Titanic”?), and those within that age bracket were strong enough to help themselves out of the situation, despite harsh weather and without having to depend on another adult (e.g kids with their nannies).
Before we part, this is an image to remind you that Jack could have survived the Titanic if Rose just scooted over a little:
Jack & Rose Last Scene