In my first post “Titanic Part 1” I do an exploratory data analysis to know what variables that affecting the passenger’s survivability of the crashed ship. Now in part 2, I will do some visualizations based on the previous EDA.
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
titanic <- read.csv("train.csv")
str(titanic)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
We still have columns that are not in the right type, so we need to changes the data type
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
str(titanic)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Now the data is in their right type so we can proceed to take a look at its statistical value.
summary(titanic)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 Length:891
## 1st Qu.: 7.91 Class :character Class :character
## Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
So we have:
891 passengers
549 death and 342 survived
314 female passengers and 577 male passengers
3 Pclass type
titanic %>%
group_by(Sex, Pclass) %>%
count(Survived, name = "Total")As we can see, female passengers have high survivability than male passengers and it is true due to the captain’s order to save the woman and children first at that point. But the response was very different and result in some of the female passengers doesn’t survive. As for Pclass, More of the first-class passengers survived because their cabins were closer to the lifeboats [just as much chance?] and many of the emigrants in the third class died because their poor English meant they did not understand what was happening
Create new column which is consist of family size and see if family size matters to the passengers survivability.
titanic <- titanic %>%
mutate(Fsize=SibSp+Parch+1)
titanic %>%
ggplot()+
geom_bar(aes(x=Pclass,fill=factor(Fsize))) +
facet_wrap(~Survived)And based on the plot, 3rd class brought more family member than the other classes and has the highest death than any other classes. As for the survived ones at any class, the individuals have higher survival rate. Maybe because they don’t have to take care of others when getting on the safe boat?
How about sex? is sex matters to survive the incident?
ggplot(titanic, aes(x= Sex, fill = Survived)) +
theme_bw()+
geom_bar(stat = "count", position = "fill")+
labs(y = "Passenger Count",title = "Titanic Survival Rates by Sex")As we can see, female passengers have high survivability than male passengers and it is true due to the captain’s order to save the woman and children first at that point. But the response was very different and result in some of the female passengers don’t survive.
titanic %>%
ggplot(aes(x= Sex, fill = Survived)) +
theme_bw()+
facet_wrap(~Pclass)+
geom_bar()+
labs(y = "Passenger Count", title = "Titanic Survival Rates by Pclass and Sex") As for Pclass, More of the first-class passengers survived because their cabins were closer to the lifeboats [just as much chance?] and many of the emigrants in the third class died because their poor English meant they did not understand what was happening
Little explore on age distribution. What is the age range for the most number of deaths?
ggplot(titanic, aes(x= Age, fill = Survived))+
theme_bw()+
geom_histogram(binwidth = 5)+
labs(y = "Passenger Count",x= "Age(binwidth=5)", title = "Titanic Age Distribution")## Warning: Removed 177 rows containing non-finite values (stat_bin).
Several children still wasn’t survived even though they ordered to save children and wonen first. And many passengers who died were at age 20 to 40.
ggplot(titanic, aes(x= Age, fill = Survived))+
theme_bw()+facet_wrap(Sex ~ Pclass)+
geom_density(aplha = 0.5)+
labs(y = "Survival Rate",x= "Age", title = "Titanic Survival Rates by Age, Pclass and Sex")## Warning: Ignoring unknown parameters: aplha
## Warning: Removed 177 rows containing non-finite values (stat_density).
Finally, at the final plot we can see that female passengers survived the crash the most. For the 3rd class, many of them didn’t make it. For 1st class and 2nd class, they have higher survival rate than the 3rd class.