Continuing From Part One

In my first post “Titanic Part 1” I do an exploratory data analysis to know what variables that affecting the passenger’s survivability of the crashed ship. Now in part 2, I will do some visualizations based on the previous EDA.

Load The Packages and Read the Data

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
titanic <- read.csv("train.csv")
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

We still have columns that are not in the right type, so we need to changes the data type

Change The Data Types

titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)

str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Now the data is in their right type so we can proceed to take a look at its statistical value.

summary(titanic)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin             Embarked        
##  Min.   :  0.00   Length:891         Length:891        
##  1st Qu.:  7.91   Class :character   Class :character  
##  Median : 14.45   Mode  :character   Mode  :character  
##  Mean   : 32.20                                        
##  3rd Qu.: 31.00                                        
##  Max.   :512.33                                        
## 

So we have:

  1. 891 passengers

  2. 549 death and 342 survived

  3. 314 female passengers and 577 male passengers

  4. 3 Pclass type

Survival Chance Based on Sex and Passanger Class

titanic %>%
  group_by(Sex, Pclass) %>%
  count(Survived, name = "Total")

As we can see, female passengers have high survivability than male passengers and it is true due to the captain’s order to save the woman and children first at that point. But the response was very different and result in some of the female passengers doesn’t survive. As for Pclass, More of the first-class passengers survived because their cabins were closer to the lifeboats [just as much chance?] and many of the emigrants in the third class died because their poor English meant they did not understand what was happening

Doing some visualizations based on certain variable

Create new column which is consist of family size and see if family size matters to the passengers survivability.

titanic <- titanic %>%
    mutate(Fsize=SibSp+Parch+1)
titanic %>% 
  ggplot()+
  geom_bar(aes(x=Pclass,fill=factor(Fsize))) +
  facet_wrap(~Survived)

And based on the plot, 3rd class brought more family member than the other classes and has the highest death than any other classes. As for the survived ones at any class, the individuals have higher survival rate. Maybe because they don’t have to take care of others when getting on the safe boat?

How about sex? is sex matters to survive the incident?

ggplot(titanic, aes(x= Sex, fill = Survived)) + 
  theme_bw()+
  geom_bar(stat = "count", position = "fill")+
  labs(y = "Passenger Count",title = "Titanic Survival Rates by Sex")

As we can see, female passengers have high survivability than male passengers and it is true due to the captain’s order to save the woman and children first at that point. But the response was very different and result in some of the female passengers don’t survive.

titanic %>%
  ggplot(aes(x= Sex, fill = Survived)) +
  theme_bw()+
  facet_wrap(~Pclass)+
  geom_bar()+
  labs(y = "Passenger Count", title = "Titanic Survival Rates by Pclass and Sex")

As for Pclass, More of the first-class passengers survived because their cabins were closer to the lifeboats [just as much chance?] and many of the emigrants in the third class died because their poor English meant they did not understand what was happening

Little explore on age distribution. What is the age range for the most number of deaths?

ggplot(titanic, aes(x= Age, fill =  Survived))+
  theme_bw()+
  geom_histogram(binwidth = 5)+
  labs(y = "Passenger Count",x= "Age(binwidth=5)", title = "Titanic Age Distribution")
## Warning: Removed 177 rows containing non-finite values (stat_bin).

Several children still wasn’t survived even though they ordered to save children and wonen first. And many passengers who died were at age 20 to 40.

ggplot(titanic, aes(x= Age, fill = Survived))+ 
  theme_bw()+facet_wrap(Sex ~ Pclass)+
  geom_density(aplha = 0.5)+
  labs(y = "Survival Rate",x= "Age", title = "Titanic Survival Rates by Age, Pclass and Sex")
## Warning: Ignoring unknown parameters: aplha
## Warning: Removed 177 rows containing non-finite values (stat_density).

Conclusion

Finally, at the final plot we can see that female passengers survived the crash the most. For the 3rd class, many of them didn’t make it. For 1st class and 2nd class, they have higher survival rate than the 3rd class.