library(tidyverse)
library(ggplot2)
library(ggpie)
knitr::include_graphics("titanicsink.jpg")

1 INTRODUCTION

here we will explore the data about the passengers of a renowed boat, RMS Titanic. At the end of this project, we might unfold some facts about its passengers through this document. For the start, let’s import the data!

2 IMPORT DATA

tnc <- read.csv("Titanic.csv")

3 EDA

dim(tnc)
## [1] 418  12
head(tnc)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
892 0 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 Q
893 1 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 S
894 0 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 Q
895 0 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 S
896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 S
897 0 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 S
tail(tnc)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
413 1304 1 3 Henriksson, Miss. Jenny Lovisa female 28.0 0 0 347086 7.7750 S
414 1305 0 3 Spector, Mr. Woolf male NA 0 0 A.5. 3236 8.0500 S
415 1306 1 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
416 1307 0 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 S
417 1308 0 3 Ware, Mr. Frederick male NA 0 0 359309 8.0500 S
418 1309 0 3 Peter, Master. Michael J male NA 1 1 2668 22.3583 C
str(tnc)
## 'data.frame':    418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Survived   : int  0 1 0 0 1 0 1 0 1 0 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...
str(tnc)
## 'data.frame':    418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Survived   : int  0 1 0 0 1 0 1 0 1 0 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

3.1 Check Missing Value

anyNA(tnc)
## [1] TRUE
colSums(is.na(tnc))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0          86 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0

3.2 Excluding Missing Value

tnc <- na.exclude(tnc)
dim(tnc)
## [1] 331  12

3.3 Summarizing the data

summary(tnc)
##   PassengerId        Survived          Pclass          Name          
##  Min.   : 892.0   Min.   :0.0000   Min.   :1.000   Length:331        
##  1st Qu.: 992.5   1st Qu.:0.0000   1st Qu.:1.000   Class :character  
##  Median :1100.0   Median :0.0000   Median :2.000   Mode  :character  
##  Mean   :1100.2   Mean   :0.3837   Mean   :2.142                     
##  3rd Qu.:1210.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1307.0   Max.   :1.0000   Max.   :3.000                     
##      Sex                 Age            SibSp            Parch       
##  Length:331         Min.   : 0.17   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :27.00   Median :0.0000   Median :0.0000  
##                     Mean   :30.18   Mean   :0.4834   Mean   :0.3988  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :76.00   Max.   :8.0000   Max.   :6.0000  
##     Ticket               Fare           Cabin             Embarked        
##  Length:331         Min.   :  0.00   Length:331         Length:331        
##  Class :character   1st Qu.:  8.05   Class :character   Class :character  
##  Mode  :character   Median : 16.00   Mode  :character   Mode  :character  
##                     Mean   : 40.98                                        
##                     3rd Qu.: 40.63                                        
##                     Max.   :512.33

3.4 Checking the unique of each column

sapply(tnc, n_distinct)
## PassengerId    Survived      Pclass        Name         Sex         Age 
##         331           2           3         331           2          78 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           7         284         148          73           3
unique(tnc$Survived)
## [1] 0 1
unique(tnc$Pclass)
## [1] 3 2 1
unique(tnc$Sex)
## [1] "male"   "female"
unique(tnc$Embarked)
## [1] "Q" "S" "C"

Here are some facts and explanation about the data

  1. PassengerId is unique for each passenger, there won’t be any similar passenger id

  2. Survived contains 2 category. 0 = not survived, 1 = survived

  3. Pclass is passenger class

  4. Name, Sex, Age are what they are

  5. SibSp is the total of siblings on board on the RMS Titanic

  6. Parch is the total of parents/children on board on the RMS Titanic

  7. Ticket is the distinctive id for each ticket, or ticket number

  8. Fare is the ticket price. the mean for the ticket price is 40.98 dollar and the most expensive ticket charges at 512.33 dollar

  9. Cabin is the number of Cabin the passenger stayed at (if they were in Cabin)

  10. Embarked is the port of Embarkation. C = Cherbourg, Q = Queenstown; S = Southampton

Now, I will remove the columns: SibSp, Parch, as it does not tell much the relation between each variables or observations

tnc <- tnc[,-c(7:8)]
names(tnc)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "Ticket"      "Fare"        "Cabin"       "Embarked"

4 DATA VISUALIZATION

head(tnc)
PassengerId Survived Pclass Name Sex Age Ticket Fare Cabin Embarked
892 0 3 Kelly, Mr. James male 34.5 330911 7.8292 Q
893 1 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 363272 7.0000 S
894 0 2 Myles, Mr. Thomas Francis male 62.0 240276 9.6875 Q
895 0 3 Wirz, Mr. Albert male 27.0 315154 8.6625 S
896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 3101298 12.2875 S
897 0 3 Svensson, Mr. Johan Cervin male 14.0 7538 9.2250 S

4.1 Proportion of Males and Females On Board

ggpie3D(data = tnc, group_key = "Sex", 
        count_type = "full", 
        tilt_degrees = 8, 
        label_size=2) + 
  ggtitle("Percentage Between Males and Females On Board") + 
  theme(plot.title = element_text(hjust = 0.5))

The total of males and females recorded in the data, there are 62% of male with the total of 204 passengers, and 38% females with the total of 127 passengers

4.2 Ticket per Class

ggplot(tnc, aes(Pclass, Fare))+
  geom_col(aes(fill= Sex), position = "dodge")+
  geom_jitter(aes(col=Survived, size=Fare))+
  labs(title = "Ticket Fare per Class of Titanic Passenger", 
       x = "Passenger Class")+
  theme(plot.title = element_text(hjust = 0.5))

from the graph above we can see that:

  1. the most expensive ticket fare was first class ticket, which was bought by female for a total of 500+ dollar

  2. the cheapest ticket fare was third class ticket, which was bought by female for slightly below 50 dollar in total

  3. there is a passenger of first class without paying a single penny

  4. the majority of the data is the third class passengers

  5. there are first class ticket which has the same Fare with second and third class, which is under 100 dollars

  6. the Fare for second and third class are under 100 dollars

Based on insight number #3, let’s uncover who was not paying a single penny to get on board the RMS Titanic

filter(tnc, Pclass == 1, Fare == 0)
PassengerId Survived Pclass Name Sex Age Ticket Fare Cabin Embarked
1264 0 1 Ismay, Mr. Joseph Bruce male 49 112058 0 B52 B54 B56 S

the free rider of a first class ticket is Mr. Joseph Bruce Ismay, which was probably the owner, special guest, or the important crew of the RMS Titanic

4.3 Passenger’s Age In Each Sex

ggplot(tnc, aes(Sex, Age))+
  geom_boxplot(outlier.shape=NA, aes(fill=Sex), col="Blue")+
  geom_jitter(alpha=0.5, col="orange")+
  labs(title = "Passengers Age")+
  theme(plot.title = element_text(hjust = 0.5))

from the data above we can see that:

  1. the most, second-most, and third-most passengers’ age for both sex fall in between the age of 20-40, 41-60, and 0-20 respectively

  2. there are babies on board

  3. the oldest male passenger is under 70

  4. the oldest female passenger is above 70

  5. the average age for both sex is almost the same

4.4 Proportion of Passenger Survived Grouped Based on Sex

ggplot(tnc, aes(Sex, Survived))+
  labs(title = "Passenger Survived", 
       x = "Sex",
       y = "Survived")+
  geom_bar(stat = "identity", aes(fill= Sex))+
  coord_polar("y", start=0, direction = 1)+
  theme_void()

as we can see from the pie chart above that all survived passengers of Titanic accident are female. To prove it, let’s see it through the table.

filter(tnc, Survived == 1, Sex == "male")
PassengerId Survived Pclass Name Sex Age Ticket Fare Cabin Embarked
filter(tnc, Survived == 0, Sex == "female")
PassengerId Survived Pclass Name Sex Age Ticket Fare Cabin Embarked

from the data above, we can conclude that all females recorded in the data are survived the accident, and all males recorded in the data are not survived the accident.

4.5 Percentage of Embarkation In Each Port

ggrosepie(data = tnc, group_key = c("Embarked", "Sex"), 
          count_type = "full", 
          label_info = "all",
          show_tick = F,
          donut_frac = NULL,
          donut_label_size = "5") + 
  ggtitle("Percentage of Embarkation In Each Port") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(tnc, aes(Embarked, Sex))+
  geom_col(aes(fill = Sex), position = "fill", 
           show.legend = F)+
  labs(title = "Proportion Based On Sex In Each Embarkation Port", 
       x = "Passenger Class")+
  theme(plot.title = element_text(hjust = 0.5))

from 2 graphs depicted above, the data show that :

  1. Port Southampton has the highest embarkation with the total of 69% total passengers, with majority of males

  2. Port Cherbourg has the second highest of embarkation with the total of 25% total passengers, with majority of males

  3. Port Queesntown has the least total passengers embarkation with the total of 7%. However, the proportion of females between port is the highest in this port

From the “Percentage of Embarkation in Each Port” graph above, we can see the percentage of passengers that embarked in each embarkation port. unfortunately, the graph does not tell us how much is the proportion between males and females, so we should see it from the graph “Proportion Based On Sex In Each Embarkation Port”. Those 2 graphs is quite ineffective since we can actually do it within a single graph. In the next visualization, we will be trying to see both info in a single visualization. However, the data present will be seeing how much the proportion of males and females in each class.

4.6 Proportion of Males And Females In Each Class

ggnestedpie(tnc, group_key = c("Pclass", "Sex"), count_type = "full",
            inner_label_info = "all", 
            inner_label_split = NULL,
            inner_label_size = 2,
            outer_label_type = "circle", 
            outer_label_pos = "in", 
            outer_label_info = "all")+
  labs(title = "Proportions of Males And Females In Each Class")+
  theme(plot.title = element_text(hjust=0.5))
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

From the Nested Pie Chart above we can see that:

  1. The proportion of 1st class is the second highest which standing at 29.6% total passengers of all classes, with proportion of males is 15.11% and females is 14.50% of all passengers in all classes

  2. The proportion of 2nd class is the least, standing at 26.6% total passengers of all classes, with proportion of males is 17.82% and females is 8.76% of all passengers in all classes

  3. The proportion of 3rd class is the majority of all passengers come from which is 43.8% total passengers of all classes, with proportion of males is 28.70% and females is 15.11% of all passengers in all classes