EDA : Subset Data of the Titanic Passenger Onboard

Raja Palawija

21 February 2023

Brief History of Titanic Disaster

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On Sunday, April 14, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. The Titanic’s distress signals were heard by a nearby ship. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

Federal law soon required that all large ocean-going vessels to be equipped with wireless for safety reasons. David Sarnoff noted that the Titanic disaster brought radio to the front.

Preprocessing Process of Passengers Data

The data is retrieved from Kaggle. The goal of this project is to explore the train.csv (It will be useful for the analysis of what sorts of people were likely to survive in the next project).

# Load CSV file
titanic_data <- read.csv("dataInputs/train.csv")
str(titanic_data) 
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Check missing values (empty or NA)

# Check missing value using "reshape" library
missing_data <- melt(apply(titanic_data[, -2], 2, function(x) sum(is.na(x) | x=="")))
cbind(row.names(missing_data)[missing_data$value>0], missing_data[missing_data$value>0,])
##      [,1]       [,2] 
## [1,] "Age"      "177"
## [2,] "Cabin"    "687"
## [3,] "Embarked" "2"
  • Missing values on Age will be replaced to 0 and be categorised as “Uncategorised” later.
  • Cabin has missed around 80% values. We won’t fix this variable.
  • Missing values on Embarked will be replaced to most common data

Clean-up data

titanic_data$Surviveddetailed[titanic_data$Survived == 0] <- "No"
titanic_data$Surviveddetailed[titanic_data$Survived == 1] <- "Yes"
titanic_data$Sex[titanic_data$Sex == "male"] <- "Male"
titanic_data$Sex[titanic_data$Sex == "female"] <- "Female"
titanic_data <- titanic_data %>% 
  mutate(
     Age = as.integer(Age),
     Embarked = as.factor(Embarked),
     Survived = as.factor(Survived), 
     Pclass = as.factor(Pclass),
     Sex = as.factor(Sex),
     SibSp = as.factor(SibSp), 
     Parch = as.factor(Parch),
     Surviveddetailed = as.factor(Surviveddetailed)) %>% 
 select(-c(1,11))

Update missing data

# Update missing Age data to 0
titanic_data$Age[which(is.na(titanic_data$Age) | titanic_data$Age=="")] <- 0
# Check common value on Embarked data to S
table(titanic_data$Embarked)
## 
##       C   Q   S 
##   2 168  77 644
# Update missing Embarked data
titanic_data$Embarked[which(is.na(titanic_data$Embarked) | titanic_data$Embarked=="")] <- 'S'

Create new values for “Age Category” and “Title of Passenger”

# Create a new value to categorise Age. N/A or 0 will be categorized as "Uncategorised"
titanic_data$Age_Category[titanic_data$Age < 1] <- "Uncategorised"
titanic_data$Age_Category[titanic_data$Age > 0 & titanic_data$Age <=14] <- "Children"
titanic_data$Age_Category[titanic_data$Age >=15 & titanic_data$Age <=24] <- "Youth"
titanic_data$Age_Category[titanic_data$Age >=25 & titanic_data$Age <=64] <- "Adults"
titanic_data$Age_Category[titanic_data$Age >= 65] <- "Seniors"
titanic_data$Age_Category <- as.factor(titanic_data$Age_Category)
levels(titanic_data$Age_Category)
## [1] "Adults"        "Children"      "Seniors"       "Uncategorised"
## [5] "Youth"
# Create a new value to categorise Title of Passengger
titanic_data$Titles <- regmatches(as.character(titanic_data$Name),regexpr("\\,[A-z ]{1,20}\\.", as.character(titanic_data$Name)))
titanic_data$Titles <- unlist(lapply(titanic_data$Titles,FUN=function(x) substr(x, 3, nchar(x)-1)))
titanic_data$Titles <- gsub("(Dr|Rev|Co|Major|Countess|Sir|Jonkheer|Lady|Capt|Don|Othersl|the Others)", "Others", titanic_data$Titles)
titanic_data$Titles <- gsub("(Ms|Mlle)", "Miss", titanic_data$Titles)
titanic_data$Titles[titanic_data$Titles == "Mme"] <- "Mrs"
titanic_data$Titles[titanic_data$Titles == "Othersl"] <- "Others"
titanic_data$Titles[titanic_data$Titles == "the Others"] <- "Others"
titanic_data$Titles <- as.factor(titanic_data$Titles)
levels(titanic_data$Titles)
## [1] "Master" "Miss"   "Mr"     "Mrs"    "Others"
# Update embarked location 
switch.location <- function(x){
  y <- switch(as.character(x),
       "C" = "Cherbourg",
       "Q" = "Queenstown",
       "S" = "Southampton")
  return(y)
}

titanic_data$Embarked <- as.factor(sapply(titanic_data$Embarked, FUN = switch.location))

Ensure the data is as wanted

# Check N/A, NULL or Empty values
colSums(is.na(titanic_data))
##         Survived           Pclass             Name              Sex 
##                0                0                0                0 
##              Age            SibSp            Parch           Ticket 
##                0                0                0                0 
##             Fare         Embarked Surviveddetailed     Age_Category 
##                0                0                0                0 
##           Titles 
##                0
# Check structure of each column
str(titanic_data)
## 'data.frame':    891 obs. of  13 variables:
##  $ Survived        : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass          : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name            : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex             : Factor w/ 2 levels "Female","Male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age             : num  22 38 26 35 35 0 54 2 27 14 ...
##  $ SibSp           : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch           : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ticket          : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare            : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked        : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 1 3 3 3 2 3 3 3 1 ...
##  $ Surviveddetailed: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Age_Category    : Factor w/ 5 levels "Adults","Children",..: 5 1 1 1 1 4 1 2 1 2 ...
##  $ Titles          : Factor w/ 5 levels "Master","Miss",..: 3 4 2 4 3 3 3 1 4 4 ...

Passenger Information at A Glance

Data information

Details of data after pre-processing process:

  1. Survival = (0 = No; 1 = Yes)
  2. Pclass = Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  3. Name = Name
  4. Sex = Sex
  5. Age = Age
  6. Sibsp = Number of Siblings/Spouses Aboard
  7. Parch = Number of Parents/Children Aboard
  8. Ticket = Ticket Number
  9. Fare = Passenger Fare
  10. Embarked = Port of Embarkation (Cherbourg, Queenstown or Southampton)
  11. Age_Category = Age Category of Passenger
  12. Titles = Titles of Passenger
# Number of passengers
nrow(titanic_data)
## [1] 891
# Summary of data passengers
summary(titanic_data)
##  Survived Pclass      Name               Sex           Age        SibSp  
##  0:549    1:216   Length:891         Female:314   Min.   : 0.00   0:608  
##  1:342    2:184   Class :character   Male  :577   1st Qu.: 6.00   1:209  
##           3:491   Mode  :character                Median :24.00   2: 28  
##                                                   Mean   :23.78   3: 16  
##                                                   3rd Qu.:35.00   4: 18  
##                                                   Max.   :80.00   5:  5  
##                                                                   8:  7  
##  Parch      Ticket               Fare               Embarked   Surviveddetailed
##  0:678   Length:891         Min.   :  0.00   Cherbourg  :168   No :549         
##  1:118   Class :character   1st Qu.:  7.91   Queenstown : 77   Yes:342         
##  2: 80   Mode  :character   Median : 14.45   Southampton:646                   
##  3:  5                      Mean   : 32.20                                     
##  4:  4                      3rd Qu.: 31.00                                     
##  5:  5                      Max.   :512.33                                     
##  6:  1                                                                         
##         Age_Category    Titles   
##  Adults       :425   Master: 40  
##  Children     : 71   Miss  :185  
##  Seniors      : 11   Mr    :517  
##  Uncategorised:184   Mrs   :126  
##  Youth        :200   Others: 23  
##                                  
## 
# Take a glimpse on data
head(titanic_data, n=1)
##   Survived Pclass                    Name  Sex Age SibSp Parch    Ticket Fare
## 1        0      3 Braund, Mr. Owen Harris Male  22     1     0 A/5 21171 7.25
##      Embarked Surviveddetailed Age_Category Titles
## 1 Southampton               No        Youth     Mr
# Show a comparison between male and female passengers using pie chart

passenger_sex <- data.frame(sex = titanic_data$Sex)

PieChart(sex, hole = 0, 
         values = "%", 
         data = passenger_sex,
         fill = c("#B9F3FC", "#93C6E7"), 
         color = "black",
         values_size=getOption("10"), 
         main = "Gender Comparison"
)

There were 891 passengers (577 males and 314 females) on this data, 549 of them survived from the disaster. We can also have other insight from the above information. For instance, the first passenger listed is Mr. Owen Harris Braund. He was 22 years old when he died on the Titanic. Next, 65% of the passengers gender are Male and 35% are female.

Passengers data statistics and numbers

Age distribution

# Remove age = 0
hist_age <- titanic_data[titanic_data$Age != 0, ]

# Use histogram to see the Age distribution on passengers data
hist(hist_age$Age,
     breaks=30,
     col = "#93C6E7",
     main = "Age Distribution of Titanic Passengers",
     xlab = "Age Range", 
     ylab = "Freq")

# Use box plot to see the outlier on passengers data
boxplot(titanic_data$Age,
        col = "#93C6E7",
        main = "Age of Passengers",
        xlab = NULL, 
        ylab = "Freq")

# Central tendency of age 
median(titanic_data$Age)
## [1] 24

The histogram and box plot help us understand that many of the passengers present on the titanic were in the age range of 20-35 years. Also, we know that “Age” data has an outlier and skewed distributed.

From above insight, we can decide to use median to measure the central tendency. We can get the result that “24 years old” is the central tendency of “age” data.

Numbers of passengers

# Number of passengers based on age category
table(titanic_data$Age_Category)
## 
##        Adults      Children       Seniors Uncategorised         Youth 
##           425            71            11           184           200
# Number of passengers based on passengers class 
table(titanic_data$Pclass)
## 
##   1   2   3 
## 216 184 491
# Number of passengers based on the port passengers embarked 
table(titanic_data$Embarked)
## 
##   Cherbourg  Queenstown Southampton 
##         168          77         646
# Number of passengers based on titles 
table(titanic_data$Titles)
## 
## Master   Miss     Mr    Mrs Others 
##     40    185    517    126     23

From above data, we can know :

  • The most passengers were Adults (425), followed by Youth, Uncategorised, Children and Seniors. The uncategorised means the we don’t have sufficient information of their age and stated as “Uncategorised”.
  • The passengers mostly chose 3rd class. But, surprisingly people chose 1st class instead of 2nd.
  • 76% of passsengers departed form Southampton port, followed by Cherbourg and Queenstown
  • And the last, there were 517 man, 126 married women, 185 unmarried woman, 40 master adn 23 others on-board. Master is a title for an underage male. If a person is under 18, master would be used. Once a person turns 18 and enters adulthood, mister would be used. Others means that small portion of number for other titles.

Details of survivors

Based on Age

plot_age <- ggplot(data= titanic_data, aes(x = Age, fill = Survived)) +
          geom_histogram() +
      scale_fill_manual(values = c("#93C6E7", "#B9F3FC")) +
      labs(title = NULL,
           y = "Passenger Count",
           x = "Passenger Age",
           fill = "Survived?",
           caption = "Source : Kaggle") +
      theme_minimal() +
      theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        legend.position = "bottom",
        plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
        axis.title = element_text(face = "bold.italic", size = rel(0.85)))


plot_age

Based on passenger class

plot_pclass <- ggplot(data = titanic_data, aes(x = Pclass, fill = Surviveddetailed)) +
  geom_bar(width = 0.4) +
  scale_fill_manual(values = c("#93C6E7", "#B9F3FC"))  +
  stat_count(aes(label = ..count..), geom = "text", position = position_stack(vjust = 0.5), show.legend = FALSE) +
  labs(title = "Survivor Number by Passenger Class",
       y = "Count",
       x = "Passenger Class",
       fill = "Survived?",
       caption = "Source : Kaggle") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        legend.position = "bottom",
        plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
        axis.title = element_text(face = "bold.italic", size = rel(0.85)))

plot_pclass

The highest survivors were from 1st class(136 people). Followed by 3rd class(119 people) and 2nd class (87 people)

Based on embarked port

plot_gender <- ggplot(data = titanic_data, aes(x = Embarked, fill = Surviveddetailed)) +
  geom_bar(width = 0.4) +
  scale_fill_manual(values = c("#93C6E7", "#B9F3FC"))  +
  stat_count(aes(label = ..count..), geom = "text", position = position_stack(vjust = 0.5), show.legend = FALSE) +
  labs(title = "Survivor Number by Embarked Port",
       y = "Count",
       x = "Embarked Port",
       fill = "Survived?",
       caption = "Source : Kaggle") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        legend.position = "bottom",
        plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
        axis.title = element_text(face = "bold.italic", size = rel(0.85)))

plot_gender

The highest survivors were from Southampton(219 people). Followed by Cherbourg(93 people) and Queenstown(30 people).

Based on sex

plot_gender <- ggplot(data = titanic_data, aes(x = Sex, fill = Surviveddetailed)) +
  geom_bar(width = 0.4) +
  scale_fill_manual(values = c("#93C6E7", "#B9F3FC"))  +
  stat_count(aes(label = ..count..), geom = "text", position = position_stack(vjust = 0.5), show.legend = FALSE) +
  labs(title = "Survivor Number by Gender",
       y = "Count",
       x = "Gender",
       fill = "Survived?",
       caption = "Source : Kaggle") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        legend.position = "bottom",
        plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
        axis.title = element_text(face = "bold.italic", size = rel(0.85)))

plot_gender

Most of the survivors were female (233 people) and only 109 male passengers survived.

Based on age category

plot_agecat <- ggplot(data = titanic_data, aes(x = Age_Category, fill = Surviveddetailed)) +
  geom_bar(width = 0.4) +
  scale_fill_manual(values = c("#93C6E7", "#B9F3FC"))  +
  stat_count(aes(label = ..count..), geom = "text", position = position_stack(vjust = 0.5), show.legend = FALSE) +
  labs(title = "Survivor Number by Age Category",
       y = "Count",
       x = "Age Category",
       fill = "Survived?",
       caption = "Source : Kaggle") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        legend.position = "bottom",
        plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
        axis.title = element_text(face = "bold.italic", size = rel(0.85)))

plot_agecat

Most of the survivors were adults (171 people) and the lowest was seniors with only 1 survivor.

Conclusion

Now, I get some insight from Subset Data of the Titanic Passenger Onboard

  • There were 891 passengers(577 males and 314 females) on this data, only 549 of them survived from the disaster. 65% of the passengers are male and 35% are female . But, most of the survivors were female with 233 people and only 109 male passengers survived.
  • “24 years old” is the central tendency of passengers “age” data.
  • The most passengers were Adults (425 people). Also, the most survivors were from Adults (171 people) and the lowest was Seniors with only 1 survivor.
  • The passengers mostly chose 3rd class. But, the highest survivors were from 1st class(136 people). Followed by 3rd class(119 people) and 2nd class (87 people).
  • 76% of passengers departed form Southampton port, followed by Cherbourg and Queenstown. Based on this parameter, the highest survivors were from Southampton(219 people). Next are Cherbourg(93 people) and Queenstown(30 people).