Titanic

Narcel Reedus
September 14, 2017

Summary

The maiden voyage of the Titanic; made popular by James Cameron's epic 1997 film, is one of the most infamous tragedies in human history. The "unsinkable" oceanliner struck an iceberg and sank to the bottom of the Atlantic killing 1,502 of its 2,224 passengers and crew.

Today, with the help of the ship's manifest, data analysts and data scientists from around the globe are able to discover numerous insights by appyling machine learning algorithms to predict who lived and who ultimately died onboard the RMS Titanic.

My goal here is to use machine learning and feature engineering to visualize the distinguishing factors between the passengers that were more likely to survive the shipwreck from those who ultimately perished.

The final graph shows that women in first class without children were more likely to survive the Titanic while men in third class with children were more likely to perish.

Analysis

Load dplyr and stringr libraries

library(dplyr)
library(stringr)
library(printr)
library(pander)
library(knitr)
library(httpuv)
library(caTools)

The first step is to become intimately familiar with the data. Even though there are only 891 observations and 12 variables, there are many insights hidden within each data point.

I begin by loading the titanic_train.csv data. Assign dataset to train. View data and view the structure of the data with str.

titanic_train <- read.csv("C:/Users/narce/OneDrive/Documents/GitHub/Titanic/titanic_train.csv")
train <- titanic_train
View(train)
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Variables

PassengerID

The PassengerID varible simply counts the number of observations. It does not directly connect to actual passengers onboard the Titanic. I will not use the PassengerID in this project.

Survived

Survived is a binary variable that shows whether a passenger survived or perished. Survived is hugely important and will serve as as our dependent variable.

Pclass

The Pclass variable translate well as a look into the socio-economic status of each passenger. 1 = first class or Upper deck, 2 = second class or Middle deck, and 3 = third class or lower deck. I will certainly take advange of this independent variable classification as I build new features.

Name

The Name variable is a string that presents this dataset with a bit of a challenge. There are a number of titles included within the name variable that we can extract and attempt to assign to gender. For instance, Jonkheer is an adult male but Miss is used to describe unwed adult women as well as female children.

Sex

The Sex variable signifies gender.

Age

The Age variable is important but the Age value is missing for several passengers. I will impute the Age for the missing values.

SibSp

SipSp lumps siblings (brother, sister, stepbrother, stepsister) together with Spouses (husbands and wives)

Parch

The Parch variable combines Parents travelling with children.

Ticket

The ticket variable provides a ticket number. However, some passengers do not have a ticket number or share a ticket number with their family.

Fare

The Fare variable relates to the price passengers' paid to travel on the Titanic. There are a number of passengers that did not pay a Fare. These missing fare values will be reconciled.

Cabin

The Cabin variable has a lot of missing values and may not play an inportant role in this analysis.

Embarked

The Embarked variable correlates to where passengers bordered the ship: C = Cherbourg, Q = Queenstown, S = Southampton

Cleaning

There are a number of passengers that appear to not have paid a fare. I filter the Fare variable for 0.0 to take a closer look at this anomaly. I assign the results to zero.fare

zero.fare <- train %>%
  filter(Fare == 0.0)
zero.fare
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
180 0 3 Leonard, Mr. Lionel male 36 0 0 LINE 0 S
264 0 1 Harrison, Mr. William male 40 0 0 112059 0 B94 S
272 1 3 Tornquist, Mr. William Henry male 25 0 0 LINE 0 S
278 0 2 Parkes, Mr. Francis "Frank" male NA 0 0 239853 0 S
303 0 3 Johnson, Mr. William Cahoone Jr male 19 0 0 LINE 0 S
414 0 2 Cunningham, Mr. Alfred Fleming male NA 0 0 239853 0 S
467 0 2 Campbell, Mr. William male NA 0 0 239853 0 S
482 0 2 Frost, Mr. Anthony Wood "Archie" male NA 0 0 239854 0 S
598 0 3 Johnson, Mr. Alfred male 49 0 0 LINE 0 S
634 0 1 Parr, Mr. William Henry Marsh male NA 0 0 112052 0 S
675 0 2 Watson, Mr. Ennis Hastings male NA 0 0 239856 0 S
733 0 2 Knight, Mr. Robert J male NA 0 0 239855 0 S
807 0 1 Andrews, Mr. Thomas Jr male 39 0 0 112050 0 A36 S
816 0 1 Fry, Mr. Richard male NA 0 0 112058 0 B102 S
823 0 1 Reuchlin, Jonkheer. John George male 38 0 0 19972 0 S

I take the zero.fare anomaly a step further by grouping zero.class by Pclass and then summarizing. The results are six passengers in first class, five passengers in first class, four passengers in third class paid zero fare.

zero.fare.pclass <- zero.fare %>%
  group_by(Pclass) %>%
  summarize(Total = n()) %>%
  arrange(desc(Total))
zero.fare.pclass
Pclass Total
2 6
1 5
3 4

The next phase of this analysis to the get a better understanding of the survivors by gender, denoted here as Sex. My first step is to seperate adult males and females from the male and female adolecents on the ship. I used regex function str_extract to extract passengers by title (Mr. Mrs. Miss. Rev...) by locating the word just before the "."") period in the Name variable and then create a new feature called Title.

train <- train %>%
mutate(Title = str_extract(Name, "[a-zA-Z0-9]+\\."))
table(train$Title)
Capt. Col. Countess. Don. Dr. Jonkheer. Lady. Major. Master. Miss. Mlle. Mme. Mr. Mrs. Ms. Rev. Sir.
1 2 1 1 7 1 1 2 40 182 2 1 517 125 1 6 1

Now I create a new data.frame (titles.lookup) that contains the title variable and a New.Title variable condensed by gender (Mr. Mrs. Miss, and Master). My goal is to line up the adult male titles to Mr., married adult female titles to Mrs., adolescent females and unwed adult females as Miss., and adolescent boys as Master.

titles.lookup <- data.frame(Title = c("Mr.", "Capt.", "Col.", "Don.", "Dr.", 
                                      "Jonkheer.", "Major", "Rev.", "Sir", 
                                      "Mrs.", "Dana.", "Lady.", "Mme.", 
                                      "Countess.", "Miss.", "Mlle.", "Ms.", 
                                      "Master."), 
                            New.Title = c(rep("Mr.", 9),
                                          rep("Mrs.", 5),
                                          rep("Miss.", 3),
                                          "Master."),
                            stringsAsFactors = FALSE)
View(titles.lookup)
knitr::kable(titles.lookup)
Title New.Title
Mr. Mr.
Capt. Mr.
Col. Mr.
Don. Mr.
Dr. Mr.
Jonkheer. Mr.
Major Mr.
Rev. Mr.
Sir Mr.
Mrs. Mrs.
Dana. Mrs.
Lady. Mrs.
Mme. Mrs.
Countess. Mrs.
Miss. Miss.
Mlle. Miss.
Ms. Miss.
Master. Master.

Currently there are 18 different titles in the Titanic dataset. To arrcurately predict survivors by Sex I need to simplfy both the Titles and Sex variables. I will use the New.Titles table to left_join (Mr. Mrs. Miss. and Master) to match the titles.lookup data.frame. This will create a more condensed Title variable that more accurately identifies the Sex and Title of each passenger.

train <- train %>%
  left_join(titles.lookup, by = "Title")
train <- train %>%
  mutate(Title = New.Title) %>%
  select(-New.Title)

There may be an error in reassigning titles by Sex. To mke sure that male and female passengers have the correct title I will filter females male titles and males with female titles.

train %>%
  filter((Sex == "female" & (Title == "Mr." | Title == "Master.")) |
         (Sex == "male" & (Title == "Mrs." | Title == "Miss.")))
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
797 1 1 Leader, Dr. Alice (Farnham) female 49 0 0 17465 25.9292 D17 S Mr.

I found one woman with the title of Dr. Since all the other titles with Dr. are male I will change this female passengers' title to Mrs.

train$Title[train$PassengerId == 797] <- "Mrs."

There are numerous male passengers with zero fare. I generate summary stats for passengers with the title "Mr." by Fare and Pclass (min, max, mean, median, var, SD,IQR).

mr.fare.stats <- train %>%
  filter(Title == "Mr.") %>%
  group_by(Pclass) %>%
  summarize(Fare.Min = min(Fare),
            Fare.Max = max(Fare),
            Fare.Mean = mean(Fare),
            Fare.Median = median(Fare),
            Fare.Var = var(Fare),
            Fare.SD = sd(Fare),
            Fare.IQR = IQR(Fare))
mr.fare.stats
Pclass Fare.Min Fare.Max Fare.Mean Fare.Median Fare.Var Fare.SD Fare.IQR
1 0 512.3292 66.67414 39.8625 6212.1977 78.81750 49.56670
2 0 73.5000 19.05412 13.0000 230.8936 15.19518 14.50000
3 0 69.5500 11.29976 7.8958 109.0203 10.44128 0.91665

In case I need to come back to passengers that had zero fare, I will create a binary table tracking feature to identify zero fare passengers from paid passengers if ifelse statment - 0 = Y, Else N.

train$Fare.Zero <- ifelse(train$Fare == 0.0, "Y", "N")

In order to impute the missing Fare values I must find the median fare by Pclass. I will create a lookup table for zero fare values using filter, group_by, and summarise and then assign it to zero.fare.lookup.

zero.fare.lookup <- train %>%
  filter(Title == "Mr.") %>%
  group_by(Pclass, Title) %>%
  summarise(New.Fare = median(Fare))
  
zero.fare.lookup
## # A tibble: 3 x 3
## # Groups:   Pclass [?]
##   Pclass Title New.Fare
##    <int> <chr>    <dbl>
## 1      1   Mr.  39.8625
## 2      2   Mr.  13.0000
## 3      3   Mr.   7.8958
knitr::kable(zero.fare.lookup)
Pclass Title New.Fare
1 Mr. 39.8625
2 Mr. 13.0000
3 Mr. 7.8958

Now that I have determined the median fare for first class as 39.86, second class 13.00, and third class 7.89, I can impute those values into the zero fare values by left_joining the zero.fare.lookup data.frame to the train dataset. This will replace zero fares with the median value per Pclass.

train <- train %>%
  left_join(zero.fare.lookup, by = c("Pclass", "Title")) %>%
  mutate(fare = ifelse(Fare == 0.0, New.Fare, Fare)) %>%
  select(-New.Fare)

I will generate summary stats based on age that will be helpful in imputing more missing values and create new insights of male survivors by age, title and Pclass.

age.stats <- train %>%
  group_by(Pclass, Title) %>%
  summarize(Age.Min = min(Age, na.rm = TRUE),
            Age.Max = max(Age, na.rm = TRUE),
            Age.Mean = mean(Age, na.rm = TRUE),
            Age.Median = median(Age, na.rm = TRUE),
            Age.Var = var(Age, na.rm = TRUE),
            Age.SD = sd(Age, na.rm = TRUE),
            Age.IQR = IQR(Age, na.rm = TRUE)) %>%
  arrange(Title, Pclass)

age.stats
## # A tibble: 13 x 9
## # Groups:   Pclass [3]
##    Pclass   Title Age.Min Age.Max  Age.Mean Age.Median    Age.Var
##     <int>   <chr>   <dbl>   <dbl>     <dbl>      <dbl>      <dbl>
##  1      1 Master.    0.92      11  5.306667        4.0  26.682133
##  2      2 Master.    0.67       8  2.258889        1.0   5.487936
##  3      3 Master.    0.42      12  5.350833        4.0  12.914017
##  4      1   Miss.    2.00      63 29.744681       30.0 159.498612
##  5      2   Miss.    2.00      50 22.560606       24.0 174.246212
##  6      3   Miss.    0.75      45 16.123188       18.0  94.037910
##  7      1     Mr.   17.00      80 42.184211       40.0 199.138578
##  8      2     Mr.   16.00      70 33.588889       31.0 150.261673
##  9      3     Mr.   11.00      74 28.724891       26.0 110.059948
## 10      1    Mrs.   17.00      62 40.631579       41.5 155.049787
## 11      2    Mrs.   14.00      57 33.682927       32.0 106.471951
## 12      3    Mrs.   15.00      63 33.515152       31.0 100.632576
## 13      1    <NA>   45.00      52 48.666667       49.0  12.333333
## # ... with 2 more variables: Age.SD <dbl>, Age.IQR <dbl>

As I did before, I will create a binary table tracking feature to identify missing Age values with an if ifelse statment - 0 = Y, Else N.

train$Age.Missing <- ifelse(is.na(train$Age), "Y", "N")

This is a lookup table for the Age values selecting Pclass, Title, Age.Mean, Age.Median.

age.lookup <- age.stats %>%
  select(Pclass, Title, Age.Mean, Age.Median)

I will impute missing ages by using this lookup table.

train <- train %>%
  left_join(age.lookup, by = c("Pclass", "Title")) %>%
  mutate(Age = ifelse(Age.Missing == "Y",
                      ifelse(Title == "Miss.", Age.Median, Age.Mean),
                      Age)) %>%
  select(-Age.Median, -Age.Mean)

For missing fare values by age and sex I will create 13 new Ticket-based features using Group_by ticket, summarise group.count, Avg fare = max fare / n(), sum of Female.Count, ratio of the n() of males in ticket count / number of people on that ticket.

ticket.lookup <- train %>%
  group_by(Ticket) %>%
  summarise(Group.Count = n(),
            Avg.Fare = max(Fare) / n(),
            Female.Count = sum(Sex == "female"),
            Male.Count = sum(Sex == "male"),
            Child.Count = sum(Age < 18),
            Elderly.Count = sum(Age > 54.0),
            Female.Ratio = sum(Sex == "female") / n(),
            Male.Ratio = sum(Sex == "Male") / n(),
            Child.Ratio = sum(Age < 18) / n(),
            Elderly.Ratio = sum(Age > 54.0) / n(),
            Female.Child.Ratio = (sum(Age < 18) +
                                  sum(Sex == "female" & Age >=18)) / n(),
            Min.Age = min(Age),
            Max.Age = max(Age))
ticket.lookup %>% filter(Ticket == "3101295")
Ticket Group.Count Avg.Fare Female.Count Male.Count Child.Count Elderly.Count Female.Ratio Male.Ratio Child.Ratio Elderly.Ratio Female.Child.Ratio Min.Age Max.Age
3101295 6 6.614583 1 5 5 0 0.1666667 0 0.8333333 0 1 1 41

Populate train data with ticket lookup table

train <- train %>%
  left_join(ticket.lookup, by = "Ticket")

Load install the ggplot2 library - load ggplot2

library(ggplot2)

It is important to create factors for Survived and Pclass.

train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)

Here I am creating a subset of passengers that are traveling with children.

tickets.children <- train %>%
  filter(Child.Count > 0)
ggplot(tickets.children, aes(x = Pclass, fill = Survived)) + 
  theme_bw() +
  geom_bar() +
  facet_wrap(~ Title) +
  labs(y = "Count of Passengers",
       title = "Survival Rates for Ticket Group Traveling with Children")

ggplot(tickets.children, aes(x = Pclass, fill = Survived)) +
  theme_bw() +
  geom_bar() +
  facet_wrap(~ Title) +
  labs(y = "Count of Passengers",
       title = "Survival Rates for Ticket Groups Traveling without Children")

This graph shows that women in first class without children were more likely to survive the Titanic while men in third class with children were more likely to perish.