Titanic

Narcel Reedus
September 14, 2017

Summary

The maiden voyage of the Titanic; made popular by James Cameron's epic 1997 film, is one of the most infamous tragedies in human history. The "unsinkable" oceanliner struck an iceberg and sank to the bottom of the Atlantic killing 1,502 of its 2,224 passengers and crew.

Today, with the help of the ship's manifest, data analysts and data scientists from around the globe are able to discover numerous insights by appyling machine learning algorithms to predict who lived and who ultimately died onboard the RMS Titanic.

My goal here is to use machine learning and feature engineering to visualize the distinguishing factors between the passengers that were more likely to survive the shipwreck from those who ultimately perished.

The final graph shows that women in first class without children were more likely to survive the Titanic while men in third class with children were more likely to perish.

Analysis

Load dplyr and stringr libraries

library(dplyr)
library(stringr)
library(printr)
library(pander)
library(knitr)
library(httpuv)
library(caTools)

The first step is to become intimately familiar with the data. Even though there are only 891 observations and 12 variables, there are many insights hidden within each data point.

I begin by loading the titanic_train.csv data. Assign dataset to train. View data and view the structure of the data with str.

titanic_train <- read.csv("C:/Users/narce/OneDrive/Documents/GitHub/Titanic/titanic_train.csv")
train <- titanic_train
View(train)
str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Variables

PassengerID

The PassengerID varible simply counts the number of observations. It does not directly connect to actual passengers onboard the Titanic. I will not use the PassengerID in this project.

Survived

Survived is a binary variable that shows whether a passenger survived or perished. Survived is hugely important and will serve as as our dependent variable.

Pclass

The Pclass variable translate well as a look into the socio-economic status of each passenger. 1 = first class or Upper deck, 2 = second class or Middle deck, and 3 = third class or lower deck. I will certainly take advange of this independent variable classification as I build new features.

Name

The Name variable is a string that presents this dataset with a bit of a challenge. There are a number of titles included within the name variable that we can extract and attempt to assign to gender. For instance, Jonkheer is an adult male but Miss is used to describe unwed adult women as well as female children.

Sex

The Sex variable signifies gender.

Age

The Age variable is important but the Age value is missing for several passengers. I will impute the Age for the missing values.

SibSp

SipSp lumps siblings (brother, sister, stepbrother, stepsister) together with Spouses (husbands and wives)

Parch

The Parch variable combines Parents travelling with children.

Ticket

The ticket variable provides a ticket number. However, some passengers do not have a ticket number or share a ticket number with their family.

Fare

The Fare variable relates to the price passengers' paid to travel on the Titanic. There are a number of passengers that did not pay a Fare. These missing fare values will be reconciled.

Cabin

The Cabin variable has a lot of missing values and may not play an inportant role in this analysis.

Embarked

The Embarked variable correlates to where passengers bordered the ship: C = Cherbourg, Q = Queenstown, S = Southampton

Cleaning

There are a number of passengers that appear to not have paid a fare. I filter the Fare variable for 0.0 to take a closer look at this anomaly. I assign the results to zero.fare

zero.fare <- train %>%
  filter(Fare == 0.0)
zero.fare

PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Cabin	Embarked
180	0	3	Leonard, Mr. Lionel	male	36	LINE		S
264	0	1	Harrison, Mr. William	male	40	112059	B94	S
272	1	3	Tornquist, Mr. William Henry	male	25	LINE		S
278	0	2	Parkes, Mr. Francis "Frank"	male	NA	239853		S
303	0	3	Johnson, Mr. William Cahoone Jr	male	19	LINE		S
414	0	2	Cunningham, Mr. Alfred Fleming	male	NA	239853		S
467	0	2	Campbell, Mr. William	male	NA	239853		S
482	0	2	Frost, Mr. Anthony Wood "Archie"	male	NA	239854		S
598	0	3	Johnson, Mr. Alfred	male	49	LINE		S
634	0	1	Parr, Mr. William Henry Marsh	male	NA	112052		S
675	0	2	Watson, Mr. Ennis Hastings	male	NA	239856		S
733	0	2	Knight, Mr. Robert J	male	NA	239855		S
807	0	1	Andrews, Mr. Thomas Jr	male	39	112050	A36	S
816	0	1	Fry, Mr. Richard	male	NA	112058	B102	S
823	0	1	Reuchlin, Jonkheer. John George	male	38	19972		S

I take the zero.fare anomaly a step further by grouping zero.class by Pclass and then summarizing. The results are six passengers in first class, five passengers in first class, four passengers in third class paid zero fare.

zero.fare.pclass <- zero.fare %>%
  group_by(Pclass) %>%
  summarize(Total = n()) %>%
  arrange(desc(Total))
zero.fare.pclass

Pclass	Total
2	6
1	5
3	4

The next phase of this analysis to the get a better understanding of the survivors by gender, denoted here as Sex. My first step is to seperate adult males and females from the male and female adolecents on the ship. I used regex function str_extract to extract passengers by title (Mr. Mrs. Miss. Rev...) by locating the word just before the "."") period in the Name variable and then create a new feature called Title.

train <- train %>%
mutate(Title = str_extract(Name, "[a-zA-Z0-9]+\\."))
table(train$Title)

Capt.	Col.	Countess.	Don.	Dr.	Jonkheer.	Lady.	Major.	Master.	Miss.	Mlle.	Mme.	Mr.	Mrs.	Ms.	Rev.	Sir.
1	2	1	1	7	1	1	2	40	182	2	1	517	125	1	6	1

Now I create a new data.frame (titles.lookup) that contains the title variable and a New.Title variable condensed by gender (Mr. Mrs. Miss, and Master). My goal is to line up the adult male titles to Mr., married adult female titles to Mrs., adolescent females and unwed adult females as Miss., and adolescent boys as Master.

titles.lookup <- data.frame(Title = c("Mr.", "Capt.", "Col.", "Don.", "Dr.", 
                                      "Jonkheer.", "Major", "Rev.", "Sir", 
                                      "Mrs.", "Dana.", "Lady.", "Mme.", 
                                      "Countess.", "Miss.", "Mlle.", "Ms.", 
                                      "Master."), 
                            New.Title = c(rep("Mr.", 9),
                                          rep("Mrs.", 5),
                                          rep("Miss.", 3),
                                          "Master."),
                            stringsAsFactors = FALSE)
View(titles.lookup)
knitr::kable(titles.lookup)

Title	New.Title
Mr.	Mr.
Capt.	Mr.
Col.	Mr.
Don.	Mr.
Dr.	Mr.
Jonkheer.	Mr.
Major	Mr.
Rev.	Mr.
Sir	Mr.
Mrs.	Mrs.
Dana.	Mrs.
Lady.	Mrs.
Mme.	Mrs.
Countess.	Mrs.
Miss.	Miss.
Mlle.	Miss.
Ms.	Miss.
Master.	Master.

Currently there are 18 different titles in the Titanic dataset. To arrcurately predict survivors by Sex I need to simplfy both the Titles and Sex variables. I will use the New.Titles table to left_join (Mr. Mrs. Miss. and Master) to match the titles.lookup data.frame. This will create a more condensed Title variable that more accurately identifies the Sex and Title of each passenger.

train <- train %>%
  left_join(titles.lookup, by = "Title")

train <- train %>%
  mutate(Title = New.Title) %>%
  select(-New.Title)

There may be an error in reassigning titles by Sex. To mke sure that male and female passengers have the correct title I will filter females male titles and males with female titles.

train %>%
  filter((Sex == "female" & (Title == "Mr." | Title == "Master.")) |
         (Sex == "male" & (Title == "Mrs." | Title == "Miss.")))

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
797	1	1	Leader, Dr. Alice (Farnham)	female	49	0	0	17465	25.9292	D17	S	Mr.

I found one woman with the title of Dr. Since all the other titles with Dr. are male I will change this female passengers' title to Mrs.

train$Title[train$PassengerId == 797] <- "Mrs."

There are numerous male passengers with zero fare. I generate summary stats for passengers with the title "Mr." by Fare and Pclass (min, max, mean, median, var, SD,IQR).

mr.fare.stats <- train %>%
  filter(Title == "Mr.") %>%
  group_by(Pclass) %>%
  summarize(Fare.Min = min(Fare),
            Fare.Max = max(Fare),
            Fare.Mean = mean(Fare),
            Fare.Median = median(Fare),
            Fare.Var = var(Fare),
            Fare.SD = sd(Fare),
            Fare.IQR = IQR(Fare))
mr.fare.stats

Pclass	Fare.Max	Fare.Mean	Fare.Median	Fare.Var	Fare.SD	Fare.IQR
1	512.3292	66.67414	39.8625	6212.1977	78.81750	49.56670
2	73.5000	19.05412	13.0000	230.8936	15.19518	14.50000
3	69.5500	11.29976	7.8958	109.0203	10.44128	0.91665

In case I need to come back to passengers that had zero fare, I will create a binary table tracking feature to identify zero fare passengers from paid passengers if ifelse statment - 0 = Y, Else N.

train$Fare.Zero <- ifelse(train$Fare == 0.0, "Y", "N")

In order to impute the missing Fare values I must find the median fare by Pclass. I will create a lookup table for zero fare values using filter, group_by, and summarise and then assign it to zero.fare.lookup.

zero.fare.lookup <- train %>%
  filter(Title == "Mr.") %>%
  group_by(Pclass, Title) %>%
  summarise(New.Fare = median(Fare))
  
zero.fare.lookup

## # A tibble: 3 x 3
## # Groups:   Pclass [?]
##   Pclass Title New.Fare
##    <int> <chr>    <dbl>
## 1      1   Mr.  39.8625
## 2      2   Mr.  13.0000
## 3      3   Mr.   7.8958

knitr::kable(zero.fare.lookup)

Pclass	Title	New.Fare
1	Mr.	39.8625
2	Mr.	13.0000
3	Mr.	7.8958

Now that I have determined the median fare for first class as 39.86, second class 13.00, and third class 7.89, I can impute those values into the zero fare values by left_joining the zero.fare.lookup data.frame to the train dataset. This will replace zero fares with the median value per Pclass.

train <- train %>%
  left_join(zero.fare.lookup, by = c("Pclass", "Title")) %>%
  mutate(fare = ifelse(Fare == 0.0, New.Fare, Fare)) %>%
  select(-New.Fare)

I will generate summary stats based on age that will be helpful in imputing more missing values and create new insights of male survivors by age, title and Pclass.

age.stats <- train %>%
  group_by(Pclass, Title) %>%
  summarize(Age.Min = min(Age, na.rm = TRUE),
            Age.Max = max(Age, na.rm = TRUE),
            Age.Mean = mean(Age, na.rm = TRUE),
            Age.Median = median(Age, na.rm = TRUE),
            Age.Var = var(Age, na.rm = TRUE),
            Age.SD = sd(Age, na.rm = TRUE),
            Age.IQR = IQR(Age, na.rm = TRUE)) %>%
  arrange(Title, Pclass)

age.stats

## # A tibble: 13 x 9
## # Groups:   Pclass [3]
##    Pclass   Title Age.Min Age.Max  Age.Mean Age.Median    Age.Var
##     <int>   <chr>   <dbl>   <dbl>     <dbl>      <dbl>      <dbl>
##  1      1 Master.    0.92      11  5.306667        4.0  26.682133
##  2      2 Master.    0.67       8  2.258889        1.0   5.487936
##  3      3 Master.    0.42      12  5.350833        4.0  12.914017
##  4      1   Miss.    2.00      63 29.744681       30.0 159.498612
##  5      2   Miss.    2.00      50 22.560606       24.0 174.246212
##  6      3   Miss.    0.75      45 16.123188       18.0  94.037910
##  7      1     Mr.   17.00      80 42.184211       40.0 199.138578
##  8      2     Mr.   16.00      70 33.588889       31.0 150.261673
##  9      3     Mr.   11.00      74 28.724891       26.0 110.059948
## 10      1    Mrs.   17.00      62 40.631579       41.5 155.049787
## 11      2    Mrs.   14.00      57 33.682927       32.0 106.471951
## 12      3    Mrs.   15.00      63 33.515152       31.0 100.632576
## 13      1    <NA>   45.00      52 48.666667       49.0  12.333333
## # ... with 2 more variables: Age.SD <dbl>, Age.IQR <dbl>

As I did before, I will create a binary table tracking feature to identify missing Age values with an if ifelse statment - 0 = Y, Else N.

train$Age.Missing <- ifelse(is.na(train$Age), "Y", "N")

This is a lookup table for the Age values selecting Pclass, Title, Age.Mean, Age.Median.

age.lookup <- age.stats %>%
  select(Pclass, Title, Age.Mean, Age.Median)

I will impute missing ages by using this lookup table.

train <- train %>%
  left_join(age.lookup, by = c("Pclass", "Title")) %>%
  mutate(Age = ifelse(Age.Missing == "Y",
                      ifelse(Title == "Miss.", Age.Median, Age.Mean),
                      Age)) %>%
  select(-Age.Median, -Age.Mean)

For missing fare values by age and sex I will create 13 new Ticket-based features using Group_by ticket, summarise group.count, Avg fare = max fare / n(), sum of Female.Count, ratio of the n() of males in ticket count / number of people on that ticket.

ticket.lookup <- train %>%
  group_by(Ticket) %>%
  summarise(Group.Count = n(),
            Avg.Fare = max(Fare) / n(),
            Female.Count = sum(Sex == "female"),
            Male.Count = sum(Sex == "male"),
            Child.Count = sum(Age < 18),
            Elderly.Count = sum(Age > 54.0),
            Female.Ratio = sum(Sex == "female") / n(),
            Male.Ratio = sum(Sex == "Male") / n(),
            Child.Ratio = sum(Age < 18) / n(),
            Elderly.Ratio = sum(Age > 54.0) / n(),
            Female.Child.Ratio = (sum(Age < 18) +
                                  sum(Sex == "female" & Age >=18)) / n(),
            Min.Age = min(Age),
            Max.Age = max(Age))

ticket.lookup %>% filter(Ticket == "3101295")

Ticket	Group.Count	Avg.Fare	Female.Count	Male.Count	Child.Count	Elderly.Count	Female.Ratio	Male.Ratio	Child.Ratio	Elderly.Ratio	Female.Child.Ratio	Min.Age	Max.Age
3101295	6	6.614583	1	5	5	0	0.1666667	0	0.8333333	0	1	1	41

Populate train data with ticket lookup table

train <- train %>%
  left_join(ticket.lookup, by = "Ticket")

Load install the ggplot2 library - load ggplot2

library(ggplot2)

It is important to create factors for Survived and Pclass.

train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)

Here I am creating a subset of passengers that are traveling with children.

tickets.children <- train %>%
  filter(Child.Count > 0)

ggplot(tickets.children, aes(x = Pclass, fill = Survived)) + 
  theme_bw() +
  geom_bar() +
  facet_wrap(~ Title) +
  labs(y = "Count of Passengers",
       title = "Survival Rates for Ticket Group Traveling with Children")

ggplot(tickets.children, aes(x = Pclass, fill = Survived)) +
  theme_bw() +
  geom_bar() +
  facet_wrap(~ Title) +
  labs(y = "Count of Passengers",
       title = "Survival Rates for Ticket Groups Traveling without Children")

This graph shows that women in first class without children were more likely to survive the Titanic while men in third class with children were more likely to perish.