Narcel Reedus
September 14, 2017
The maiden voyage of the Titanic; made popular by James Cameron's epic 1997 film, is one of the most infamous tragedies in human history. The "unsinkable" oceanliner struck an iceberg and sank to the bottom of the Atlantic killing 1,502 of its 2,224 passengers and crew.
Today, with the help of the ship's manifest, data analysts and data scientists from around the globe are able to discover numerous insights by appyling machine learning algorithms to predict who lived and who ultimately died onboard the RMS Titanic.
My goal here is to use machine learning and feature engineering to visualize the distinguishing factors between the passengers that were more likely to survive the shipwreck from those who ultimately perished.
The final graph shows that women in first class without children were more likely to survive the Titanic while men in third class with children were more likely to perish.
Load dplyr and stringr libraries
library(dplyr)
library(stringr)
library(printr)
library(pander)
library(knitr)
library(httpuv)
library(caTools)The first step is to become intimately familiar with the data. Even though there are only 891 observations and 12 variables, there are many insights hidden within each data point.
I begin by loading the titanic_train.csv data. Assign dataset to train. View data and view the structure of the data with str.
titanic_train <- read.csv("C:/Users/narce/OneDrive/Documents/GitHub/Titanic/titanic_train.csv")
train <- titanic_train
View(train)
str(train)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
PassengerID
The PassengerID varible simply counts the number of observations. It does not directly connect to actual passengers onboard the Titanic. I will not use the PassengerID in this project.
Survived
Survived is a binary variable that shows whether a passenger survived or perished. Survived is hugely important and will serve as as our dependent variable.
Pclass
The Pclass variable translate well as a look into the socio-economic status of each passenger. 1 = first class or Upper deck, 2 = second class or Middle deck, and 3 = third class or lower deck. I will certainly take advange of this independent variable classification as I build new features.
Name
The Name variable is a string that presents this dataset with a bit of a challenge. There are a number of titles included within the name variable that we can extract and attempt to assign to gender. For instance, Jonkheer is an adult male but Miss is used to describe unwed adult women as well as female children.
Sex
The Sex variable signifies gender.
Age
The Age variable is important but the Age value is missing for several passengers. I will impute the Age for the missing values.
SibSp
SipSp lumps siblings (brother, sister, stepbrother, stepsister) together with Spouses (husbands and wives)
Parch
The Parch variable combines Parents travelling with children.
Ticket
The ticket variable provides a ticket number. However, some passengers do not have a ticket number or share a ticket number with their family.
Fare
The Fare variable relates to the price passengers' paid to travel on the Titanic. There are a number of passengers that did not pay a Fare. These missing fare values will be reconciled.
Cabin
The Cabin variable has a lot of missing values and may not play an inportant role in this analysis.
Embarked
The Embarked variable correlates to where passengers bordered the ship: C = Cherbourg, Q = Queenstown, S = Southampton
There are a number of passengers that appear to not have paid a fare. I filter the Fare variable for 0.0 to take a closer look at this anomaly. I assign the results to zero.fare
zero.fare <- train %>%
filter(Fare == 0.0)
zero.fare| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 180 | 0 | 3 | Leonard, Mr. Lionel | male | 36 | 0 | 0 | LINE | 0 | S | |
| 264 | 0 | 1 | Harrison, Mr. William | male | 40 | 0 | 0 | 112059 | 0 | B94 | S |
| 272 | 1 | 3 | Tornquist, Mr. William Henry | male | 25 | 0 | 0 | LINE | 0 | S | |
| 278 | 0 | 2 | Parkes, Mr. Francis "Frank" | male | NA | 0 | 0 | 239853 | 0 | S | |
| 303 | 0 | 3 | Johnson, Mr. William Cahoone Jr | male | 19 | 0 | 0 | LINE | 0 | S | |
| 414 | 0 | 2 | Cunningham, Mr. Alfred Fleming | male | NA | 0 | 0 | 239853 | 0 | S | |
| 467 | 0 | 2 | Campbell, Mr. William | male | NA | 0 | 0 | 239853 | 0 | S | |
| 482 | 0 | 2 | Frost, Mr. Anthony Wood "Archie" | male | NA | 0 | 0 | 239854 | 0 | S | |
| 598 | 0 | 3 | Johnson, Mr. Alfred | male | 49 | 0 | 0 | LINE | 0 | S | |
| 634 | 0 | 1 | Parr, Mr. William Henry Marsh | male | NA | 0 | 0 | 112052 | 0 | S | |
| 675 | 0 | 2 | Watson, Mr. Ennis Hastings | male | NA | 0 | 0 | 239856 | 0 | S | |
| 733 | 0 | 2 | Knight, Mr. Robert J | male | NA | 0 | 0 | 239855 | 0 | S | |
| 807 | 0 | 1 | Andrews, Mr. Thomas Jr | male | 39 | 0 | 0 | 112050 | 0 | A36 | S |
| 816 | 0 | 1 | Fry, Mr. Richard | male | NA | 0 | 0 | 112058 | 0 | B102 | S |
| 823 | 0 | 1 | Reuchlin, Jonkheer. John George | male | 38 | 0 | 0 | 19972 | 0 | S |
I take the zero.fare anomaly a step further by grouping zero.class by Pclass and then summarizing. The results are six passengers in first class, five passengers in first class, four passengers in third class paid zero fare.
zero.fare.pclass <- zero.fare %>%
group_by(Pclass) %>%
summarize(Total = n()) %>%
arrange(desc(Total))
zero.fare.pclass| Pclass | Total |
|---|---|
| 2 | 6 |
| 1 | 5 |
| 3 | 4 |
The next phase of this analysis to the get a better understanding of the survivors by gender, denoted here as Sex. My first step is to seperate adult males and females from the male and female adolecents on the ship. I used regex function str_extract to extract passengers by title (Mr. Mrs. Miss. Rev...) by locating the word just before the "."") period in the Name variable and then create a new feature called Title.
train <- train %>%
mutate(Title = str_extract(Name, "[a-zA-Z0-9]+\\."))
table(train$Title)| Capt. | Col. | Countess. | Don. | Dr. | Jonkheer. | Lady. | Major. | Master. | Miss. | Mlle. | Mme. | Mr. | Mrs. | Ms. | Rev. | Sir. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 1 | 7 | 1 | 1 | 2 | 40 | 182 | 2 | 1 | 517 | 125 | 1 | 6 | 1 |
Now I create a new data.frame (titles.lookup) that contains the title variable and a New.Title variable condensed by gender (Mr. Mrs. Miss, and Master). My goal is to line up the adult male titles to Mr., married adult female titles to Mrs., adolescent females and unwed adult females as Miss., and adolescent boys as Master.
titles.lookup <- data.frame(Title = c("Mr.", "Capt.", "Col.", "Don.", "Dr.",
"Jonkheer.", "Major", "Rev.", "Sir",
"Mrs.", "Dana.", "Lady.", "Mme.",
"Countess.", "Miss.", "Mlle.", "Ms.",
"Master."),
New.Title = c(rep("Mr.", 9),
rep("Mrs.", 5),
rep("Miss.", 3),
"Master."),
stringsAsFactors = FALSE)
View(titles.lookup)
knitr::kable(titles.lookup)| Title | New.Title |
|---|---|
| Mr. | Mr. |
| Capt. | Mr. |
| Col. | Mr. |
| Don. | Mr. |
| Dr. | Mr. |
| Jonkheer. | Mr. |
| Major | Mr. |
| Rev. | Mr. |
| Sir | Mr. |
| Mrs. | Mrs. |
| Dana. | Mrs. |
| Lady. | Mrs. |
| Mme. | Mrs. |
| Countess. | Mrs. |
| Miss. | Miss. |
| Mlle. | Miss. |
| Ms. | Miss. |
| Master. | Master. |
Currently there are 18 different titles in the Titanic dataset. To arrcurately predict survivors by Sex I need to simplfy both the Titles and Sex variables. I will use the New.Titles table to left_join (Mr. Mrs. Miss. and Master) to match the titles.lookup data.frame. This will create a more condensed Title variable that more accurately identifies the Sex and Title of each passenger.
train <- train %>%
left_join(titles.lookup, by = "Title")train <- train %>%
mutate(Title = New.Title) %>%
select(-New.Title)There may be an error in reassigning titles by Sex. To mke sure that male and female passengers have the correct title I will filter females male titles and males with female titles.
train %>%
filter((Sex == "female" & (Title == "Mr." | Title == "Master.")) |
(Sex == "male" & (Title == "Mrs." | Title == "Miss.")))| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 797 | 1 | 1 | Leader, Dr. Alice (Farnham) | female | 49 | 0 | 0 | 17465 | 25.9292 | D17 | S | Mr. |
I found one woman with the title of Dr. Since all the other titles with Dr. are male I will change this female passengers' title to Mrs.
train$Title[train$PassengerId == 797] <- "Mrs."There are numerous male passengers with zero fare. I generate summary stats for passengers with the title "Mr." by Fare and Pclass (min, max, mean, median, var, SD,IQR).
mr.fare.stats <- train %>%
filter(Title == "Mr.") %>%
group_by(Pclass) %>%
summarize(Fare.Min = min(Fare),
Fare.Max = max(Fare),
Fare.Mean = mean(Fare),
Fare.Median = median(Fare),
Fare.Var = var(Fare),
Fare.SD = sd(Fare),
Fare.IQR = IQR(Fare))
mr.fare.stats| Pclass | Fare.Min | Fare.Max | Fare.Mean | Fare.Median | Fare.Var | Fare.SD | Fare.IQR |
|---|---|---|---|---|---|---|---|
| 1 | 0 | 512.3292 | 66.67414 | 39.8625 | 6212.1977 | 78.81750 | 49.56670 |
| 2 | 0 | 73.5000 | 19.05412 | 13.0000 | 230.8936 | 15.19518 | 14.50000 |
| 3 | 0 | 69.5500 | 11.29976 | 7.8958 | 109.0203 | 10.44128 | 0.91665 |
In case I need to come back to passengers that had zero fare, I will create a binary table tracking feature to identify zero fare passengers from paid passengers if ifelse statment - 0 = Y, Else N.
train$Fare.Zero <- ifelse(train$Fare == 0.0, "Y", "N")In order to impute the missing Fare values I must find the median fare by Pclass. I will create a lookup table for zero fare values using filter, group_by, and summarise and then assign it to zero.fare.lookup.
zero.fare.lookup <- train %>%
filter(Title == "Mr.") %>%
group_by(Pclass, Title) %>%
summarise(New.Fare = median(Fare))
zero.fare.lookup## # A tibble: 3 x 3
## # Groups: Pclass [?]
## Pclass Title New.Fare
## <int> <chr> <dbl>
## 1 1 Mr. 39.8625
## 2 2 Mr. 13.0000
## 3 3 Mr. 7.8958
knitr::kable(zero.fare.lookup)| Pclass | Title | New.Fare |
|---|---|---|
| 1 | Mr. | 39.8625 |
| 2 | Mr. | 13.0000 |
| 3 | Mr. | 7.8958 |
Now that I have determined the median fare for first class as 39.86, second class 13.00, and third class 7.89, I can impute those values into the zero fare values by left_joining the zero.fare.lookup data.frame to the train dataset. This will replace zero fares with the median value per Pclass.
train <- train %>%
left_join(zero.fare.lookup, by = c("Pclass", "Title")) %>%
mutate(fare = ifelse(Fare == 0.0, New.Fare, Fare)) %>%
select(-New.Fare)I will generate summary stats based on age that will be helpful in imputing more missing values and create new insights of male survivors by age, title and Pclass.
age.stats <- train %>%
group_by(Pclass, Title) %>%
summarize(Age.Min = min(Age, na.rm = TRUE),
Age.Max = max(Age, na.rm = TRUE),
Age.Mean = mean(Age, na.rm = TRUE),
Age.Median = median(Age, na.rm = TRUE),
Age.Var = var(Age, na.rm = TRUE),
Age.SD = sd(Age, na.rm = TRUE),
Age.IQR = IQR(Age, na.rm = TRUE)) %>%
arrange(Title, Pclass)
age.stats## # A tibble: 13 x 9
## # Groups: Pclass [3]
## Pclass Title Age.Min Age.Max Age.Mean Age.Median Age.Var
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Master. 0.92 11 5.306667 4.0 26.682133
## 2 2 Master. 0.67 8 2.258889 1.0 5.487936
## 3 3 Master. 0.42 12 5.350833 4.0 12.914017
## 4 1 Miss. 2.00 63 29.744681 30.0 159.498612
## 5 2 Miss. 2.00 50 22.560606 24.0 174.246212
## 6 3 Miss. 0.75 45 16.123188 18.0 94.037910
## 7 1 Mr. 17.00 80 42.184211 40.0 199.138578
## 8 2 Mr. 16.00 70 33.588889 31.0 150.261673
## 9 3 Mr. 11.00 74 28.724891 26.0 110.059948
## 10 1 Mrs. 17.00 62 40.631579 41.5 155.049787
## 11 2 Mrs. 14.00 57 33.682927 32.0 106.471951
## 12 3 Mrs. 15.00 63 33.515152 31.0 100.632576
## 13 1 <NA> 45.00 52 48.666667 49.0 12.333333
## # ... with 2 more variables: Age.SD <dbl>, Age.IQR <dbl>
As I did before, I will create a binary table tracking feature to identify missing Age values with an if ifelse statment - 0 = Y, Else N.
train$Age.Missing <- ifelse(is.na(train$Age), "Y", "N")This is a lookup table for the Age values selecting Pclass, Title, Age.Mean, Age.Median.
age.lookup <- age.stats %>%
select(Pclass, Title, Age.Mean, Age.Median)I will impute missing ages by using this lookup table.
train <- train %>%
left_join(age.lookup, by = c("Pclass", "Title")) %>%
mutate(Age = ifelse(Age.Missing == "Y",
ifelse(Title == "Miss.", Age.Median, Age.Mean),
Age)) %>%
select(-Age.Median, -Age.Mean)For missing fare values by age and sex I will create 13 new Ticket-based features using Group_by ticket, summarise group.count, Avg fare = max fare / n(), sum of Female.Count, ratio of the n() of males in ticket count / number of people on that ticket.
ticket.lookup <- train %>%
group_by(Ticket) %>%
summarise(Group.Count = n(),
Avg.Fare = max(Fare) / n(),
Female.Count = sum(Sex == "female"),
Male.Count = sum(Sex == "male"),
Child.Count = sum(Age < 18),
Elderly.Count = sum(Age > 54.0),
Female.Ratio = sum(Sex == "female") / n(),
Male.Ratio = sum(Sex == "Male") / n(),
Child.Ratio = sum(Age < 18) / n(),
Elderly.Ratio = sum(Age > 54.0) / n(),
Female.Child.Ratio = (sum(Age < 18) +
sum(Sex == "female" & Age >=18)) / n(),
Min.Age = min(Age),
Max.Age = max(Age))ticket.lookup %>% filter(Ticket == "3101295")| Ticket | Group.Count | Avg.Fare | Female.Count | Male.Count | Child.Count | Elderly.Count | Female.Ratio | Male.Ratio | Child.Ratio | Elderly.Ratio | Female.Child.Ratio | Min.Age | Max.Age |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3101295 | 6 | 6.614583 | 1 | 5 | 5 | 0 | 0.1666667 | 0 | 0.8333333 | 0 | 1 | 1 | 41 |
Populate train data with ticket lookup table
train <- train %>%
left_join(ticket.lookup, by = "Ticket")Load install the ggplot2 library - load ggplot2
library(ggplot2)It is important to create factors for Survived and Pclass.
train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)Here I am creating a subset of passengers that are traveling with children.
tickets.children <- train %>%
filter(Child.Count > 0)ggplot(tickets.children, aes(x = Pclass, fill = Survived)) +
theme_bw() +
geom_bar() +
facet_wrap(~ Title) +
labs(y = "Count of Passengers",
title = "Survival Rates for Ticket Group Traveling with Children")ggplot(tickets.children, aes(x = Pclass, fill = Survived)) +
theme_bw() +
geom_bar() +
facet_wrap(~ Title) +
labs(y = "Count of Passengers",
title = "Survival Rates for Ticket Groups Traveling without Children")This graph shows that women in first class without children were more likely to survive the Titanic while men in third class with children were more likely to perish.