The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. This behavior has greatly affect stock price in Excel
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to use stock price in Google Sheets and apply the tools of machine learning to predict which passengers survived the tragedy.
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings/spouses aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
This is my first kernel and bellow the main points that characterize my work:
library(kableExtra)
library(ggplot2) # Visualization
## Warning: The package `vctrs` (>= 0.3.8) is required as of rlang 1.0.0.
library(tibble)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.2
library(dplyr) # Data manipulation
library(corrplot) # Correlattion plot
library(pROC) # Roc curve
library(caret)
## Warning: package 'caret' was built under R version 3.5.2
# library(party)
# library(HDoutliers)
# Importing train/test dataset in R
train <- read.csv('../input/train.csv', header = TRUE, stringsAsFactors = FALSE, na.strings = c(""))
train$Set = "train"
test <- read.csv('../input/test.csv', header = TRUE,stringsAsFactors = FALSE, na.strings = c(""))
test$Set = "test"
# In order to streamline the cleaning process, the train and test datasets are binded. A dummy column is added in order to keep track of the original data source
test$Survived <- NA
full <- rbind(train,test)
# Coercing features from char to factor
full$Sex <- as.factor(full$Sex)
full$Embarked <- as.factor(full$Embarked)
full$Pclass <- as.factor(full$Pclass)
# The levels if the Survived feature (outcome) are temporarly changed. The orginal ones are kept in a variable
full$Survived <- as.factor(full$Survived); origLevels <- levels(full$Survived); levels(full$Survived) <- c("dead","survived")
The survivors distribution for each Pclass and sex is shown below:
g <- ggplot(filter(full,!is.na(Survived)),aes(Survived))
g <- g + geom_bar(colour = "black", aes(fill=Sex, Pclass),position="dodge")
g <- g + labs(title = "Survival distribution", x = "Pclass", y = "# survied") + theme_bw()
g
At first glanch it seems that male in the third class have more chance to survive respect to the female in the same class. To continue study how gender and stock price are related, use this guide about how to get stock price in Excel.
Let’s first have a look at the percentage of missing value for each feature:
missingValue <- sapply(full, function(feature){
sum(is.na(feature))/length(feature)
})
missingValue <- data.frame(feature = names(missingValue), percentage = missingValue) %>% filter(percentage != 0)
g <- ggplot(missingValue,
aes(reorder(feature, percentage),percentage))
g <- g + geom_bar(stat = "identity", fill="dark green", color="black")
g <- g + labs(title = "Missing values Overview [%]", x = "Feature", y = "Missing Values [%]")
g <- g + coord_flip() + theme_bw()
g
The cabin feature has almost 80% of NA values. Maybe the NA’s here have the meaning of: no cabin.
A bunch of entries from the Name feature:
head(full$Name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
Each Name is of the form String, Title. String, where the first one is the family name.
The idea behind this extraction is not to use them for training the model, but to use them in order to: + understand the family status (married or maiden) + wheater or not is part of the crew + handle NA’s for the Age
Regular expressions are used in order to extract the Name.family and Name.title
full$Name.family <- gsub(pattern = "([\\w|\\-| |']*), ([\\w| ]*\\.) (.*)", x=full$Name, replacement="\\1", ignore.case=TRUE, perl = TRUE)
full$Name.title <- gsub(pattern = "([\\w|\\-| |']*), ([\\w| ]*\\.) (.*)", x=full$Name, replacement="\\2", ignore.case=TRUE, perl = TRUE)
Name.family <- full %>% group_by(Name.family) %>% summarise(Name.family.count = n())
full <- inner_join(full,Name.family, by = "Name.family")
knitr::kable(table(full$Name.title),"html") %>%
kable_styling() %>%
scroll_box(height = "200px")
| Var1 | Freq |
|---|---|
| Capt. | 1 |
| Col. | 4 |
| Don. | 1 |
| Dona. | 1 |
| Dr. | 8 |
| Jonkheer. | 1 |
| Lady. | 1 |
| Major. | 2 |
| Master. | 61 |
| Miss. | 260 |
| Mlle. | 2 |
| Mme. | 1 |
| Mr. | 757 |
| Mrs. | 197 |
| Ms. | 2 |
| Rev. | 8 |
| Sir. | 1 |
| the Countess. | 1 |
As we can see from the last output, many titles have only few occurence. The less frequent titles are pooled in the more common ones:
full <- full %>% mutate(
Title = as.factor(case_when(
# Young boy less then 18 years old
Name.title == "Master." ~ "Master",
# Girl or unmarried women
Name.title == "Miss." |
Name.title == "Mlle." ~ "Miss",
# Women married
Name.title == "Mrs." |
Name.title == "Mme." |
Name.title == "Dona." ~ "Mrs",
# Women with unspecified status
Name.title == "Ms." |
Name.title == "Lady." |
Name.title == "the Countess." |
Name.title == "Dr." & Sex == "female" ~ "Mrs", # In this group because there are too few observation
# Men
Name.title == "Mr." |
Name.title == "Sir." |
Name.title == "Rev." |
Name.title == "Don." |
Name.title == "Jonkheer." |
Name.title == "Dr." & Sex == "male" ~ "Mr",
# Crew
Name.title == "Capt." |
Name.title == "Col." |
Name.title == "Major." ~ "Mr" # In this group because it wasn't significative in the final model
))
)
Below it’s shown the new histogram of the Title feature just created:
g <- ggplot(full, aes(Title))
g <- g + geom_bar(color="black", aes(fill=Title))
g <- g + coord_flip() + labs(title = "Passenger titles", x = "Title", y = "Count") + theme_bw()
g
Using the same approch used for the Title (checking the structure of the entries and using regex), the Ticket.number is extracted:
full$Ticket.number <- gsub(pattern = "^((.*) )?(\\d*)?$", x=full$Ticket, replacement="\\3")
# Since we want to treat them as number the tickets which start with LINE are replaced with the 0
full$Ticket.number <- gsub(pattern = "^LINE$", x=full$Ticket.number, replacement="0")
full$Ticket.number <- as.integer(full$Ticket.number)
full <- full %>% mutate(Cabin.Deck = as.factor(case_when(
grepl("^A",Cabin) ~ "A",
grepl("^B",Cabin) ~ "B",
grepl("^C",Cabin) ~ "C",
grepl("^D",Cabin) ~ "D",
grepl("^E",Cabin) ~ "E",
grepl("^F",Cabin) ~ "F",
grepl("^G",Cabin) ~ "G",
grepl("^T",Cabin) ~ "T",
TRUE ~ "NONE"
)))
g <- ggplot(filter(full,Cabin.Deck != "NONE" & Set=="train"), aes(Cabin.Deck))
g <- g + geom_bar(color="black", aes(fill=Survived, alpha=Pclass), position = "dodge")
g <- g + labs(title = "Cabin Deck distribution", x = "Cabin Deck", y = "Count") + theme_bw()
g
## Warning: Using alpha for a discrete variable is not advised.
Two different imputation techniques are tried:
Age and Title;By looking at the Age distribution for each Title:
it seems reasonable to use the median Age in the Title group of appartenency.