Comprehensive EDA

The Titanic dataset provides 11 predictors and 1 binary response variable: Survival. The ultimate goal is to predict survivorship based on characteristics of the passenger.

In this Exploratory Data Analysis, I will examine the relationships between predictors and the response variable. I will use feature engineering to identify new predictors that may contribute to the analysis. And I will handle missing values and outliers as appropriate. The outputs of this analysis will be a preprocessed dataset, ready for machine learning algorithms.

library(dplyr)
library(ggplot2)
library(tidyverse)
library(RColorBrewer)
library(scales)
library(ggpubr)
train <- read_csv('train.csv')
test <- read_csv('test.csv') #test has 1 less col 
test$Survived <- NA
original <- rbind(train,test)
titanic <- original

Let’s begin by getting a glimpse at the data:

head(titanic)

## # A tibble: 6 x 12
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Brau~ male     22     1     0 A/5 2~  7.25 <NA> 
## 2           2        1      1 Cumi~ fema~    38     1     0 PC 17~ 71.3  C85  
## 3           3        1      3 Heik~ fema~    26     0     0 STON/~  7.92 <NA> 
## 4           4        1      1 Futr~ fema~    35     1     0 113803 53.1  C123 
## 5           5        0      3 Alle~ male     35     0     0 373450  8.05 <NA> 
## 6           6        0      3 Mora~ male     NA     0     0 330877  8.46 <NA> 
## # ... with 1 more variable: Embarked <chr>

summary(titanic)

##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin             Embarked        
##  Length:1309        Min.   :  0.000   Length:1309        Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character   Mode  :character  
##                     Mean   : 33.295                                        
##                     3rd Qu.: 31.275                                        
##                     Max.   :512.329                                        
##                     NA's   :1

We see missing data for a few features. Let’s filter them out:

#convert these variables to factors, easier to plot 
factors <- c('Survived', 'Pclass', 'Sex', 'Cabin', 'Embarked')
titanic[factors] <- lapply(titanic[factors], function(x) as.factor(x))

#missing values to keep in mind
titanic %>% 
  summarize_all(~(sum(is.na(.))))

## # A tibble: 1 x 12
##   PassengerId Survived Pclass  Name   Sex   Age SibSp Parch Ticket  Fare Cabin
##         <int>    <int>  <int> <int> <int> <int> <int> <int>  <int> <int> <int>
## 1           0      418      0     0     0   263     0     0      0     1  1014
## # ... with 1 more variable: Embarked <int>

Age: 263 missing - this is 20% of the dataset; too large to drop and may be too complex to set to the simple median
Cabin: 1014 missing - we will likely discard this variable
Fare: 1 missing
Embarked: 2 missing

With this in mind, let’s begin our variable exploration!

PassengerId - unique to each passenger
Survived - dependent variable/ targets

Pclass

There are three levels: 1, 2, 3

Q: How is Pclass distributed onboard?

Majority of the passengers are in Pclass 3
Surprisingly, more people in Pclass 1 than 2

titanic %>% 
  count(Pclass) %>% 
  mutate(proportion = round(n/sum(n)*100))

## # A tibble: 3 x 3
##   Pclass     n proportion
##   <fct>  <int>      <dbl>
## 1 1        323         25
## 2 2        277         21
## 3 3        709         54

Q: Are there more men or women in each Pclass?

There are more men in all Pclass. But the ratio of men to women is a lot higher in Pclass 3

titanic %>% 
  ggplot(aes(x=Pclass, fill=Sex)) +
  geom_bar(position = "dodge", width = 0.4)

Q: Survival rate? Were those who paid more prioritized for rescue?

Yes, those in Pclass 1 had much better chances of survival

titanic %>% 
  filter(!is.na(Survived)) %>%  
  ggplot(aes(x=Pclass, fill=Survived))+
  geom_bar(position='dodge')+
  scale_fill_brewer(palette = 'Set1')

Name

format: {LastName, Title. FirstName MiddleName}

head(titanic$Name)

## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

Individual names are unique, but we can extract the Title to help identify patterns based on this variable

Q: What is the frequency of each Title?

titanic$Title <- gsub('(.*, )|(\\..*)', '', titanic$Name)
table(titanic$Sex, titanic$Title)

##         
##          Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme  Mr Mrs
##   female    0   0   0    1   1        0    1     0      0  260    2   1   0 197
##   male      1   4   1    0   7        1    0     2     61    0    0   0 757   0
##         
##           Ms Rev Sir the Countess
##   female   2   0   0            1
##   male     0   8   1            0

There are some Titles that are more one-off. And ‘Mlle’ and ‘Miss’ refer to the same thing. To avoid overfitting, let’s consolidate these Titles into fewer groups.

titanic <- titanic %>% 
  mutate(Title = case_when(Title %in% c('Master', 'Miss', 'Mr', 'Mrs') ~ Title,
                           Title %in% c('Mlle', 'Mme', 'Ms') ~ "Miss",
                           TRUE ~ "Other"))

table(titanic$Sex, titanic$Title)

##         
##          Master Miss  Mr Mrs Other
##   female      0  265   0 197     4
##   male       61    0 757   0    25

Q: Survival rate based on Title?

Women had better chances than men, they must have been prioritized for rescue
Men with the Title “Mr. ” in their names had the worst survival rate
Back the days, “Master” was used in England for boys who were too young to be addressed as Mister; looks like children had much better chances of survival

titanic %>% 
  filter(!is.na(Survived)) %>%  
  ggplot(aes(x=Title, fill=Survived))+
  geom_bar(position='dodge')+
  scale_fill_brewer(palette = 'Set1')

Sex

Q: A trans-atlantic sounds rough. Were more men than women who were making it?

Yes, 64% of those onboard were men

count(titanic, Sex) %>% 
  mutate(prop = round(n/sum(n)*100)) %>% 
  arrange(desc(prop))

## # A tibble: 2 x 3
##   Sex        n  prop
##   <fct>  <int> <dbl>
## 1 male     843    64
## 2 female   466    36

Q: We already suspect that women have higher rate of survival. But what if we holding Pclass constant, was survival rate still higher for women?

Perhaps surprisingly, yes.
In Pclass 1, women survived almost with certainty; their male counterparts though, were still more likely to have died
In Pclass 3, survival chances for women were almost 50/50; this sharply contrasts the much higher survival rates for those in higher classes

titanic %>% 
  filter(!is.na(Survived)) %>%  
  ggplot(aes(x=Sex, fill=Survived)) +
  geom_bar(position='dodge') +
  scale_fill_brewer(palette = 'Set1') + 
  facet_wrap(~Pclass, nrow=3, scales = "free_y")

SibSp + Parch

Sibsp = siblings and spouses onboard Parch = parents and children onboard They both tell us about how having family with you impacts survival. Let’s combine these into a new feature: Family

Q: Did having other family members impact your survival?

Seems so. If you had no family onboard, your chances of survival was worse
Smaller family units (2-4) had better odds of survival
However, large family units (5+) did not

titanic$Family = titanic$SibSp + titanic$Parch + 1

titanic %>% 
  filter(!is.na(Survived)) %>%  
  ggplot(aes(x=Family, fill=Survived)) +
  geom_bar(position='dodge') +
  scale_fill_brewer(palette = 'Set1')

Ticket

There were 1309 passengers with 929 unique tickets

Let’s use Ticke to create a variable: N_Per_Ticket

#Count of how many passengeres per ticket 
n_per_ticket <- titanic %>% 
  group_by(Ticket) %>% 
  count() 

#merge onto the main dataset 
titanic <- merge(x=n_per_ticket, y=titanic, 
                 by.x="Ticket", by.y="Ticket",
                 all.x=TRUE, all.y=TRUE)
colnames(titanic)[colnames(titanic) == "n"] <- "N_Per_Ticket"

Q: Are people on the same ticket because they’re from the same family?

Not necessarily. There are groups traveling together (on the same ticket) when they are not “Family”. And there are families who do not share the same ticket.
There are 284 instances where Family != N_Per_Ticket

titanic %>% 
  filter(N_Per_Ticket != Family) %>% 
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1   284

It is not clear what matters more: how big your family is (Sipsp+Parch), or how large your travel group is. Let’s create another variable Groups to be the larger of Family and N_Per_Ticket.

titanic <- titanic %>% 
  mutate(Groups = ifelse(Family > N_Per_Ticket, Family, N_Per_Ticket)) 

titanic %>% 
  filter(!is.na(Survived)) %>%  
  ggplot(aes(x=Groups, fill=Survived)) +
  geom_bar(position='dodge') +
  scale_fill_brewer(palette = 'Set1')

The distribution is similar to what we saw for Family. Let’s consolidate everyone into based on a new variable: Group_Size.

titanic <- titanic %>% 
  mutate(Group_Size = case_when(Groups == 1 ~ 1,
                                Groups == 2 ~ 2,
                                Groups > 2 & Groups < 5 ~ 3,
                                Groups >= 5 ~ 4))

titanic %>% 
  filter(!is.na(Survived)) %>%  
  ggplot(aes(x=Group_Size, fill=Survived)) +
  geom_bar(position='dodge') +
  scale_fill_brewer(palette = 'Set1')

Fare

Does Fare refer to price for the passenger, or the ticket?

for the ticket, as we can see by isolating Pclass 3: there is a linear relationship between Fare and N_Per_Ticket

titanic %>% 
  filter(Pclass==3) %>% 
  ggplot(aes(x=N_Per_Ticket, y=Fare)) + 
  geom_jitter(color="blue", size=2, alpha=0.3)

## Warning: Removed 1 rows containing missing values (geom_point).

This has caused outliers in Fare. For example, there are Fare in Pclass 3 that are much higher than Pclass 2.

titanic %>% 
  filter(Fare<50) %>% 
  ggplot(aes(x=Fare, y=Pclass, fill=Pclass))+
  geom_boxplot()

We can combine Fare and N_Per_Person to create a new feature: Fare_Per_Person.
Let’s check the distribution for Pclass 3 again.

titanic$Fare_Per_Person <-  titanic$Fare/titanic$N_Per_Ticket

titanic %>% 
  filter(Pclass==3) %>% 
  ggplot(aes(x=N_Per_Ticket, y=Fare_Per_Person)) + 
  geom_jitter(color="blue", size=2, alpha=0.3)

## Warning: Removed 1 rows containing missing values (geom_point).

That looks like a more useful variable for us. We can isolate the relationship between who much you paid, and your surival rate.

Now here is something interesting - there are people who had 0 Fare. Maybe they got a free ticket. For the purposes of this analysis, we should change their Fare_Per_Ticket to median of their Pclass.

q1 <- quantile(filter(titanic, Pclass==1)$Fare_Per_Person)
q2 <- quantile(filter(titanic, Pclass==2)$Fare_Per_Person)
q3 <- quantile(filter(titanic, Pclass==3)$Fare_Per_Person, na.rm = TRUE)

titanic <- titanic %>% 
  mutate(Fare_Per_Person = case_when(Fare_Per_Person == 0 & Pclass == 1 ~ q1[3],
                                     Fare_Per_Person == 0 & Pclass == 2 ~ q2[3], 
                                     Fare_Per_Person == 0 & Pclass == 3 ~ q3[3],
                                     TRUE ~ Fare_Per_Person))

While we’re at it, let’s address our missing value in Fare. One way to deal with missing values is to delete the row. But in this case, it is coming from the test set. We will have to make predictions about it.

Filtering it out, we see that this person traveled alone. We can then set his Fare_Per_Person to the median of Pclass 3.

filter(titanic, is.na(Fare))

##   Ticket N_Per_Ticket PassengerId Survived Pclass               Name  Sex  Age
## 1   3701            1        1044     <NA>      3 Storey, Mr. Thomas male 60.5
##   SibSp Parch Fare Cabin Embarked Title Family Groups Group_Size
## 1     0     0   NA  <NA>        S    Mr      1      1          1
##   Fare_Per_Person
## 1              NA

titanic$Fare_Per_Person[is.na(titanic$Fare_Per_Person)] <- q3[3]

There are still outliers.

titanic %>% 
  ggplot(aes(x=Fare_Per_Person, y=Pclass, fill=Pclass))+
  geom_boxplot()

Looking at this dataset, it seems that Ensemble Tree algorithms could be a good fit for it. Those are relatively immune to outliers. I think it’s best to leave Fare_Per_Person as is. It is a significant improvement over the orignal Fare already.

Another thing we can do is create Fare_Groups. For algorithms that are sensitive to outliers, this can be a good solution.

Q: Survival rate by Fare_Group?
* we see a clearer pattern that higher fare means higher survival rate

cuts <- c(-Inf, 7, 8, 10.5, 13, 26, 39, Inf)
labs <- c("0", "1", "2", "3", "4", "5", "6")
titanic <- titanic %>% 
  mutate(Fare_Groups = cut(Fare_Per_Person, breaks = cuts, labels=labs))
titanic$Fare_Groups <- as.factor(titanic$Fare_Groups)

titanic %>% 
  filter(!is.na(Survived)) %>% 
  ggplot(aes(x=Fare_Groups, fill=Survived)) +
  geom_bar(position = 'dodge')+
  scale_fill_brewer(palette = 'Set1')

Embarked

There are 2 missing values:

filter(titanic, is.na(Embarked))

##   Ticket N_Per_Ticket PassengerId Survived Pclass
## 1 113572            2         830        1      1
## 2 113572            2          62        1      1
##                                        Name    Sex Age SibSp Parch Fare Cabin
## 1 Stone, Mrs. George Nelson (Martha Evelyn) female  62     0     0   80   B28
## 2                       Icard, Miss. Amelie female  38     0     0   80   B28
##   Embarked Title Family Groups Group_Size Fare_Per_Person Fare_Groups
## 1     <NA>   Mrs      1      2          2              40           6
## 2     <NA>  Miss      1      2          2              40           6

These 2 share the same Ticket and Cabin. Since they’re traveling together, they likely have embarked together.

Given that they’re in the Pclass 1, where are they most likely to have embarked?
* At ‘S’

titanic %>% 
  filter(Pclass==1 & !is.na(Embarked)) %>% 
  count(Embarked) %>% 
  mutate(proportion=n/sum(n))

## # A tibble: 3 x 3
##   Embarked     n proportion
##   <fct>    <int>      <dbl>
## 1 C          141    0.439  
## 2 Q            3    0.00935
## 3 S          177    0.551

We also know that their Fare_Per_Person was 40. This is a pretty high price for Pclass 2. Does where you embark change how much you paid?
* We see that given they embarked at ‘C’, they were more likely to have paid the higher price. But since probability of boarding at ‘S’ is higher, ‘S’ remains more likely.

titanic %>% 
  filter(Pclass==1& (Embarked=='C' | Embarked=='S')) %>% 
  ggplot(aes(x=Embarked, y=Fare_Per_Person, fill=Embarked))+
  geom_boxplot() +
  ylim(20, 60)

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

#set NA to 'S'
titanic$Embarked[is.na(titanic$Embarked)] <- 'S'

Survival rate?
* surprisingly, there does seem to be a some kind of relationship

titanic %>% 
  filter(!is.na(Survived)) %>% 
  ggplot(aes(x=Embarked, fill=Survived)) +
  geom_bar(position = 'dodge')+
  scale_fill_brewer(palette = 'Set1')

Cabin

format: {letter+numbers}

head(titanic$Cabin)

## [1] B77 B77 B79 E68 E67 E67
## 186 Levels: A10 A11 A14 A16 A18 A19 A20 A21 A23 A24 A26 A29 A31 A32 A34 ... T

There are 1014 missing values of Cabin, or 80% of the dataset. The best way to handle this may be throwing this variable away. But first, let’s try to figure out what information we’re missing exactly.

values are missing b/c these people did not have cabins (this could affect survival)
values are missing b/c we do not know which cabins they had

This is a subtle distinction, let’s see if we can find some additional information online.
https://titanic.fandom.com/wiki/Third_Class_cabins
https://titanic.fandom.com/wiki/First_Class_Staterooms

The most information: “steerage” (open-space dorms) did not apply to Titanic. All passengers were housed in cabins. There option 2 was the correct answer.

For those passengers that we have the data for, Cabin provides potentially important information about the Deck the passenger’s accomodation is on. Since lower decks were flooded first as the ship sinked, this mattered. Let’s extract Deck from Cabin. But keep in mind that this variable likely will lead to overfitting. We will have to test the models and see.

The format is {‘A10’}. Are there any that did not follow this rule?
1) – within’F’, there are a few that appear different. It is likely that the ‘F’ is extract, so what remains follows our normal format.

titanic %>% 
  filter((!Pclass==1) & !is.na(Cabin) & substr(Cabin, start=1, stop=2)=='F ') %>% 
  select(PassengerId, Sex, Age, Pclass, Cabin)

##   PassengerId    Sex Age Pclass Cabin
## 1        1213   male  25      3 F E57
## 2        1180   male  NA      3 F E46
## 3         129 female  NA      3 F E69
## 4         700   male  42      3 F G63
## 5         949   male  25      3 F G63
## 6          76   male  25      3 F G73
## 7         716   male  19      3 F G73

there is one entry with Deck T - which does not exist. Let’s override to “Unknown”.

titanic <- titanic %>% 
  mutate(Deck = case_when(is.na(Cabin) ~ "Unknown",
                          substr(Cabin, start=1, stop=1)=="T" ~ "Unknown",
                          substr(Cabin, start=1, stop=2)=='F ' ~ substr(Cabin, start=3, stop=3),
                          TRUE ~  substr(Cabin, start=1, stop=1)
  ))
#see the resulting variable - 
count(titanic, Deck)

## # A tibble: 8 x 2
##   Deck        n
##   <chr>   <int>
## 1 A          22
## 2 B          65
## 3 C          94
## 4 D          46
## 5 E          44
## 6 F          14
## 7 G           9
## 8 Unknown  1015

Also note most passengers with Deck data are Pclass 1. If this variable is useful, it would be for Pclass 1.

titanic %>% 
  filter(Deck!='Unknown') %>% 
  count(Pclass)

## # A tibble: 3 x 2
##   Pclass     n
##   <fct>  <int>
## 1 1        255
## 2 2         23
## 3 3         16

Survival rate & Deck (ignore missing data and only Pclass==1)?
* Deck A actually had the worst survival rate

titanic %>% 
  filter(!is.na(Survived) & Deck!='Unknown' & Pclass==1) %>% 
  ggplot(aes(x=Deck, fill=Survived)) +
  geom_bar(position = 'dodge')+
  scale_fill_brewer(palette = 'Set1')

Age

There are 263 missing values of Cabin, or 20% of the dataset. This information is potentally impactful, so we should do our best to replace it. The median age of the dataset is 28.

Our Title variable perhaps can help. Miss vs Mrs, and Mr vs Master, they have an age component to it. We also can guess that older people can afford to be in better Pclass. Let’s use these two variables, and assign missing Age to the median of these groups.

#lookup table 

lookup <- titanic %>% 
  group_by(Pclass, Title) %>% 
  summarize(med_age = median(Age, na.rm=TRUE))

#rows with missing ages 
missing_ages <- titanic %>% filter(is.na(Age))

#merge the two 
temp <- merge(x=missing_ages, y=lookup,
              by = c('Pclass', 'Title'),
              all.x=TRUE) 
temp <- temp %>% select(PassengerId, med_age)
titanic <- merge(x=titanic, y=temp,
                 by = 'PassengerId',
                 all.x=TRUE)

#add med_age to original dataset
titanic <- titanic %>% 
  mutate(med_age = case_when(is.na(med_age) ~ Age,
                             TRUE ~ med_age))

#confirm that Age and our best guesses share the same median 
titanic %>%
  group_by(Pclass, Title) %>%
  summarize(n =n(),
            missing= sum(is.na(Age)),
            prop_missing = round(missing/n*100),
            age = median(Age, na.rm=TRUE),
            new = median(med_age, na.rm=TRUE))

## # A tibble: 14 x 7
## # Groups:   Pclass [3]
##    Pclass Title      n missing prop_missing   age   new
##    <fct>  <chr>  <int>   <int>        <dbl> <dbl> <dbl>
##  1 1      Master     5       0            0   6     6  
##  2 1      Miss      63       1            2  30    30  
##  3 1      Mr       159      27           17  41.5  41.5
##  4 1      Mrs       77      10           13  45    45  
##  5 1      Other     19       1            5  48.5  48.5
##  6 2      Master    11       0            0   2     2  
##  7 2      Miss      51       2            4  20    20  
##  8 2      Mr       150      13            9  30    30  
##  9 2      Mrs       55       1            2  30.5  30.5
## 10 2      Other     10       0            0  41.5  41.5
## 11 3      Master    45       8           18   6     6  
## 12 3      Miss     151      48           32  18    18  
## 13 3      Mr       448     136           30  26    26  
## 14 3      Mrs       65      16           25  31    31

#replace Age
titanic$Age <- titanic$med_age

Given that we replaced 20% of the Age data, and there are still outliers, perhaps bucketing Age could be useful.

titanic <- titanic %>% 
  mutate(Age_Group = case_when(Age < 16 ~ "0-15",
                               Age >= 16 & Age < 20 ~ "16-19",
                               Age >= 20 & Age < 24 ~ "20-23",
                               Age >= 24 & Age < 30 ~ "24-29",
                               Age >= 30 & Age < 40 ~ "30-39",
                               Age >= 40 & Age < 55 ~ "40-54",
                               Age >= 55 ~ "55+"))
titanic %>% count(Age_Group)

## # A tibble: 7 x 2
##   Age_Group     n
##   <chr>     <int>
## 1 0-15        123
## 2 16-19       158
## 3 20-23       138
## 4 24-29       344
## 5 30-39       263
## 6 40-54       216
## 7 55+          67

titanic$Age_Group <- as.factor(titanic$Age_Group)

Survival rate by Age?
* the group with the best survival rate were children 15 and under * the worst were people in their 20s * surprisingly, people 55+ were not really better off; given that they seemed to should have been prioritized in the rescue

titanic %>% 
  filter(!is.na(Survived)) %>% 
  ggplot(aes(x = Age_Group, fill=Survived))+
  geom_bar(position='dodge')+
  scale_fill_brewer(palette = 'Set1')

Fate

Did the fate of other people in your group impact your survival? ‘group’ here means those who were on the same ticket as you. Identifying family members based on last name were difficult, and I have decided to use ticket for simplicity. If this variable appears to be important after running feature_importance_ on models, we can return to do more work on it.

Who are your companions?

Companions <- titanic %>%
  group_by(Ticket) %>%
  summarize(n = n(),
            Died = sum(Survived==0, na.rm = T),
            Lived = sum(Survived==1, na.rm = T),
            Known_Fate = Died + Lived) %>% 
  arrange(desc(n))
Companions

## # A tibble: 929 x 5
##    Ticket           n  Died Lived Known_Fate
##    <chr>        <int> <int> <int>      <int>
##  1 CA. 2343        11     7     0          7
##  2 1601             8     2     5          7
##  3 CA 2144          8     6     0          6
##  4 3101295          7     6     0          6
##  5 347077           7     1     3          4
##  6 347082           7     7     0          7
##  7 PC 17608         7     0     2          2
##  8 S.O.C. 14879     7     5     0          5
##  9 113781           6     2     2          4
## 10 19950            6     2     2          4
## # ... with 919 more rows

For example: Ticket CA. 2343
* there are 11 people who share this ticket
* of those, we know the fate of 7 (the other 5 are in the test set)
* of those 7, all of them died
* could this mean that the rest suffered the same fate?

Let’s go through and add a Fate variable to our dataset. For each passenger, if she is in the train test set, we should ignore her own fate. Since that would mean using the dependent variable to engineer an independent variable.

There are 3 levels.
* 0.5 = default * 1 = better fate (other people in your group, when fate is known, more often lived) * 0 = worse fate (other people in your group, when fate is known, more often died)

Survival rate based on fate? * as expected, if other members of your group surived, your chances were higher as well

full <- merge(x=titanic, y=Companions,
              by.x="Ticket", by.y="Ticket",
              all.x=TRUE, all.y=TRUE)
full$Fate <- NA

trainfate <- full %>% 
  filter(!is.na(Survived)) %>% 
  mutate(Known_Fate = Known_Fate-1,
         Fate = case_when(Known_Fate==0 ~ 0.5,
                          n==1 ~ 0.5,
                          n>1 & Survived==1 ~ (Lived-1)/Known_Fate,
                          n>1 & Survived==0 ~ Lived/Known_Fate)) 
testfate <- full %>% 
  filter(is.na(Survived)) %>% 
  mutate(Fate = case_when(Known_Fate==0 ~ 0.5,
                          n==1 ~ 0.5,
                          n>1 ~ Lived/Known_Fate)) 

full <- rbind(trainfate, testfate)
full <- full %>% arrange(PassengerId)

full <- full %>% 
  mutate(Fate = case_when(Fate < 0.5 ~ 0,
                          Fate > 0.5 ~ 1,
                          TRUE ~ Fate))

titanic$Fate <- as.factor(full$Fate)

titanic %>% 
  filter(!is.na(Survived)) %>% 
  ggplot(aes(x=Fate, fill=Survived))+
  geom_bar(position='dodge')+
  scale_fill_brewer(palette = 'Set1')

This concludes our EDA!

We can go through the variables and delete the ones we do not need.
Here is our preprocessed data, ready for the next stage!

titanic$Name <- NULL
titanic$SibSp <- NULL
titanic$Parch <- NULL
titanic$Cabin <- NULL
titanic$N_Per_Ticket <- NULL
titanic$Fare <- NULL
titanic$Family_Size <- NULL
titanic$Family <- NULL
titanic$Ticket <- NULL
titanic$Groups <- NULL
titanic$med_age <- NULL
str(titanic)

## 'data.frame':    1309 obs. of  13 variables:
##  $ PassengerId    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived       : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass         : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex            : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age            : num  22 38 26 35 35 26 54 2 27 14 ...
##  $ Embarked       : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  $ Title          : chr  "Mr" "Mrs" "Miss" "Mrs" ...
##  $ Group_Size     : num  2 2 1 2 1 1 2 4 3 2 ...
##  $ Fare_Per_Person: num  7.25 35.64 7.92 26.55 8.05 ...
##  $ Fare_Groups    : Factor w/ 7 levels "0","1","2","3",..: 2 6 2 6 3 3 5 1 1 5 ...
##  $ Deck           : chr  "Unknown" "C" "Unknown" "C" ...
##  $ Age_Group      : Factor w/ 7 levels "0-15","16-19",..: 3 5 4 5 5 4 6 1 4 1 ...
##  $ Fate           : Factor w/ 3 levels "0","0.5","1": 2 2 2 1 2 2 2 1 3 1 ...

#export file: 
ttrain <- titanic[1:891,]
ttest <- titanic[892:1309,]
write.csv(ttrain,"preprocessed_train.csv", row.names = FALSE)
write.csv(ttest,"preprocessed_test.csv", row.names = FALSE)

Comprehensive EDA - Titanic