This Project want to analysis and estimate what factors that influence survival rate passengers in Titanic. This analysis will use three models such as,
1. Logistic Regression
2. Decision Tree
3. Support Vector Machine (SVM)
These three models will be compared which can best explain survival rate passengers in Titanic based by their accuracy level.

1 Input Data

1.1 Library

library(tidyverse)
library(corrplot)
library(ggplot2)
library(dplyr)
library(e1071)
library(caret)
library(rpart)
library(rpart.plot)
library(partykit)
library(alluvial)
library(plotly)
library(kernlab)
library(ROCR)
library(pscl)
library(gtools)
library(rattle)
library(RColorBrewer)

1.2 Data Setup

Because the data has been separated from the beginning, I will combine the data again for cleansing data.

data_train <- read.csv("train.csv", na.strings = "")
data_test <- read.csv("test.csv", na.strings = "")
full_data <- bind_rows(data_train, data_test)

2 Cleansing Data and Exploratory Data

glimpse(full_data)

># Observations: 1,309
># Variables: 12
># $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
># $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
># $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
># $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
># $ Sex         <fct> male, female, female, female, male, male, male, male, f...
># $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
># $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
># $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
># $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
># $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
># $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
># $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...

Information Data :
- Survival (0 = No, 1 = Yes)
- pclass = Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Sex (Male, Female)
- Age (in years)
- sibsp = Number of siblings / spouses aboard the Titanic
- parch = Number of parents / children aboard the Titanic
- ticket = Ticket number
- fare = Passenger fare
- cabin = Cabin number
- embarked = Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

First, we will check if there are any missing values for each variables,

colSums(is.na(full_data))

># PassengerId    Survived      Pclass        Name         Sex         Age 
>#           0         418           0           0           0         263 
>#       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
>#           0           0           0           1        1014           2

From the result, can be seen that there are several variables have missing values such as : Survived (418 NA), Age (263 NA), Fare (1 NA), Cabin (1014 NA), Embarked (2 NA). I will solve this problem one by one.

Begin with Age variable, I will replace missing Age cells with the mean age of all passengers on the titanic and divides by age (“0-12”, “13-17”, “18-59”, “>60” years) category to simplify analysis

full_data <- full_data %>% 
  mutate(
    Age = ifelse(is.na(Age), mean(full_data$Age, na.rm = T), Age),
    `Age Group` = case_when(Age < 13 ~ "00-12",
                            Age >= 13 & Age < 18 ~ "13-17",
                            Age >= 18 & Age < 60 ~ "18-59",
                            Age >= 60 ~ ">60"))
table(full_data$`Age Group`)

># 
>#   >60 00-12 13-17 18-59 
>#    40    94    60  1115

in Embarked variable, I will replace Embarked missing values by most frequent observation such as Southampton (S)

table(full_data$Embarked)

># 
>#   C   Q   S 
># 270 123 914

full_data$Embarked <- replace(full_data$Embarked, which(is.na(full_data$Embarked)), 'S')
table(full_data$Embarked)

># 
>#   C   Q   S 
># 270 123 916

In Title variable, that variable contains the name and the title used for each passengers, I will subset to take only the title used for each passengers

full_data$Title <- gsub('(.*, )|(\\..*)', '', full_data$Name)
table(full_data$Title)

># 
>#         Capt          Col          Don         Dona           Dr     Jonkheer 
>#            1            4            1            1            8            1 
>#         Lady        Major       Master         Miss         Mlle          Mme 
>#            1            2           61          260            2            1 
>#           Mr          Mrs           Ms          Rev          Sir the Countess 
>#          757          197            2            8            1            1

Because too many titles, I will make some of the titles to new category so we have only five titles such as Master, Miss, Mr, Mrs, and Rare Title

full_data$Title[full_data$Title %in% c("Mlle", "Ms")] <- "Miss"
full_data$Title[full_data$Title == "Mme"] <- "Mrs"
full_data$Title[!(full_data$Title %in% c('Master', 'Miss', 'Mr', 'Mrs'))] <- "Rare Title"
table(full_data$Title)

># 
>#     Master       Miss         Mr        Mrs Rare Title 
>#         61        264        757        198         29

in Family Size variable, I will divide into three categories such as (“1”, “2-5”, “>5”) family size to simplify the analysis

full_data$Familysize <- full_data$SibSp + full_data$Parch + 1
full_data$Familysize[full_data$Familysize == 1] <- "1"
full_data$Familysize[full_data$Familysize < 5 & full_data$Familysize >= 2] <- "2-5"
full_data$Familysize[full_data$Familysize >= 5] <- ">5"
full_data$Familysize[full_data$Familysize == 11] <- ">5"
table(full_data$Familysize)

># 
>#  >5   1 2-5 
>#  82 790 437

recheck the missing values again

colSums(is.na(full_data))

># PassengerId    Survived      Pclass        Name         Sex         Age 
>#           0         418           0           0           0           0 
>#       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
>#           0           0           0           1        1014           0 
>#   Age Group       Title  Familysize 
>#           0           0           0

only left Survived and Cabin variable that have missing values. Missing values in Survived, that is test data used for prediction. For Cabin variable, I will not used it later because not very useful for analysis.

Now, I will change the class type of some variable into a factor

full_data <- full_data %>% 
  mutate( Survived = as.factor(Survived), 
         Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Embarked = as.factor(Embarked),
         `Age Group` = as.factor(`Age Group`),
         Title = as.factor(Title),
         Familysize = as.factor(Familysize))

Check the class for each variables again,

glimpse(full_data)

># Observations: 1,309
># Variables: 15
># $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
># $ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
># $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
># $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
># $ Sex         <fct> male, female, female, female, male, male, male, male, f...
># $ Age         <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 29.88...
># $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
># $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
># $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
># $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
># $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
># $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...
># $ `Age Group` <fct> 18-59, 18-59, 18-59, 18-59, 18-59, 18-59, 18-59, 00-12,...
># $ Title       <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss,...
># $ Familysize  <fct> 2-5, 2-5, 1, 2-5, 1, 1, 1, >5, 2-5, 2-5, 2-5, 1, 1, >5,...

Now, I will discard some variables that not used in analysis, I will only used “Survived”, “Pclass”, “Sex”, “Fare”, “Embarked”, “Age Group”, “Title”, “Family Size” variables.

final_data <- full_data %>% 
  select(Survived, Pclass, Sex, Fare, Embarked, `Age Group`, Title, Familysize, Age)

Now, I will separate again into train and test data as in the beginning.

train_fix <- final_data[1:891,]
test_fix <- final_data[892:1309,]

3 Exploratory Data

3.1 Alluvial Graph

I will make the alluvial graph to see in general how each variable relates to each other

alluvialgraph <- train_fix %>% 
  group_by(Survived, Sex, Pclass, `Age Group`) %>% 
  summarise(total = n()) %>% 
  ungroup %>% 
  na.omit()

alluvial(alluvialgraph[, c(1:4)], 
         freq = alluvialgraph$total, border = NA,
         col = ifelse(alluvialgraph$Survived == "1", "green", "red"),
         cex = 0.65,
         ordering = list(
           order(alluvialgraph$Survived, alluvialgraph$Pclass == 1),
                 order(alluvialgraph$Sex, alluvialgraph$Pclass == 1), NULL, NULL
         ))

Notes :
- Green (Passenger can survived)
- Red (Passenger can’t survived)
From the graph, can be seen that female are more likely to survived than male.
From the Pclass perspective, there is a tendency pattern the higher the class, the higher also survival rate.
From the Age group perspective, most of the data is in the category “18-59” age, but look likely more are not survive than survive.
I will look more detail in the bar chart how each variable related with survival rate.