Load library

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.8     v dplyr   1.0.9
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Preface

On April 14, 1912, the RMS Titanic struck an iceberg in the North Atlantic Ocean and sank. Of the 2,224 people on board, only 706 survived.

Problem Statement

The goal of this exercise is to predict survivors on the Titanic based on nine input variables, described below. We are provided two datasets:

1. train.csv, containing 891 records
2. test.csv, containing 418 records.

The two datasets are provided with the intent that models are formulated using the train dataset and model performance is evaluated on the test dataset.

Variables

Data

Load both train and test daata set

setwd("C:/Users/Galih Dwika Putra R/Documents/project_learning/kaggle/titanic")
titan_train <- read.csv("train.csv",sep=",")
titan_test <- read.csv("test.csv",sep=",")
head(titan_train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q
head(titan_test)
##   PassengerId Pclass                                         Name    Sex  Age
## 1         892      3                             Kelly, Mr. James   male 34.5
## 2         893      3             Wilkes, Mrs. James (Ellen Needs) female 47.0
## 3         894      2                    Myles, Mr. Thomas Francis   male 62.0
## 4         895      3                             Wirz, Mr. Albert   male 27.0
## 5         896      3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0
## 6         897      3                   Svensson, Mr. Johan Cervin   male 14.0
##   SibSp Parch  Ticket    Fare Cabin Embarked
## 1     0     0  330911  7.8292              Q
## 2     1     0  363272  7.0000              S
## 3     0     0  240276  9.6875              Q
## 4     0     0  315154  8.6625              S
## 5     1     1 3101298 12.2875              S
## 6     0     0    7538  9.2250              S

Exploratory Data Analysis

In training data set, we have 12 features; 11 as predictor features and 1 features, Survived, as target feature. We will explore these features to know the distribution and statistic of each parameters.

Survived

Suvived is the only target feature. it is a binary variable, 1 indicated for survived and 0 for not. Because it’s internal characteristic as binary, it would be better to convert it first into factor data type

# transfrom Survived as factor
titan_train$Survived <- as.factor(titan_train$Survived)
str(titan_train$Survived)
##  Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
print(paste("Missing value : ",sum(is.na(titan_train$Survived))),sep="")
## [1] "Missing value :  0"
print("Statistic summary of Survived : ")
## [1] "Statistic summary of Survived : "
print(summary(titan_train$Survived))
##   0   1 
## 549 342
# create mode function
getmode <- function(x){
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x,uniqx)))]
}
# get mode of Survived
print(paste("Mode value : ",getmode(titan_train$Survived)),sep="")
## [1] "Mode value :  0"

There are no missing value in Survived feature. It has complete 891 obs consist of 549 not survived and 342 survived passangers, and mode value of 0.

# Visualize Survived
titan_train %>% ggplot(aes(x=Survived)) + 
  geom_bar(fill=c("blue","red"))+
  ggtitle("Titanic Survival Count")+
  scale_x_discrete(labels=c("Not Survived","Survived"))+
  theme(plot.title = element_text(hjust=0.5))+
  geom_text(aes(label=..count..),stat="count",vjust=1.5)

Pclass

Pclass is a featured of numeric value that indicate the cabin class of the passangers. There are 3 class here, with 3 as the lowest class and 1 as the highest or the luxurious one.

Because of its internal characteristic that indicate some kind of quality order(1 to 3 has different quality hierarchy; 1 is better than 2, and 2 is better than 3), it would be better to transform it as factor variable

# transform Pclass as factor
titan_train$Pclass <- as.factor(titan_train$Pclass)
# Don't forget to do the same to test data set
titan_test$Pclass <- as.factor(titan_test$Pclass)

# Check for missing value in train and test data set
print(paste("Missing value in train data set:",sum(is.na(titan_train$Pclass))),sep="")
## [1] "Missing value in train data set: 0"
print(paste("Missing value in test data set: ",
            sum(is.na(titan_test$Pclass))),sep="")
## [1] "Missing value in test data set:  0"
print("Summary of train data set")
## [1] "Summary of train data set"
summary(titan_train$Pclass)
##   1   2   3 
## 216 184 491
print("Summary of test data set")
## [1] "Summary of test data set"
summary(titan_test$Pclass)
##   1   2   3 
## 107  93 218
print(paste("Mode value of Pclass train data:",getmode(titan_train$Pclass)),sep=" ")
## [1] "Mode value of Pclass train data: 3"
print(paste("Mode value of Pclass test data:",getmode(titan_test$Pclass)),sep=" ")
## [1] "Mode value of Pclass test data: 3"

We got basically same characteristic of Pclass from train and test data. Class 3 is a the most frequent value (which is make sense), followed by class 1, and class 2 as the least frequent value. Both data set has 0 missing value.

# Visualize Pclass from training
titan_train %>% ggplot(aes(x=Pclass))+
  geom_bar(fill=c("green","dark blue","purple"))+
  ggtitle("Titanic Cabin Class Distribution in Training Data")+
  scale_x_discrete(labels=c("First Class","Second Class","Third Class"))+
  geom_text(aes(label=..count..),stat="count",vjust=1.5,color="white")+
  theme(plot.title = element_text(hjust=0.5))

titan_test %>% ggplot(aes(x=Pclass))+
  geom_bar(fill=c("green","dark blue","purple"))+
  ggtitle("Titanic Cabin Class Distribution in Test Data")+
  scale_x_discrete(labels=c("First Class","Second Class","Third Class"))+
  geom_text(aes(label=..count..),stat="count",vjust=1.5,color="white")+
  theme(plot.title = element_text(hjust=0.5))

The following chart shows that around 52.63% of casualty comes from passangers in third and second class, with relatively high difference compared to its pair, survived passangers from the same cabin class. It may indicate some correlation between these Pclass and Survived features.

titan_train %>% ggplot(aes(x=Pclass,fill=Survived))+
  geom_bar()+
  ggtitle("Survivability by Cabin Class Distribution in training")+
  scale_x_discrete(labels=c("First Class","Second Class","Third Class"))+
  scale_fill_discrete(labels=c("Not Survived","Survived"))+
  geom_text(aes(label=..count..),stat="count",position = position_stack(0.5))+
  theme(plot.title = element_text(hjust=0.5))

Name

The name of passangers. In a glance, there is not many insight we can extract from this feature.

head(titan_train$Name)
## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

But, it contains a title, and it maybe can be considered usefull. The name formatting is like this: “Last name, title(.) First and Middle Name” let’s extract all titles from this feature and mutate a new feature called ‘titlePass’

# Check for missing value
print(paste("Missing value of Name in training:",sum(is.na(titan_train$Name))),sep=" ")
## [1] "Missing value of Name in training: 0"
# It has zero missing value
# Mutate new feature
titan_train <- titan_train %>%
  mutate(titlePass=str_extract(titan_train$Name,
                                          '[a-zA-Z]+(?=\\.)'))
titan_train %>% ggplot(aes(x=fct_infreq(titlePass)))+
  geom_bar(fill="magenta")+
  xlab("Title of Passanger")+
  ggtitle("Passanger Title Distribution in training data")+
  geom_text(aes(label=..count..),stat="count",vjust=-0.1)+
  theme(axis.text.x = element_text(angle=45))+
  theme(plot.title = element_text(hjust=0.5))

The most frequent title in training data is ‘Mr’, followed by Miss, then Mrs, Master, and the rest. From this data alone, we can estimate that huge percentage of passanger is male. Test data set has similiar distribution, and also zero missing value.

# Check for missing value
print(paste("Missing value of Name in training:",sum(is.na(titan_test$Name))),sep=" ")
## [1] "Missing value of Name in training: 0"
# It has zero missing value
# Mutate new feature
titan_test <- titan_test %>%
  mutate(titlePass=str_extract(titan_test$Name,
                                          '[a-zA-Z]+(?=\\.)'))
titan_test %>% ggplot(aes(x=fct_infreq(titlePass)))+
  geom_bar(fill="magenta")+
  xlab("Title of Passanger")+
  ggtitle("Passanger Title Distribution in test data")+
  geom_text(aes(label=..count..),stat="count",vjust=-0.1)+
  theme(axis.text.x = element_text(angle=45))+
  theme(plot.title = element_text(hjust=0.5))

Sex

A binary feature of passangers; either Male or Female. We will transform it into factor

# Convert Sex Feature into factor data
titan_train$Sex <- as.factor(titan_train$Sex)
print(paste("Missing value of Sex feature : ",
      sum(is.na(titan_train$Sex))),sep=" ")
## [1] "Missing value of Sex feature :  0"
titan_train %>% ggplot(aes(x=Sex))+
  geom_bar(fill=c("magenta","purple"))+
  ggtitle("Passanger Sex Distribution")+
  xlab("Sex")+
  geom_text(aes(label=..count..),stat="count",vjust=1.5)

This chart confirm our estimation from title in name variable, that in titanic training data, male is the dominant sex. Now, we wi

titan_train %>% ggplot(aes(x=Sex,fill=Survived))+
  geom_bar()+
  xlab("Sex")+
  ggtitle("Survivability by Sex in Training Data")+
  geom_text(aes(label=..count..),stat="count",position = position_stack(0.5))+
  scale_fill_discrete(labels=c("Not Survived","Survived"))

Test data set has similiar Sex feature distribution.

print(paste("Missing value of Sex in Test Data:",
            sum(is.na(titan_test$Sex))),sep=" ")
## [1] "Missing value of Sex in Test Data: 0"
titan_test %>% ggplot(aes(x=Sex))+
  geom_bar(fill=c("chocolate","coral"))+
  xlab("Sex")+
  ggtitle("Survivability by Sex in Test Data")+
  geom_text(aes(label=..count..),stat="count",position = position_stack(0.5))

NOTE : Because of its internal quality, we can not make Sex feature as a simple factor. Remember, factor will treat each category as if they are ordinal, start from 0,1,2, etc. So, with current convention, it’s as if we create an hierarchy that female (0) is lower or less important than male (1), numerically. This could distort our model later in order to make prediction. The solution is to mutate two dummy features, one is binary feature for Male and the other is binary feature for female

Age

No further definition.

summary(titan_train$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177
print(paste("Missing value of age feature in training:",
            sum(is.na(titan_train$Age))))
## [1] "Missing value of age feature in training: 177"
titan_test %>% ggplot(aes(x=Age))+
  geom_histogram(aes(y=..density..),binwidth = 5,fill="darkslateblue",
                 color="cyan1")+
  geom_density(color="red",lwd=1)+
  ggtitle("Age Distribution in Training Data")
## Warning: Removed 86 rows containing non-finite values (stat_bin).
## Warning: Removed 86 rows containing non-finite values (stat_density).

Freqpoly and histogram plot indicate that the age distribution is centered around 20-30 years old, and presummably right-skewed, and has 177 missing values (we will deal with it later). To prove the skewness, i use skewness test in R. If the coefficient of skewness is more than 0, i can confirm that Age is right skewed

library(moments)
skewness(titan_train$Age,na.rm = TRUE)# don't forget to add na.rm, 
## [1] 0.3882899
# because we still have missing value

The skewness of age feature in training data set is 0.3882899, which is more than 0. So, our age feature in training data set is confirmend and prove to be right-skewed, where the majority of data is less than its mean.

For age in test data set,

print(summary(titan_test$Age))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.17   21.00   27.00   30.27   39.00   76.00      86
print(paste("Missing value of age feature in training:",
            sum(is.na(titan_test$Age))))
## [1] "Missing value of age feature in training: 86"
print(skewness(titan_test$Age,na.rm=TRUE))#0.4552923
## [1] 0.4552923
titan_test %>% ggplot(aes(x=Age))+
  geom_histogram(aes(y=..density..),binwidth = 5,fill="darkslateblue",
                 color="cyan1")+
  geom_density(color="red",lwd=1)+
  ggtitle("Age Distribution in Test Data")
## Warning: Removed 86 rows containing non-finite values (stat_bin).
## Warning: Removed 86 rows containing non-finite values (stat_density).

Both training and test data set, has similiar age distribution around 20-30 years old. Test data set has 86 missing value, and skewness of 0.4552923 (Positive or Right Skewed). Even the percentile of training and test data set are almost similiar, with small difference.

For further analysis, here is more statistic about Age in training and test data set

# Training data
print(paste("Standard Deviation of Age in Training Data:",
            sd(titan_train$Age,na.rm=TRUE),sep=" "))
## [1] "Standard Deviation of Age in Training Data: 14.526497332334"
print(paste("IQR of Age in Training Data:",
            IQR(titan_train$Age,na.rm=TRUE),sep=" "))
## [1] "IQR of Age in Training Data: 17.875"
print(paste("Median Absolute Deviation of Age in Training Data:",
            mad(titan_train$Age,na.rm=TRUE),sep=" "))
## [1] "Median Absolute Deviation of Age in Training Data: 13.3434"
print("----------------",end="\n\n")
## [1] "----------------"
# Test data
print(paste("Standard Deviation of Age in Test Data:",
            sd(titan_test$Age,na.rm=TRUE),sep=" "))
## [1] "Standard Deviation of Age in Test Data: 14.1812092356244"
print(paste("IQR of Age in Test Data:",
            IQR(titan_test$Age,na.rm=TRUE),sep=" "))
## [1] "IQR of Age in Test Data: 18"
print(paste("Median Absolute Deviation of Age in Test Data:",
            mad(titan_test$Age,na.rm=TRUE),sep=" "))
## [1] "Median Absolute Deviation of Age in Test Data: 11.8608"

Again, these two data set seemingly similiar statistic. Now, to proceed, let’s try compare Age data with Survived status.

titan_train %>% ggplot(aes(x=Survived,y=Age))+
  geom_violin(trim=FALSE,aes(fill=Survived))+
  scale_fill_manual(values=c("cyan1","coral"))+
  scale_color_manual(values = c("red","magenta"))+
  stat_summary(fun=median, geom="point",size=3)+
  ggtitle("Violin Plot of Age by Survived in Training Data")+
  scale_x_discrete(labels=c("Not Survived","Survived"))
## Warning: Removed 177 rows containing non-finite values (stat_ydensity).
## Warning: Removed 177 rows containing non-finite values (stat_summary).

It seems there are small difference of median between Age in Survived group and Not Survived group. We will analyze it later.

SibSp

The number of Sibling and Spouse bring along by passangers. This is litle tricky, actually. Let’s say Mr A and Mrs B is a spouse. They both have the same surname, and their SibSp feature would have same value of 1. So, not only that it’s ambigous in definition, it also confusing in numeric. The solution for this feature are :

1. We need to split between sibling and spouse

2. transform it to dummy variables

We will deal with this later. For now, i will continue to explore the data as its original form.

print(summary(titan_train$SibSp))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.523   1.000   8.000
print(paste("Missing value of SibSp in training data:",
            sum(is.na(titan_train$SibSp))))
## [1] "Missing value of SibSp in training data: 0"
print("---------------------------",end="\n")
## [1] "---------------------------"
print(summary(titan_test$SibSp))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4474  1.0000  8.0000
print(paste("Missing value of SibSp in test data:",
            sum(is.na(titan_test$SibSp))))
## [1] "Missing value of SibSp in test data: 0"

From above summary, at least we can get information about minimal and maximum number of sibling or spouse brought by passangers. Both have missing value. The statistical summary also indicate that SibSp data, both in training and test data set, are right skewed. (mean > median).

To make it clear, let’s visualize this data

SibSp_tab <- table(titan_train$SibSp)
sibSp_df <- data.frame("SibSp"=names(SibSp_tab),"Count"=SibSp_tab)
sibSp_df %>% ggplot(aes(x=SibSp,y=Count.Freq))+
  geom_col(aes(fill=Count.Freq))+
  ggtitle("Sibling and Spouse Distribution in Training Data")+
  geom_text(aes(label=Count.Freq),vjust=-0.1)

SibSp_tab2 <- table(titan_test$SibSp)
sibSp_df2 <- data.frame("SibSp"=names(SibSp_tab2),"Count"=SibSp_tab2)
sibSp_df2 %>% ggplot(aes(x=SibSp,y=Count.Freq))+
  geom_col(aes(fill=Count.Freq))+
  ggtitle("Sibling and Spouse Distribution in Test Data")+
  geom_text(aes(label=Count.Freq),vjust=-0.1)

They both have similiar distribution, where people who travel alone in Titanic is the most frequent type of passanger. In term of survivability,

titan_train %>% ggplot(aes(x=SibSp))+
  geom_freqpoly(aes(color=Survived),position = position_dodge2())+
  ggtitle("Frequency of Survived by SibSp in Training Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Parch

Parch is same as SibSp, it is a feature that indicate the number of parent and children brought along by passangers. Again, it’s an ambigous variable, and the solution for it is same as the solution we will apply to SibSp.

print(summary(titan_train$Parch))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3816  0.0000  6.0000
print(paste("Missing value of Parch from training data:",
            sum(is.na(titan_train$Parch))),sep=" ")
## [1] "Missing value of Parch from training data: 0"
print(paste("Standard deviation of Parch from training data:",
            sd(titan_train$Parch)),sep=" ")
## [1] "Standard deviation of Parch from training data: 0.806057221129948"
print(paste("The IQR of Parch from training data:",
            IQR(titan_train$Parch)),sep=" ")
## [1] "The IQR of Parch from training data: 0"
print(paste("Median Absolute Deviation of Parch from training data:",
            mad(titan_train$Parch)),sep=" ")
## [1] "Median Absolute Deviation of Parch from training data: 0"
print(paste("The range of Parch from training data:",
            range(titan_train$Parch)[2]),sep=" ")
## [1] "The range of Parch from training data: 6"
print("------------------------------------------",end="\n")
## [1] "------------------------------------------"
print(summary(titan_test$Parch))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3923  0.0000  9.0000
print(paste("Missing value of Parch from training data:",
            sum(is.na(titan_test$Parch))),sep=" ")
## [1] "Missing value of Parch from training data: 0"
print(paste("Standard deviation of Parch from training data:",
            sd(titan_test$Parch)),sep=" ")
## [1] "Standard deviation of Parch from training data: 0.981428878537169"
print(paste("The IQR of Parch from training data:",
            IQR(titan_test$Parch)),sep=" ")
## [1] "The IQR of Parch from training data: 0"
print(paste("Median Absolute Deviation of Parch from training data:",
            mad(titan_test$Parch)),sep=" ")
## [1] "Median Absolute Deviation of Parch from training data: 0"
print(paste("The range of Parch from training data:",
            range(titan_test$Parch)[2]),sep=" ")
## [1] "The range of Parch from training data: 9"

Both training and test data set, has similiar distribution for Parch. They have 0 missing value, right-skewed (median, mode < mean), the same median, and similiar mean.

parch_tab <- table(titan_train$Parch)
parch_df <- data.frame(parch_tab)
parch_df %>% ggplot(aes(x=Var1,y=Freq))+
  geom_col(aes(fill=Freq))+
  ggtitle("Survivability by Parch Distribution in Test Data")+
  geom_text(aes(label=Freq),vjust=-0.1)

parch_tab2 <- table(titan_test$Parch)
parch_df2 <- data.frame(parch_tab2)
parch_df2 %>% ggplot(aes(x=Var1,y=Freq))+
  geom_col(aes(fill=Freq))+
  ggtitle("Survivability by Parch Distribution in Test Data")+
  geom_text(aes(label=Freq),vjust=-0.1)

More than 75% of Titanic passangers in training and test dataset did not brought along their parent or children, similiar to the insight we extract from SibSp feature.

Ticket

This is a feature that displays the ticket number. As we can see in the glimpse, each ticket do not have some kind fixed pattern for us to analyze. There is actually an effort to dechipering the prefix on ticket. Some prefix is easy to understand as departed harbour, and for the ticket numeric, it easy to consider the first number as indicator for cabin class, but the rest, the information about cabin number, on which deck, etc are still mystery. This is due the fact that Titanic ticket was sold through multiple agency, not centered one. So each agency has their own ticket formating, and this is understandable, as the administration system are still lag behind at that point of time.

print(paste("Missing value of Ticket in training data:",
      sum(is.na(titan_train$Ticket))),sep=" ")
## [1] "Missing value of Ticket in training data: 0"
print(paste("Missing value of Ticket in Test data:",
      sum(is.na(titan_test$Ticket))),sep=" ")
## [1] "Missing value of Ticket in Test data: 0"

For this analysis, i will not indulge myself to research this feature. Let’s keep it simple, by drop this feature later.

Fare

Fare is a feature that display Titanic’s ticket fare when its sold. The number itself comes in pound-shilling. So, 7.25 means 7 pounds and 25 shilling.

summary(titan_train$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.33
print(paste("Missing value of Fare in Training data:",
            sum(is.na(titan_train$Fare))),sep=" ")
## [1] "Missing value of Fare in Training data: 0"
print(paste("Skewness of Fare in Training data:",
            skewness(titan_train$Fare)),sep=" ")
## [1] "Skewness of Fare in Training data: 4.77925329237236"
print(paste("Standard Deviation of Fare in Training data:",
            sd(titan_train$Fare)),sep=" ")
## [1] "Standard Deviation of Fare in Training data: 49.6934285971809"
print(paste("Median Absolute Deviation of Fare from Training data:",
            mad(titan_train$Fare)),sep=" ")
## [1] "Median Absolute Deviation of Fare from Training data: 10.23616692"
print(paste("IQR of Fare from Training data:",
            IQR(titan_train$Fare)),sep="  ")
## [1] "IQR of Fare from Training data: 23.0896"
print("--------------------------------------------",end="\n")
## [1] "--------------------------------------------"
summary(titan_test$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   7.896  14.454  35.627  31.500 512.329       1
print(paste("Missing value of Fare in Test data:",
            sum(is.na(titan_test$Fare))),sep=" ")
## [1] "Missing value of Fare in Test data: 1"
print(paste("Skewness of Fare in Test data:",
            skewness(titan_test$Fare,na.rm=TRUE)),sep=" ")
## [1] "Skewness of Fare in Test data: 3.67393667584391"
print(paste("Standard Deviation of Fare in Test data:",
            sd(titan_test$Fare,na.rm=TRUE)),sep=" ")
## [1] "Standard Deviation of Fare in Test data: 55.9075761799738"
print(paste("Median Absolute Deviation of Fare from Test data:",
            mad(titan_test$Fare,na.rm=TRUE)),sep=" ")
## [1] "Median Absolute Deviation of Fare from Test data: 10.118745"
print(paste("IQR of Fare from Test data:",
            IQR(titan_test$Fare,na.rm=TRUE)),sep="  ")
## [1] "IQR of Fare from Test data: 23.6042"

There are 1 missing value of Fare from Test data set. But, overall statistic of both data are same, more or less. The only significant different is fare of test data has larger variance compare to training data set. Furthermore, both of them is right-skewed.

titan_train %>% ggplot(aes(x=Fare))+
  geom_histogram(aes(y=..density..),fill="coral",color="black")+
  geom_density(color="brown",lwd=1)+
  ggtitle("Fare Distribution of Training Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From histogram above, there is strong indication of outlier in Fare training data set. We will analyze this in next section. The same distribution also display in test data set.

titan_test %>% ggplot(aes(x=Fare))+
  geom_histogram(aes(y=..density..),fill="coral",color="black")+
  geom_density(color="brown",lwd=1)+
  ggtitle("Fare Distribution of Test Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).

titan_train %>% ggplot(aes(x=Survived,y=Fare))+
  geom_boxplot(aes(fill=Survived))+
  scale_color_manual(values=c("coral","cyan2"))+
  ggtitle("Survivability by Fare in Training")+
  scale_x_discrete(labels=c("Not Survived","Survived"))

The histogram also confirm the existence of outlier in Fare of training data set, and we also can see that the fare median of those who suvived is slightly higher than those who not make it. This is another interesting insight, and this also may have some correlation with the fact that passanger from class 1 has higher propability to survived.

Cabin

This feature is series of cabin number of each passangers. There are prefix of alphabet followed by series of numbers.

head(titan_train$Cabin)
## [1] ""     "C85"  ""     "C123" ""     ""
summary(titan_train$Cabin)
##    Length     Class      Mode 
##       891 character character
print(paste("Missing value of Cabin in training:",
            sum(is.na(titan_train$Cabin)),"(",
            round((sum(is.na(titan_train$Cabin))/length(titan_train$Cabin))*100,2),"%)",sep=""))
## [1] "Missing value of Cabin in training:0(0%)"
# Define a function to convert blank value to NA
na_conv <- function(x){
  return(mutate_all(x,~na_if(.,"")))
}

Let’s try to convert training data

# Original data
for(col in names(titan_train)){
  print(paste("Missing value of",
              col,
              sum(is.na(titan_train[[col]]))),sep=" ")
}
## [1] "Missing value of PassengerId 0"
## [1] "Missing value of Survived 0"
## [1] "Missing value of Pclass 0"
## [1] "Missing value of Name 0"
## [1] "Missing value of Sex 0"
## [1] "Missing value of Age 177"
## [1] "Missing value of SibSp 0"
## [1] "Missing value of Parch 0"
## [1] "Missing value of Ticket 0"
## [1] "Missing value of Fare 0"
## [1] "Missing value of Cabin 0"
## [1] "Missing value of Embarked 0"
## [1] "Missing value of titlePass 0"
# Convert blank value to NA
titan_train <- na_conv(titan_train)

for(col in names(titan_test)){
  print(paste("Missing value of",
              col,
              sum(is.na(titan_train[[col]]))),sep=" ")
}
## [1] "Missing value of PassengerId 0"
## [1] "Missing value of Pclass 0"
## [1] "Missing value of Name 0"
## [1] "Missing value of Sex 0"
## [1] "Missing value of Age 177"
## [1] "Missing value of SibSp 0"
## [1] "Missing value of Parch 0"
## [1] "Missing value of Ticket 0"
## [1] "Missing value of Fare 0"
## [1] "Missing value of Cabin 687"
## [1] "Missing value of Embarked 2"
## [1] "Missing value of titlePass 0"
# Convert titan_test missing value to NA
titan_test <- na_conv(titan_test)
head(titan_test$Cabin)
## [1] NA NA NA NA NA NA
summary(titan_test$Cabin)
##    Length     Class      Mode 
##       418 character character
print(paste("Missing value of Cabin in training:",
            sum(is.na(titan_test$Cabin)),"(",
            round((sum(is.na(titan_test$Cabin))/length(titan_test$Cabin))*100,2),"%)",sep=""))
## [1] "Missing value of Cabin in training:327(78.23%)"

In short, this feature is useless. Huge percentage of missing value is unfixable. Any technique of imputation would only generate estimate at best, and it can’t be trusted as valid predictor in large porpotion. Nevertheles, i will try to conduct some analysis by using Pclass and Fare.

To be fair, this feature indeed has a potential. Its alphabet prefix indicate the deck where passangers side in. Titanic has limited safeboat available in certain deck. So, this position will able to indicate whether the certain deck has significant effect to survivability.

Embarked

Embarked is feature about which city port the passangers departed from. It symbolize with abbrevation of each city : “S” for Southampton, “C” for Cherbourg, and “Q” for Queenstown. Here we will analyze it as factor data.

titan_train$Embarked <- as.factor(titan_train$Embarked)
head(titan_train$Embarked)
## [1] S C S S S Q
## Levels: C Q S
summary(titan_train$Embarked)
##    C    Q    S NA's 
##  168   77  644    2
print(paste("Missing value of Embarked in training data;",
            sum(is.na(titan_train$Embarked))),sep=" ")
## [1] "Missing value of Embarked in training data; 2"
titan_train %>% ggplot(aes(x=Embarked))+
  geom_bar(fill=c("brown","cyan1","darksalmon","azure2"))+
  ggtitle("Embarked Distribution in Training Data")+
  geom_text(aes(label=..count..),stat="count",vjust=0.5)

For test data set,

titan_test$Embarked <- as.factor(titan_test$Embarked)
head(titan_test$Embarked)
## [1] Q S Q S S S
## Levels: C Q S
summary(titan_test$Embarked)
##   C   Q   S 
## 102  46 270
print(paste("Missing value of Embarked in training data;",
            sum(is.na(titan_test$Embarked))),sep=" ")
## [1] "Missing value of Embarked in training data; 0"
titan_test %>% ggplot(aes(x=Embarked))+
  geom_bar(fill=c("brown","cyan1","darksalmon"))+
  ggtitle("Embarked Distribution in Test Data")+
  geom_text(aes(label=..count..),stat="count",vjust=0.5)

In surival term, here

titan_train %>% ggplot(aes(x=Embarked,fill=Survived))+
  geom_bar()+
  ggtitle("Survivability by Embarked Port in Traning Data")+
  geom_text(aes(label=..count..),
            stat="count",position=position_stack())

The most frequent port is Southampton, and this is understandable considering Southampton, England, was the port where Titanic set sails in April, 10 1912. Ironically, as shown in the plot, this is also the port where most dead passangers came from.

There are 2 missing values in training data, and 0 missing value in test data. The distribution is similiar.

Pre Processing Data

Split data set

As best practice, to prevent data leakage and in-sample bias, before we proceed to cleaning and pre-processing data, let’s split training data set into train and validation data set. So, from here on, we would deal with three separate data; training, validation, and testing.

# Load library caTools
#install.packages("caTools")
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
# Split data training into train and validation with ratio 80:20
set.seed(123)
idx <- sample.split(titan_train,SplitRatio=0.8)
train_X <- titan_train[idx,]
valid_X <- titan_train[!idx,]

dim(train_X) # 686 observations in train data set
## [1] 686  13
dim(valid_X) # 205 observations in valid data set
## [1] 205  13

Missing Value

Let’s recalculate how much missing value in every features of data set

for(col in names(train_X)){
  print(paste("Missing value in",col,":",sum(is.na(train_X[[col]]))),sep=" ")
}
## [1] "Missing value in PassengerId : 0"
## [1] "Missing value in Survived : 0"
## [1] "Missing value in Pclass : 0"
## [1] "Missing value in Name : 0"
## [1] "Missing value in Sex : 0"
## [1] "Missing value in Age : 131"
## [1] "Missing value in SibSp : 0"
## [1] "Missing value in Parch : 0"
## [1] "Missing value in Ticket : 0"
## [1] "Missing value in Fare : 0"
## [1] "Missing value in Cabin : 526"
## [1] "Missing value in Embarked : 1"
## [1] "Missing value in titlePass : 0"
for(col in names(valid_X)){
  print(paste("Missing value in",col,":",sum(is.na(valid_X[[col]]))),sep=" ")
}
## [1] "Missing value in PassengerId : 0"
## [1] "Missing value in Survived : 0"
## [1] "Missing value in Pclass : 0"
## [1] "Missing value in Name : 0"
## [1] "Missing value in Sex : 0"
## [1] "Missing value in Age : 46"
## [1] "Missing value in SibSp : 0"
## [1] "Missing value in Parch : 0"
## [1] "Missing value in Ticket : 0"
## [1] "Missing value in Fare : 0"
## [1] "Missing value in Cabin : 161"
## [1] "Missing value in Embarked : 1"
## [1] "Missing value in titlePass : 0"
for(col in names(titan_test)){
  print(paste("Missing value in",col,":",sum(is.na(titan_test[[col]]))),sep=" ")
}
## [1] "Missing value in PassengerId : 0"
## [1] "Missing value in Pclass : 0"
## [1] "Missing value in Name : 0"
## [1] "Missing value in Sex : 0"
## [1] "Missing value in Age : 86"
## [1] "Missing value in SibSp : 0"
## [1] "Missing value in Parch : 0"
## [1] "Missing value in Ticket : 0"
## [1] "Missing value in Fare : 1"
## [1] "Missing value in Cabin : 327"
## [1] "Missing value in Embarked : 0"
## [1] "Missing value in titlePass : 0"

For train_X and valid_X, the features with missing values are Age,Cabin, and Embarked. Technique to replace missing value of data is called imputation. There are many kinds of imputation technique, such as mean imputation, media imputation, mode imputation, regression imputation, backfill, fowardfill, etc.Here i will explain my imputation strategy for each features.

Age Imputation

To impute Age feature, i utilize some necessary information i can obtain from the data set. Here, we already have ‘titlePass’ feature, and we already know the distribution of it. For your information, title is not just some random honorific for people, it has some kind of rule based on culture and ethic. “Mister” or “Mr” is a title for male who has already reach to adulthood or a male person over 18. “Master”, on the contrary, is a title for male under 18. “Miss” is a title used for underage or/and unmarried woman (under 30) “Mrs” for married woman “Ms” is neutral married title. Used if we are not sure about the person’s marriage status.

# Check titlePass
unique(train_X$titlePass)
##  [1] "Mr"       "Mrs"      "Miss"     "Master"   "Rev"      "Dr"      
##  [7] "Mme"      "Ms"       "Major"    "Sir"      "Col"      "Mlle"    
## [13] "Countess" "Jonkheer"

There are other title like Mme, Don, Rev, Lady, Sir, Mlle, Col, Major, Capt, Countess, Jonkheer.

Mme (Madame) is a French title for woman in general. Don is title used in Spain and Hispanic America, it is a title for master of a household Rev is a title for member of Christian clergy or official worker of the church Mlle is a french title specifically for young and unmarried woman. Lady, and Countess are noble title for a woman and usually already married to another noble. Jonkheer is a noble title from Dutch rank system.

From precede explanation, it is make sense to try impute Age by using its correlation with titlePass, but i found it’s not enough. Let’s say i can estimate passangers age from their title, like Miss would be female under 18, and Mrs would be married female, how about Ms? How can i deduce such neutral marriage title?

For this, i decide to conduct analysis on passangers surname. In Western culture, especially in Eestern Europe, women tend to take their husband’s family name after married. in fact, majority of modern peoples still conduct this historical practice. Cited from BBC,2020, according to a 2016 survey, around 85% of those aged between 18 and 30 in US saying they still follow the practice. For British women, the figure is almost 90%. The data also stated that these figures are lower than they were a generation ago. Based on this study, i decide that surname is a reasonable variable to use for estimate and impute age variable.

# First, we need to mutate  surname feature
head(train_X$Name)
## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Moran, Mr. James"                                   
## [6] "McCarthy, Mr. Timothy J"
# Name formatting is :
# LastName,  Title (.) FirstName MiddleName
# The maiden name is written inside a bracket
train_X <- train_X %>% mutate(surname = str_extract(Name,".+(?=,.+)"))
head(train_X)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 6           6        0      3
## 7           7        0      1
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 6                                    Moran, Mr. James   male  NA     0     0
## 7                             McCarthy, Mr. Timothy J   male  54     0     0
##             Ticket    Fare Cabin Embarked titlePass   surname
## 1        A/5 21171  7.2500  <NA>        S        Mr    Braund
## 2         PC 17599 71.2833   C85        C       Mrs   Cumings
## 3 STON/O2. 3101282  7.9250  <NA>        S      Miss Heikkinen
## 4           113803 53.1000  C123        S       Mrs  Futrelle
## 6           330877  8.4583  <NA>        Q        Mr     Moran
## 7            17463 51.8625   E46        S        Mr  McCarthy

Before further analysis, Let’s check that whether our title feature has mistake or not.

train_X %>% group_by(titlePass,Sex) %>% 
  summarise(count=n())
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 15 x 3
## # Groups:   titlePass [14]
##    titlePass Sex    count
##    <chr>     <fct>  <int>
##  1 Col       male       1
##  2 Countess  female     1
##  3 Dr        female     1
##  4 Dr        male       5
##  5 Jonkheer  male       1
##  6 Major     male       1
##  7 Master    male      31
##  8 Miss      female   131
##  9 Mlle      female     1
## 10 Mme       female     1
## 11 Mr        male     407
## 12 Mrs       female    98
## 13 Ms        female     1
## 14 Rev       male       5
## 15 Sir       male       1

The table above confirm that our title indeed divide passangers by their gender/sex. No missplaced here. Now, let’s check if our title feature also able to divide passangers’s age

train_X %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 15 x 6
## # Groups:   titlePass [14]
##    titlePass Sex    count median   min   max
##    <chr>     <fct>  <int>  <dbl> <dbl> <dbl>
##  1 Col       male       1     60 60       60
##  2 Countess  female     1     33 33       33
##  3 Dr        female     1     49 49       49
##  4 Dr        male       5     38 23       54
##  5 Jonkheer  male       1     38 38       38
##  6 Major     male       1     45 45       45
##  7 Master    male      31      4  0.67    12
##  8 Miss      female   131     22  0.75    63
##  9 Mlle      female     1     24 24       24
## 10 Mme       female     1     24 24       24
## 11 Mr        male     407     29 11       80
## 12 Mrs       female    98     36 14       63
## 13 Ms        female     1     28 28       28
## 14 Rev       male       5     42 27       57
## 15 Sir       male       1     49 49       49

Now, we seems to found some missplace of data here.

First of all, at summary of title “Miss”; as we know, “Miss” is a western title for young unmarried woman under 30, but the data shows us that the range of passangers with title “Miss” is reach up to 58, culturally, this seems not make sense. Let’s say that indeed there are one or two women who still not tied to marrieage at that age, the more appropriate title would be “Ms”.

Second, it appears that there were several underage male passengers who were given the title “Mr”, which indicate a missplace.

At last, in title “Mrs”, the minimum age is 14. If we were going to judge it according to modern social value and law, it is unlikely to happens,and potentially is a missplace. But we talk about early 19th here, so i decide to dig deeper before make any decision.

# Check missplace for passangers with title "Miss"
quantile(train_X[which(train_X$titlePass=="Miss"),"Age"],c(0.25,0.75),na.rm=TRUE)
## 25% 75% 
##  16  30

The quantile of Miss is fit to my explanation about the the title. 75% women under 30 is being entitled by “Miss”.

# get the index of passangers with title Miss over 30 years old
over30_Miss <- which(train_X$titlePass=="Miss" & train_X$Age>30)
train_X[c(over30_Miss),"titlePass"] <- "Ms"
train_X[c(over30_Miss),]
##     PassengerId Survived Pclass                              Name    Sex  Age
## 12           12        1      1          Bonnell, Miss. Elizabeth female 58.0
## 62           62        1      1               Icard, Miss. Amelie female 38.0
## 124         124        1      2               Webber, Miss. Susan female 32.5
## 178         178        0      1        Isham, Miss. Ann Elizabeth female 50.0
## 196         196        1      1              Lurette, Miss. Elise female 58.0
## 212         212        1      2        Cameron, Miss. Clear Annie female 35.0
## 259         259        1      1                  Ward, Miss. Anna female 35.0
## 270         270        1      1            Bissette, Miss. Amelia female 35.0
## 276         276        1      1 Andrews, Miss. Kornelia Theodosia female 63.0
## 277         277        0      3 Lindblom, Miss. Augusta Charlotta female 45.0
## 319         319        1      1          Wick, Miss. Mary Natalie female 31.0
## 326         326        1      1          Young, Miss. Marie Grice female 36.0
## 338         338        1      1   Burns, Miss. Elizabeth Margaret female 41.0
## 347         347        1      2         Smith, Miss. Marion Elsie female 40.0
## 358         358        0      2         Funk, Miss. Annie Clemmer female 38.0
## 381         381        1      1             Bidois, Miss. Rosalie female 42.0
## 397         397        0      3               Olsson, Miss. Elina female 31.0
## 413         413        1      1            Minahan, Miss. Daisy E female 33.0
## 459         459        1      2               Toomey, Miss. Ellen female 50.0
## 497         497        1      1    Eustis, Miss. Elizabeth Mussey female 54.0
## 504         504        0      3    Laitinen, Miss. Kristina Sofia female 37.0
## 527         527        1      2              Ridsdale, Miss. Lucy female 50.0
## 610         610        1      1         Shutes, Miss. Elizabeth W female 40.0
## 717         717        1      1     Endres, Miss. Caroline Louise female 38.0
## 768         768        0      3                Mangan, Miss. Mary female 30.5
## 836         836        1      1       Compton, Miss. Sara Rebecca female 39.0
##     SibSp Parch       Ticket     Fare Cabin Embarked titlePass  surname
## 12      0     0       113783  26.5500  C103        S        Ms  Bonnell
## 62      0     0       113572  80.0000   B28     <NA>        Ms    Icard
## 124     0     0        27267  13.0000  E101        S        Ms   Webber
## 178     0     0     PC 17595  28.7125   C49        C        Ms    Isham
## 196     0     0     PC 17569 146.5208   B80        C        Ms  Lurette
## 212     0     0 F.C.C. 13528  21.0000  <NA>        S        Ms  Cameron
## 259     0     0     PC 17755 512.3292  <NA>        C        Ms     Ward
## 270     0     0     PC 17760 135.6333   C99        S        Ms Bissette
## 276     1     0        13502  77.9583    D7        S        Ms  Andrews
## 277     0     0       347073   7.7500  <NA>        S        Ms Lindblom
## 319     0     2        36928 164.8667    C7        S        Ms     Wick
## 326     0     0     PC 17760 135.6333   C32        C        Ms    Young
## 338     0     0        16966 134.5000   E40        C        Ms    Burns
## 347     0     0        31418  13.0000  <NA>        S        Ms    Smith
## 358     0     0       237671  13.0000  <NA>        S        Ms     Funk
## 381     0     0     PC 17757 227.5250  <NA>        C        Ms   Bidois
## 397     0     0       350407   7.8542  <NA>        S        Ms   Olsson
## 413     1     0        19928  90.0000   C78        Q        Ms  Minahan
## 459     0     0 F.C.C. 13531  10.5000  <NA>        S        Ms   Toomey
## 497     1     0        36947  78.2667   D20        C        Ms   Eustis
## 504     0     0         4135   9.5875  <NA>        S        Ms Laitinen
## 527     0     0  W./C. 14258  10.5000  <NA>        S        Ms Ridsdale
## 610     0     0     PC 17582 153.4625  C125        S        Ms   Shutes
## 717     0     0     PC 17757 227.5250   C45        C        Ms   Endres
## 768     0     0       364850   7.7500  <NA>        Q        Ms   Mangan
## 836     1     1     PC 17756  83.1583   E49        C        Ms  Compton

Proceed to Mr,

quantile(train_X[which(train_X$titlePass=="Mr"),"Age"],c(0.25,0.75),na.rm=TRUE)
## 25% 75% 
##  23  39

The quantile of Mr title shows us that 50% of the passangers titled “Mr” are aged around 22 to 40 years old. Back to the definition we stated before, “Mr” is a title for a men whose already reach adulthood. So, our only concern is which underage male passanger who titled “Mr”

# Filter underage male passangers with title "Mr"
under18_Mr <- which(train_X$titlePass=="Mr" & train_X$Age<18)
train_X[c(under18_Mr),"titlePass"] <- "Master"
train_X[c(under18_Mr),]
##     PassengerId Survived Pclass                            Name  Sex Age SibSp
## 87           87        0      3          Ford, Mr. William Neal male  16     1
## 139         139        0      3             Osen, Mr. Olaf Elon male  16     0
## 221         221        1      3  Sunderland, Mr. Victor Francis male  16     0
## 267         267        0      3       Panula, Mr. Ernesti Arvid male  16     4
## 283         283        0      3       de Pelsmaeker, Mr. Alfons male  16     0
## 334         334        0      3 Vander Planke, Mr. Leo Edmondus male  16     2
## 353         353        0      3              Elias, Mr. Tannous male  15     1
## 501         501        0      3                Calic, Mr. Petar male  17     0
## 533         533        0      3            Elias, Mr. Joseph Jr male  17     1
## 575         575        0      3    Rush, Mr. Alfred George John male  16     0
## 722         722        0      3       Jensen, Mr. Svend Lauritz male  17     1
## 732         732        0      3        Hassan, Mr. Houssein G N male  11     0
## 747         747        0      3     Abbott, Mr. Rossmore Edward male  16     1
## 792         792        0      2             Gaskell, Mr. Alfred male  16     0
## 842         842        0      2        Mudd, Mr. Thomas Charles male  16     0
## 845         845        0      3             Culumovic, Mr. Jeso male  17     0
##     Parch          Ticket    Fare Cabin Embarked titlePass       surname
## 87      3      W./C. 6608 34.3750  <NA>        S    Master          Ford
## 139     0            7534  9.2167  <NA>        S    Master          Osen
## 221     0 SOTON/OQ 392089  8.0500  <NA>        S    Master    Sunderland
## 267     1         3101295 39.6875  <NA>        S    Master        Panula
## 283     0          345778  9.5000  <NA>        S    Master de Pelsmaeker
## 334     0          345764 18.0000  <NA>        S    Master Vander Planke
## 353     1            2695  7.2292  <NA>        C    Master         Elias
## 501     0          315086  8.6625  <NA>        S    Master         Calic
## 533     1            2690  7.2292  <NA>        C    Master         Elias
## 575     0      A/4. 20589  8.0500  <NA>        S    Master          Rush
## 722     0          350048  7.0542  <NA>        S    Master        Jensen
## 732     0            2699 18.7875  <NA>        C    Master        Hassan
## 747     1       C.A. 2673 20.2500  <NA>        S    Master        Abbott
## 792     0          239865 26.0000  <NA>        S    Master       Gaskell
## 842     0     S.O./P.P. 3 10.5000  <NA>        S    Master          Mudd
## 845     0          315090  8.6625  <NA>        S    Master     Culumovic

Now, let’s deal with title “Mrs”.

under30_Mrs <- which(train_X$titlePass=="Mrs" & train_X$Age < 30)
train_X[c(under30_Mrs),c("Name","Age","Sex")]
##                                                                                   Name
## 9                                    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
## 10                                                 Nasser, Mrs. Nicholas (Adele Achem)
## 42                            Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)
## 54                                  Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)
## 67                                                        Nye, Mrs. (Elizabeth Ramell)
## 134                                      Weisz, Mrs. Leopold (Mathilde Francoise Pede)
## 143                               Hakkarainen, Mrs. Pekka Pietari (Elin Matilda Dolck)
## 152                                                  Pears, Mrs. Thomas (Edith Wearne)
## 248                                                    Hamalainen, Mrs. William (Anna)
## 256                                            Touma, Mrs. Darwis (Hanne Youssef Razi)
## 292                                            Bishop, Mrs. Dickinson H (Helen Walton)
## 308 Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)
## 313                                              Lahtinen, Mrs. William (Anna Sylfven)
## 324                                Caldwell, Mrs. Albert Francis (Sylvia Mae Harbaugh)
## 400                                                   Trout, Mrs. William H (Jessie L)
## 438                                              Richards, Mrs. Sidney (Emily Hocking)
## 474                                       Jerwan, Mrs. Amin S (Marie Marthe Thuillard)
## 547                                                  Beane, Mrs. Edward (Ethel Clarke)
## 568                                        Palsson, Mrs. Nils (Alma Cornelia Berglund)
## 601                                Jacobsohn, Mrs. Sidney Samuel (Amy Frances Christy)
## 618                                    Lobb, Mrs. William Arthur (Cordelia K Stanlick)
## 701                                  Astor, Mrs. John Jacob (Madeleine Talmadge Force)
## 782                                          Dick, Mrs. Albert Adrian (Vera Gillespie)
## 831                                            Yasbeck, Mrs. Antoni (Selini Alexander)
## 859                                              Baclini, Mrs. Solomon (Latifa Qurban)
## 875                                              Abelson, Mrs. Samuel (Hannah Wizosky)
## 881                                       Shelley, Mrs. William (Imanita Parrish Hall)
##     Age    Sex
## 9    27 female
## 10   14 female
## 42   27 female
## 54   29 female
## 67   29 female
## 134  29 female
## 143  24 female
## 152  22 female
## 248  24 female
## 256  29 female
## 292  19 female
## 308  17 female
## 313  26 female
## 324  22 female
## 400  28 female
## 438  24 female
## 474  23 female
## 547  19 female
## 568  29 female
## 601  24 female
## 618  26 female
## 701  18 female
## 782  17 female
## 831  15 female
## 859  24 female
## 875  28 female
## 881  25 female

Apparently, all female under 30 with title “Mrs” already married, it’s proved by their maiden name written inside bracket. So, there are no missplace in this title.

let’s do a recheck to the age range by title

train_X %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 15 x 6
## # Groups:   titlePass [14]
##    titlePass Sex    count median   min   max
##    <chr>     <fct>  <int>  <dbl> <dbl> <dbl>
##  1 Col       male       1     60 60       60
##  2 Countess  female     1     33 33       33
##  3 Dr        female     1     49 49       49
##  4 Dr        male       5     38 23       54
##  5 Jonkheer  male       1     38 38       38
##  6 Major     male       1     45 45       45
##  7 Master    male      47      9  0.67    17
##  8 Miss      female   105     18  0.75    30
##  9 Mlle      female     1     24 24       24
## 10 Mme       female     1     24 24       24
## 11 Mr        male     391     30 18       80
## 12 Mrs       female    98     36 14       63
## 13 Ms        female    27     38 28       63
## 14 Rev       male       5     42 27       57
## 15 Sir       male       1     49 49       49

Now all title has appropriate age range. I proceed to valid_X and titan_test

# Mutate surname for valid_X
valid_X <- valid_X %>% mutate(surname=str_extract(Name,".+(?=,.+)"))
str(valid_X)
## 'data.frame':    205 obs. of  14 variables:
##  $ PassengerId: int  5 8 11 18 21 24 31 34 37 44 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 1 2 3 2 ...
##  $ Name       : chr  "Allen, Mr. William Henry" "Palsson, Master. Gosta Leonard" "Sandstrom, Miss. Marguerite Rut" "Williams, Mr. Charles Eugene" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 2 2 1 ...
##  $ Age        : num  35 2 4 NA 35 28 40 66 NA 3 ...
##  $ SibSp      : int  0 3 1 0 0 0 0 0 0 1 ...
##  $ Parch      : int  0 1 1 0 0 0 0 0 0 2 ...
##  $ Ticket     : chr  "373450" "349909" "PP 9549" "244373" ...
##  $ Fare       : num  8.05 21.07 16.7 13 26 ...
##  $ Cabin      : chr  NA NA "G6" NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 1 3 1 1 ...
##  $ titlePass  : chr  "Mr" "Master" "Miss" "Mr" ...
##  $ surname    : chr  "Allen" "Palsson" "Sandstrom" "Williams" ...
# Mutate surname for titan_test
titan_test <- titan_test %>% mutate(surname=str_extract(Name,".+(?=,.+)"))
str(titan_test)
## 'data.frame':    418 obs. of  13 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  NA NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ titlePass  : chr  "Mr" "Mrs" "Mr" "Mr" ...
##  $ surname    : chr  "Kelly" "Wilkes" "Myles" "Wirz" ...

Age range by title in valid_X :

valid_X %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 12 x 6
## # Groups:   titlePass [12]
##    titlePass Sex    count median   min   max
##    <chr>     <fct>  <int>  <dbl> <dbl> <dbl>
##  1 Capt      male       1     70 70       70
##  2 Col       male       1     56 56       56
##  3 Don       male       1     40 40       40
##  4 Dr        male       1     50 50       50
##  5 Lady      female     1     48 48       48
##  6 Major     male       1     52 52       52
##  7 Master    male       9      3  0.42    11
##  8 Miss      female    51     19  0.75    36
##  9 Mlle      female     1     24 24       24
## 10 Mr        male     110     30 14       66
## 11 Mrs       female    27     32 18       62
## 12 Rev       male       1     51 51       51

Each gender is placed properly in title, but there are some indication of missplaces, like those range in “Miss”, and “Mr”

# Let's start with "Miss"
over30_Miss_val <- which(valid_X$titlePass=="Miss" & valid_X$Age>=30)
valid_X[c(over30_Miss_val),"titlePass"] <- "Ms"
valid_X[c(over30_Miss_val),c("Name","Age","titlePass")]
##                               Name Age titlePass
## 216        Newell, Miss. Madeleine  31        Ms
## 219          Bazzani, Miss. Albina  32        Ms
## 258           Cherry, Miss. Gladys  30        Ms
## 310 Francatelli, Miss. Laura Mabel  30        Ms
## 323      Slayter, Miss. Hilda Mary  30        Ms
## 388               Buss, Miss. Kate  36        Ms
## 538            LeRoy, Miss. Bertha  30        Ms
## 541        Crosby, Miss. Harriet R  36        Ms
## 577           Garside, Miss. Ethel  34        Ms
## 843        Serepeca, Miss. Augusta  30        Ms

and then for “Mr”

under18_Mr_val <- which(valid_X$titlePass=="Mr" & valid_X$Age<18)
valid_X[c(under18_Mr_val),"titlePass"] <- "Master"
valid_X[c(under18_Mr_val),c("Name","Age","titlePass")]
##                            Name Age titlePass
## 164             Calic, Mr. Jovo  17    Master
## 434  Kallio, Mr. Nikolai Erland  17    Master
## 551 Thayer, Mr. John Borland Jr  17    Master
## 684 Goodwin, Mr. Charles Edward  14    Master
## 687    Panula, Mr. Jaako Arnold  14    Master
## 765      Eklund, Mr. Hans Linus  16    Master

Recheck for valid_X

valid_X %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 13 x 6
## # Groups:   titlePass [13]
##    titlePass Sex    count median   min   max
##    <chr>     <fct>  <int>  <dbl> <dbl> <dbl>
##  1 Capt      male       1   70   70       70
##  2 Col       male       1   56   56       56
##  3 Don       male       1   40   40       40
##  4 Dr        male       1   50   50       50
##  5 Lady      female     1   48   48       48
##  6 Major     male       1   52   52       52
##  7 Master    male      15   11    0.42    17
##  8 Miss      female    41   15    0.75    26
##  9 Mlle      female     1   24   24       24
## 10 Mr        male     104   31.5 18       66
## 11 Mrs       female    27   32   18       62
## 12 Ms        female    10   30.5 30       36
## 13 Rev       male       1   51   51       51

Now, for titan_test

titan_test %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## Warning in min(Age, na.rm = TRUE): no non-missing arguments to min; returning
## Inf
## Warning in max(Age, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 6
## # Groups:   titlePass [9]
##   titlePass Sex    count median    min    max
##   <chr>     <chr>  <int>  <dbl>  <dbl>  <dbl>
## 1 Col       male       2   50    47      53  
## 2 Dona      female     1   39    39      39  
## 3 Dr        male       1   53    53      53  
## 4 Master    male      21    7     0.33   14.5
## 5 Miss      female    78   22     0.17   45  
## 6 Mr        male     240   28.5  14      67  
## 7 Mrs       female    72   36.5  16      76  
## 8 Ms        female     1   NA   Inf    -Inf  
## 9 Rev       male       2   35.5  30      41

The suspicious title from titan_test are Mr,Miss, and Mrs. We start with Mr

under18_Mr_test <-which(titan_test$titlePass=="Mr" & titan_test$Age<18)
titan_test[c(under18_Mr_test),"titlePass"] <- "Master"
titan_test[c(under18_Mr_test),c("Age","titlePass")]
##     Age titlePass
## 6    14    Master
## 61   17    Master
## 188  17    Master
## 213  17    Master
## 231  14    Master
## 270  17    Master
## 404  17    Master

for “Miss” in titan_test

quantile(titan_test[which(titan_test$titlePass=="Miss"),"Age"],c(0.25,0.75),na.rm = TRUE)
##   25%   75% 
## 17.75 29.25
over30_Miss_test <- which(titan_test$titlePass=="Miss" & titan_test$Age >=30)
# Convert the title of passengers over 30 to "Ms"
titan_test[c(over30_Miss_test),"titlePass"] <- "Ms"
titan_test[c(over30_Miss_test),c("Name","titlePass","Age")]
##                                       Name titlePass Age
## 7                     Connolly, Miss. Kate        Ms  30
## 60             Chaudanson, Miss. Victorine        Ms  36
## 75                    Geiger, Miss. Amalie        Ms  35
## 113               Evans, Miss. Edith Corse        Ms  36
## 142                   Daniels, Miss. Sarah        Ms  33
## 207               McGowan, Miss. Katherine        Ms  35
## 209          Rosenbaum, Miss. Edith Louise        Ms  33
## 215 Andersson, Miss. Ida Augusta Margareta        Ms  38
## 221         Duran y More, Miss. Florentina        Ms  30
## 292 Daly, Miss. Margaret Marcella Maggie""        Ms  30
## 314                    Carr, Miss. Jeannie        Ms  37
## 325                 Kreuchen, Miss. Emilie        Ms  39
## 350                 Walcroft, Miss. Nellie        Ms  31
## 372              Wilson, Miss. Helen Alice        Ms  31
## 376               Bowen, Miss. Grace Scott        Ms  45
## 401                Bonnell, Miss. Caroline        Ms  30

Finally, we will deal with “Mrs” in titan_test

quantile(titan_test[which(titan_test$titlePass=="Mrs"),"Age"],c(0.25,0.75),na.rm = TRUE)
##  25%  75% 
## 26.0 49.5
# Check if all "Mrs"  already married
titan_test[which(titan_test$titlePass=="Mrs"),"Name"] 
##  [1] "Wilkes, Mrs. James (Ellen Needs)"                               
##  [2] "Hirvonen, Mrs. Alexander (Helga E Lindqvist)"                   
##  [3] "Abrahim, Mrs. Joseph (Sophie Halaut Easu)"                      
##  [4] "Snyder, Mrs. John Pillsbury (Nelle Stevenson)"                  
##  [5] "Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)"        
##  [6] "del Carlo, Mrs. Sebastiano (Argenia Genovesi)"                  
##  [7] "Assaf Khalil, Mrs. Mariana (Miriam\")\""                        
##  [8] "Flegenheim, Mrs. Alfred (Antoinette)"                           
##  [9] "Ryerson, Mrs. Arthur Larned (Emily Maria Borie)"                
## [10] "Dean, Mrs. Bertram (Eva Georgetta Light)"                       
## [11] "Johnston, Mrs. Andrew G (Elizabeth Lily\" Watson)\""            
## [12] "Corbett, Mrs. Walter H (Irene Colvin)"                          
## [13] "Kimball, Mrs. Edwin Nelson Jr (Gertrude Parsons)"               
## [14] "Bucknell, Mrs. William Robert (Emma Eliza Ward)"                
## [15] "Coutts, Mrs. William (Winnie Minnie\" Treanor)\""               
## [16] "Corey, Mrs. Percy C (Mary Phyllis Elizabeth Miller)"            
## [17] "Fortune, Mrs. Mark (Mary McDougald)"                            
## [18] "Cornell, Mrs. Robert Clifford (Malvina Helen Lamson)"           
## [19] "Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judith Andersson)"    
## [20] "Davidson, Mrs. Thornton (Orian Hays)"                           
## [21] "Cavendish, Mrs. Tyrell William (Julia Florence Siegel)"         
## [22] "Stengel, Mrs. Charles Emil Henry (Annie May Morris)"            
## [23] "Thomas, Mrs. Alexander (Thamine Thelma\")\""                    
## [24] "Straus, Mrs. Isidor (Rosalie Ida Blun)"                         
## [25] "Chapman, Mrs. John Henry (Sara Elizabeth Lawry)"                
## [26] "Schabert, Mrs. Paul (Emma Mock)"                                
## [27] "Lefebre, Mrs. Frank (Frances)"                                  
## [28] "Earnshaw, Mrs. Boulton (Olive Potter)"                          
## [29] "Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)"                 
## [30] "Peacock, Mrs. Benjamin (Edith Nile)"                            
## [31] "Kink-Heilmann, Mrs. Anton (Luise Heilmann)"                     
## [32] "Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genevieve Fosdick)"    
## [33] "Becker, Mrs. Allen Oliver (Nellie E Baumgardner)"               
## [34] "Compton, Mrs. Alexander Taylor (Mary Eliza Ingersoll)"          
## [35] "Marvin, Mrs. Daniel Warner (Mary Graham Carmichael Farquarson)" 
## [36] "Douglas, Mrs. Frederick Charles (Mary Helene Baxter)"           
## [37] "Rasmussen, Mrs. (Lena Jacobsen Solvang)"                        
## [38] "Howard, Mrs. Benjamin (Ellen Truelove Arman)"                   
## [39] "Widener, Mrs. George Dunton (Eleanor Elkins)"                   
## [40] "Cook, Mrs. (Selena Rogers)"                                     
## [41] "Candee, Mrs. Edward (Helen Churchill Hungerford)"               
## [42] "Moubarek, Mrs. George (Omine Amenia\" Alexander)\""             
## [43] "Douglas, Mrs. Walter Donald (Mahala Dutton)"                    
## [44] "Lindstrom, Mrs. Carl Johan (Sigrid Posse)"                      
## [45] "Christy, Mrs. (Alice Frances)"                                  
## [46] "Karnes, Mrs. J Frank (Claire Bennett)"                          
## [47] "Hold, Mrs. Stephen (Annie Margaret Hill)"                       
## [48] "Khalil, Mrs. Betros (Zahie Maria\" Elias)\""                    
## [49] "Wells, Mrs. Arthur Henry (Addie\" Dart Trevaskis)\""            
## [50] "Clark, Mrs. Walter Miller (Virginia McDowell)"                  
## [51] "Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)"     
## [52] "Hansen, Mrs. Claus Peter (Jennie L Howard)"                     
## [53] "White, Mrs. John Stuart (Ella Holmes)"                          
## [54] "Davies, Mrs. John Morgan (Elizabeth Agnes Mary White) "         
## [55] "Nakid, Mrs. Said (Waika Mary\" Mowad)\""                        
## [56] "Cardeza, Mrs. James Warburton Martinez (Charlotte Wardle Drake)"
## [57] "Whabee, Mrs. George Joseph (Shawneene Abi-Saab)"                
## [58] "Greenfield, Mrs. Leo David (Blanche Strouse)"                   
## [59] "Brown, Mrs. John Murray (Caroline Lane Lamson)"                 
## [60] "Lindell, Mrs. Edvard Bengtsson (Elin Gerda Persson)"            
## [61] "Mallet, Mrs. Albert (Antoinette Magnin)"                        
## [62] "Ware, Mrs. John James (Florence Louise Long)"                   
## [63] "Harder, Mrs. George Achilles (Dorothy Annan)"                   
## [64] "Sage, Mrs. John (Annie Bullen)"                                 
## [65] "Gibson, Mrs. Leonard (Pauline C Boeson)"                        
## [66] "Dodge, Mrs. Washington (Ruth Vidaver)"                          
## [67] "Risien, Mrs. Samuel (Emma)"                                     
## [68] "McNamee, Mrs. Neal (Eileen O'Leary)"                            
## [69] "Lines, Mrs. Ernest H (Elizabeth Lindsey James)"                 
## [70] "Smith, Mrs. Lucien Philip (Mary Eloise Hughes)"                 
## [71] "Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)"
## [72] "Minahan, Mrs. William Edward (Lillian E Thorpe)"

Based on this data, all passangers with title “Mrs” already married, so there is no missplace here.

let’s do recheck on age range by their title in titan_test

titan_test %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 6
## # Groups:   titlePass [9]
##   titlePass Sex    count median   min   max
##   <chr>     <chr>  <int>  <dbl> <dbl> <dbl>
## 1 Col       male       2   50   47       53
## 2 Dona      female     1   39   39       39
## 3 Dr        male       1   53   53       53
## 4 Master    male      28   10.8  0.33    17
## 5 Miss      female    62   20    0.17    29
## 6 Mr        male     233   29   18       67
## 7 Mrs       female    72   36.5 16       76
## 8 Ms        female    17   34   30       45
## 9 Rev       male       2   35.5 30       41

After we fix all age range in all dataset, we begin the imputation process for Age.

# get the index which Age feature is missing
miss_age_train <- which(is.na(train_X$Age))
train_X %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 15 x 6
## # Groups:   titlePass [14]
##    titlePass Sex    count median   min   max
##    <chr>     <fct>  <int>  <dbl> <dbl> <dbl>
##  1 Col       male       1     60 60       60
##  2 Countess  female     1     33 33       33
##  3 Dr        female     1     49 49       49
##  4 Dr        male       5     38 23       54
##  5 Jonkheer  male       1     38 38       38
##  6 Major     male       1     45 45       45
##  7 Master    male      47      9  0.67    17
##  8 Miss      female   105     18  0.75    30
##  9 Mlle      female     1     24 24       24
## 10 Mme       female     1     24 24       24
## 11 Mr        male     391     30 18       80
## 12 Mrs       female    98     36 14       63
## 13 Ms        female    27     38 28       63
## 14 Rev       male       5     42 27       57
## 15 Sir       male       1     49 49       49
unique(train_X[c(miss_age_train),"titlePass"])
## [1] "Mr"     "Mrs"    "Miss"   "Master" "Dr"
# Title that have miss age in train_X : "Mr","Miss","Mrs","Master","Dr"
# We use median imputation, it's  quite simple and would not impact the data distribution despite the lack of randomness
# If "Mr", impute age == 30
# If "Miss", impute age == 18.5
# If "Mrs", impute age == 35
# If "Master", impute age == 9
# If "Dr" and "male", impute age == 38
# If "Dr" and "female", impute age == 49
train_X <- train_X %>% 
  mutate(Age2 = case_when(is.na(Age)==FALSE~Age,
                          is.na(Age) & titlePass=="Mr"~30,
                          is.na(Age) & titlePass=="Miss"~18,
                          is.na(Age) & titlePass=="Mrs"~36,
                          is.na(Age) & titlePass=="Master"~9,
                          is.na(Age) & titlePass=="Dr" &
                            Sex=="male"~38,
                          is.na(Age) & titlePass=="Dr" &
                            Sex=="female"~49))
# Check missing value of Age2 in train_X
sum(is.na(train_X$Age2))
## [1] 0

Apply the same steps in valid_X and titan_test

# get the index which Age feature is missing
miss_age_train_val <- which(is.na(valid_X$Age))
valid_X %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 13 x 6
## # Groups:   titlePass [13]
##    titlePass Sex    count median   min   max
##    <chr>     <fct>  <int>  <dbl> <dbl> <dbl>
##  1 Capt      male       1   70   70       70
##  2 Col       male       1   56   56       56
##  3 Don       male       1   40   40       40
##  4 Dr        male       1   50   50       50
##  5 Lady      female     1   48   48       48
##  6 Major     male       1   52   52       52
##  7 Master    male      15   11    0.42    17
##  8 Miss      female    41   15    0.75    26
##  9 Mlle      female     1   24   24       24
## 10 Mr        male     104   31.5 18       66
## 11 Mrs       female    27   32   18       62
## 12 Ms        female    10   30.5 30       36
## 13 Rev       male       1   51   51       51
unique(valid_X[c(miss_age_train_val),"titlePass"])
## [1] "Mr"     "Miss"   "Mrs"    "Master"
# Title that have miss age in train_X : "Mr","Miss","Mrs","Master","Dr"
# We use median imputation, it's  quite simple and would not impact the data distribution despite the lack of randomness
# If "Mr", impute age == 31.5
# If "Miss", impute age == 15.0
# If "Mrs", impute age == 32
# If "Master", impute age == 11.0

Imputation of Age in valid_X

# If "Mr", impute age == 31.5
# If "Miss", impute age == 15.0
# If "Mrs", impute age == 32
# If "Master", impute age == 11.0
valid_X <- valid_X %>% 
  mutate(Age2 = case_when(is.na(Age)==FALSE~Age,
                          is.na(Age) & titlePass=="Mr"~31.5,
                          is.na(Age) & titlePass=="Miss"~15.0,
                          is.na(Age) & titlePass=="Mrs"~32,
                          is.na(Age) & titlePass=="Master"~11.0))

#check the missing values of Age2 in valid_X
sum(is.na(valid_X$Age2))
## [1] 0

Imputation of Age in titan_test

miss_age_test <- which(is.na(titan_test$Age))
titan_test %>% group_by(titlePass,Sex) %>% 
  summarise(count=n(),
            median = median(Age,na.rm=TRUE),
            min=min(Age,na.rm=TRUE),
            max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 6
## # Groups:   titlePass [9]
##   titlePass Sex    count median   min   max
##   <chr>     <chr>  <int>  <dbl> <dbl> <dbl>
## 1 Col       male       2   50   47       53
## 2 Dona      female     1   39   39       39
## 3 Dr        male       1   53   53       53
## 4 Master    male      28   10.8  0.33    17
## 5 Miss      female    62   20    0.17    29
## 6 Mr        male     233   29   18       67
## 7 Mrs       female    72   36.5 16       76
## 8 Ms        female    17   34   30       45
## 9 Rev       male       2   35.5 30       41
unique(titan_test[c(miss_age_test),"titlePass"])
## [1] "Mr"     "Mrs"    "Miss"   "Ms"     "Master"
# if the title is "Mr", impute age == 29.00
# if the title is "Mrs", impute age ==36.50
# if the title is "Miss", impute age ==20.00
# if the title is "Ms",  impute age == 34.00
# if the title is "Master", impute age == 10.75
# if the title is "Mr", impute age == 29.00
# if the title is "Mrs", impute age ==36.50
# if the title is "Miss", impute age ==20.00
# if the title is "Ms",  impute age == 34.00
# if the title is "Master", impute age == 10.75
titan_test <- titan_test %>% 
  mutate(Age2 = case_when(is.na(Age)==FALSE~Age,
                          is.na(Age) & titlePass=="Mr"~29.00,
                          is.na(Age) & titlePass=="Miss"~20.00,
                          is.na(Age) & titlePass=="Mrs"~36.50,
                          is.na(Age) & titlePass=="Master"~10.75,
                          is.na(Age) & titlePass=="Ms"~34.00))

#check the missing values of Age2 in valid_X
sum(is.na(titan_test$Age2))
## [1] 0

Now, we have feature Age2, the complete version of feature Age.

Cabin Imputation

Missing values of Cabin feature in each data sets

paste("Missing value of Cabin in train_X:",round((sum(is.na(train_X$Cabin))/length(train_X$Cabin))*100,2),"%")
## [1] "Missing value of Cabin in train_X: 76.68 %"
paste("Missing value of Cabin in valid_X:",round((sum(is.na(valid_X$Cabin))/length(valid_X$Cabin))*100,2),"%")
## [1] "Missing value of Cabin in valid_X: 78.54 %"
paste("Missing value of Cabin in titan_test:",round((sum(is.na(titan_test$Cabin))/length(titan_test$Cabin))*100,2),"%")
## [1] "Missing value of Cabin in titan_test: 78.23 %"

As i stated in previous section, with huge percentage of missing value, it’s useless to try to extract any information from Cabin feature. A simple solution would be to drop it entirely.

Actually, with extensive effort, there are some insight we may able to extract from Cabin. But, after conduct literacy study from previous research, i decide that the benefit is not worth the effort. I may conduct this special analysis with Cabin feature in future occassion.

# Drop Cabin 
trainX_new <- train_X %>% select(!Cabin)
validX_new <- valid_X %>% select(!Cabin)
titan_test_new <- titan_test %>% select(!Cabin)

names(trainX_new)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Embarked"    "titlePass"   "surname"     "Age2"
names(validX_new)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Embarked"    "titlePass"   "surname"     "Age2"
names(titan_test_new)
##  [1] "PassengerId" "Pclass"      "Name"        "Sex"         "Age"        
##  [6] "SibSp"       "Parch"       "Ticket"      "Fare"        "Embarked"   
## [11] "titlePass"   "surname"     "Age2"

Embarked Imputation

There is only 1 missing value of Embarked in trainX_new and validX_new data set.

sum(is.na(trainX_new$Embarked))
## [1] 1
sum(is.na(validX_new$Embarked))
## [1] 1
sum(is.na(titan_test_new$Embarked))
## [1] 0

We call that specific observation.

trainX_new[which(is.na(trainX_new$Embarked)),]
##    PassengerId Survived Pclass                Name    Sex Age SibSp Parch
## 62          62        1      1 Icard, Miss. Amelie female  38     0     0
##    Ticket Fare Embarked titlePass surname Age2
## 62 113572   80     <NA>        Ms   Icard   38
validX_new[which(is.na(validX_new$Embarked)),]
##     PassengerId Survived Pclass                                      Name
## 830         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn)
##        Sex Age SibSp Parch Ticket Fare Embarked titlePass surname Age2
## 830 female  62     0     0 113572   80     <NA>       Mrs   Stone   62

To impute these values, i conduct a simple searching in a online forum dedicated to Titanic, and found the answer. (For your information, most of the data use in this project is sourced from this forum) Source : encyclopedia-titanica.org

# Impute Embarked value in trainX_new
trainX_new[which(is.na(trainX_new$Embarked)),"Embarked"] <- "S"
# Impute Embarked value in validX_new
validX_new[which(is.na(validX_new$Embarked)),"Embarked"] <- "S"

sum(is.na(trainX_new$Embarked))
## [1] 0
sum(is.na(validX_new$Embarked))
## [1] 0

Fare Imputation

There is only one missing value of Fare in titan_test_new. Here is the data

titan_test_new[which(is.na(titan_test_new$Fare)),]
##     PassengerId Pclass               Name  Sex  Age SibSp Parch Ticket Fare
## 153        1044      3 Storey, Mr. Thomas male 60.5     0     0   3701   NA
##     Embarked titlePass surname Age2
## 153        S        Mr  Storey 60.5
# Impute the value with median Fare in Pclass 3
titan_test_new %>% group_by(Embarked,Pclass) %>% 
  summarise(med_Fare = median(Fare,na.rm=TRUE)) # 8.05
## `summarise()` has grouped output by 'Embarked'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 3
## # Groups:   Embarked [3]
##   Embarked Pclass med_Fare
##   <fct>    <fct>     <dbl>
## 1 C        1         75.2 
## 2 C        2         15.0 
## 3 C        3          7.23
## 4 Q        1         90   
## 5 Q        2         11.5 
## 6 Q        3          7.75
## 7 S        1         51.7 
## 8 S        2         21   
## 9 S        3          8.05
# Impute 
titan_test_new[which(is.na(titan_test_new$Fare)),"Fare"] <- 8.05

# Check
sum(is.na(titan_test_new$Fare))
## [1] 0

Aside of missing value, Fare feature is likely to has another problem.

summary(trainX_new$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.925  14.500  33.666  31.275 512.329
summary(validX_new$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.896  13.000  27.313  29.125 263.000
summary(titan_test_new$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.896  14.454  35.561  31.472 512.329

as the summary show us, the minimum value of Fare in all data set is zero. I don’t think this is possible, but a mere inputing error.

trainX_new[which(trainX_new$Fare == 0),]
##     PassengerId Survived Pclass                             Name  Sex Age SibSp
## 264         264        0      1            Harrison, Mr. William male  40     0
## 272         272        1      3     Tornquist, Mr. William Henry male  25     0
## 303         303        0      3  Johnson, Mr. William Cahoone Jr male  19     0
## 467         467        0      2            Campbell, Mr. William male  NA     0
## 482         482        0      2 Frost, Mr. Anthony Wood "Archie" male  NA     0
## 598         598        0      3              Johnson, Mr. Alfred male  49     0
## 634         634        0      1    Parr, Mr. William Henry Marsh male  NA     0
## 675         675        0      2       Watson, Mr. Ennis Hastings male  NA     0
## 807         807        0      1           Andrews, Mr. Thomas Jr male  39     0
## 816         816        0      1                 Fry, Mr. Richard male  NA     0
## 823         823        0      1  Reuchlin, Jonkheer. John George male  38     0
##     Parch Ticket Fare Embarked titlePass   surname Age2
## 264     0 112059    0        S        Mr  Harrison   40
## 272     0   LINE    0        S        Mr Tornquist   25
## 303     0   LINE    0        S        Mr   Johnson   19
## 467     0 239853    0        S        Mr  Campbell   30
## 482     0 239854    0        S        Mr     Frost   30
## 598     0   LINE    0        S        Mr   Johnson   49
## 634     0 112052    0        S        Mr      Parr   30
## 675     0 239856    0        S        Mr    Watson   30
## 807     0 112050    0        S        Mr   Andrews   39
## 816     0 112058    0        S        Mr       Fry   30
## 823     0  19972    0        S  Jonkheer  Reuchlin   38

All of these “free-rider” are Male, aged from 19-40 years old, departed from Southampton, and travel alone.

trainX_new[which(trainX_new$SibSp==0 & trainX_new$Parch==0 & trainX_new$Embarked=="S"),] %>% group_by(Pclass) %>% 
  summarise(fare_med = median(Fare)) 
## # A tibble: 3 x 2
##   Pclass fare_med
##   <fct>     <dbl>
## 1 1         26.6 
## 2 2         13   
## 3 3          7.92
# Pclass == 1 then fare==26.55
# Pclass == 2 then fare==13
# Pclass == 3 then fare==7.925

# Impute 
trainX_new <- trainX_new %>% mutate(Fare2=
  case_when(SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==1~26.55,
            SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==2~13,
            SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==3~7.925,
            TRUE~Fare)
)

summary(trainX_new$Fare2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.013   7.925  15.048  33.951  31.275 512.329

in validX_new, our “free-rider” has similiar characteristic as those in trainX_new. So, here i will apply the same solution

validX_new[which(validX_new$Fare==0),]
##     PassengerId Survived Pclass                           Name  Sex Age SibSp
## 180         180        0      3            Leonard, Mr. Lionel male  36     0
## 278         278        0      2    Parkes, Mr. Francis "Frank" male  NA     0
## 414         414        0      2 Cunningham, Mr. Alfred Fleming male  NA     0
## 733         733        0      2           Knight, Mr. Robert J male  NA     0
##     Parch Ticket Fare Embarked titlePass    surname Age2
## 180     0   LINE    0        S        Mr    Leonard 36.0
## 278     0 239853    0        S        Mr     Parkes 31.5
## 414     0 239853    0        S        Mr Cunningham 31.5
## 733     0 239855    0        S        Mr     Knight 31.5
validX_new[which(validX_new$SibSp==0 & validX_new$Parch==0 & validX_new$Embarked=="S"),] %>% group_by(Pclass) %>% 
  summarise(fare_med = median(Fare)) 
## # A tibble: 3 x 2
##   Pclass fare_med
##   <fct>     <dbl>
## 1 1         47.2 
## 2 2         13   
## 3 3          7.90
# if Pclass==1 then fare==47.2
# if Pclass==2 then fare==13
# if Pclass==3 then fate==7.8958

validX_new <- validX_new %>% mutate(Fare2=
  case_when(SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==1~47.2,
            SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==2~13,
            SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==3~7.8958,
            TRUE~Fare)
)

summary(validX_new$Fare2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.496   7.896  13.000  27.542  29.125 263.000

At last, for titan_test_new

titan_test_new[which(titan_test_new$Fare==0),]
##     PassengerId Pclass                                  Name  Sex Age SibSp
## 267        1158      1 Chisholm, Mr. Roderick Robert Crispin male  NA     0
## 373        1264      1               Ismay, Mr. Joseph Bruce male  49     0
##     Parch Ticket Fare Embarked titlePass  surname Age2
## 267     0 112051    0        S        Mr Chisholm   29
## 373     0 112058    0        S        Mr    Ismay   49

It is also reasonable to apply the same solution to zero Fare in titan_test_new.

titan_test_new[which(titan_test_new$SibSp==0 & titan_test_new$Parch==0 & titan_test_new$Embarked=="S"),] %>% group_by(Pclass) %>% 
  summarise(fare_med = median(Fare)) 
## # A tibble: 3 x 2
##   Pclass fare_med
##   <fct>     <dbl>
## 1 1         30.5 
## 2 2         13   
## 3 3          7.90
# if pclass==1 then Fare == 30.5
# if Pclass==2 then Fare == 13
# if Pclass==3 then Fare == 7.8958

# impute
titan_test_new <- titan_test_new %>% mutate(Fare2=
  case_when(SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==1~30.5,
            SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==2~13,
            SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==3~7.8958,
            TRUE~Fare)
)

summary(titan_test_new$Fare2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.171   7.896  14.456  35.707  31.472 512.329

To prevent confussion, let’s drop Age and Fare feature from each data set

trainX_new <- trainX_new %>% select(!c(Age,Fare))
validX_new <- validX_new %>% select(!c(Age,Fare))
titan_test_new <- titan_test_new %>% select(!c(Age,Fare))

Normalization and Scalling

summary(trainX_new)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:423    1:175   Length:686         female:234  
##  1st Qu.:223.2   1:263    2:142   Class :character   male  :452  
##  Median :445.5            3:369   Mode  :character               
##  Mean   :446.1                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##      SibSp            Parch           Ticket          Embarked
##  Min.   :0.0000   Min.   :0.0000   Length:686         C:133   
##  1st Qu.:0.0000   1st Qu.:0.0000   Class :character   Q: 59   
##  Median :0.0000   Median :0.0000   Mode  :character   S:494   
##  Mean   :0.5394   Mean   :0.3848                              
##  3rd Qu.:1.0000   3rd Qu.:0.0000                              
##  Max.   :8.0000   Max.   :6.0000                              
##   titlePass           surname               Age2           Fare2        
##  Length:686         Length:686         Min.   : 0.67   Min.   :  4.013  
##  Class :character   Class :character   1st Qu.:21.00   1st Qu.:  7.925  
##  Mode  :character   Mode  :character   Median :30.00   Median : 15.048  
##                                        Mean   :29.75   Mean   : 33.951  
##                                        3rd Qu.:36.00   3rd Qu.: 31.275  
##                                        Max.   :80.00   Max.   :512.329

I will conduct normality test and apply standarization on Age2 and Fare features.

There are two reason :

  1. As we know from all preceding data explorations, the distribution of Fare and Age in Titanic data sets are right-skewed, which means they are not normally distributed.
  2. Furthermore, their range value with our target feature, “Survived”, are too different in scale. it could impact the model accuracy.
# Histogram  of Fare2 and Age2
qqnorm(trainX_new$Fare2,main="Q-Q Plot for Fare Normality in Training Data")
qqline(trainX_new$Fare2,col="magenta")

qqnorm(validX_new$Fare2,main="Q-Q Plot for Fare Normality in Validation Data")
qqline(validX_new$Fare2,col="coral")

qqnorm(titan_test_new$Fare2,main="Q-Q Plot for Fare Normality in Test Data")
qqline(titan_test_new$Fare2,col="coral")

From Q-Q plot in data sets, we can see the diverse of data from the straight line, especially around the tail.

qqnorm(trainX_new$Age2,main="Q-Q Plot for Age Normality in Training Data")
qqline(trainX_new$Age2,col="magenta")

qqnorm(validX_new$Age2,main="Q-Q Plot for Age Normality in Validation Data")
qqline(validX_new$Age2,col="coral")

qqnorm(titan_test_new$Age2,main="Q-Q Plot for Age Normality in Test Data")
qqline(titan_test_new$Age2,col="coral")

For Age, even though they are not completely follow normal distribution, Age has closer distance to theoritical Quantiles rather than Fare.

To ensure our interpretation of Q-Q Plot, i will perform Saphiro-Wilk, to test data normality.

shapiro.test(trainX_new$Age2)
## 
##  Shapiro-Wilk normality test
## 
## data:  trainX_new$Age2
## W = 0.97015, p-value = 1.307e-10
shapiro.test(validX_new$Age2)
## 
##  Shapiro-Wilk normality test
## 
## data:  validX_new$Age2
## W = 0.97388, p-value = 0.0007291
shapiro.test(titan_test_new$Age2)
## 
##  Shapiro-Wilk normality test
## 
## data:  titan_test_new$Age2
## W = 0.96171, p-value = 5.655e-09

Normality test using Kolmogrov-Smirnov test

ks.test(trainX_new$Age2,"pnorm")
## Warning in ks.test(trainX_new$Age2, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  trainX_new$Age2
## D = 0.97241, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(validX_new$Age2,"pnorm")
## Warning in ks.test(validX_new$Age2, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  validX_new$Age2
## D = 0.96938, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(titan_test_new$Age2,"pnorm")
## Warning in ks.test(titan_test_new$Age2, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  titan_test_new$Age2
## D = 0.97473, p-value < 2.2e-16
## alternative hypothesis: two-sided

From all test, it is proved that Age2 is not normally distributed. Now, for Fare2

qqnorm(trainX_new$Fare2,main="Q-Q Plot for Fare Normality in Training Data")
qqline(trainX_new$Fare2,col="magenta")

qqnorm(validX_new$Fare2,main="Q-Q Plot for Fare Normality in Validation Data")
qqline(validX_new$Fare2,col="coral")

qqnorm(titan_test_new$Fare2,main="Q-Q Plot for Fare Normality in Test Data")
qqline(titan_test_new$Fare2,col="coral")

#

Same as Age, Fare feature in all data set are most likely to not normally distributed.

# Shapiro-Wilk Test
shapiro.test(trainX_new$Fare2)
## 
##  Shapiro-Wilk normality test
## 
## data:  trainX_new$Fare2
## W = 0.50906, p-value < 2.2e-16
shapiro.test(validX_new$Fare2)
## 
##  Shapiro-Wilk normality test
## 
## data:  validX_new$Fare2
## W = 0.60136, p-value < 2.2e-16
shapiro.test(titan_test_new$Fare2)
## 
##  Shapiro-Wilk normality test
## 
## data:  titan_test_new$Fare2
## W = 0.53761, p-value < 2.2e-16
#Kolmogrov-Smirnov Test
ks.test(trainX_new$Fare2,"pnorm")
## Warning in ks.test(trainX_new$Fare2, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  trainX_new$Fare2
## D = 0.99997, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(validX_new$Fare2,"pnorm")
## Warning in ks.test(validX_new$Fare2, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  validX_new$Fare2
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(titan_test_new$Fare2,"pnorm")
## Warning in ks.test(titan_test_new$Fare2, "pnorm"): ties should not be present
## for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  titan_test_new$Fare2
## D = 0.99924, p-value < 2.2e-16
## alternative hypothesis: two-sided

Both Shapiro-Wilk and Kolmogorov-Smirnov test’s results are confirmed this deduction.

For normalization*, i will use log transformation and Yeo Johnson transformation, then compare them to see which one is better at normalize Fare and Age features.

# Log transformation
log_fare <- log(trainX_new$Fare2)
hist(log_fare)

shapiro.test(log_fare)
## 
##  Shapiro-Wilk normality test
## 
## data:  log_fare
## W = 0.89446, p-value < 2.2e-16
log_age <- log(trainX_new$Age2)
hist(log_age)

shapiro.test(log_age)
## 
##  Shapiro-Wilk normality test
## 
## data:  log_age
## W = 0.75829, p-value < 2.2e-16
#install.packages("bestNormalize")
library(bestNormalize)
## Warning: package 'bestNormalize' was built under R version 4.1.3
yeoJ_fare <- yeojohnson(trainX_new$Fare2)
hist(yeoJ_fare$x.t)

shapiro.test(yeoJ_fare$x.t)
## 
##  Shapiro-Wilk normality test
## 
## data:  yeoJ_fare$x.t
## W = 0.92003, p-value < 2.2e-16
yeoJ_age <- yeojohnson(trainX_new$Age2)
hist(yeoJ_age$x.t)

shapiro.test(yeoJ_age$x.t)
## 
##  Shapiro-Wilk normality test
## 
## data:  yeoJ_age$x.t
## W = 0.97735, p-value = 8.164e-09

For all consideration, it seems that Yeo Johnson transformation generate more centered an symmetric data than log transformation, even though they all are still considered as not normally distributed.

iqr_yeo_fare <- IQR(yeoJ_fare$x.t)
q1_yeo_fare <- quantile(yeoJ_fare$x.t,0.25)
q3_yeo_fare <-quantile(yeoJ_fare$x.t,0.75)

# Check for outliers
yeoJ_fare$x.t[which(yeoJ_fare$x.t < q1_yeo_fare - (1.5*iqr_yeo_fare)|
                      yeoJ_fare$x.t > q3_yeo_fare + (1.5*iqr_yeo_fare))]
## numeric(0)
# Check for normality using Q-Q plot
qqnorm(yeoJ_fare$x.t,main="Q-Q Plot for YeoJohnson-Fare Normality in Training Data")
qqline(yeoJ_fare$x.t,col="magenta")

iqr_yeo_age <- IQR(yeoJ_age$x.t)
q1_yeo_age <- quantile(yeoJ_age$x.t,0.25)
q3_yeo_age <-quantile(yeoJ_age$x.t,0.75)

# Check for outliers
yeoJ_age$x.t[which(yeoJ_age$x.t < q1_yeo_age - (1.5*iqr_yeo_age)|
                      yeoJ_age$x.t > q3_yeo_age + (1.5*iqr_yeo_age))]
##  [1] -2.395473  2.398930 -2.551608  2.763624  2.733512 -2.395473 -2.527724
##  [8] -2.527724 -2.527724  2.213698  2.275667 -2.395473 -2.538902 -2.395473
## [15] -2.527724  2.337409  2.398930 -2.563026 -2.395473  2.275667  2.763624
## [22]  2.337409  2.213698  2.213698  3.297668 -2.395473  2.703350 -2.574566
## [29] -2.527724 -2.395473 -2.527724 -2.551608  2.943285
# QQ plot
qqnorm(yeoJ_age$x.t,main="Q-Q Plot for YeoJohnson Age Normality in Training Data")
qqline(yeoJ_age$x.t,col="magenta")

Based on outliers number left after transformed, it seems that Yeo Johnson is not the right technique for Age (the untransformed Age2 has less outlier than the transformed). So, i running test on multiple transformation technique, but the best result i can get is 24 (the same outlier number as the original data) outliers left for the transformed ones.

Finally i try to compare outlier data on Age feature against Age2. Apparently, the increasing outlier number of Age variable happens right after imputation process. Age feature has only 6, and Age2 has 24.

The reason this happens is because before imputation, 75% of Age lies under 39 years old. While after imputation using median on each gender and title, 75% of passanger’s age lies under 36 years old. So, imputed Age generate smaller IQR and make it more sensitive to outlier.

quantile(train_X$Age,c(0.25,0.75),na.rm=TRUE)
## 25% 75% 
##  21  39
quantile(train_X$Age2,c(0.25,0.75))
## 25% 75% 
##  21  36
IQR(train_X$Age,na.rm=TRUE)
## [1] 18
IQR(train_X$Age2)
## [1] 15

Hence, i just let this outliers exist. For model building purpose, i will compare between log transformed Age and Yeo Johnson transformed Age as feature in the later section.

trainX_new$YJTrans_Fare2 <- yeoJ_fare$x.t
trainX_new$YJTrans_Age2 <- yeoJ_age$x.t
trainX_new$log_age <- log_age
str(trainX_new)
## 'data.frame':    686 obs. of  16 variables:
##  $ PassengerId  : int  1 2 3 4 6 7 9 10 12 13 ...
##  $ Survived     : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
##  $ Name         : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex          : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 1 2 ...
##  $ SibSp        : int  1 1 0 1 0 0 0 1 0 0 ...
##  $ Parch        : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ Ticket       : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Embarked     : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 1 3 3 ...
##  $ titlePass    : chr  "Mr" "Mrs" "Miss" "Mrs" ...
##  $ surname      : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
##  $ Age2         : num  22 38 26 35 30 54 27 14 58 20 ...
##  $ Fare2        : num  7.25 71.28 7.92 53.1 8.46 ...
##  $ YJTrans_Fare2: num  -1.235 1.36 -1.077 1.167 -0.964 ...
##  $ YJTrans_Age2 : num  -0.5323 0.644 -0.2239 0.4337 0.0738 ...
##  $ log_age      : num  3.09 3.64 3.26 3.56 3.4 ...

Normalization and Scaling in validX_new

yeoJ_age_valid <- predict(yeoJ_age, newdata = validX_new$Age2)
yeoJ_fare_valid <- predict(yeoJ_fare, newdata = validX_new$Fare2)
hist(yeoJ_age_valid)

hist(validX_new$Age2)

summary(yeoJ_age_valid)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.61150 -0.77177  0.07384 -0.14596  0.29127  2.70335
skewness(yeoJ_age_valid)
## [1] -0.1632515
kurtosis(yeoJ_age_valid)
## [1] 3.363006
#outlier on Yeo Johnson Age2
length(yeoJ_age_valid[which(yeoJ_age_valid < quantile(yeoJ_age_valid,0.25) - (1.5*IQR(yeoJ_age_valid))|
                     yeoJ_age_valid > quantile(yeoJ_age_valid,0.75) + (1.5*IQR(yeoJ_age_valid)))])
## [1] 10
# Outlier on Age2
length(validX_new$Age2[which(validX_new$Age2 < quantile(validX_new$Age2,0.25) - (1.5*IQR(validX_new$Age2))|
                     validX_new$Age2 > quantile(validX_new$Age2,0.75) + (1.5*IQR(validX_new$Age2)))])
## [1] 5
hist(yeoJ_fare_valid)

hist(validX_new$Fare2)

summary(yeoJ_fare_valid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4384 -1.0833 -0.2913 -0.1113  0.6662  1.9274
skewness(yeoJ_fare_valid)
## [1] 0.380994
kurtosis(yeoJ_fare_valid)
## [1] 1.811828
length(yeoJ_fare_valid[which(yeoJ_fare_valid < quantile(yeoJ_fare_valid,0.25) - (1.5*IQR(yeoJ_fare_valid))|
                     yeoJ_fare_valid > quantile(yeoJ_fare_valid,0.75) + (1.5*IQR(yeoJ_fare_valid)))])
## [1] 0
length(validX_new$Fare2[which(validX_new$Fare2 < quantile(validX_new$Fare2,0.25) - (1.5*IQR(validX_new$Fare2))|
                     validX_new$Fare2 > quantile(validX_new$Fare2,0.75) + (1.5*IQR(validX_new$Fare2)))])
## [1] 23

Add yeoJ_age_valid and yeoJ_fare_valid to validX_new dataframe

validX_new$YJTrans_Fare2 <- yeoJ_fare_valid
validX_new$YJTrans_Age2 <- yeoJ_age_valid
validX_new$log_Age2 <- log(validX_new$Age2)

str(validX_new)
## 'data.frame':    205 obs. of  16 variables:
##  $ PassengerId  : int  5 8 11 18 21 24 31 34 37 44 ...
##  $ Survived     : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 1 2 2 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 1 2 3 2 ...
##  $ Name         : chr  "Allen, Mr. William Henry" "Palsson, Master. Gosta Leonard" "Sandstrom, Miss. Marguerite Rut" "Williams, Mr. Charles Eugene" ...
##  $ Sex          : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 2 2 1 ...
##  $ SibSp        : int  0 3 1 0 0 0 0 0 0 1 ...
##  $ Parch        : int  0 1 1 0 0 0 0 0 0 2 ...
##  $ Ticket       : chr  "373450" "349909" "PP 9549" "244373" ...
##  $ Embarked     : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 1 3 1 1 ...
##  $ titlePass    : chr  "Mr" "Master" "Miss" "Mr" ...
##  $ surname      : chr  "Allen" "Palsson" "Sandstrom" "Williams" ...
##  $ Age2         : num  35 2 4 31.5 35 28 40 66 31.5 3 ...
##  $ Fare2        : num  8.05 21.07 16.7 13 26 ...
##  $ YJTrans_Fare2: num  -1.0495 0.3261 0.0462 -0.2913 0.5531 ...
##  $ YJTrans_Age2 : num  0.434 -2.395 -2.158 0.183 0.434 ...
##  $ log_Age2     : num  3.555 0.693 1.386 3.45 3.555 ...

Transformation on titan_test

yeoJ_age_test <- predict(yeojohnson(titan_test_new$Age2))
yeoJ_fare_test <- predict(yeojohnson(titan_test_new$Fare2))
head(yeoJ_age_test)
## [1]  0.4138232  1.2951371  2.2788698 -0.1527104 -0.5518869 -1.2413137
hist(yeoJ_age_test)

hist(titan_test_new$Age2)

summary(yeoJ_age_test)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.80232 -0.55189  0.00177  0.00000  0.55955  3.14418
skewness(yeoJ_age_test)
## [1] 0.08055547
kurtosis(yeoJ_age_test)
## [1] 3.715383
#outlier on Yeo Johnson Age2
length(yeoJ_age_test[which(yeoJ_age_test < quantile(yeoJ_age_test,0.25) - (1.5*IQR(yeoJ_age_test))|
                     yeoJ_age_test > quantile(yeoJ_age_test,0.75) + (1.5*IQR(yeoJ_age_test)))])
## [1] 19
# Outlier on Age2
length(titan_test_new$Age2[which(titan_test_new$Age2 < quantile(titan_test_new$Age2,0.25) - (1.5*IQR(titan_test_new$Age2))|
                     titan_test_new$Age2 > quantile(titan_test_new$Age2,0.75) + (1.5*IQR(titan_test_new$Age2)))])
## [1] 16
hist(yeoJ_fare_test)

hist(titan_test_new$Fare2)

summary(yeoJ_fare_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.9287 -1.0422 -0.1084  0.0000  0.7535  2.0217
skewness(yeoJ_fare_test)
## [1] 0.2385711
kurtosis(yeoJ_fare_test)
## [1] 1.856749
#outlier on Yeo Johnson Age2
length(yeoJ_fare_test[which(yeoJ_fare_test < quantile(yeoJ_fare_test,0.25) - (1.5*IQR(yeoJ_fare_test))|
                     yeoJ_fare_test > quantile(yeoJ_fare_test,0.75) + (1.5*IQR(yeoJ_fare_test)))])
## [1] 0
# Outlier on Age2
length(titan_test_new$Fare2[which(titan_test_new$Fare2 < quantile(titan_test_new$Fare2,0.25) - (1.5*IQR(titan_test_new$Fare2))|
                     titan_test_new$Fare2 > quantile(titan_test_new$Fare2,0.75) + (1.5*IQR(titan_test_new$Fare2)))])
## [1] 55

Add yeoJ_fare_test and yeoJ_age_valid to titan_test_new dataframe

titan_test_new$YJTrans_Fare2 <- yeoJ_fare_test
titan_test_new$YJTrans_Age2 <- yeoJ_age_test
titan_test_new$log_Age2 <- log(titan_test_new$Age2)

str(titan_test_new)
## 'data.frame':    418 obs. of  15 variables:
##  $ PassengerId  : int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name         : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex          : chr  "male" "female" "male" "male" ...
##  $ SibSp        : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch        : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket       : chr  "330911" "363272" "240276" "315154" ...
##  $ Embarked     : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ titlePass    : chr  "Mr" "Mrs" "Mr" "Mr" ...
##  $ surname      : chr  "Kelly" "Wilkes" "Myles" "Wirz" ...
##  $ Age2         : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ Fare2        : num  7.83 7 9.69 8.66 12.29 ...
##  $ YJTrans_Fare2: num  -1.057 -1.259 -0.698 -0.883 -0.335 ...
##  $ YJTrans_Age2 : num  0.414 1.295 2.279 -0.153 -0.552 ...
##  $ log_Age2     : num  3.54 3.85 4.13 3.3 3.09 ...

Suprisingly, Yeo Johnson transformation generate similar pattern on all data sets. They eliminate outliers in Fare, but increasing outliers in Age. All of them also manage to return a more centered data.

Feature Engineering

As i stated on preceding section (Exploratory Data Analysis), Sex and Embarked features are nominal measurements, which means there are no order or any kinds level to compare which group is better to another. For this reason, i will convert these two features, from factor data type to binary feature.

# unique value Sex feature
unique(trainX_new$Sex)
## [1] male   female
## Levels: female male
unique(trainX_new$Embarked)
## [1] S C Q
## Levels: C Q S
# Convert each of this unique values to binary feature

trainX_new <- trainX_new %>% mutate(
  SexMale = case_when(
    Sex=="male"~1,
    TRUE~0
  ),
  SexFemale = case_when(
    Sex=="female"~1,
    TRUE~0
  ),
  Embarked_S = case_when(
    Embarked=="S"~1,
    TRUE~0
  ),
  Embarked_C = case_when(
    Embarked=="C"~1,
    TRUE~0
  ),
  Embarked_Q = case_when(
    Embarked=="Q"~1,
    TRUE~0
  )
)

str(trainX_new)
## 'data.frame':    686 obs. of  21 variables:
##  $ PassengerId  : int  1 2 3 4 6 7 9 10 12 13 ...
##  $ Survived     : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
##  $ Name         : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex          : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 1 2 ...
##  $ SibSp        : int  1 1 0 1 0 0 0 1 0 0 ...
##  $ Parch        : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ Ticket       : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Embarked     : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 1 3 3 ...
##  $ titlePass    : chr  "Mr" "Mrs" "Miss" "Mrs" ...
##  $ surname      : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
##  $ Age2         : num  22 38 26 35 30 54 27 14 58 20 ...
##  $ Fare2        : num  7.25 71.28 7.92 53.1 8.46 ...
##  $ YJTrans_Fare2: num  -1.235 1.36 -1.077 1.167 -0.964 ...
##  $ YJTrans_Age2 : num  -0.5323 0.644 -0.2239 0.4337 0.0738 ...
##  $ log_age      : num  3.09 3.64 3.26 3.56 3.4 ...
##  $ SexMale      : num  1 0 0 0 1 1 0 0 0 1 ...
##  $ SexFemale    : num  0 1 1 1 0 0 1 1 1 0 ...
##  $ Embarked_S   : num  1 0 1 1 0 1 1 0 1 1 ...
##  $ Embarked_C   : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ Embarked_Q   : num  0 0 0 0 1 0 0 0 0 0 ...

Sex and Embarked engineering in validX_new

# unique value Sex feature
unique(validX_new$Sex)
## [1] male   female
## Levels: female male
unique(validX_new$Embarked)
## [1] S C Q
## Levels: C Q S
# Convert each of this unique values to binary feature

validX_new <- validX_new %>% mutate(
  SexMale = case_when(
    Sex=="male"~1,
    TRUE~0
  ),
  SexFemale = case_when(
    Sex=="female"~1,
    TRUE~0
  ),
  Embarked_S = case_when(
    Embarked=="S"~1,
    TRUE~0
  ),
  Embarked_C = case_when(
    Embarked=="C"~1,
    TRUE~0
  ),
  Embarked_Q = case_when(
    Embarked=="Q"~1,
    TRUE~0
  )
)

str(validX_new)
## 'data.frame':    205 obs. of  21 variables:
##  $ PassengerId  : int  5 8 11 18 21 24 31 34 37 44 ...
##  $ Survived     : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 1 2 2 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 1 2 3 2 ...
##  $ Name         : chr  "Allen, Mr. William Henry" "Palsson, Master. Gosta Leonard" "Sandstrom, Miss. Marguerite Rut" "Williams, Mr. Charles Eugene" ...
##  $ Sex          : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 2 2 1 ...
##  $ SibSp        : int  0 3 1 0 0 0 0 0 0 1 ...
##  $ Parch        : int  0 1 1 0 0 0 0 0 0 2 ...
##  $ Ticket       : chr  "373450" "349909" "PP 9549" "244373" ...
##  $ Embarked     : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 1 3 1 1 ...
##  $ titlePass    : chr  "Mr" "Master" "Miss" "Mr" ...
##  $ surname      : chr  "Allen" "Palsson" "Sandstrom" "Williams" ...
##  $ Age2         : num  35 2 4 31.5 35 28 40 66 31.5 3 ...
##  $ Fare2        : num  8.05 21.07 16.7 13 26 ...
##  $ YJTrans_Fare2: num  -1.0495 0.3261 0.0462 -0.2913 0.5531 ...
##  $ YJTrans_Age2 : num  0.434 -2.395 -2.158 0.183 0.434 ...
##  $ log_Age2     : num  3.555 0.693 1.386 3.45 3.555 ...
##  $ SexMale      : num  1 1 0 1 1 1 1 1 1 0 ...
##  $ SexFemale    : num  0 0 1 0 0 0 0 0 0 1 ...
##  $ Embarked_S   : num  1 1 1 1 1 1 0 1 0 0 ...
##  $ Embarked_C   : num  0 0 0 0 0 0 1 0 1 1 ...
##  $ Embarked_Q   : num  0 0 0 0 0 0 0 0 0 0 ...

Sex and Embarked engineering in titan_test_new

unique(titan_test_new$Sex)
## [1] "male"   "female"
unique(titan_test_new$Embarked)
## [1] Q S C
## Levels: C Q S
# Convert each of this unique values to binary feature

titan_test_new <- titan_test_new %>% mutate(
  SexMale = case_when(
    Sex=="male"~1,
    TRUE~0
  ),
  SexFemale = case_when(
    Sex=="female"~1,
    TRUE~0
  ),
  Embarked_S = case_when(
    Embarked=="S"~1,
    TRUE~0
  ),
  Embarked_C = case_when(
    Embarked=="C"~1,
    TRUE~0
  ),
  Embarked_Q = case_when(
    Embarked=="Q"~1,
    TRUE~0
  )
)

str(titan_test_new)
## 'data.frame':    418 obs. of  20 variables:
##  $ PassengerId  : int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name         : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex          : chr  "male" "female" "male" "male" ...
##  $ SibSp        : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch        : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket       : chr  "330911" "363272" "240276" "315154" ...
##  $ Embarked     : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ titlePass    : chr  "Mr" "Mrs" "Mr" "Mr" ...
##  $ surname      : chr  "Kelly" "Wilkes" "Myles" "Wirz" ...
##  $ Age2         : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ Fare2        : num  7.83 7 9.69 8.66 12.29 ...
##  $ YJTrans_Fare2: num  -1.057 -1.259 -0.698 -0.883 -0.335 ...
##  $ YJTrans_Age2 : num  0.414 1.295 2.279 -0.153 -0.552 ...
##  $ log_Age2     : num  3.54 3.85 4.13 3.3 3.09 ...
##  $ SexMale      : num  1 0 1 1 0 1 0 1 0 1 ...
##  $ SexFemale    : num  0 1 0 0 1 0 1 0 1 0 ...
##  $ Embarked_S   : num  0 1 0 1 1 1 0 1 0 1 ...
##  $ Embarked_C   : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Embarked_Q   : num  1 0 1 0 0 0 1 0 0 0 ...
str(trainX_new)
## 'data.frame':    686 obs. of  21 variables:
##  $ PassengerId  : int  1 2 3 4 6 7 9 10 12 13 ...
##  $ Survived     : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
##  $ Name         : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex          : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 1 2 ...
##  $ SibSp        : int  1 1 0 1 0 0 0 1 0 0 ...
##  $ Parch        : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ Ticket       : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Embarked     : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 1 3 3 ...
##  $ titlePass    : chr  "Mr" "Mrs" "Miss" "Mrs" ...
##  $ surname      : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
##  $ Age2         : num  22 38 26 35 30 54 27 14 58 20 ...
##  $ Fare2        : num  7.25 71.28 7.92 53.1 8.46 ...
##  $ YJTrans_Fare2: num  -1.235 1.36 -1.077 1.167 -0.964 ...
##  $ YJTrans_Age2 : num  -0.5323 0.644 -0.2239 0.4337 0.0738 ...
##  $ log_age      : num  3.09 3.64 3.26 3.56 3.4 ...
##  $ SexMale      : num  1 0 0 0 1 1 0 0 0 1 ...
##  $ SexFemale    : num  0 1 1 1 0 0 1 1 1 0 ...
##  $ Embarked_S   : num  1 0 1 1 0 1 1 0 1 1 ...
##  $ Embarked_C   : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ Embarked_Q   : num  0 0 0 0 1 0 0 0 0 0 ...

Feature Selection

In common data analysis procedure, there are many feature selection method to be choosen. Each method has its own strength and weakness. Generally, these methods are divided into 4 groups : 1. Filter 2. Wrapper 3. Embedded 4. Hybrid For this project, i will utilize one of the most popular and reliable feature selection method named Boruta. For detailed explanationof Boruta, please visit this link analytics vidyha

# install boruta package
#install.packages("Boruta")
library(Boruta)
## Warning: package 'Boruta' was built under R version 4.1.3

before proceeding, it’s a good idea to, once again, have a look at the data structure. Here, I just realized that there are several features that should not be involved or need to be separated before we start the selection process

str(trainX_new)
## 'data.frame':    686 obs. of  21 variables:
##  $ PassengerId  : int  1 2 3 4 6 7 9 10 12 13 ...
##  $ Survived     : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
##  $ Name         : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex          : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 1 2 ...
##  $ SibSp        : int  1 1 0 1 0 0 0 1 0 0 ...
##  $ Parch        : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ Ticket       : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Embarked     : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 1 3 3 ...
##  $ titlePass    : chr  "Mr" "Mrs" "Miss" "Mrs" ...
##  $ surname      : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
##  $ Age2         : num  22 38 26 35 30 54 27 14 58 20 ...
##  $ Fare2        : num  7.25 71.28 7.92 53.1 8.46 ...
##  $ YJTrans_Fare2: num  -1.235 1.36 -1.077 1.167 -0.964 ...
##  $ YJTrans_Age2 : num  -0.5323 0.644 -0.2239 0.4337 0.0738 ...
##  $ log_age      : num  3.09 3.64 3.26 3.56 3.4 ...
##  $ SexMale      : num  1 0 0 0 1 1 0 0 0 1 ...
##  $ SexFemale    : num  0 1 1 1 0 0 1 1 1 0 ...
##  $ Embarked_S   : num  1 0 1 1 0 1 1 0 1 1 ...
##  $ Embarked_C   : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ Embarked_Q   : num  0 0 0 0 1 0 0 0 0 0 ...

First of all, i want to discard “PassangerId”, or rather, convert it into row names.

trainX_select <- trainX_new %>% remove_rownames() %>% column_to_rownames("PassengerId")
validX_select <- validX_new %>% remove_rownames() %>% column_to_rownames("PassengerId")
titan_test_select <- titan_test_new %>% remove_rownames() %>% 
  column_to_rownames("PassengerId")

Next, i need to discard Name,titlePass,surname, and Ticket feature, because they are unexplainable in term of data analysis interpretation. Moreover, the imbalance propotion of each levels would cause error in model.

trainX_select <- trainX_select %>% select(-c(Name,Ticket,surname,titlePass))
validX_select <- validX_select %>% select(-c(Name,Ticket,surname,titlePass))
titan_test_select <- titan_test_select %>% select(-c(Name,Ticket,surname,titlePass))

Furthermore, I also realized, that due to data transformation, our titanic data set has some redundant features. For example, the Sex feature with the SexMale and SexFemale features. For features like this, I want to discard non-transformed features such as Sex, Embarked, Age2, and Fare2.

# Data set with Yeo Johnson transformed Age2
trainX_trans <- trainX_select %>% select(-c(Sex,Embarked,Age2,Fare2,log_age))
validX_trans <- validX_select %>% select(-c(Sex,Embarked,Age2,Fare2,log_Age2))
titan_test_trans <- titan_test_select %>% select(-c(Sex,Embarked,Age2,Fare2,log_Age2))
# Data set with log transformed Age2
trainX_trans_log <- trainX_select %>% select(-c(Sex,Embarked,Age2,Fare2,YJTrans_Age2))
validX_trans_log <- validX_select %>% select(-c(Sex,Embarked,Age2,Fare2,YJTrans_Age2))
titan_test_trans_log <- titan_test_select %>% select(-c(Sex,Embarked,Age2,Fare2,YJTrans_Age2))
str(trainX_trans)
## 'data.frame':    686 obs. of  11 variables:
##  $ Survived     : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
##  $ SibSp        : int  1 1 0 1 0 0 0 1 0 0 ...
##  $ Parch        : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ YJTrans_Fare2: num  -1.235 1.36 -1.077 1.167 -0.964 ...
##  $ YJTrans_Age2 : num  -0.5323 0.644 -0.2239 0.4337 0.0738 ...
##  $ SexMale      : num  1 0 0 0 1 1 0 0 0 1 ...
##  $ SexFemale    : num  0 1 1 1 0 0 1 1 1 0 ...
##  $ Embarked_S   : num  1 0 1 1 0 1 1 0 1 1 ...
##  $ Embarked_C   : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ Embarked_Q   : num  0 0 0 0 1 0 0 0 0 0 ...
str(trainX_trans_log)
## 'data.frame':    686 obs. of  11 variables:
##  $ Survived     : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
##  $ Pclass       : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
##  $ SibSp        : int  1 1 0 1 0 0 0 1 0 0 ...
##  $ Parch        : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ YJTrans_Fare2: num  -1.235 1.36 -1.077 1.167 -0.964 ...
##  $ log_age      : num  3.09 3.64 3.26 3.56 3.4 ...
##  $ SexMale      : num  1 0 0 0 1 1 0 0 0 1 ...
##  $ SexFemale    : num  0 1 1 1 0 0 1 1 1 0 ...
##  $ Embarked_S   : num  1 0 1 1 0 1 1 0 1 1 ...
##  $ Embarked_C   : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ Embarked_Q   : num  0 0 0 0 1 0 0 0 0 0 ...
# Apply Boruta
set.seed(141)
boruta_titan <- Boruta(Survived~., data=trainX_trans, doTrace=2)
##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
## After 10 iterations, +9.2 secs:
##  confirmed 8 attributes: Embarked_S, Parch, Pclass, SexFemale, SexMale and 3 more;
##  still have 2 attributes left.
##  11. run of importance source...
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
## After 14 iterations, +13 secs:
##  confirmed 1 attribute: Embarked_Q;
##  still have 1 attribute left.
##  15. run of importance source...
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
##  20. run of importance source...
##  21. run of importance source...
##  22. run of importance source...
##  23. run of importance source...
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
##  27. run of importance source...
##  28. run of importance source...
##  29. run of importance source...
##  30. run of importance source...
##  31. run of importance source...
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
##  76. run of importance source...
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
##  84. run of importance source...
##  85. run of importance source...
##  86. run of importance source...
##  87. run of importance source...
##  88. run of importance source...
##  89. run of importance source...
##  90. run of importance source...
##  91. run of importance source...
##  92. run of importance source...
##  93. run of importance source...
##  94. run of importance source...
##  95. run of importance source...
##  96. run of importance source...
##  97. run of importance source...
##  98. run of importance source...
##  99. run of importance source...
boruta_titan
## Boruta performed 99 iterations in 1.628624 mins.
##  9 attributes confirmed important: Embarked_Q, Embarked_S, Parch,
## Pclass, SexFemale and 4 more;
##  No attributes deemed unimportant.
##  1 tentative attributes left: Embarked_C;

Suprisingly, all features in trainX_trans are deemed as important.

plot(boruta_titan,las=2)

Because there are no tentative feature, the only thing we need to do to print out a final data frame result derived from Boruta

attStats(boruta_titan)
##                 meanImp medianImp     minImp    maxImp  normHits  decision
## Pclass        29.373456 29.410889 26.3415017 33.056757 1.0000000 Confirmed
## SibSp         13.188614 13.179391  9.9398120 15.867781 1.0000000 Confirmed
## Parch          9.258545  9.195480  7.2416953 12.758392 0.9797980 Confirmed
## YJTrans_Fare2 26.320235 26.210472 22.2362680 30.762680 1.0000000 Confirmed
## YJTrans_Age2  21.451385 21.374573 17.9053520 26.006113 1.0000000 Confirmed
## SexMale       24.412133 24.269320 22.2670036 27.175844 1.0000000 Confirmed
## SexFemale     24.461863 24.504033 22.9301114 26.990550 1.0000000 Confirmed
## Embarked_S     9.085314  8.971972  6.0090053 12.977801 0.9797980 Confirmed
## Embarked_C     3.383204  3.483069  0.3167022  6.135968 0.5555556 Tentative
## Embarked_Q     7.286877  7.400099  4.4782788 10.231091 0.9494949 Confirmed
# just to ensure all data set has the same column name
colnames(validX_trans) <- colnames(trainX_trans)

Apply the same to the log_transformation data set

# Apply Boruta in log transformed data set
set.seed(141)
boruta_titan_log <- Boruta(Survived~., data=trainX_trans_log, doTrace=2)
##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
## After 10 iterations, +10 secs:
##  confirmed 8 attributes: Embarked_S, log_age, Parch, Pclass, SexFemale and 3 more;
##  still have 2 attributes left.
##  11. run of importance source...
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
## After 14 iterations, +18 secs:
##  confirmed 1 attribute: Embarked_Q;
##  still have 1 attribute left.
##  15. run of importance source...
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
##  20. run of importance source...
##  21. run of importance source...
##  22. run of importance source...
##  23. run of importance source...
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
##  27. run of importance source...
##  28. run of importance source...
##  29. run of importance source...
##  30. run of importance source...
##  31. run of importance source...
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
##  76. run of importance source...
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
##  84. run of importance source...
##  85. run of importance source...
##  86. run of importance source...
##  87. run of importance source...
##  88. run of importance source...
##  89. run of importance source...
##  90. run of importance source...
##  91. run of importance source...
##  92. run of importance source...
##  93. run of importance source...
##  94. run of importance source...
##  95. run of importance source...
##  96. run of importance source...
##  97. run of importance source...
##  98. run of importance source...
##  99. run of importance source...
boruta_titan_log
## Boruta performed 99 iterations in 1.654999 mins.
##  9 attributes confirmed important: Embarked_Q, Embarked_S, log_age,
## Parch, Pclass and 4 more;
##  No attributes deemed unimportant.
##  1 tentative attributes left: Embarked_C;

Same as yeoJohnson transformed data, Boruta test in log transformed data set also deemed all features to be important.

colnames(validX_trans_log)<-colnames(trainX_trans_log)

Modelling

There is no one-fit-all algorithm for any case. To create better model, i decide to just run multiple model and maybe ensembling it.

# First of all, i need to ensemble trainX_ and validX_ data set, to enable me conduct cross validation 

titan_cross_df <- rbind(trainX_trans,validX_trans)
titan_cross_log_df <- rbind(trainX_trans_log,validX_trans_log)

dim(titan_cross_df)
## [1] 891  11
dim(titan_cross_log_df)
## [1] 891  11

1. Logistic Regression

library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
#library(boot)
#library(MASS)
set.seed(909)
train_glm_control <- trainControl(method="cv",number=10)
titan_check_glm <- train(Survived~.,data=titan_cross_df,
                         trControl=train_glm_control,
                         method="glm",
                         family="binomial")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
titan_check_glm_step <- train(Survived~.,data=titan_cross_df,
                              trControl=train_glm_control,
                              method="glmStepAIC",
                              family="binomial")
## Start:  AIC=721.08
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=721.08
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=721.08
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_S     1   701.08 719.08
## - Embarked_C     1   702.63 720.63
## <none>               701.08 721.08
## - YJTrans_Fare2  1   704.57 722.57
## - Parch          1   705.15 723.15
## - Pclass2        1   706.51 724.51
## - SibSp          1   715.48 733.48
## - Pclass3        1   716.17 734.17
## - YJTrans_Age2   1   735.77 753.77
## - SexMale        1   891.71 909.71
## 
## Step:  AIC=719.08
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_C
## 
##                 Df Deviance    AIC
## <none>               701.08 719.08
## - YJTrans_Fare2  1   704.57 720.57
## - Parch          1   705.20 721.20
## - Embarked_C     1   705.59 721.59
## - Pclass2        1   706.56 722.56
## - SibSp          1   715.53 731.53
## - Pclass3        1   716.18 732.18
## - YJTrans_Age2   1   735.98 751.98
## - SexMale        1   898.82 914.82
## Start:  AIC=711.97
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=711.97
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=711.97
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   691.97 709.97
## - Embarked_S     1   693.35 711.35
## - Parch          1   693.50 711.50
## <none>               691.97 711.97
## - YJTrans_Fare2  1   694.08 712.08
## - Pclass2        1   696.41 714.41
## - SibSp          1   706.19 724.19
## - Pclass3        1   706.45 724.45
## - YJTrans_Age2   1   721.17 739.17
## - SexMale        1   900.52 918.52
## 
## Step:  AIC=709.97
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - Parch          1   693.51 709.51
## <none>               691.97 709.97
## - YJTrans_Fare2  1   694.08 710.08
## - Embarked_S     1   695.41 711.41
## - Pclass2        1   696.42 712.42
## - SibSp          1   706.21 722.21
## - Pclass3        1   706.49 722.49
## - YJTrans_Age2   1   721.18 737.18
## - SexMale        1   902.52 918.52
## 
## Step:  AIC=709.51
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 + 
##     SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - YJTrans_Fare2  1   694.44 708.44
## <none>               693.51 709.51
## - Embarked_S     1   697.24 711.24
## - Pclass2        1   700.26 714.26
## - SibSp          1   707.34 721.34
## - Pclass3        1   715.43 729.43
## - YJTrans_Age2   1   722.39 736.39
## - SexMale        1   903.69 917.69
## 
## Step:  AIC=708.44
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale + 
##     Embarked_S
## 
##                Df Deviance    AIC
## <none>              694.44 708.44
## - Embarked_S    1   698.34 710.34
## - SibSp         1   710.20 722.20
## - Pclass2       1   710.23 722.23
## - YJTrans_Age2  1   723.80 735.80
## - Pclass3       1   785.32 797.32
## - SexMale       1   916.31 928.31
## Start:  AIC=727.26
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=727.26
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=727.26
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_S     1   707.57 725.57
## - Embarked_C     1   707.65 725.65
## <none>               707.26 727.26
## - Pclass2        1   709.59 727.59
## - YJTrans_Fare2  1   709.82 727.82
## - Parch          1   710.22 728.22
## - SibSp          1   717.16 735.16
## - Pclass3        1   720.08 738.08
## - YJTrans_Age2   1   731.54 749.54
## - SexMale        1   899.33 917.33
## 
## Step:  AIC=725.57
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_C
## 
##                 Df Deviance    AIC
## <none>               707.57 725.57
## - YJTrans_Fare2  1   710.03 726.03
## - Pclass2        1   710.09 726.09
## - Parch          1   710.67 726.67
## - Embarked_C     1   710.67 726.67
## - SibSp          1   717.63 733.63
## - Pclass3        1   720.28 736.28
## - YJTrans_Age2   1   732.29 748.29
## - SexMale        1   907.72 923.72
## Start:  AIC=725.69
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=725.69
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=725.69
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   705.78 723.78
## - Embarked_S     1   706.20 724.20
## - Parch          1   706.97 724.97
## <none>               705.69 725.69
## - YJTrans_Fare2  1   708.48 726.48
## - Pclass2        1   708.72 726.72
## - SibSp          1   715.58 733.58
## - Pclass3        1   719.49 737.49
## - YJTrans_Age2   1   729.17 747.17
## - SexMale        1   905.21 923.21
## 
## Step:  AIC=723.78
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - Parch          1   707.05 723.05
## <none>               705.78 723.78
## - Embarked_S     1   708.37 724.37
## - YJTrans_Fare2  1   708.76 724.76
## - Pclass2        1   708.82 724.82
## - SibSp          1   715.87 731.87
## - Pclass3        1   719.70 735.70
## - YJTrans_Age2   1   729.23 745.23
## - SexMale        1   906.62 922.62
## 
## Step:  AIC=723.05
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 + 
##     SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - YJTrans_Fare2  1   708.87 722.87
## <none>               707.05 723.05
## - Embarked_S     1   709.90 723.90
## - Pclass2        1   711.63 725.63
## - SibSp          1   716.92 730.92
## - Pclass3        1   727.09 741.09
## - YJTrans_Age2   1   730.11 744.11
## - SexMale        1   907.62 921.62
## 
## Step:  AIC=722.87
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale + 
##     Embarked_S
## 
##                Df Deviance    AIC
## <none>              708.87 722.87
## - Embarked_S    1   712.06 724.06
## - SibSp         1   717.32 729.32
## - Pclass2       1   722.32 734.32
## - YJTrans_Age2  1   732.83 744.83
## - Pclass3       1   802.55 814.55
## - SexMale       1   924.10 936.10
## Start:  AIC=714.17
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=714.17
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=714.17
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_S     1   694.33 712.33
## - Embarked_C     1   694.38 712.38
## <none>               694.17 714.17
## - Parch          1   696.66 714.66
## - YJTrans_Fare2  1   697.09 715.09
## - Pclass2        1   699.29 717.29
## - SibSp          1   706.24 724.24
## - Pclass3        1   711.00 729.00
## - YJTrans_Age2   1   731.96 749.96
## - SexMale        1   889.94 907.94
## 
## Step:  AIC=712.33
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   695.91 711.91
## <none>               694.33 712.33
## - Parch          1   696.93 712.93
## - YJTrans_Fare2  1   697.19 713.19
## - Pclass2        1   699.66 715.66
## - SibSp          1   706.55 722.55
## - Pclass3        1   711.07 727.07
## - YJTrans_Age2   1   732.65 748.65
## - SexMale        1   898.26 914.26
## 
## Step:  AIC=711.91
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale
## 
##                 Df Deviance    AIC
## <none>               695.91 711.91
## - Parch          1   698.62 712.62
## - YJTrans_Fare2  1   699.39 713.39
## - Pclass2        1   702.17 716.17
## - SibSp          1   709.76 723.76
## - Pclass3        1   713.26 727.26
## - YJTrans_Age2   1   735.57 749.57
## - SexMale        1   901.72 915.72
## Start:  AIC=725.64
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=725.64
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=725.64
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   705.66 723.66
## - Embarked_S     1   706.52 724.52
## - Parch          1   706.91 724.91
## <none>               705.64 725.64
## - YJTrans_Fare2  1   708.44 726.44
## - Pclass2        1   710.37 728.37
## - SibSp          1   719.59 737.59
## - Pclass3        1   720.70 738.70
## - YJTrans_Age2   1   728.52 746.52
## - SexMale        1   901.00 919.00
## 
## Step:  AIC=723.66
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - Parch          1   706.92 722.92
## <none>               705.66 723.66
## - YJTrans_Fare2  1   708.52 724.52
## - Embarked_S     1   708.59 724.59
## - Pclass2        1   710.41 726.41
## - SibSp          1   719.68 735.68
## - Pclass3        1   720.90 736.90
## - YJTrans_Age2   1   728.55 744.55
## - SexMale        1   902.19 918.19
## 
## Step:  AIC=722.92
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 + 
##     SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - YJTrans_Fare2  1   708.70 722.70
## <none>               706.92 722.92
## - Embarked_S     1   710.17 724.17
## - Pclass2        1   713.18 727.18
## - SibSp          1   720.98 734.98
## - Pclass3        1   727.17 741.17
## - YJTrans_Age2   1   729.00 743.00
## - SexMale        1   903.00 917.00
## 
## Step:  AIC=722.7
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale + 
##     Embarked_S
## 
##                Df Deviance    AIC
## <none>              708.70 722.70
## - Embarked_S    1   712.18 724.18
## - SibSp         1   722.47 734.47
## - Pclass2       1   724.91 736.91
## - YJTrans_Age2  1   731.51 743.51
## - Pclass3       1   802.07 814.07
## - SexMale       1   915.42 927.42
## Start:  AIC=714.35
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=714.35
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=714.35
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   694.41 712.41
## - Embarked_S     1   694.82 712.82
## - Parch          1   695.45 713.45
## <none>               694.35 714.35
## - YJTrans_Fare2  1   696.54 714.54
## - Pclass2        1   698.53 716.53
## - SibSp          1   708.39 726.39
## - Pclass3        1   709.14 727.14
## - YJTrans_Age2   1   724.42 742.42
## - SexMale        1   891.83 909.83
## 
## Step:  AIC=712.41
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - Parch          1   695.49 711.49
## <none>               694.41 712.41
## - Embarked_S     1   696.42 712.42
## - YJTrans_Fare2  1   696.72 712.72
## - Pclass2        1   698.58 714.58
## - SibSp          1   708.51 724.51
## - Pclass3        1   709.30 725.30
## - YJTrans_Age2   1   724.46 740.46
## - SexMale        1   893.90 909.90
## 
## Step:  AIC=711.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 + 
##     SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - YJTrans_Fare2  1   696.84 710.84
## <none>               695.49 711.49
## - Embarked_S     1   697.77 711.77
## - Pclass2        1   701.50 715.50
## - SibSp          1   709.33 723.33
## - Pclass3        1   717.00 731.00
## - YJTrans_Age2   1   725.07 739.07
## - SexMale        1   895.63 909.63
## 
## Step:  AIC=710.84
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale + 
##     Embarked_S
## 
##                Df Deviance    AIC
## <none>              696.84 710.84
## - Embarked_S    1   699.33 711.33
## - SibSp         1   711.03 723.03
## - Pclass2       1   711.64 723.64
## - YJTrans_Age2  1   727.18 739.18
## - Pclass3       1   788.19 800.19
## - SexMale       1   908.79 920.79
## Start:  AIC=711.47
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=711.47
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=711.47
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   691.55 709.55
## - Embarked_S     1   692.59 710.59
## <none>               691.47 711.47
## - Parch          1   694.28 712.28
## - YJTrans_Fare2  1   694.64 712.64
## - Pclass2        1   696.79 714.79
## - SibSp          1   704.79 722.79
## - Pclass3        1   707.25 725.25
## - YJTrans_Age2   1   723.41 741.41
## - SexMale        1   881.42 899.42
## 
## Step:  AIC=709.55
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## <none>               691.55 709.55
## - Parch          1   694.32 710.32
## - YJTrans_Fare2  1   694.90 710.90
## - Embarked_S     1   696.24 712.24
## - Pclass2        1   696.87 712.87
## - SibSp          1   704.94 720.94
## - Pclass3        1   707.50 723.50
## - YJTrans_Age2   1   723.44 739.44
## - SexMale        1   884.20 900.20
## Start:  AIC=719.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=719.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=719.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   699.50 717.50
## - YJTrans_Fare2  1   699.66 717.66
## - Embarked_S     1   699.76 717.76
## - Parch          1   700.35 718.35
## <none>               699.29 719.29
## - Pclass2        1   705.85 723.85
## - SibSp          1   709.45 727.45
## - Pclass3        1   721.90 739.90
## - YJTrans_Age2   1   731.18 749.18
## - SexMale        1   900.41 918.41
## 
## Step:  AIC=717.5
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - YJTrans_Fare2  1   699.95 715.95
## - Parch          1   700.54 716.54
## <none>               699.50 717.50
## - Embarked_S     1   702.67 718.67
## - Pclass2        1   706.11 722.11
## - SibSp          1   709.92 725.92
## - Pclass3        1   722.41 738.41
## - YJTrans_Age2   1   731.40 747.40
## - SexMale        1   901.33 917.33
## 
## Step:  AIC=715.95
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Age2 + 
##     SexMale + Embarked_S
## 
##                Df Deviance    AIC
## - Parch         1   700.57 714.57
## <none>              699.95 715.95
## - Embarked_S    1   703.41 717.41
## - SibSp         1   711.83 725.83
## - Pclass2       1   715.12 729.12
## - YJTrans_Age2  1   731.97 745.97
## - Pclass3       1   805.98 819.98
## - SexMale       1   906.07 920.07
## 
## Step:  AIC=714.57
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale + 
##     Embarked_S
## 
##                Df Deviance    AIC
## <none>              700.57 714.57
## - Embarked_S    1   704.13 716.13
## - SibSp         1   715.48 727.48
## - Pclass2       1   715.63 727.63
## - YJTrans_Age2  1   732.22 744.22
## - Pclass3       1   806.49 818.49
## - SexMale       1   910.38 922.38
## Start:  AIC=712.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=712.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=712.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   692.55 710.55
## - Parch          1   692.80 710.80
## - Embarked_S     1   693.00 711.00
## <none>               692.49 712.49
## - YJTrans_Fare2  1   694.60 712.60
## - Pclass2        1   696.77 714.77
## - Pclass3        1   708.93 726.93
## - SibSp          1   710.36 728.36
## - YJTrans_Age2   1   722.80 740.80
## - SexMale        1   885.11 903.11
## 
## Step:  AIC=710.55
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - Parch          1   692.84 708.84
## <none>               692.55 710.55
## - YJTrans_Fare2  1   694.77 710.77
## - Embarked_S     1   694.91 710.91
## - Pclass2        1   696.85 712.85
## - Pclass3        1   709.20 725.20
## - SibSp          1   710.66 726.66
## - YJTrans_Age2   1   722.87 738.87
## - SexMale        1   887.42 903.42
## 
## Step:  AIC=708.84
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 + 
##     SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - YJTrans_Fare2  1   694.80 708.80
## <none>               692.84 708.84
## - Embarked_S     1   695.41 709.41
## - Pclass2        1   698.17 712.17
## - SibSp          1   710.99 724.99
## - Pclass3        1   713.54 727.54
## - YJTrans_Age2   1   722.91 736.91
## - SexMale        1   889.91 903.91
## 
## Step:  AIC=708.8
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale + 
##     Embarked_S
## 
##                Df Deviance    AIC
## <none>              694.80 708.80
## - Embarked_S    1   697.72 709.72
## - Pclass2       1   709.59 721.59
## - SibSp         1   713.44 725.44
## - YJTrans_Age2  1   725.67 737.67
## - Pclass3       1   790.70 802.70
## - SexMale       1   904.79 916.79
## Start:  AIC=797.11
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=797.11
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=797.11
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance     AIC
## - Embarked_C     1   777.28  795.28
## - Embarked_S     1   777.64  795.64
## - Parch          1   779.05  797.05
## <none>               777.11  797.11
## - YJTrans_Fare2  1   779.68  797.68
## - Pclass2        1   782.03  800.03
## - SibSp          1   791.38  809.38
## - Pclass3        1   794.54  812.54
## - YJTrans_Age2   1   809.85  827.85
## - SexMale        1   995.09 1013.09
## 
## Step:  AIC=795.28
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     YJTrans_Age2 + SexMale + Embarked_S
## 
##                 Df Deviance     AIC
## - Parch          1   779.19  795.19
## <none>               777.28  795.28
## - YJTrans_Fare2  1   780.05  796.05
## - Embarked_S     1   780.39  796.39
## - Pclass2        1   782.23  798.23
## - SibSp          1   791.76  807.76
## - Pclass3        1   794.97  810.97
## - YJTrans_Age2   1   810.00  826.00
## - SexMale        1   996.61 1012.61
## 
## Step:  AIC=795.19
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 + 
##     SexMale + Embarked_S
## 
##                 Df Deviance     AIC
## - YJTrans_Fare2  1   780.49  794.49
## <none>               779.19  795.19
## - Embarked_S     1   782.69  796.69
## - Pclass2        1   786.55  800.55
## - SibSp          1   793.40  807.40
## - Pclass3        1   805.13  819.13
## - YJTrans_Age2   1   811.26  825.26
## - SexMale        1   997.47 1011.47
## 
## Step:  AIC=794.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale + 
##     Embarked_S
## 
##                Df Deviance     AIC
## <none>              780.49  794.49
## - Embarked_S    1   784.24  796.24
## - SibSp         1   795.47  807.47
## - Pclass2       1   797.78  809.78
## - YJTrans_Age2  1   813.15  825.15
## - Pclass3       1   887.71  899.71
## - SexMale       1  1012.13 1024.13

Apply the same steps to log transformed data set

set.seed(821)
#train_glm_control <- trainControl(method="cv",number=10)
titan_log_check_glm <- train(Survived~.,data=titan_cross_log_df,
                         trControl=train_glm_control,
                         method="glm",
                         family="binomial")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
titan_log_check_glm_step <- train(Survived~.,data=titan_cross_log_df,
                              trControl=train_glm_control,
                              method="glmStepAIC",
                              family="binomial")
## Start:  AIC=702.9
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=702.9
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=702.9
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_S     1   683.32 701.32
## - Embarked_C     1   683.40 701.40
## <none>               682.90 702.90
## - YJTrans_Fare2  1   685.79 703.79
## - Parch          1   686.51 704.51
## - Pclass2        1   686.99 704.99
## - Pclass3        1   695.52 713.52
## - SibSp          1   702.03 720.03
## - log_age        1   722.48 740.48
## - SexMale        1   893.38 911.38
## 
## Step:  AIC=701.32
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_C
## 
##                 Df Deviance    AIC
## <none>               683.32 701.32
## - YJTrans_Fare2  1   686.15 702.15
## - Parch          1   687.20 703.20
## - Embarked_C     1   687.58 703.58
## - Pclass2        1   687.65 703.65
## - Pclass3        1   695.73 711.73
## - SibSp          1   702.72 718.72
## - log_age        1   723.34 739.34
## - SexMale        1   902.94 918.94
## Start:  AIC=705.99
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=705.99
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=705.99
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   686.12 704.12
## - Embarked_S     1   686.38 704.38
## - YJTrans_Fare2  1   687.12 705.12
## - Parch          1   687.15 705.15
## <none>               685.99 705.99
## - Pclass2        1   691.46 709.46
## - SibSp          1   700.38 718.38
## - Pclass3        1   703.71 721.71
## - log_age        1   718.94 736.94
## - SexMale        1   889.60 907.60
## 
## Step:  AIC=704.12
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - Parch          1   687.27 703.27
## - YJTrans_Fare2  1   687.38 703.38
## <none>               686.12 704.12
## - Embarked_S     1   688.38 704.38
## - Pclass2        1   691.59 707.59
## - SibSp          1   700.74 716.74
## - Pclass3        1   704.02 720.02
## - log_age        1   719.12 735.12
## - SexMale        1   890.86 906.86
## 
## Step:  AIC=703.27
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + log_age + 
##     SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - YJTrans_Fare2  1   687.77 701.77
## <none>               687.27 703.27
## - Embarked_S     1   689.85 703.85
## - Pclass2        1   694.58 708.58
## - SibSp          1   701.63 715.63
## - Pclass3        1   711.72 725.72
## - log_age        1   719.16 733.16
## - SexMale        1   892.22 906.22
## 
## Step:  AIC=701.77
## .outcome ~ Pclass2 + Pclass3 + SibSp + log_age + SexMale + Embarked_S
## 
##              Df Deviance    AIC
## <none>            687.77 701.77
## - Embarked_S  1   690.48 702.48
## - Pclass2     1   703.15 715.15
## - SibSp       1   705.59 717.59
## - log_age     1   720.81 732.81
## - Pclass3     1   789.39 801.39
## - SexMale     1   903.50 915.50
## Start:  AIC=705.92
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=705.92
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=705.92
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   685.95 703.95
## - Embarked_S     1   686.86 704.86
## <none>               685.92 705.92
## - YJTrans_Fare2  1   688.78 706.78
## - Pclass2        1   690.80 708.80
## - Parch          1   692.98 710.98
## - SibSp          1   699.17 717.17
## - Pclass3        1   700.09 718.09
## - log_age        1   733.16 751.16
## - SexMale        1   900.08 918.08
## 
## Step:  AIC=703.95
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - Embarked_S     1   687.83 703.83
## <none>               685.95 703.95
## - YJTrans_Fare2  1   688.79 704.79
## - Pclass2        1   690.83 706.83
## - Parch          1   693.03 709.03
## - SibSp          1   699.19 715.19
## - Pclass3        1   700.09 716.09
## - log_age        1   733.17 749.17
## - SexMale        1   902.19 918.19
## 
## Step:  AIC=703.83
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale
## 
##                 Df Deviance    AIC
## <none>               687.83 703.83
## - YJTrans_Fare2  1   691.09 705.09
## - Pclass2        1   693.78 707.78
## - Parch          1   695.45 709.45
## - Pclass3        1   702.12 716.12
## - SibSp          1   702.76 716.76
## - log_age        1   736.84 750.84
## - SexMale        1   913.35 927.35
## Start:  AIC=715.61
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=715.61
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=715.61
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   695.61 713.61
## - Embarked_S     1   697.13 715.13
## <none>               695.61 715.61
## - Pclass2        1   698.17 716.17
## - YJTrans_Fare2  1   700.32 718.32
## - Parch          1   701.41 719.41
## - Pclass3        1   704.59 722.59
## - SibSp          1   711.67 729.67
## - log_age        1   733.17 751.17
## - SexMale        1   902.99 920.99
## 
## Step:  AIC=713.61
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## <none>               695.61 713.61
## - Pclass2        1   698.17 714.17
## - Embarked_S     1   699.37 715.37
## - YJTrans_Fare2  1   700.35 716.35
## - Parch          1   701.43 717.43
## - Pclass3        1   704.60 720.60
## - SibSp          1   711.72 727.72
## - log_age        1   733.17 749.17
## - SexMale        1   905.86 921.86
## Start:  AIC=692.81
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=692.81
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=692.81
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_S     1   672.99 690.99
## - Embarked_C     1   673.16 691.16
## <none>               672.81 692.81
## - Parch          1   675.48 693.48
## - YJTrans_Fare2  1   675.93 693.93
## - Pclass2        1   677.11 695.11
## - Pclass3        1   685.56 703.56
## - SibSp          1   691.12 709.12
## - log_age        1   716.75 734.75
## - SexMale        1   893.86 911.86
## 
## Step:  AIC=690.99
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_C
## 
##                 Df Deviance    AIC
## <none>               672.99 690.99
## - Embarked_C     1   675.10 691.10
## - Parch          1   675.76 691.76
## - YJTrans_Fare2  1   676.03 692.03
## - Pclass2        1   677.48 693.48
## - Pclass3        1   685.64 701.64
## - SibSp          1   691.50 707.50
## - log_age        1   717.35 733.35
## - SexMale        1   903.52 919.52
## Start:  AIC=719.65
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=719.65
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=719.65
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   699.67 717.67
## - Embarked_S     1   700.54 718.54
## <none>               699.65 719.65
## - YJTrans_Fare2  1   701.96 719.96
## - Parch          1   702.99 720.99
## - Pclass2        1   704.86 722.86
## - SibSp          1   713.14 731.14
## - Pclass3        1   714.62 732.62
## - log_age        1   744.59 762.59
## - SexMale        1   889.67 907.67
## 
## Step:  AIC=717.67
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - Embarked_S     1   701.63 717.63
## <none>               699.67 717.67
## - YJTrans_Fare2  1   701.96 717.96
## - Parch          1   703.02 719.02
## - Pclass2        1   704.86 720.86
## - SibSp          1   713.14 729.14
## - Pclass3        1   714.65 730.65
## - log_age        1   744.60 760.60
## - SexMale        1   893.89 909.89
## 
## Step:  AIC=717.63
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale
## 
##                 Df Deviance    AIC
## <none>               701.63 717.63
## - YJTrans_Fare2  1   704.40 718.40
## - Parch          1   705.44 719.44
## - Pclass2        1   707.74 721.74
## - Pclass3        1   716.44 730.44
## - SibSp          1   717.35 731.35
## - log_age        1   748.51 762.51
## - SexMale        1   903.95 917.95
## Start:  AIC=704.98
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=704.98
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=704.98
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   685.04 703.04
## - YJTrans_Fare2  1   685.38 703.38
## - Embarked_S     1   685.61 703.61
## <none>               684.98 704.98
## - Parch          1   687.10 705.10
## - Pclass2        1   691.77 709.77
## - SibSp          1   697.12 715.12
## - Pclass3        1   705.79 723.79
## - log_age        1   729.14 747.14
## - SexMale        1   883.19 901.19
## 
## Step:  AIC=703.04
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## - YJTrans_Fare2  1   685.49 701.49
## <none>               685.04 703.04
## - Parch          1   687.14 703.14
## - Embarked_S     1   687.70 703.70
## - Pclass2        1   691.86 707.86
## - SibSp          1   697.35 713.35
## - Pclass3        1   706.07 722.07
## - log_age        1   729.30 745.30
## - SexMale        1   885.70 901.70
## 
## Step:  AIC=701.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + log_age + SexMale + 
##     Embarked_S
## 
##              Df Deviance    AIC
## - Parch       1   687.15 701.15
## <none>            685.49 701.49
## - Embarked_S  1   688.35 702.35
## - SibSp       1   699.76 713.76
## - Pclass2     1   701.70 715.70
## - log_age     1   730.10 744.10
## - Pclass3     1   788.03 802.03
## - SexMale     1   889.93 903.93
## 
## Step:  AIC=701.15
## .outcome ~ Pclass2 + Pclass3 + SibSp + log_age + SexMale + Embarked_S
## 
##              Df Deviance    AIC
## <none>            687.15 701.15
## - Embarked_S  1   690.28 702.28
## - Pclass2     1   703.01 715.01
## - SibSp       1   705.77 717.77
## - log_age     1   730.21 742.21
## - Pclass3     1   789.21 801.21
## - SexMale     1   893.88 905.88
## Start:  AIC=708.25
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=708.25
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=708.25
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   688.40 706.40
## - Embarked_S     1   688.68 706.68
## <none>               688.25 708.25
## - Pclass2        1   691.10 709.10
## - YJTrans_Fare2  1   691.69 709.69
## - Parch          1   692.39 710.39
## - Pclass3        1   700.70 718.70
## - SibSp          1   705.84 723.84
## - log_age        1   727.32 745.32
## - SexMale        1   892.35 910.35
## 
## Step:  AIC=706.4
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## <none>               688.40 706.40
## - Embarked_S     1   690.91 706.91
## - Pclass2        1   691.27 707.27
## - YJTrans_Fare2  1   692.08 708.08
## - Parch          1   692.51 708.51
## - Pclass3        1   701.04 717.04
## - SibSp          1   706.28 722.28
## - log_age        1   727.59 743.59
## - SexMale        1   893.21 909.21
## Start:  AIC=707.07
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=707.07
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=707.07
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   687.39 705.39
## - Embarked_S     1   687.48 705.48
## <none>               687.07 707.07
## - YJTrans_Fare2  1   689.30 707.30
## - Parch          1   690.16 708.16
## - Pclass2        1   690.27 708.27
## - Pclass3        1   700.16 718.16
## - SibSp          1   700.38 718.38
## - log_age        1   723.67 741.67
## - SexMale        1   900.13 918.13
## 
## Step:  AIC=705.39
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## <none>               687.39 705.39
## - YJTrans_Fare2  1   689.83 705.83
## - Parch          1   690.33 706.33
## - Pclass2        1   690.65 706.65
## - Embarked_S     1   690.66 706.66
## - Pclass3        1   700.92 716.92
## - SibSp          1   701.14 717.14
## - log_age        1   723.97 739.97
## - SexMale        1   901.49 917.49
## Start:  AIC=707.45
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=707.45
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=707.45
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance    AIC
## - Embarked_C     1   687.55 705.55
## - Embarked_S     1   688.02 706.02
## <none>               687.45 707.45
## - YJTrans_Fare2  1   690.14 708.14
## - Pclass2        1   690.86 708.86
## - Parch          1   691.74 709.74
## - Pclass3        1   701.30 719.30
## - SibSp          1   702.47 720.47
## - log_age        1   732.57 750.57
## - SexMale        1   882.76 900.76
## 
## Step:  AIC=705.55
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance    AIC
## <none>               687.55 705.55
## - Embarked_S     1   690.31 706.31
## - YJTrans_Fare2  1   690.46 706.46
## - Pclass2        1   690.96 706.96
## - Parch          1   691.83 707.83
## - Pclass3        1   701.51 717.51
## - SibSp          1   702.82 718.82
## - log_age        1   732.84 748.84
## - SexMale        1   884.41 900.41
## Start:  AIC=784.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C + 
##     Embarked_Q
## 
## 
## Step:  AIC=784.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + SexFemale + Embarked_S + Embarked_C
## 
## 
## Step:  AIC=784.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S + Embarked_C
## 
##                 Df Deviance     AIC
## - Embarked_C     1   764.38  782.38
## - Embarked_S     1   764.94  782.94
## <none>               764.29  784.29
## - YJTrans_Fare2  1   766.97  784.97
## - Parch          1   768.25  786.25
## - Pclass2        1   768.95  786.95
## - Pclass3        1   779.88  797.88
## - SibSp          1   781.19  799.19
## - log_age        1   809.85  827.85
## - SexMale        1   992.87 1010.87
## 
## Step:  AIC=782.38
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 + 
##     log_age + SexMale + Embarked_S
## 
##                 Df Deviance     AIC
## <none>               764.38  782.38
## - YJTrans_Fare2  1   767.22  783.22
## - Embarked_S     1   767.34  783.34
## - Parch          1   768.30  784.30
## - Pclass2        1   769.06  785.06
## - Pclass3        1   780.19  796.19
## - SibSp          1   781.52  797.52
## - log_age        1   810.00  826.00
## - SexMale        1   994.81 1010.81

check the result

titan_check_glm
## Generalized Linear Model 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 801, 802, 802, 802, 802, 802, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7979401  0.5679377
titan_check_glm_step
## Generalized Linear Model with Stepwise Feature Selection 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8002747  0.5738776
titan_log_check_glm
## Generalized Linear Model 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 801, 802, 802, 801, 802, 802, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8080706  0.5876776
titan_log_check_glm_step
## Generalized Linear Model with Stepwise Feature Selection 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 801, 802, 803, 802, 802, 802, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8114683  0.5947568

In short,stepwise logistic regression with log-transformed Age feature is the model with highest accuracy among four logistic models i builded; It has 81.14% accuracy.

2. Decision Tree

In this project, i use C.50 algorithm decision tree, because it combines the speed from C4.5 and accuracy from CART.

# load library

library(C50)
## Warning: package 'C50' was built under R version 4.1.2
set.seed(123)
# define train control
train_control <- trainControl(method="cv", number=7)
#train model
titan_check_c50tree <- train(Survived~.,data=titan_cross_df,trControl=train_control,
                             method="C5.0")
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
titan_check_c50tree_log <- train(Survived~.,data=titan_cross_log_df,
                                 trControl=train_control,
                                 method="C5.0")
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
# Result of C5.0  Cross Validation
titan_check_c50tree
## C5.0 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (7 fold) 
## Summary of sample sizes: 765, 764, 763, 763, 764, 764, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8248351  0.6106760
##   rules  FALSE   10      0.8226380  0.6145652
##   rules  FALSE   20      0.8147995  0.5999015
##   rules   TRUE    1      0.8068545  0.5727792
##   rules   TRUE   10      0.8012387  0.5676018
##   rules   TRUE   20      0.7934529  0.5535953
##   tree   FALSE    1      0.8259687  0.6129549
##   tree   FALSE   10      0.8215223  0.6127888
##   tree   FALSE   20      0.8136565  0.5965768
##   tree    TRUE    1      0.8057384  0.5705426
##   tree    TRUE   10      0.8000875  0.5702632
##   tree    TRUE   20      0.8046137  0.5735554
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = tree and winnow
##  = FALSE.
titan_check_c50tree_log
## C5.0 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (7 fold) 
## Summary of sample sizes: 764, 764, 763, 764, 763, 764, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8270968  0.6211079
##   rules  FALSE   10      0.8260071  0.6206052
##   rules  FALSE   20      0.8137040  0.5963872
##   rules   TRUE    1      0.8282217  0.6254257
##   rules   TRUE   10      0.8181155  0.6039522
##   rules   TRUE   20      0.8192755  0.6070621
##   tree   FALSE    1      0.8248559  0.6162159
##   tree   FALSE   10      0.8192843  0.6083743
##   tree   FALSE   20      0.8215252  0.6091254
##   tree    TRUE    1      0.8271056  0.6228232
##   tree    TRUE   10      0.8248998  0.6189354
##   tree    TRUE   20      0.8170522  0.6038817
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules and winnow
##  = TRUE.

C5.0 decision tree has higher accuracy than logistic regression model, and log transformed decision tree has the highest accuracy among them. Its accuracy is 82.7%.

3. Random Forest

As the name sugest, Random Forest (RF) is a collection of decision trees. It is a common type of ensemble methods that aggregate results from multiple predictors. Random Forest utilizes bagging technique that allows each tree trained on random sampling of original data set and takes the majority vote from the trees. It less interpretable, but has better accuracy than decision tree.

set.seed(761)
train_control_rf <- trainControl(method="cv", number=10)
#train model
titan_check_rf <- train(Survived~.,data=titan_cross_df,
                             trControl=train_control_rf,
                             method="rf")

titan_log_check_rf <- train(Survived~.,data=titan_cross_log_df,
                             trControl=train_control_rf,
                             method="rf")

Let’s see the accuracy result

titan_check_rf
## Random Forest 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 803, 802, 802, 801, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8136281  0.5862745
##    6    0.8203825  0.6111680
##   11    0.8057627  0.5805231
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
titan_log_check_rf
## Random Forest 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 802, 801, 802, 802, 803, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8237782  0.6081292
##    6    0.8102820  0.5927915
##   11    0.8068485  0.5882175
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Supprisingly, both random forest model using yeo Johnson and Log transformed data set is lower than C5.0 decision tree using log transformed data set. While generally, these models still genereate good and higher result compared to other.

4. K-Nearest Neighbour (KNN)

K-Nearest Neighbor (KNN) algorithm predicts based on the specified number (k) of the nearest neighboring data points. Here, the pre-processing of the data is significant as it impacts the distance measurements directly. Unlike others, the model does not have a mathematical formula, neither any descriptive ability.

Here, the parameter ‘k’ needs to be chosen wisely; as a value lower than optimal leads to bias, whereas a higher value impacts prediction accuracy.

set.seed(721)
train_control_knn <- trainControl(method="cv", number=10)
#train model
titan_check_knn <- train(Survived~.,data=titan_cross_df,
                             trControl=train_control_knn,
                             method="knn")

titan_log_check_knn <- train(Survived~.,data=titan_cross_log_df,
                             trControl=train_control_knn,
                             method="knn")

Check the result of KNN model

titan_check_knn
## k-Nearest Neighbors 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 801, 802, 802, 801, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8114062  0.5902591
##   7  0.8001822  0.5650622
##   9  0.8159511  0.5994137
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
titan_log_check_knn
## k-Nearest Neighbors 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 803, 802, 801, 802, 801, 802, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8180842  0.6070636
##   7  0.8102818  0.5903405
##   9  0.8102562  0.5851296
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Still, no KNN model has accuracy higher than the accuracy of C5.0 model.

After running all of these model, we can conclude that accuracy of our model is ranged from 79%-82.7%, higher accuracy with log-transformed Age2 feature, and propably, the most suitable model for this data is the decision tree algorithm c5.0. Therefore, I will build a Titanic passenger survival classification model using this model.

library(C50)
# Build model
titan_treec50_fit <- C5.0(trainX_trans[,-1],trainX_trans[,1],rules = TRUE)
titan_treec50_fit
## 
## Call:
## C5.0.default(x = trainX_trans[, -1], y = trainX_trans[, 1], rules = TRUE)
## 
## Rule-Based Model
## Number of samples: 686 
## Number of predictors: 10 
## 
## Number of Rules: 7 
## 
## Non-standard options: attempt to group attributes
summary(titan_treec50_fit)
## 
## Call:
## C5.0.default(x = trainX_trans[, -1], y = trainX_trans[, 1], rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Nov 28 04:03:10 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 686 cases (11 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (19/1, lift 1.5)
##  SibSp > 2
##  SexMale > 0
##  ->  class 0  [0.905]
## 
## Rule 2: (420/67, lift 1.4)
##  YJTrans_Age2 > -1.278483
##  SexMale > 0
##  ->  class 0  [0.839]
## 
## Rule 3: (263/51, lift 1.3)
##  Pclass = 3
##  Embarked_S > 0
##  ->  class 0  [0.804]
## 
## Rule 4: (129/7, lift 2.4)
##  Pclass in {1, 2}
##  SexMale <= 0
##  ->  class 1  [0.939]
## 
## Rule 5: (10, lift 2.4)
##  Parch > 0
##  YJTrans_Fare2 <= 0.2986848
##  SexMale <= 0
##  Embarked_S > 0
##  ->  class 1  [0.917]
## 
## Rule 6: (18/1, lift 2.3)
##  SibSp <= 2
##  YJTrans_Age2 <= -1.278483
##  SexMale > 0
##  ->  class 1  [0.900]
## 
## Rule 7: (77/9, lift 2.3)
##  SexMale <= 0
##  Embarked_S <= 0
##  ->  class 1  [0.873]
## 
## Default class: 0
## 
## 
## Evaluation on training data (686 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       7  103(15.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     407    16    (a): class 0
##      87   176    (b): class 1
## 
## 
##  Attribute usage:
## 
##   91.25% SexMale
##   63.85% YJTrans_Age2
##   57.14% Pclass
##   50.15% Embarked_S
##    5.39% SibSp
##    1.46% Parch
##    1.46% YJTrans_Fare2
## 
## 
## Time: 0.0 secs

Do the same to log-transformed data set

titan_log_treec50_fit <- C5.0(trainX_trans_log[,-1],trainX_trans_log[,1],rules = TRUE)
titan_log_treec50_fit
## 
## Call:
## C5.0.default(x = trainX_trans_log[, -1], y = trainX_trans_log[, 1], rules
##  = TRUE)
## 
## Rule-Based Model
## Number of samples: 686 
## Number of predictors: 10 
## 
## Number of Rules: 7 
## 
## Non-standard options: attempt to group attributes
summary(titan_log_treec50_fit)
## 
## Call:
## C5.0.default(x = trainX_trans_log[, -1], y = trainX_trans_log[, 1], rules
##  = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Nov 28 04:03:10 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 686 cases (11 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (19/1, lift 1.5)
##  SibSp > 2
##  SexMale > 0
##  ->  class 0  [0.905]
## 
## Rule 2: (420/67, lift 1.4)
##  log_age > 2.564949
##  SexMale > 0
##  ->  class 0  [0.839]
## 
## Rule 3: (263/51, lift 1.3)
##  Pclass = 3
##  Embarked_S > 0
##  ->  class 0  [0.804]
## 
## Rule 4: (129/7, lift 2.4)
##  Pclass in {1, 2}
##  SexMale <= 0
##  ->  class 1  [0.939]
## 
## Rule 5: (10, lift 2.4)
##  Parch > 0
##  YJTrans_Fare2 <= 0.2986848
##  SexMale <= 0
##  Embarked_S > 0
##  ->  class 1  [0.917]
## 
## Rule 6: (18/1, lift 2.3)
##  SibSp <= 2
##  log_age <= 2.564949
##  SexMale > 0
##  ->  class 1  [0.900]
## 
## Rule 7: (77/9, lift 2.3)
##  SexMale <= 0
##  Embarked_S <= 0
##  ->  class 1  [0.873]
## 
## Default class: 0
## 
## 
## Evaluation on training data (686 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       7  103(15.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     407    16    (a): class 0
##      87   176    (b): class 1
## 
## 
##  Attribute usage:
## 
##   91.25% SexMale
##   63.85% log_age
##   57.14% Pclass
##   50.15% Embarked_S
##    5.39% SibSp
##    1.46% Parch
##    1.46% YJTrans_Fare2
## 
## 
## Time: 0.0 secs
# Predict the value in validation data set
titan_treec50_predict <- predict(titan_treec50_fit,validX_trans)
titan_log_treec50_predict <- predict(titan_log_treec50_fit,validX_trans_log)
titan_treec50_predict
##   [1] 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0
##  [38] 0 1 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1
##  [75] 1 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0
## [112] 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1
## [149] 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
## [186] 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0
## Levels: 0 1

Plot of decision tree

titan_treec50_plot <- C5.0(trainX_trans[,-1],trainX_trans[,1])
plot(titan_treec50_plot)

titan_log_treec50_plot <- C5.0(trainX_trans_log[,-1],trainX_trans_log[,1])
plot(titan_log_treec50_plot)

At last, let’s evaluate our prediction

confusionMatrix(validX_trans[,1],titan_treec50_predict)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 111  15
##          1  23  56
##                                           
##                Accuracy : 0.8146          
##                  95% CI : (0.7546, 0.8654)
##     No Information Rate : 0.6537          
##     P-Value [Acc > NIR] : 2.742e-07       
##                                           
##                   Kappa : 0.6012          
##                                           
##  Mcnemar's Test P-Value : 0.2561          
##                                           
##             Sensitivity : 0.8284          
##             Specificity : 0.7887          
##          Pos Pred Value : 0.8810          
##          Neg Pred Value : 0.7089          
##              Prevalence : 0.6537          
##          Detection Rate : 0.5415          
##    Detection Prevalence : 0.6146          
##       Balanced Accuracy : 0.8085          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(validX_trans_log[,1],titan_log_treec50_predict)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 111  15
##          1  23  56
##                                           
##                Accuracy : 0.8146          
##                  95% CI : (0.7546, 0.8654)
##     No Information Rate : 0.6537          
##     P-Value [Acc > NIR] : 2.742e-07       
##                                           
##                   Kappa : 0.6012          
##                                           
##  Mcnemar's Test P-Value : 0.2561          
##                                           
##             Sensitivity : 0.8284          
##             Specificity : 0.7887          
##          Pos Pred Value : 0.8810          
##          Neg Pred Value : 0.7089          
##              Prevalence : 0.6537          
##          Detection Rate : 0.5415          
##    Detection Prevalence : 0.6146          
##       Balanced Accuracy : 0.8085          
##                                           
##        'Positive' Class : 0               
## 

Conclusion

Based on feature analysis using the Boruta algorithm, all features (Pclass, SibSp, Parch, transformed Age, SexMale, SexFemale, Embarked_S, Embarked_Q, Embarked_C) have a significant correlation with the target features. These results were then confirmed with the C5.0 model, where out of the 10 features used to build the model, 7 features were used as the basis for making algorithm decisions (3 are feature pairs, such as SexFemale, Embarked_Q, and Embarked_C which can be considered sufficiently represented in 7 features the). The first determining factor is the gender variable, if male, the next determining factor is age. If the passenger’s age is below the limit of 1.278(Age2 transformed), and brings a relative or partner under two(2), then the probability of survival is higher than vice versa. If the passenger is a woman, then the next determining feature is cabin class, coming from Southampton, bringing children or parents, then her chances of survival are higher.

Using this decision rule, our model can predict the safety of Titanic passengers with 85% accuracy.