library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.9
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
On April 14, 1912, the RMS Titanic struck an iceberg in the North Atlantic Ocean and sank. Of the 2,224 people on board, only 706 survived.
The goal of this exercise is to predict survivors on the Titanic based on nine input variables, described below. We are provided two datasets:
1. train.csv, containing 891 records
2. test.csv, containing 418 records.
The two datasets are provided with the intent that models are formulated using the train dataset and model performance is evaluated on the test dataset.
Load both train and test daata set
setwd("C:/Users/Galih Dwika Putra R/Documents/project_learning/kaggle/titanic")
titan_train <- read.csv("train.csv",sep=",")
titan_test <- read.csv("test.csv",sep=",")
head(titan_train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
head(titan_test)
## PassengerId Pclass Name Sex Age
## 1 892 3 Kelly, Mr. James male 34.5
## 2 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0
## 3 894 2 Myles, Mr. Thomas Francis male 62.0
## 4 895 3 Wirz, Mr. Albert male 27.0
## 5 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0
## 6 897 3 Svensson, Mr. Johan Cervin male 14.0
## SibSp Parch Ticket Fare Cabin Embarked
## 1 0 0 330911 7.8292 Q
## 2 1 0 363272 7.0000 S
## 3 0 0 240276 9.6875 Q
## 4 0 0 315154 8.6625 S
## 5 1 1 3101298 12.2875 S
## 6 0 0 7538 9.2250 S
In training data set, we have 12 features; 11 as predictor features and 1 features, Survived, as target feature. We will explore these features to know the distribution and statistic of each parameters.
Suvived is the only target feature. it is a binary variable, 1 indicated for survived and 0 for not. Because it’s internal characteristic as binary, it would be better to convert it first into factor data type
# transfrom Survived as factor
titan_train$Survived <- as.factor(titan_train$Survived)
str(titan_train$Survived)
## Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
print(paste("Missing value : ",sum(is.na(titan_train$Survived))),sep="")
## [1] "Missing value : 0"
print("Statistic summary of Survived : ")
## [1] "Statistic summary of Survived : "
print(summary(titan_train$Survived))
## 0 1
## 549 342
# create mode function
getmode <- function(x){
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x,uniqx)))]
}
# get mode of Survived
print(paste("Mode value : ",getmode(titan_train$Survived)),sep="")
## [1] "Mode value : 0"
There are no missing value in Survived feature. It has complete 891 obs consist of 549 not survived and 342 survived passangers, and mode value of 0.
# Visualize Survived
titan_train %>% ggplot(aes(x=Survived)) +
geom_bar(fill=c("blue","red"))+
ggtitle("Titanic Survival Count")+
scale_x_discrete(labels=c("Not Survived","Survived"))+
theme(plot.title = element_text(hjust=0.5))+
geom_text(aes(label=..count..),stat="count",vjust=1.5)
Pclass is a featured of numeric value that indicate the cabin class of the passangers. There are 3 class here, with 3 as the lowest class and 1 as the highest or the luxurious one.
Because of its internal characteristic that indicate some kind of quality order(1 to 3 has different quality hierarchy; 1 is better than 2, and 2 is better than 3), it would be better to transform it as factor variable
# transform Pclass as factor
titan_train$Pclass <- as.factor(titan_train$Pclass)
# Don't forget to do the same to test data set
titan_test$Pclass <- as.factor(titan_test$Pclass)
# Check for missing value in train and test data set
print(paste("Missing value in train data set:",sum(is.na(titan_train$Pclass))),sep="")
## [1] "Missing value in train data set: 0"
print(paste("Missing value in test data set: ",
sum(is.na(titan_test$Pclass))),sep="")
## [1] "Missing value in test data set: 0"
print("Summary of train data set")
## [1] "Summary of train data set"
summary(titan_train$Pclass)
## 1 2 3
## 216 184 491
print("Summary of test data set")
## [1] "Summary of test data set"
summary(titan_test$Pclass)
## 1 2 3
## 107 93 218
print(paste("Mode value of Pclass train data:",getmode(titan_train$Pclass)),sep=" ")
## [1] "Mode value of Pclass train data: 3"
print(paste("Mode value of Pclass test data:",getmode(titan_test$Pclass)),sep=" ")
## [1] "Mode value of Pclass test data: 3"
We got basically same characteristic of Pclass from train and test data. Class 3 is a the most frequent value (which is make sense), followed by class 1, and class 2 as the least frequent value. Both data set has 0 missing value.
# Visualize Pclass from training
titan_train %>% ggplot(aes(x=Pclass))+
geom_bar(fill=c("green","dark blue","purple"))+
ggtitle("Titanic Cabin Class Distribution in Training Data")+
scale_x_discrete(labels=c("First Class","Second Class","Third Class"))+
geom_text(aes(label=..count..),stat="count",vjust=1.5,color="white")+
theme(plot.title = element_text(hjust=0.5))
titan_test %>% ggplot(aes(x=Pclass))+
geom_bar(fill=c("green","dark blue","purple"))+
ggtitle("Titanic Cabin Class Distribution in Test Data")+
scale_x_discrete(labels=c("First Class","Second Class","Third Class"))+
geom_text(aes(label=..count..),stat="count",vjust=1.5,color="white")+
theme(plot.title = element_text(hjust=0.5))
The following chart shows that around 52.63% of casualty comes from passangers in third and second class, with relatively high difference compared to its pair, survived passangers from the same cabin class. It may indicate some correlation between these Pclass and Survived features.
titan_train %>% ggplot(aes(x=Pclass,fill=Survived))+
geom_bar()+
ggtitle("Survivability by Cabin Class Distribution in training")+
scale_x_discrete(labels=c("First Class","Second Class","Third Class"))+
scale_fill_discrete(labels=c("Not Survived","Survived"))+
geom_text(aes(label=..count..),stat="count",position = position_stack(0.5))+
theme(plot.title = element_text(hjust=0.5))
The name of passangers. In a glance, there is not many insight we can extract from this feature.
head(titan_train$Name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
But, it contains a title, and it maybe can be considered usefull. The name formatting is like this: “Last name, title(.) First and Middle Name” let’s extract all titles from this feature and mutate a new feature called ‘titlePass’
# Check for missing value
print(paste("Missing value of Name in training:",sum(is.na(titan_train$Name))),sep=" ")
## [1] "Missing value of Name in training: 0"
# It has zero missing value
# Mutate new feature
titan_train <- titan_train %>%
mutate(titlePass=str_extract(titan_train$Name,
'[a-zA-Z]+(?=\\.)'))
titan_train %>% ggplot(aes(x=fct_infreq(titlePass)))+
geom_bar(fill="magenta")+
xlab("Title of Passanger")+
ggtitle("Passanger Title Distribution in training data")+
geom_text(aes(label=..count..),stat="count",vjust=-0.1)+
theme(axis.text.x = element_text(angle=45))+
theme(plot.title = element_text(hjust=0.5))
The most frequent title in training data is ‘Mr’, followed by Miss, then Mrs, Master, and the rest. From this data alone, we can estimate that huge percentage of passanger is male. Test data set has similiar distribution, and also zero missing value.
# Check for missing value
print(paste("Missing value of Name in training:",sum(is.na(titan_test$Name))),sep=" ")
## [1] "Missing value of Name in training: 0"
# It has zero missing value
# Mutate new feature
titan_test <- titan_test %>%
mutate(titlePass=str_extract(titan_test$Name,
'[a-zA-Z]+(?=\\.)'))
titan_test %>% ggplot(aes(x=fct_infreq(titlePass)))+
geom_bar(fill="magenta")+
xlab("Title of Passanger")+
ggtitle("Passanger Title Distribution in test data")+
geom_text(aes(label=..count..),stat="count",vjust=-0.1)+
theme(axis.text.x = element_text(angle=45))+
theme(plot.title = element_text(hjust=0.5))
A binary feature of passangers; either Male or Female. We will transform it into factor
# Convert Sex Feature into factor data
titan_train$Sex <- as.factor(titan_train$Sex)
print(paste("Missing value of Sex feature : ",
sum(is.na(titan_train$Sex))),sep=" ")
## [1] "Missing value of Sex feature : 0"
titan_train %>% ggplot(aes(x=Sex))+
geom_bar(fill=c("magenta","purple"))+
ggtitle("Passanger Sex Distribution")+
xlab("Sex")+
geom_text(aes(label=..count..),stat="count",vjust=1.5)
This chart confirm our estimation from title in name variable, that in titanic training data, male is the dominant sex. Now, we wi
titan_train %>% ggplot(aes(x=Sex,fill=Survived))+
geom_bar()+
xlab("Sex")+
ggtitle("Survivability by Sex in Training Data")+
geom_text(aes(label=..count..),stat="count",position = position_stack(0.5))+
scale_fill_discrete(labels=c("Not Survived","Survived"))
Test data set has similiar Sex feature distribution.
print(paste("Missing value of Sex in Test Data:",
sum(is.na(titan_test$Sex))),sep=" ")
## [1] "Missing value of Sex in Test Data: 0"
titan_test %>% ggplot(aes(x=Sex))+
geom_bar(fill=c("chocolate","coral"))+
xlab("Sex")+
ggtitle("Survivability by Sex in Test Data")+
geom_text(aes(label=..count..),stat="count",position = position_stack(0.5))
NOTE : Because of its internal quality, we can not make Sex feature as a simple factor. Remember, factor will treat each category as if they are ordinal, start from 0,1,2, etc. So, with current convention, it’s as if we create an hierarchy that female (0) is lower or less important than male (1), numerically. This could distort our model later in order to make prediction. The solution is to mutate two dummy features, one is binary feature for Male and the other is binary feature for female
No further definition.
summary(titan_train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
print(paste("Missing value of age feature in training:",
sum(is.na(titan_train$Age))))
## [1] "Missing value of age feature in training: 177"
titan_test %>% ggplot(aes(x=Age))+
geom_histogram(aes(y=..density..),binwidth = 5,fill="darkslateblue",
color="cyan1")+
geom_density(color="red",lwd=1)+
ggtitle("Age Distribution in Training Data")
## Warning: Removed 86 rows containing non-finite values (stat_bin).
## Warning: Removed 86 rows containing non-finite values (stat_density).
Freqpoly and histogram plot indicate that the age distribution is centered around 20-30 years old, and presummably right-skewed, and has 177 missing values (we will deal with it later). To prove the skewness, i use skewness test in R. If the coefficient of skewness is more than 0, i can confirm that Age is right skewed
library(moments)
skewness(titan_train$Age,na.rm = TRUE)# don't forget to add na.rm,
## [1] 0.3882899
# because we still have missing value
The skewness of age feature in training data set is 0.3882899, which is more than 0. So, our age feature in training data set is confirmend and prove to be right-skewed, where the majority of data is less than its mean.
For age in test data set,
print(summary(titan_test$Age))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.17 21.00 27.00 30.27 39.00 76.00 86
print(paste("Missing value of age feature in training:",
sum(is.na(titan_test$Age))))
## [1] "Missing value of age feature in training: 86"
print(skewness(titan_test$Age,na.rm=TRUE))#0.4552923
## [1] 0.4552923
titan_test %>% ggplot(aes(x=Age))+
geom_histogram(aes(y=..density..),binwidth = 5,fill="darkslateblue",
color="cyan1")+
geom_density(color="red",lwd=1)+
ggtitle("Age Distribution in Test Data")
## Warning: Removed 86 rows containing non-finite values (stat_bin).
## Warning: Removed 86 rows containing non-finite values (stat_density).
Both training and test data set, has similiar age distribution around 20-30 years old. Test data set has 86 missing value, and skewness of 0.4552923 (Positive or Right Skewed). Even the percentile of training and test data set are almost similiar, with small difference.
For further analysis, here is more statistic about Age in training and test data set
# Training data
print(paste("Standard Deviation of Age in Training Data:",
sd(titan_train$Age,na.rm=TRUE),sep=" "))
## [1] "Standard Deviation of Age in Training Data: 14.526497332334"
print(paste("IQR of Age in Training Data:",
IQR(titan_train$Age,na.rm=TRUE),sep=" "))
## [1] "IQR of Age in Training Data: 17.875"
print(paste("Median Absolute Deviation of Age in Training Data:",
mad(titan_train$Age,na.rm=TRUE),sep=" "))
## [1] "Median Absolute Deviation of Age in Training Data: 13.3434"
print("----------------",end="\n\n")
## [1] "----------------"
# Test data
print(paste("Standard Deviation of Age in Test Data:",
sd(titan_test$Age,na.rm=TRUE),sep=" "))
## [1] "Standard Deviation of Age in Test Data: 14.1812092356244"
print(paste("IQR of Age in Test Data:",
IQR(titan_test$Age,na.rm=TRUE),sep=" "))
## [1] "IQR of Age in Test Data: 18"
print(paste("Median Absolute Deviation of Age in Test Data:",
mad(titan_test$Age,na.rm=TRUE),sep=" "))
## [1] "Median Absolute Deviation of Age in Test Data: 11.8608"
Again, these two data set seemingly similiar statistic. Now, to proceed, let’s try compare Age data with Survived status.
titan_train %>% ggplot(aes(x=Survived,y=Age))+
geom_violin(trim=FALSE,aes(fill=Survived))+
scale_fill_manual(values=c("cyan1","coral"))+
scale_color_manual(values = c("red","magenta"))+
stat_summary(fun=median, geom="point",size=3)+
ggtitle("Violin Plot of Age by Survived in Training Data")+
scale_x_discrete(labels=c("Not Survived","Survived"))
## Warning: Removed 177 rows containing non-finite values (stat_ydensity).
## Warning: Removed 177 rows containing non-finite values (stat_summary).
It seems there are small difference of median between Age in Survived group and Not Survived group. We will analyze it later.
The number of Sibling and Spouse bring along by passangers. This is litle tricky, actually. Let’s say Mr A and Mrs B is a spouse. They both have the same surname, and their SibSp feature would have same value of 1. So, not only that it’s ambigous in definition, it also confusing in numeric. The solution for this feature are :
1. We need to split between sibling and spouse
2. transform it to dummy variables
We will deal with this later. For now, i will continue to explore the data as its original form.
print(summary(titan_train$SibSp))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.523 1.000 8.000
print(paste("Missing value of SibSp in training data:",
sum(is.na(titan_train$SibSp))))
## [1] "Missing value of SibSp in training data: 0"
print("---------------------------",end="\n")
## [1] "---------------------------"
print(summary(titan_test$SibSp))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4474 1.0000 8.0000
print(paste("Missing value of SibSp in test data:",
sum(is.na(titan_test$SibSp))))
## [1] "Missing value of SibSp in test data: 0"
From above summary, at least we can get information about minimal and maximum number of sibling or spouse brought by passangers. Both have missing value. The statistical summary also indicate that SibSp data, both in training and test data set, are right skewed. (mean > median).
To make it clear, let’s visualize this data
SibSp_tab <- table(titan_train$SibSp)
sibSp_df <- data.frame("SibSp"=names(SibSp_tab),"Count"=SibSp_tab)
sibSp_df %>% ggplot(aes(x=SibSp,y=Count.Freq))+
geom_col(aes(fill=Count.Freq))+
ggtitle("Sibling and Spouse Distribution in Training Data")+
geom_text(aes(label=Count.Freq),vjust=-0.1)
SibSp_tab2 <- table(titan_test$SibSp)
sibSp_df2 <- data.frame("SibSp"=names(SibSp_tab2),"Count"=SibSp_tab2)
sibSp_df2 %>% ggplot(aes(x=SibSp,y=Count.Freq))+
geom_col(aes(fill=Count.Freq))+
ggtitle("Sibling and Spouse Distribution in Test Data")+
geom_text(aes(label=Count.Freq),vjust=-0.1)
They both have similiar distribution, where people who travel alone in Titanic is the most frequent type of passanger. In term of survivability,
titan_train %>% ggplot(aes(x=SibSp))+
geom_freqpoly(aes(color=Survived),position = position_dodge2())+
ggtitle("Frequency of Survived by SibSp in Training Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Parch
Parch is same as SibSp, it is a feature that indicate the number of parent and children brought along by passangers. Again, it’s an ambigous variable, and the solution for it is same as the solution we will apply to SibSp.
print(summary(titan_train$Parch))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3816 0.0000 6.0000
print(paste("Missing value of Parch from training data:",
sum(is.na(titan_train$Parch))),sep=" ")
## [1] "Missing value of Parch from training data: 0"
print(paste("Standard deviation of Parch from training data:",
sd(titan_train$Parch)),sep=" ")
## [1] "Standard deviation of Parch from training data: 0.806057221129948"
print(paste("The IQR of Parch from training data:",
IQR(titan_train$Parch)),sep=" ")
## [1] "The IQR of Parch from training data: 0"
print(paste("Median Absolute Deviation of Parch from training data:",
mad(titan_train$Parch)),sep=" ")
## [1] "Median Absolute Deviation of Parch from training data: 0"
print(paste("The range of Parch from training data:",
range(titan_train$Parch)[2]),sep=" ")
## [1] "The range of Parch from training data: 6"
print("------------------------------------------",end="\n")
## [1] "------------------------------------------"
print(summary(titan_test$Parch))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3923 0.0000 9.0000
print(paste("Missing value of Parch from training data:",
sum(is.na(titan_test$Parch))),sep=" ")
## [1] "Missing value of Parch from training data: 0"
print(paste("Standard deviation of Parch from training data:",
sd(titan_test$Parch)),sep=" ")
## [1] "Standard deviation of Parch from training data: 0.981428878537169"
print(paste("The IQR of Parch from training data:",
IQR(titan_test$Parch)),sep=" ")
## [1] "The IQR of Parch from training data: 0"
print(paste("Median Absolute Deviation of Parch from training data:",
mad(titan_test$Parch)),sep=" ")
## [1] "Median Absolute Deviation of Parch from training data: 0"
print(paste("The range of Parch from training data:",
range(titan_test$Parch)[2]),sep=" ")
## [1] "The range of Parch from training data: 9"
Both training and test data set, has similiar distribution for Parch. They have 0 missing value, right-skewed (median, mode < mean), the same median, and similiar mean.
parch_tab <- table(titan_train$Parch)
parch_df <- data.frame(parch_tab)
parch_df %>% ggplot(aes(x=Var1,y=Freq))+
geom_col(aes(fill=Freq))+
ggtitle("Survivability by Parch Distribution in Test Data")+
geom_text(aes(label=Freq),vjust=-0.1)
parch_tab2 <- table(titan_test$Parch)
parch_df2 <- data.frame(parch_tab2)
parch_df2 %>% ggplot(aes(x=Var1,y=Freq))+
geom_col(aes(fill=Freq))+
ggtitle("Survivability by Parch Distribution in Test Data")+
geom_text(aes(label=Freq),vjust=-0.1)
More than 75% of Titanic passangers in training and test dataset did not brought along their parent or children, similiar to the insight we extract from SibSp feature.
This is a feature that displays the ticket number. As we can see in the glimpse, each ticket do not have some kind fixed pattern for us to analyze. There is actually an effort to dechipering the prefix on ticket. Some prefix is easy to understand as departed harbour, and for the ticket numeric, it easy to consider the first number as indicator for cabin class, but the rest, the information about cabin number, on which deck, etc are still mystery. This is due the fact that Titanic ticket was sold through multiple agency, not centered one. So each agency has their own ticket formating, and this is understandable, as the administration system are still lag behind at that point of time.
print(paste("Missing value of Ticket in training data:",
sum(is.na(titan_train$Ticket))),sep=" ")
## [1] "Missing value of Ticket in training data: 0"
print(paste("Missing value of Ticket in Test data:",
sum(is.na(titan_test$Ticket))),sep=" ")
## [1] "Missing value of Ticket in Test data: 0"
For this analysis, i will not indulge myself to research this feature. Let’s keep it simple, by drop this feature later.
Fare is a feature that display Titanic’s ticket fare when its sold. The number itself comes in pound-shilling. So, 7.25 means 7 pounds and 25 shilling.
summary(titan_train$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.91 14.45 32.20 31.00 512.33
print(paste("Missing value of Fare in Training data:",
sum(is.na(titan_train$Fare))),sep=" ")
## [1] "Missing value of Fare in Training data: 0"
print(paste("Skewness of Fare in Training data:",
skewness(titan_train$Fare)),sep=" ")
## [1] "Skewness of Fare in Training data: 4.77925329237236"
print(paste("Standard Deviation of Fare in Training data:",
sd(titan_train$Fare)),sep=" ")
## [1] "Standard Deviation of Fare in Training data: 49.6934285971809"
print(paste("Median Absolute Deviation of Fare from Training data:",
mad(titan_train$Fare)),sep=" ")
## [1] "Median Absolute Deviation of Fare from Training data: 10.23616692"
print(paste("IQR of Fare from Training data:",
IQR(titan_train$Fare)),sep=" ")
## [1] "IQR of Fare from Training data: 23.0896"
print("--------------------------------------------",end="\n")
## [1] "--------------------------------------------"
summary(titan_test$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 7.896 14.454 35.627 31.500 512.329 1
print(paste("Missing value of Fare in Test data:",
sum(is.na(titan_test$Fare))),sep=" ")
## [1] "Missing value of Fare in Test data: 1"
print(paste("Skewness of Fare in Test data:",
skewness(titan_test$Fare,na.rm=TRUE)),sep=" ")
## [1] "Skewness of Fare in Test data: 3.67393667584391"
print(paste("Standard Deviation of Fare in Test data:",
sd(titan_test$Fare,na.rm=TRUE)),sep=" ")
## [1] "Standard Deviation of Fare in Test data: 55.9075761799738"
print(paste("Median Absolute Deviation of Fare from Test data:",
mad(titan_test$Fare,na.rm=TRUE)),sep=" ")
## [1] "Median Absolute Deviation of Fare from Test data: 10.118745"
print(paste("IQR of Fare from Test data:",
IQR(titan_test$Fare,na.rm=TRUE)),sep=" ")
## [1] "IQR of Fare from Test data: 23.6042"
There are 1 missing value of Fare from Test data set. But, overall statistic of both data are same, more or less. The only significant different is fare of test data has larger variance compare to training data set. Furthermore, both of them is right-skewed.
titan_train %>% ggplot(aes(x=Fare))+
geom_histogram(aes(y=..density..),fill="coral",color="black")+
geom_density(color="brown",lwd=1)+
ggtitle("Fare Distribution of Training Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From histogram above, there is strong indication of outlier in Fare training data set. We will analyze this in next section. The same distribution also display in test data set.
titan_test %>% ggplot(aes(x=Fare))+
geom_histogram(aes(y=..density..),fill="coral",color="black")+
geom_density(color="brown",lwd=1)+
ggtitle("Fare Distribution of Test Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).
titan_train %>% ggplot(aes(x=Survived,y=Fare))+
geom_boxplot(aes(fill=Survived))+
scale_color_manual(values=c("coral","cyan2"))+
ggtitle("Survivability by Fare in Training")+
scale_x_discrete(labels=c("Not Survived","Survived"))
The histogram also confirm the existence of outlier in Fare of training data set, and we also can see that the fare median of those who suvived is slightly higher than those who not make it. This is another interesting insight, and this also may have some correlation with the fact that passanger from class 1 has higher propability to survived.
This feature is series of cabin number of each passangers. There are prefix of alphabet followed by series of numbers.
head(titan_train$Cabin)
## [1] "" "C85" "" "C123" "" ""
summary(titan_train$Cabin)
## Length Class Mode
## 891 character character
print(paste("Missing value of Cabin in training:",
sum(is.na(titan_train$Cabin)),"(",
round((sum(is.na(titan_train$Cabin))/length(titan_train$Cabin))*100,2),"%)",sep=""))
## [1] "Missing value of Cabin in training:0(0%)"
# Define a function to convert blank value to NA
na_conv <- function(x){
return(mutate_all(x,~na_if(.,"")))
}
Let’s try to convert training data
# Original data
for(col in names(titan_train)){
print(paste("Missing value of",
col,
sum(is.na(titan_train[[col]]))),sep=" ")
}
## [1] "Missing value of PassengerId 0"
## [1] "Missing value of Survived 0"
## [1] "Missing value of Pclass 0"
## [1] "Missing value of Name 0"
## [1] "Missing value of Sex 0"
## [1] "Missing value of Age 177"
## [1] "Missing value of SibSp 0"
## [1] "Missing value of Parch 0"
## [1] "Missing value of Ticket 0"
## [1] "Missing value of Fare 0"
## [1] "Missing value of Cabin 0"
## [1] "Missing value of Embarked 0"
## [1] "Missing value of titlePass 0"
# Convert blank value to NA
titan_train <- na_conv(titan_train)
for(col in names(titan_test)){
print(paste("Missing value of",
col,
sum(is.na(titan_train[[col]]))),sep=" ")
}
## [1] "Missing value of PassengerId 0"
## [1] "Missing value of Pclass 0"
## [1] "Missing value of Name 0"
## [1] "Missing value of Sex 0"
## [1] "Missing value of Age 177"
## [1] "Missing value of SibSp 0"
## [1] "Missing value of Parch 0"
## [1] "Missing value of Ticket 0"
## [1] "Missing value of Fare 0"
## [1] "Missing value of Cabin 687"
## [1] "Missing value of Embarked 2"
## [1] "Missing value of titlePass 0"
# Convert titan_test missing value to NA
titan_test <- na_conv(titan_test)
head(titan_test$Cabin)
## [1] NA NA NA NA NA NA
summary(titan_test$Cabin)
## Length Class Mode
## 418 character character
print(paste("Missing value of Cabin in training:",
sum(is.na(titan_test$Cabin)),"(",
round((sum(is.na(titan_test$Cabin))/length(titan_test$Cabin))*100,2),"%)",sep=""))
## [1] "Missing value of Cabin in training:327(78.23%)"
In short, this feature is useless. Huge percentage of missing value is unfixable. Any technique of imputation would only generate estimate at best, and it can’t be trusted as valid predictor in large porpotion. Nevertheles, i will try to conduct some analysis by using Pclass and Fare.
To be fair, this feature indeed has a potential. Its alphabet prefix indicate the deck where passangers side in. Titanic has limited safeboat available in certain deck. So, this position will able to indicate whether the certain deck has significant effect to survivability.
Embarked is feature about which city port the passangers departed from. It symbolize with abbrevation of each city : “S” for Southampton, “C” for Cherbourg, and “Q” for Queenstown. Here we will analyze it as factor data.
titan_train$Embarked <- as.factor(titan_train$Embarked)
head(titan_train$Embarked)
## [1] S C S S S Q
## Levels: C Q S
summary(titan_train$Embarked)
## C Q S NA's
## 168 77 644 2
print(paste("Missing value of Embarked in training data;",
sum(is.na(titan_train$Embarked))),sep=" ")
## [1] "Missing value of Embarked in training data; 2"
titan_train %>% ggplot(aes(x=Embarked))+
geom_bar(fill=c("brown","cyan1","darksalmon","azure2"))+
ggtitle("Embarked Distribution in Training Data")+
geom_text(aes(label=..count..),stat="count",vjust=0.5)
For test data set,
titan_test$Embarked <- as.factor(titan_test$Embarked)
head(titan_test$Embarked)
## [1] Q S Q S S S
## Levels: C Q S
summary(titan_test$Embarked)
## C Q S
## 102 46 270
print(paste("Missing value of Embarked in training data;",
sum(is.na(titan_test$Embarked))),sep=" ")
## [1] "Missing value of Embarked in training data; 0"
titan_test %>% ggplot(aes(x=Embarked))+
geom_bar(fill=c("brown","cyan1","darksalmon"))+
ggtitle("Embarked Distribution in Test Data")+
geom_text(aes(label=..count..),stat="count",vjust=0.5)
In surival term, here
titan_train %>% ggplot(aes(x=Embarked,fill=Survived))+
geom_bar()+
ggtitle("Survivability by Embarked Port in Traning Data")+
geom_text(aes(label=..count..),
stat="count",position=position_stack())
The most frequent port is Southampton, and this is understandable
considering Southampton, England, was the port where Titanic set sails
in April, 10 1912. Ironically, as shown in the plot, this is also the
port where most dead passangers came from.
There are 2 missing values in training data, and 0 missing value in test data. The distribution is similiar.
As best practice, to prevent data leakage and in-sample bias, before we proceed to cleaning and pre-processing data, let’s split training data set into train and validation data set. So, from here on, we would deal with three separate data; training, validation, and testing.
# Load library caTools
#install.packages("caTools")
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
# Split data training into train and validation with ratio 80:20
set.seed(123)
idx <- sample.split(titan_train,SplitRatio=0.8)
train_X <- titan_train[idx,]
valid_X <- titan_train[!idx,]
dim(train_X) # 686 observations in train data set
## [1] 686 13
dim(valid_X) # 205 observations in valid data set
## [1] 205 13
Let’s recalculate how much missing value in every features of data set
for(col in names(train_X)){
print(paste("Missing value in",col,":",sum(is.na(train_X[[col]]))),sep=" ")
}
## [1] "Missing value in PassengerId : 0"
## [1] "Missing value in Survived : 0"
## [1] "Missing value in Pclass : 0"
## [1] "Missing value in Name : 0"
## [1] "Missing value in Sex : 0"
## [1] "Missing value in Age : 131"
## [1] "Missing value in SibSp : 0"
## [1] "Missing value in Parch : 0"
## [1] "Missing value in Ticket : 0"
## [1] "Missing value in Fare : 0"
## [1] "Missing value in Cabin : 526"
## [1] "Missing value in Embarked : 1"
## [1] "Missing value in titlePass : 0"
for(col in names(valid_X)){
print(paste("Missing value in",col,":",sum(is.na(valid_X[[col]]))),sep=" ")
}
## [1] "Missing value in PassengerId : 0"
## [1] "Missing value in Survived : 0"
## [1] "Missing value in Pclass : 0"
## [1] "Missing value in Name : 0"
## [1] "Missing value in Sex : 0"
## [1] "Missing value in Age : 46"
## [1] "Missing value in SibSp : 0"
## [1] "Missing value in Parch : 0"
## [1] "Missing value in Ticket : 0"
## [1] "Missing value in Fare : 0"
## [1] "Missing value in Cabin : 161"
## [1] "Missing value in Embarked : 1"
## [1] "Missing value in titlePass : 0"
for(col in names(titan_test)){
print(paste("Missing value in",col,":",sum(is.na(titan_test[[col]]))),sep=" ")
}
## [1] "Missing value in PassengerId : 0"
## [1] "Missing value in Pclass : 0"
## [1] "Missing value in Name : 0"
## [1] "Missing value in Sex : 0"
## [1] "Missing value in Age : 86"
## [1] "Missing value in SibSp : 0"
## [1] "Missing value in Parch : 0"
## [1] "Missing value in Ticket : 0"
## [1] "Missing value in Fare : 1"
## [1] "Missing value in Cabin : 327"
## [1] "Missing value in Embarked : 0"
## [1] "Missing value in titlePass : 0"
For train_X and valid_X, the features with missing values are Age,Cabin, and Embarked. Technique to replace missing value of data is called imputation. There are many kinds of imputation technique, such as mean imputation, media imputation, mode imputation, regression imputation, backfill, fowardfill, etc.Here i will explain my imputation strategy for each features.
To impute Age feature, i utilize some necessary information i can obtain from the data set. Here, we already have ‘titlePass’ feature, and we already know the distribution of it. For your information, title is not just some random honorific for people, it has some kind of rule based on culture and ethic. “Mister” or “Mr” is a title for male who has already reach to adulthood or a male person over 18. “Master”, on the contrary, is a title for male under 18. “Miss” is a title used for underage or/and unmarried woman (under 30) “Mrs” for married woman “Ms” is neutral married title. Used if we are not sure about the person’s marriage status.
# Check titlePass
unique(train_X$titlePass)
## [1] "Mr" "Mrs" "Miss" "Master" "Rev" "Dr"
## [7] "Mme" "Ms" "Major" "Sir" "Col" "Mlle"
## [13] "Countess" "Jonkheer"
There are other title like Mme, Don, Rev, Lady, Sir, Mlle, Col, Major, Capt, Countess, Jonkheer.
Mme (Madame) is a French title for woman in general. Don is title used in Spain and Hispanic America, it is a title for master of a household Rev is a title for member of Christian clergy or official worker of the church Mlle is a french title specifically for young and unmarried woman. Lady, and Countess are noble title for a woman and usually already married to another noble. Jonkheer is a noble title from Dutch rank system.
From precede explanation, it is make sense to try impute Age by using its correlation with titlePass, but i found it’s not enough. Let’s say i can estimate passangers age from their title, like Miss would be female under 18, and Mrs would be married female, how about Ms? How can i deduce such neutral marriage title?
For this, i decide to conduct analysis on passangers surname. In Western culture, especially in Eestern Europe, women tend to take their husband’s family name after married. in fact, majority of modern peoples still conduct this historical practice. Cited from BBC,2020, according to a 2016 survey, around 85% of those aged between 18 and 30 in US saying they still follow the practice. For British women, the figure is almost 90%. The data also stated that these figures are lower than they were a generation ago. Based on this study, i decide that surname is a reasonable variable to use for estimate and impute age variable.
# First, we need to mutate surname feature
head(train_X$Name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Moran, Mr. James"
## [6] "McCarthy, Mr. Timothy J"
# Name formatting is :
# LastName, Title (.) FirstName MiddleName
# The maiden name is written inside a bracket
train_X <- train_X %>% mutate(surname = str_extract(Name,".+(?=,.+)"))
head(train_X)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 6 6 0 3
## 7 7 0 1
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 6 Moran, Mr. James male NA 0 0
## 7 McCarthy, Mr. Timothy J male 54 0 0
## Ticket Fare Cabin Embarked titlePass surname
## 1 A/5 21171 7.2500 <NA> S Mr Braund
## 2 PC 17599 71.2833 C85 C Mrs Cumings
## 3 STON/O2. 3101282 7.9250 <NA> S Miss Heikkinen
## 4 113803 53.1000 C123 S Mrs Futrelle
## 6 330877 8.4583 <NA> Q Mr Moran
## 7 17463 51.8625 E46 S Mr McCarthy
Before further analysis, Let’s check that whether our title feature has mistake or not.
train_X %>% group_by(titlePass,Sex) %>%
summarise(count=n())
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 15 x 3
## # Groups: titlePass [14]
## titlePass Sex count
## <chr> <fct> <int>
## 1 Col male 1
## 2 Countess female 1
## 3 Dr female 1
## 4 Dr male 5
## 5 Jonkheer male 1
## 6 Major male 1
## 7 Master male 31
## 8 Miss female 131
## 9 Mlle female 1
## 10 Mme female 1
## 11 Mr male 407
## 12 Mrs female 98
## 13 Ms female 1
## 14 Rev male 5
## 15 Sir male 1
The table above confirm that our title indeed divide passangers by their gender/sex. No missplaced here. Now, let’s check if our title feature also able to divide passangers’s age
train_X %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 15 x 6
## # Groups: titlePass [14]
## titlePass Sex count median min max
## <chr> <fct> <int> <dbl> <dbl> <dbl>
## 1 Col male 1 60 60 60
## 2 Countess female 1 33 33 33
## 3 Dr female 1 49 49 49
## 4 Dr male 5 38 23 54
## 5 Jonkheer male 1 38 38 38
## 6 Major male 1 45 45 45
## 7 Master male 31 4 0.67 12
## 8 Miss female 131 22 0.75 63
## 9 Mlle female 1 24 24 24
## 10 Mme female 1 24 24 24
## 11 Mr male 407 29 11 80
## 12 Mrs female 98 36 14 63
## 13 Ms female 1 28 28 28
## 14 Rev male 5 42 27 57
## 15 Sir male 1 49 49 49
Now, we seems to found some missplace of data here.
First of all, at summary of title “Miss”; as we know, “Miss” is a western title for young unmarried woman under 30, but the data shows us that the range of passangers with title “Miss” is reach up to 58, culturally, this seems not make sense. Let’s say that indeed there are one or two women who still not tied to marrieage at that age, the more appropriate title would be “Ms”.
Second, it appears that there were several underage male passengers who were given the title “Mr”, which indicate a missplace.
At last, in title “Mrs”, the minimum age is 14. If we were going to judge it according to modern social value and law, it is unlikely to happens,and potentially is a missplace. But we talk about early 19th here, so i decide to dig deeper before make any decision.
# Check missplace for passangers with title "Miss"
quantile(train_X[which(train_X$titlePass=="Miss"),"Age"],c(0.25,0.75),na.rm=TRUE)
## 25% 75%
## 16 30
The quantile of Miss is fit to my explanation about the the title. 75% women under 30 is being entitled by “Miss”.
# get the index of passangers with title Miss over 30 years old
over30_Miss <- which(train_X$titlePass=="Miss" & train_X$Age>30)
train_X[c(over30_Miss),"titlePass"] <- "Ms"
train_X[c(over30_Miss),]
## PassengerId Survived Pclass Name Sex Age
## 12 12 1 1 Bonnell, Miss. Elizabeth female 58.0
## 62 62 1 1 Icard, Miss. Amelie female 38.0
## 124 124 1 2 Webber, Miss. Susan female 32.5
## 178 178 0 1 Isham, Miss. Ann Elizabeth female 50.0
## 196 196 1 1 Lurette, Miss. Elise female 58.0
## 212 212 1 2 Cameron, Miss. Clear Annie female 35.0
## 259 259 1 1 Ward, Miss. Anna female 35.0
## 270 270 1 1 Bissette, Miss. Amelia female 35.0
## 276 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0
## 277 277 0 3 Lindblom, Miss. Augusta Charlotta female 45.0
## 319 319 1 1 Wick, Miss. Mary Natalie female 31.0
## 326 326 1 1 Young, Miss. Marie Grice female 36.0
## 338 338 1 1 Burns, Miss. Elizabeth Margaret female 41.0
## 347 347 1 2 Smith, Miss. Marion Elsie female 40.0
## 358 358 0 2 Funk, Miss. Annie Clemmer female 38.0
## 381 381 1 1 Bidois, Miss. Rosalie female 42.0
## 397 397 0 3 Olsson, Miss. Elina female 31.0
## 413 413 1 1 Minahan, Miss. Daisy E female 33.0
## 459 459 1 2 Toomey, Miss. Ellen female 50.0
## 497 497 1 1 Eustis, Miss. Elizabeth Mussey female 54.0
## 504 504 0 3 Laitinen, Miss. Kristina Sofia female 37.0
## 527 527 1 2 Ridsdale, Miss. Lucy female 50.0
## 610 610 1 1 Shutes, Miss. Elizabeth W female 40.0
## 717 717 1 1 Endres, Miss. Caroline Louise female 38.0
## 768 768 0 3 Mangan, Miss. Mary female 30.5
## 836 836 1 1 Compton, Miss. Sara Rebecca female 39.0
## SibSp Parch Ticket Fare Cabin Embarked titlePass surname
## 12 0 0 113783 26.5500 C103 S Ms Bonnell
## 62 0 0 113572 80.0000 B28 <NA> Ms Icard
## 124 0 0 27267 13.0000 E101 S Ms Webber
## 178 0 0 PC 17595 28.7125 C49 C Ms Isham
## 196 0 0 PC 17569 146.5208 B80 C Ms Lurette
## 212 0 0 F.C.C. 13528 21.0000 <NA> S Ms Cameron
## 259 0 0 PC 17755 512.3292 <NA> C Ms Ward
## 270 0 0 PC 17760 135.6333 C99 S Ms Bissette
## 276 1 0 13502 77.9583 D7 S Ms Andrews
## 277 0 0 347073 7.7500 <NA> S Ms Lindblom
## 319 0 2 36928 164.8667 C7 S Ms Wick
## 326 0 0 PC 17760 135.6333 C32 C Ms Young
## 338 0 0 16966 134.5000 E40 C Ms Burns
## 347 0 0 31418 13.0000 <NA> S Ms Smith
## 358 0 0 237671 13.0000 <NA> S Ms Funk
## 381 0 0 PC 17757 227.5250 <NA> C Ms Bidois
## 397 0 0 350407 7.8542 <NA> S Ms Olsson
## 413 1 0 19928 90.0000 C78 Q Ms Minahan
## 459 0 0 F.C.C. 13531 10.5000 <NA> S Ms Toomey
## 497 1 0 36947 78.2667 D20 C Ms Eustis
## 504 0 0 4135 9.5875 <NA> S Ms Laitinen
## 527 0 0 W./C. 14258 10.5000 <NA> S Ms Ridsdale
## 610 0 0 PC 17582 153.4625 C125 S Ms Shutes
## 717 0 0 PC 17757 227.5250 C45 C Ms Endres
## 768 0 0 364850 7.7500 <NA> Q Ms Mangan
## 836 1 1 PC 17756 83.1583 E49 C Ms Compton
Proceed to Mr,
quantile(train_X[which(train_X$titlePass=="Mr"),"Age"],c(0.25,0.75),na.rm=TRUE)
## 25% 75%
## 23 39
The quantile of Mr title shows us that 50% of the passangers titled “Mr” are aged around 22 to 40 years old. Back to the definition we stated before, “Mr” is a title for a men whose already reach adulthood. So, our only concern is which underage male passanger who titled “Mr”
# Filter underage male passangers with title "Mr"
under18_Mr <- which(train_X$titlePass=="Mr" & train_X$Age<18)
train_X[c(under18_Mr),"titlePass"] <- "Master"
train_X[c(under18_Mr),]
## PassengerId Survived Pclass Name Sex Age SibSp
## 87 87 0 3 Ford, Mr. William Neal male 16 1
## 139 139 0 3 Osen, Mr. Olaf Elon male 16 0
## 221 221 1 3 Sunderland, Mr. Victor Francis male 16 0
## 267 267 0 3 Panula, Mr. Ernesti Arvid male 16 4
## 283 283 0 3 de Pelsmaeker, Mr. Alfons male 16 0
## 334 334 0 3 Vander Planke, Mr. Leo Edmondus male 16 2
## 353 353 0 3 Elias, Mr. Tannous male 15 1
## 501 501 0 3 Calic, Mr. Petar male 17 0
## 533 533 0 3 Elias, Mr. Joseph Jr male 17 1
## 575 575 0 3 Rush, Mr. Alfred George John male 16 0
## 722 722 0 3 Jensen, Mr. Svend Lauritz male 17 1
## 732 732 0 3 Hassan, Mr. Houssein G N male 11 0
## 747 747 0 3 Abbott, Mr. Rossmore Edward male 16 1
## 792 792 0 2 Gaskell, Mr. Alfred male 16 0
## 842 842 0 2 Mudd, Mr. Thomas Charles male 16 0
## 845 845 0 3 Culumovic, Mr. Jeso male 17 0
## Parch Ticket Fare Cabin Embarked titlePass surname
## 87 3 W./C. 6608 34.3750 <NA> S Master Ford
## 139 0 7534 9.2167 <NA> S Master Osen
## 221 0 SOTON/OQ 392089 8.0500 <NA> S Master Sunderland
## 267 1 3101295 39.6875 <NA> S Master Panula
## 283 0 345778 9.5000 <NA> S Master de Pelsmaeker
## 334 0 345764 18.0000 <NA> S Master Vander Planke
## 353 1 2695 7.2292 <NA> C Master Elias
## 501 0 315086 8.6625 <NA> S Master Calic
## 533 1 2690 7.2292 <NA> C Master Elias
## 575 0 A/4. 20589 8.0500 <NA> S Master Rush
## 722 0 350048 7.0542 <NA> S Master Jensen
## 732 0 2699 18.7875 <NA> C Master Hassan
## 747 1 C.A. 2673 20.2500 <NA> S Master Abbott
## 792 0 239865 26.0000 <NA> S Master Gaskell
## 842 0 S.O./P.P. 3 10.5000 <NA> S Master Mudd
## 845 0 315090 8.6625 <NA> S Master Culumovic
Now, let’s deal with title “Mrs”.
under30_Mrs <- which(train_X$titlePass=="Mrs" & train_X$Age < 30)
train_X[c(under30_Mrs),c("Name","Age","Sex")]
## Name
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
## 10 Nasser, Mrs. Nicholas (Adele Achem)
## 42 Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)
## 54 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)
## 67 Nye, Mrs. (Elizabeth Ramell)
## 134 Weisz, Mrs. Leopold (Mathilde Francoise Pede)
## 143 Hakkarainen, Mrs. Pekka Pietari (Elin Matilda Dolck)
## 152 Pears, Mrs. Thomas (Edith Wearne)
## 248 Hamalainen, Mrs. William (Anna)
## 256 Touma, Mrs. Darwis (Hanne Youssef Razi)
## 292 Bishop, Mrs. Dickinson H (Helen Walton)
## 308 Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)
## 313 Lahtinen, Mrs. William (Anna Sylfven)
## 324 Caldwell, Mrs. Albert Francis (Sylvia Mae Harbaugh)
## 400 Trout, Mrs. William H (Jessie L)
## 438 Richards, Mrs. Sidney (Emily Hocking)
## 474 Jerwan, Mrs. Amin S (Marie Marthe Thuillard)
## 547 Beane, Mrs. Edward (Ethel Clarke)
## 568 Palsson, Mrs. Nils (Alma Cornelia Berglund)
## 601 Jacobsohn, Mrs. Sidney Samuel (Amy Frances Christy)
## 618 Lobb, Mrs. William Arthur (Cordelia K Stanlick)
## 701 Astor, Mrs. John Jacob (Madeleine Talmadge Force)
## 782 Dick, Mrs. Albert Adrian (Vera Gillespie)
## 831 Yasbeck, Mrs. Antoni (Selini Alexander)
## 859 Baclini, Mrs. Solomon (Latifa Qurban)
## 875 Abelson, Mrs. Samuel (Hannah Wizosky)
## 881 Shelley, Mrs. William (Imanita Parrish Hall)
## Age Sex
## 9 27 female
## 10 14 female
## 42 27 female
## 54 29 female
## 67 29 female
## 134 29 female
## 143 24 female
## 152 22 female
## 248 24 female
## 256 29 female
## 292 19 female
## 308 17 female
## 313 26 female
## 324 22 female
## 400 28 female
## 438 24 female
## 474 23 female
## 547 19 female
## 568 29 female
## 601 24 female
## 618 26 female
## 701 18 female
## 782 17 female
## 831 15 female
## 859 24 female
## 875 28 female
## 881 25 female
Apparently, all female under 30 with title “Mrs” already married, it’s proved by their maiden name written inside bracket. So, there are no missplace in this title.
let’s do a recheck to the age range by title
train_X %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 15 x 6
## # Groups: titlePass [14]
## titlePass Sex count median min max
## <chr> <fct> <int> <dbl> <dbl> <dbl>
## 1 Col male 1 60 60 60
## 2 Countess female 1 33 33 33
## 3 Dr female 1 49 49 49
## 4 Dr male 5 38 23 54
## 5 Jonkheer male 1 38 38 38
## 6 Major male 1 45 45 45
## 7 Master male 47 9 0.67 17
## 8 Miss female 105 18 0.75 30
## 9 Mlle female 1 24 24 24
## 10 Mme female 1 24 24 24
## 11 Mr male 391 30 18 80
## 12 Mrs female 98 36 14 63
## 13 Ms female 27 38 28 63
## 14 Rev male 5 42 27 57
## 15 Sir male 1 49 49 49
Now all title has appropriate age range. I proceed to valid_X and titan_test
# Mutate surname for valid_X
valid_X <- valid_X %>% mutate(surname=str_extract(Name,".+(?=,.+)"))
str(valid_X)
## 'data.frame': 205 obs. of 14 variables:
## $ PassengerId: int 5 8 11 18 21 24 31 34 37 44 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 1 2 3 2 ...
## $ Name : chr "Allen, Mr. William Henry" "Palsson, Master. Gosta Leonard" "Sandstrom, Miss. Marguerite Rut" "Williams, Mr. Charles Eugene" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 2 2 1 ...
## $ Age : num 35 2 4 NA 35 28 40 66 NA 3 ...
## $ SibSp : int 0 3 1 0 0 0 0 0 0 1 ...
## $ Parch : int 0 1 1 0 0 0 0 0 0 2 ...
## $ Ticket : chr "373450" "349909" "PP 9549" "244373" ...
## $ Fare : num 8.05 21.07 16.7 13 26 ...
## $ Cabin : chr NA NA "G6" NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 1 3 1 1 ...
## $ titlePass : chr "Mr" "Master" "Miss" "Mr" ...
## $ surname : chr "Allen" "Palsson" "Sandstrom" "Williams" ...
# Mutate surname for titan_test
titan_test <- titan_test %>% mutate(surname=str_extract(Name,".+(?=,.+)"))
str(titan_test)
## 'data.frame': 418 obs. of 13 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr NA NA NA NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ titlePass : chr "Mr" "Mrs" "Mr" "Mr" ...
## $ surname : chr "Kelly" "Wilkes" "Myles" "Wirz" ...
Age range by title in valid_X :
valid_X %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 12 x 6
## # Groups: titlePass [12]
## titlePass Sex count median min max
## <chr> <fct> <int> <dbl> <dbl> <dbl>
## 1 Capt male 1 70 70 70
## 2 Col male 1 56 56 56
## 3 Don male 1 40 40 40
## 4 Dr male 1 50 50 50
## 5 Lady female 1 48 48 48
## 6 Major male 1 52 52 52
## 7 Master male 9 3 0.42 11
## 8 Miss female 51 19 0.75 36
## 9 Mlle female 1 24 24 24
## 10 Mr male 110 30 14 66
## 11 Mrs female 27 32 18 62
## 12 Rev male 1 51 51 51
Each gender is placed properly in title, but there are some indication of missplaces, like those range in “Miss”, and “Mr”
# Let's start with "Miss"
over30_Miss_val <- which(valid_X$titlePass=="Miss" & valid_X$Age>=30)
valid_X[c(over30_Miss_val),"titlePass"] <- "Ms"
valid_X[c(over30_Miss_val),c("Name","Age","titlePass")]
## Name Age titlePass
## 216 Newell, Miss. Madeleine 31 Ms
## 219 Bazzani, Miss. Albina 32 Ms
## 258 Cherry, Miss. Gladys 30 Ms
## 310 Francatelli, Miss. Laura Mabel 30 Ms
## 323 Slayter, Miss. Hilda Mary 30 Ms
## 388 Buss, Miss. Kate 36 Ms
## 538 LeRoy, Miss. Bertha 30 Ms
## 541 Crosby, Miss. Harriet R 36 Ms
## 577 Garside, Miss. Ethel 34 Ms
## 843 Serepeca, Miss. Augusta 30 Ms
and then for “Mr”
under18_Mr_val <- which(valid_X$titlePass=="Mr" & valid_X$Age<18)
valid_X[c(under18_Mr_val),"titlePass"] <- "Master"
valid_X[c(under18_Mr_val),c("Name","Age","titlePass")]
## Name Age titlePass
## 164 Calic, Mr. Jovo 17 Master
## 434 Kallio, Mr. Nikolai Erland 17 Master
## 551 Thayer, Mr. John Borland Jr 17 Master
## 684 Goodwin, Mr. Charles Edward 14 Master
## 687 Panula, Mr. Jaako Arnold 14 Master
## 765 Eklund, Mr. Hans Linus 16 Master
Recheck for valid_X
valid_X %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 13 x 6
## # Groups: titlePass [13]
## titlePass Sex count median min max
## <chr> <fct> <int> <dbl> <dbl> <dbl>
## 1 Capt male 1 70 70 70
## 2 Col male 1 56 56 56
## 3 Don male 1 40 40 40
## 4 Dr male 1 50 50 50
## 5 Lady female 1 48 48 48
## 6 Major male 1 52 52 52
## 7 Master male 15 11 0.42 17
## 8 Miss female 41 15 0.75 26
## 9 Mlle female 1 24 24 24
## 10 Mr male 104 31.5 18 66
## 11 Mrs female 27 32 18 62
## 12 Ms female 10 30.5 30 36
## 13 Rev male 1 51 51 51
Now, for titan_test
titan_test %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## Warning in min(Age, na.rm = TRUE): no non-missing arguments to min; returning
## Inf
## Warning in max(Age, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 6
## # Groups: titlePass [9]
## titlePass Sex count median min max
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Col male 2 50 47 53
## 2 Dona female 1 39 39 39
## 3 Dr male 1 53 53 53
## 4 Master male 21 7 0.33 14.5
## 5 Miss female 78 22 0.17 45
## 6 Mr male 240 28.5 14 67
## 7 Mrs female 72 36.5 16 76
## 8 Ms female 1 NA Inf -Inf
## 9 Rev male 2 35.5 30 41
The suspicious title from titan_test are Mr,Miss, and Mrs. We start with Mr
under18_Mr_test <-which(titan_test$titlePass=="Mr" & titan_test$Age<18)
titan_test[c(under18_Mr_test),"titlePass"] <- "Master"
titan_test[c(under18_Mr_test),c("Age","titlePass")]
## Age titlePass
## 6 14 Master
## 61 17 Master
## 188 17 Master
## 213 17 Master
## 231 14 Master
## 270 17 Master
## 404 17 Master
for “Miss” in titan_test
quantile(titan_test[which(titan_test$titlePass=="Miss"),"Age"],c(0.25,0.75),na.rm = TRUE)
## 25% 75%
## 17.75 29.25
over30_Miss_test <- which(titan_test$titlePass=="Miss" & titan_test$Age >=30)
# Convert the title of passengers over 30 to "Ms"
titan_test[c(over30_Miss_test),"titlePass"] <- "Ms"
titan_test[c(over30_Miss_test),c("Name","titlePass","Age")]
## Name titlePass Age
## 7 Connolly, Miss. Kate Ms 30
## 60 Chaudanson, Miss. Victorine Ms 36
## 75 Geiger, Miss. Amalie Ms 35
## 113 Evans, Miss. Edith Corse Ms 36
## 142 Daniels, Miss. Sarah Ms 33
## 207 McGowan, Miss. Katherine Ms 35
## 209 Rosenbaum, Miss. Edith Louise Ms 33
## 215 Andersson, Miss. Ida Augusta Margareta Ms 38
## 221 Duran y More, Miss. Florentina Ms 30
## 292 Daly, Miss. Margaret Marcella Maggie"" Ms 30
## 314 Carr, Miss. Jeannie Ms 37
## 325 Kreuchen, Miss. Emilie Ms 39
## 350 Walcroft, Miss. Nellie Ms 31
## 372 Wilson, Miss. Helen Alice Ms 31
## 376 Bowen, Miss. Grace Scott Ms 45
## 401 Bonnell, Miss. Caroline Ms 30
Finally, we will deal with “Mrs” in titan_test
quantile(titan_test[which(titan_test$titlePass=="Mrs"),"Age"],c(0.25,0.75),na.rm = TRUE)
## 25% 75%
## 26.0 49.5
# Check if all "Mrs" already married
titan_test[which(titan_test$titlePass=="Mrs"),"Name"]
## [1] "Wilkes, Mrs. James (Ellen Needs)"
## [2] "Hirvonen, Mrs. Alexander (Helga E Lindqvist)"
## [3] "Abrahim, Mrs. Joseph (Sophie Halaut Easu)"
## [4] "Snyder, Mrs. John Pillsbury (Nelle Stevenson)"
## [5] "Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)"
## [6] "del Carlo, Mrs. Sebastiano (Argenia Genovesi)"
## [7] "Assaf Khalil, Mrs. Mariana (Miriam\")\""
## [8] "Flegenheim, Mrs. Alfred (Antoinette)"
## [9] "Ryerson, Mrs. Arthur Larned (Emily Maria Borie)"
## [10] "Dean, Mrs. Bertram (Eva Georgetta Light)"
## [11] "Johnston, Mrs. Andrew G (Elizabeth Lily\" Watson)\""
## [12] "Corbett, Mrs. Walter H (Irene Colvin)"
## [13] "Kimball, Mrs. Edwin Nelson Jr (Gertrude Parsons)"
## [14] "Bucknell, Mrs. William Robert (Emma Eliza Ward)"
## [15] "Coutts, Mrs. William (Winnie Minnie\" Treanor)\""
## [16] "Corey, Mrs. Percy C (Mary Phyllis Elizabeth Miller)"
## [17] "Fortune, Mrs. Mark (Mary McDougald)"
## [18] "Cornell, Mrs. Robert Clifford (Malvina Helen Lamson)"
## [19] "Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judith Andersson)"
## [20] "Davidson, Mrs. Thornton (Orian Hays)"
## [21] "Cavendish, Mrs. Tyrell William (Julia Florence Siegel)"
## [22] "Stengel, Mrs. Charles Emil Henry (Annie May Morris)"
## [23] "Thomas, Mrs. Alexander (Thamine Thelma\")\""
## [24] "Straus, Mrs. Isidor (Rosalie Ida Blun)"
## [25] "Chapman, Mrs. John Henry (Sara Elizabeth Lawry)"
## [26] "Schabert, Mrs. Paul (Emma Mock)"
## [27] "Lefebre, Mrs. Frank (Frances)"
## [28] "Earnshaw, Mrs. Boulton (Olive Potter)"
## [29] "Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)"
## [30] "Peacock, Mrs. Benjamin (Edith Nile)"
## [31] "Kink-Heilmann, Mrs. Anton (Luise Heilmann)"
## [32] "Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genevieve Fosdick)"
## [33] "Becker, Mrs. Allen Oliver (Nellie E Baumgardner)"
## [34] "Compton, Mrs. Alexander Taylor (Mary Eliza Ingersoll)"
## [35] "Marvin, Mrs. Daniel Warner (Mary Graham Carmichael Farquarson)"
## [36] "Douglas, Mrs. Frederick Charles (Mary Helene Baxter)"
## [37] "Rasmussen, Mrs. (Lena Jacobsen Solvang)"
## [38] "Howard, Mrs. Benjamin (Ellen Truelove Arman)"
## [39] "Widener, Mrs. George Dunton (Eleanor Elkins)"
## [40] "Cook, Mrs. (Selena Rogers)"
## [41] "Candee, Mrs. Edward (Helen Churchill Hungerford)"
## [42] "Moubarek, Mrs. George (Omine Amenia\" Alexander)\""
## [43] "Douglas, Mrs. Walter Donald (Mahala Dutton)"
## [44] "Lindstrom, Mrs. Carl Johan (Sigrid Posse)"
## [45] "Christy, Mrs. (Alice Frances)"
## [46] "Karnes, Mrs. J Frank (Claire Bennett)"
## [47] "Hold, Mrs. Stephen (Annie Margaret Hill)"
## [48] "Khalil, Mrs. Betros (Zahie Maria\" Elias)\""
## [49] "Wells, Mrs. Arthur Henry (Addie\" Dart Trevaskis)\""
## [50] "Clark, Mrs. Walter Miller (Virginia McDowell)"
## [51] "Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)"
## [52] "Hansen, Mrs. Claus Peter (Jennie L Howard)"
## [53] "White, Mrs. John Stuart (Ella Holmes)"
## [54] "Davies, Mrs. John Morgan (Elizabeth Agnes Mary White) "
## [55] "Nakid, Mrs. Said (Waika Mary\" Mowad)\""
## [56] "Cardeza, Mrs. James Warburton Martinez (Charlotte Wardle Drake)"
## [57] "Whabee, Mrs. George Joseph (Shawneene Abi-Saab)"
## [58] "Greenfield, Mrs. Leo David (Blanche Strouse)"
## [59] "Brown, Mrs. John Murray (Caroline Lane Lamson)"
## [60] "Lindell, Mrs. Edvard Bengtsson (Elin Gerda Persson)"
## [61] "Mallet, Mrs. Albert (Antoinette Magnin)"
## [62] "Ware, Mrs. John James (Florence Louise Long)"
## [63] "Harder, Mrs. George Achilles (Dorothy Annan)"
## [64] "Sage, Mrs. John (Annie Bullen)"
## [65] "Gibson, Mrs. Leonard (Pauline C Boeson)"
## [66] "Dodge, Mrs. Washington (Ruth Vidaver)"
## [67] "Risien, Mrs. Samuel (Emma)"
## [68] "McNamee, Mrs. Neal (Eileen O'Leary)"
## [69] "Lines, Mrs. Ernest H (Elizabeth Lindsey James)"
## [70] "Smith, Mrs. Lucien Philip (Mary Eloise Hughes)"
## [71] "Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)"
## [72] "Minahan, Mrs. William Edward (Lillian E Thorpe)"
Based on this data, all passangers with title “Mrs” already married, so there is no missplace here.
let’s do recheck on age range by their title in titan_test
titan_test %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 6
## # Groups: titlePass [9]
## titlePass Sex count median min max
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Col male 2 50 47 53
## 2 Dona female 1 39 39 39
## 3 Dr male 1 53 53 53
## 4 Master male 28 10.8 0.33 17
## 5 Miss female 62 20 0.17 29
## 6 Mr male 233 29 18 67
## 7 Mrs female 72 36.5 16 76
## 8 Ms female 17 34 30 45
## 9 Rev male 2 35.5 30 41
After we fix all age range in all dataset, we begin the imputation process for Age.
# get the index which Age feature is missing
miss_age_train <- which(is.na(train_X$Age))
train_X %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 15 x 6
## # Groups: titlePass [14]
## titlePass Sex count median min max
## <chr> <fct> <int> <dbl> <dbl> <dbl>
## 1 Col male 1 60 60 60
## 2 Countess female 1 33 33 33
## 3 Dr female 1 49 49 49
## 4 Dr male 5 38 23 54
## 5 Jonkheer male 1 38 38 38
## 6 Major male 1 45 45 45
## 7 Master male 47 9 0.67 17
## 8 Miss female 105 18 0.75 30
## 9 Mlle female 1 24 24 24
## 10 Mme female 1 24 24 24
## 11 Mr male 391 30 18 80
## 12 Mrs female 98 36 14 63
## 13 Ms female 27 38 28 63
## 14 Rev male 5 42 27 57
## 15 Sir male 1 49 49 49
unique(train_X[c(miss_age_train),"titlePass"])
## [1] "Mr" "Mrs" "Miss" "Master" "Dr"
# Title that have miss age in train_X : "Mr","Miss","Mrs","Master","Dr"
# We use median imputation, it's quite simple and would not impact the data distribution despite the lack of randomness
# If "Mr", impute age == 30
# If "Miss", impute age == 18.5
# If "Mrs", impute age == 35
# If "Master", impute age == 9
# If "Dr" and "male", impute age == 38
# If "Dr" and "female", impute age == 49
train_X <- train_X %>%
mutate(Age2 = case_when(is.na(Age)==FALSE~Age,
is.na(Age) & titlePass=="Mr"~30,
is.na(Age) & titlePass=="Miss"~18,
is.na(Age) & titlePass=="Mrs"~36,
is.na(Age) & titlePass=="Master"~9,
is.na(Age) & titlePass=="Dr" &
Sex=="male"~38,
is.na(Age) & titlePass=="Dr" &
Sex=="female"~49))
# Check missing value of Age2 in train_X
sum(is.na(train_X$Age2))
## [1] 0
Apply the same steps in valid_X and titan_test
# get the index which Age feature is missing
miss_age_train_val <- which(is.na(valid_X$Age))
valid_X %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 13 x 6
## # Groups: titlePass [13]
## titlePass Sex count median min max
## <chr> <fct> <int> <dbl> <dbl> <dbl>
## 1 Capt male 1 70 70 70
## 2 Col male 1 56 56 56
## 3 Don male 1 40 40 40
## 4 Dr male 1 50 50 50
## 5 Lady female 1 48 48 48
## 6 Major male 1 52 52 52
## 7 Master male 15 11 0.42 17
## 8 Miss female 41 15 0.75 26
## 9 Mlle female 1 24 24 24
## 10 Mr male 104 31.5 18 66
## 11 Mrs female 27 32 18 62
## 12 Ms female 10 30.5 30 36
## 13 Rev male 1 51 51 51
unique(valid_X[c(miss_age_train_val),"titlePass"])
## [1] "Mr" "Miss" "Mrs" "Master"
# Title that have miss age in train_X : "Mr","Miss","Mrs","Master","Dr"
# We use median imputation, it's quite simple and would not impact the data distribution despite the lack of randomness
# If "Mr", impute age == 31.5
# If "Miss", impute age == 15.0
# If "Mrs", impute age == 32
# If "Master", impute age == 11.0
Imputation of Age in valid_X
# If "Mr", impute age == 31.5
# If "Miss", impute age == 15.0
# If "Mrs", impute age == 32
# If "Master", impute age == 11.0
valid_X <- valid_X %>%
mutate(Age2 = case_when(is.na(Age)==FALSE~Age,
is.na(Age) & titlePass=="Mr"~31.5,
is.na(Age) & titlePass=="Miss"~15.0,
is.na(Age) & titlePass=="Mrs"~32,
is.na(Age) & titlePass=="Master"~11.0))
#check the missing values of Age2 in valid_X
sum(is.na(valid_X$Age2))
## [1] 0
Imputation of Age in titan_test
miss_age_test <- which(is.na(titan_test$Age))
titan_test %>% group_by(titlePass,Sex) %>%
summarise(count=n(),
median = median(Age,na.rm=TRUE),
min=min(Age,na.rm=TRUE),
max=max(Age,na.rm=TRUE))
## `summarise()` has grouped output by 'titlePass'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 6
## # Groups: titlePass [9]
## titlePass Sex count median min max
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Col male 2 50 47 53
## 2 Dona female 1 39 39 39
## 3 Dr male 1 53 53 53
## 4 Master male 28 10.8 0.33 17
## 5 Miss female 62 20 0.17 29
## 6 Mr male 233 29 18 67
## 7 Mrs female 72 36.5 16 76
## 8 Ms female 17 34 30 45
## 9 Rev male 2 35.5 30 41
unique(titan_test[c(miss_age_test),"titlePass"])
## [1] "Mr" "Mrs" "Miss" "Ms" "Master"
# if the title is "Mr", impute age == 29.00
# if the title is "Mrs", impute age ==36.50
# if the title is "Miss", impute age ==20.00
# if the title is "Ms", impute age == 34.00
# if the title is "Master", impute age == 10.75
# if the title is "Mr", impute age == 29.00
# if the title is "Mrs", impute age ==36.50
# if the title is "Miss", impute age ==20.00
# if the title is "Ms", impute age == 34.00
# if the title is "Master", impute age == 10.75
titan_test <- titan_test %>%
mutate(Age2 = case_when(is.na(Age)==FALSE~Age,
is.na(Age) & titlePass=="Mr"~29.00,
is.na(Age) & titlePass=="Miss"~20.00,
is.na(Age) & titlePass=="Mrs"~36.50,
is.na(Age) & titlePass=="Master"~10.75,
is.na(Age) & titlePass=="Ms"~34.00))
#check the missing values of Age2 in valid_X
sum(is.na(titan_test$Age2))
## [1] 0
Now, we have feature Age2, the complete version of feature Age.
Missing values of Cabin feature in each data sets
paste("Missing value of Cabin in train_X:",round((sum(is.na(train_X$Cabin))/length(train_X$Cabin))*100,2),"%")
## [1] "Missing value of Cabin in train_X: 76.68 %"
paste("Missing value of Cabin in valid_X:",round((sum(is.na(valid_X$Cabin))/length(valid_X$Cabin))*100,2),"%")
## [1] "Missing value of Cabin in valid_X: 78.54 %"
paste("Missing value of Cabin in titan_test:",round((sum(is.na(titan_test$Cabin))/length(titan_test$Cabin))*100,2),"%")
## [1] "Missing value of Cabin in titan_test: 78.23 %"
As i stated in previous section, with huge percentage of missing value, it’s useless to try to extract any information from Cabin feature. A simple solution would be to drop it entirely.
Actually, with extensive effort, there are some insight we may able to extract from Cabin. But, after conduct literacy study from previous research, i decide that the benefit is not worth the effort. I may conduct this special analysis with Cabin feature in future occassion.
# Drop Cabin
trainX_new <- train_X %>% select(!Cabin)
validX_new <- valid_X %>% select(!Cabin)
titan_test_new <- titan_test %>% select(!Cabin)
names(trainX_new)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Embarked" "titlePass" "surname" "Age2"
names(validX_new)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Embarked" "titlePass" "surname" "Age2"
names(titan_test_new)
## [1] "PassengerId" "Pclass" "Name" "Sex" "Age"
## [6] "SibSp" "Parch" "Ticket" "Fare" "Embarked"
## [11] "titlePass" "surname" "Age2"
There is only 1 missing value of Embarked in trainX_new and validX_new data set.
sum(is.na(trainX_new$Embarked))
## [1] 1
sum(is.na(validX_new$Embarked))
## [1] 1
sum(is.na(titan_test_new$Embarked))
## [1] 0
We call that specific observation.
trainX_new[which(is.na(trainX_new$Embarked)),]
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 62 62 1 1 Icard, Miss. Amelie female 38 0 0
## Ticket Fare Embarked titlePass surname Age2
## 62 113572 80 <NA> Ms Icard 38
validX_new[which(is.na(validX_new$Embarked)),]
## PassengerId Survived Pclass Name
## 830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Embarked titlePass surname Age2
## 830 female 62 0 0 113572 80 <NA> Mrs Stone 62
To impute these values, i conduct a simple searching in a online forum dedicated to Titanic, and found the answer. (For your information, most of the data use in this project is sourced from this forum) Source : encyclopedia-titanica.org
# Impute Embarked value in trainX_new
trainX_new[which(is.na(trainX_new$Embarked)),"Embarked"] <- "S"
# Impute Embarked value in validX_new
validX_new[which(is.na(validX_new$Embarked)),"Embarked"] <- "S"
sum(is.na(trainX_new$Embarked))
## [1] 0
sum(is.na(validX_new$Embarked))
## [1] 0
There is only one missing value of Fare in titan_test_new. Here is the data
titan_test_new[which(is.na(titan_test_new$Fare)),]
## PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare
## 153 1044 3 Storey, Mr. Thomas male 60.5 0 0 3701 NA
## Embarked titlePass surname Age2
## 153 S Mr Storey 60.5
# Impute the value with median Fare in Pclass 3
titan_test_new %>% group_by(Embarked,Pclass) %>%
summarise(med_Fare = median(Fare,na.rm=TRUE)) # 8.05
## `summarise()` has grouped output by 'Embarked'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 3
## # Groups: Embarked [3]
## Embarked Pclass med_Fare
## <fct> <fct> <dbl>
## 1 C 1 75.2
## 2 C 2 15.0
## 3 C 3 7.23
## 4 Q 1 90
## 5 Q 2 11.5
## 6 Q 3 7.75
## 7 S 1 51.7
## 8 S 2 21
## 9 S 3 8.05
# Impute
titan_test_new[which(is.na(titan_test_new$Fare)),"Fare"] <- 8.05
# Check
sum(is.na(titan_test_new$Fare))
## [1] 0
Aside of missing value, Fare feature is likely to has another problem.
summary(trainX_new$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.925 14.500 33.666 31.275 512.329
summary(validX_new$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.896 13.000 27.313 29.125 263.000
summary(titan_test_new$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.896 14.454 35.561 31.472 512.329
as the summary show us, the minimum value of Fare in all data set is zero. I don’t think this is possible, but a mere inputing error.
trainX_new[which(trainX_new$Fare == 0),]
## PassengerId Survived Pclass Name Sex Age SibSp
## 264 264 0 1 Harrison, Mr. William male 40 0
## 272 272 1 3 Tornquist, Mr. William Henry male 25 0
## 303 303 0 3 Johnson, Mr. William Cahoone Jr male 19 0
## 467 467 0 2 Campbell, Mr. William male NA 0
## 482 482 0 2 Frost, Mr. Anthony Wood "Archie" male NA 0
## 598 598 0 3 Johnson, Mr. Alfred male 49 0
## 634 634 0 1 Parr, Mr. William Henry Marsh male NA 0
## 675 675 0 2 Watson, Mr. Ennis Hastings male NA 0
## 807 807 0 1 Andrews, Mr. Thomas Jr male 39 0
## 816 816 0 1 Fry, Mr. Richard male NA 0
## 823 823 0 1 Reuchlin, Jonkheer. John George male 38 0
## Parch Ticket Fare Embarked titlePass surname Age2
## 264 0 112059 0 S Mr Harrison 40
## 272 0 LINE 0 S Mr Tornquist 25
## 303 0 LINE 0 S Mr Johnson 19
## 467 0 239853 0 S Mr Campbell 30
## 482 0 239854 0 S Mr Frost 30
## 598 0 LINE 0 S Mr Johnson 49
## 634 0 112052 0 S Mr Parr 30
## 675 0 239856 0 S Mr Watson 30
## 807 0 112050 0 S Mr Andrews 39
## 816 0 112058 0 S Mr Fry 30
## 823 0 19972 0 S Jonkheer Reuchlin 38
All of these “free-rider” are Male, aged from 19-40 years old, departed from Southampton, and travel alone.
trainX_new[which(trainX_new$SibSp==0 & trainX_new$Parch==0 & trainX_new$Embarked=="S"),] %>% group_by(Pclass) %>%
summarise(fare_med = median(Fare))
## # A tibble: 3 x 2
## Pclass fare_med
## <fct> <dbl>
## 1 1 26.6
## 2 2 13
## 3 3 7.92
# Pclass == 1 then fare==26.55
# Pclass == 2 then fare==13
# Pclass == 3 then fare==7.925
# Impute
trainX_new <- trainX_new %>% mutate(Fare2=
case_when(SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==1~26.55,
SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==2~13,
SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==3~7.925,
TRUE~Fare)
)
summary(trainX_new$Fare2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.013 7.925 15.048 33.951 31.275 512.329
in validX_new, our “free-rider” has similiar characteristic as those in trainX_new. So, here i will apply the same solution
validX_new[which(validX_new$Fare==0),]
## PassengerId Survived Pclass Name Sex Age SibSp
## 180 180 0 3 Leonard, Mr. Lionel male 36 0
## 278 278 0 2 Parkes, Mr. Francis "Frank" male NA 0
## 414 414 0 2 Cunningham, Mr. Alfred Fleming male NA 0
## 733 733 0 2 Knight, Mr. Robert J male NA 0
## Parch Ticket Fare Embarked titlePass surname Age2
## 180 0 LINE 0 S Mr Leonard 36.0
## 278 0 239853 0 S Mr Parkes 31.5
## 414 0 239853 0 S Mr Cunningham 31.5
## 733 0 239855 0 S Mr Knight 31.5
validX_new[which(validX_new$SibSp==0 & validX_new$Parch==0 & validX_new$Embarked=="S"),] %>% group_by(Pclass) %>%
summarise(fare_med = median(Fare))
## # A tibble: 3 x 2
## Pclass fare_med
## <fct> <dbl>
## 1 1 47.2
## 2 2 13
## 3 3 7.90
# if Pclass==1 then fare==47.2
# if Pclass==2 then fare==13
# if Pclass==3 then fate==7.8958
validX_new <- validX_new %>% mutate(Fare2=
case_when(SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==1~47.2,
SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==2~13,
SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==3~7.8958,
TRUE~Fare)
)
summary(validX_new$Fare2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.496 7.896 13.000 27.542 29.125 263.000
At last, for titan_test_new
titan_test_new[which(titan_test_new$Fare==0),]
## PassengerId Pclass Name Sex Age SibSp
## 267 1158 1 Chisholm, Mr. Roderick Robert Crispin male NA 0
## 373 1264 1 Ismay, Mr. Joseph Bruce male 49 0
## Parch Ticket Fare Embarked titlePass surname Age2
## 267 0 112051 0 S Mr Chisholm 29
## 373 0 112058 0 S Mr Ismay 49
It is also reasonable to apply the same solution to zero Fare in titan_test_new.
titan_test_new[which(titan_test_new$SibSp==0 & titan_test_new$Parch==0 & titan_test_new$Embarked=="S"),] %>% group_by(Pclass) %>%
summarise(fare_med = median(Fare))
## # A tibble: 3 x 2
## Pclass fare_med
## <fct> <dbl>
## 1 1 30.5
## 2 2 13
## 3 3 7.90
# if pclass==1 then Fare == 30.5
# if Pclass==2 then Fare == 13
# if Pclass==3 then Fare == 7.8958
# impute
titan_test_new <- titan_test_new %>% mutate(Fare2=
case_when(SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==1~30.5,
SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==2~13,
SibSp==0 & Parch==0 & Embarked=="S" & Fare==0 & Pclass==3~7.8958,
TRUE~Fare)
)
summary(titan_test_new$Fare2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.171 7.896 14.456 35.707 31.472 512.329
To prevent confussion, let’s drop Age and Fare feature from each data set
trainX_new <- trainX_new %>% select(!c(Age,Fare))
validX_new <- validX_new %>% select(!c(Age,Fare))
titan_test_new <- titan_test_new %>% select(!c(Age,Fare))
summary(trainX_new)
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:423 1:175 Length:686 female:234
## 1st Qu.:223.2 1:263 2:142 Class :character male :452
## Median :445.5 3:369 Mode :character
## Mean :446.1
## 3rd Qu.:668.5
## Max. :891.0
## SibSp Parch Ticket Embarked
## Min. :0.0000 Min. :0.0000 Length:686 C:133
## 1st Qu.:0.0000 1st Qu.:0.0000 Class :character Q: 59
## Median :0.0000 Median :0.0000 Mode :character S:494
## Mean :0.5394 Mean :0.3848
## 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :8.0000 Max. :6.0000
## titlePass surname Age2 Fare2
## Length:686 Length:686 Min. : 0.67 Min. : 4.013
## Class :character Class :character 1st Qu.:21.00 1st Qu.: 7.925
## Mode :character Mode :character Median :30.00 Median : 15.048
## Mean :29.75 Mean : 33.951
## 3rd Qu.:36.00 3rd Qu.: 31.275
## Max. :80.00 Max. :512.329
I will conduct normality test and apply standarization on Age2 and Fare features.
There are two reason :
# Histogram of Fare2 and Age2
qqnorm(trainX_new$Fare2,main="Q-Q Plot for Fare Normality in Training Data")
qqline(trainX_new$Fare2,col="magenta")
qqnorm(validX_new$Fare2,main="Q-Q Plot for Fare Normality in Validation Data")
qqline(validX_new$Fare2,col="coral")
qqnorm(titan_test_new$Fare2,main="Q-Q Plot for Fare Normality in Test Data")
qqline(titan_test_new$Fare2,col="coral")
From Q-Q plot in data sets, we can see the diverse of data from the
straight line, especially around the tail.
qqnorm(trainX_new$Age2,main="Q-Q Plot for Age Normality in Training Data")
qqline(trainX_new$Age2,col="magenta")
qqnorm(validX_new$Age2,main="Q-Q Plot for Age Normality in Validation Data")
qqline(validX_new$Age2,col="coral")
qqnorm(titan_test_new$Age2,main="Q-Q Plot for Age Normality in Test Data")
qqline(titan_test_new$Age2,col="coral")
For Age, even though they are not completely follow normal distribution,
Age has closer distance to theoritical Quantiles rather than Fare.
To ensure our interpretation of Q-Q Plot, i will perform Saphiro-Wilk, to test data normality.
shapiro.test(trainX_new$Age2)
##
## Shapiro-Wilk normality test
##
## data: trainX_new$Age2
## W = 0.97015, p-value = 1.307e-10
shapiro.test(validX_new$Age2)
##
## Shapiro-Wilk normality test
##
## data: validX_new$Age2
## W = 0.97388, p-value = 0.0007291
shapiro.test(titan_test_new$Age2)
##
## Shapiro-Wilk normality test
##
## data: titan_test_new$Age2
## W = 0.96171, p-value = 5.655e-09
Normality test using Kolmogrov-Smirnov test
ks.test(trainX_new$Age2,"pnorm")
## Warning in ks.test(trainX_new$Age2, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: trainX_new$Age2
## D = 0.97241, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(validX_new$Age2,"pnorm")
## Warning in ks.test(validX_new$Age2, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: validX_new$Age2
## D = 0.96938, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(titan_test_new$Age2,"pnorm")
## Warning in ks.test(titan_test_new$Age2, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: titan_test_new$Age2
## D = 0.97473, p-value < 2.2e-16
## alternative hypothesis: two-sided
From all test, it is proved that Age2 is not normally distributed. Now, for Fare2
qqnorm(trainX_new$Fare2,main="Q-Q Plot for Fare Normality in Training Data")
qqline(trainX_new$Fare2,col="magenta")
qqnorm(validX_new$Fare2,main="Q-Q Plot for Fare Normality in Validation Data")
qqline(validX_new$Fare2,col="coral")
qqnorm(titan_test_new$Fare2,main="Q-Q Plot for Fare Normality in Test Data")
qqline(titan_test_new$Fare2,col="coral")
#
Same as Age, Fare feature in all data set are most likely to not normally distributed.
# Shapiro-Wilk Test
shapiro.test(trainX_new$Fare2)
##
## Shapiro-Wilk normality test
##
## data: trainX_new$Fare2
## W = 0.50906, p-value < 2.2e-16
shapiro.test(validX_new$Fare2)
##
## Shapiro-Wilk normality test
##
## data: validX_new$Fare2
## W = 0.60136, p-value < 2.2e-16
shapiro.test(titan_test_new$Fare2)
##
## Shapiro-Wilk normality test
##
## data: titan_test_new$Fare2
## W = 0.53761, p-value < 2.2e-16
#Kolmogrov-Smirnov Test
ks.test(trainX_new$Fare2,"pnorm")
## Warning in ks.test(trainX_new$Fare2, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: trainX_new$Fare2
## D = 0.99997, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(validX_new$Fare2,"pnorm")
## Warning in ks.test(validX_new$Fare2, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: validX_new$Fare2
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(titan_test_new$Fare2,"pnorm")
## Warning in ks.test(titan_test_new$Fare2, "pnorm"): ties should not be present
## for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: titan_test_new$Fare2
## D = 0.99924, p-value < 2.2e-16
## alternative hypothesis: two-sided
Both Shapiro-Wilk and Kolmogorov-Smirnov test’s results are confirmed this deduction.
For normalization*, i will use log transformation and Yeo Johnson transformation, then compare them to see which one is better at normalize Fare and Age features.
# Log transformation
log_fare <- log(trainX_new$Fare2)
hist(log_fare)
shapiro.test(log_fare)
##
## Shapiro-Wilk normality test
##
## data: log_fare
## W = 0.89446, p-value < 2.2e-16
log_age <- log(trainX_new$Age2)
hist(log_age)
shapiro.test(log_age)
##
## Shapiro-Wilk normality test
##
## data: log_age
## W = 0.75829, p-value < 2.2e-16
#install.packages("bestNormalize")
library(bestNormalize)
## Warning: package 'bestNormalize' was built under R version 4.1.3
yeoJ_fare <- yeojohnson(trainX_new$Fare2)
hist(yeoJ_fare$x.t)
shapiro.test(yeoJ_fare$x.t)
##
## Shapiro-Wilk normality test
##
## data: yeoJ_fare$x.t
## W = 0.92003, p-value < 2.2e-16
yeoJ_age <- yeojohnson(trainX_new$Age2)
hist(yeoJ_age$x.t)
shapiro.test(yeoJ_age$x.t)
##
## Shapiro-Wilk normality test
##
## data: yeoJ_age$x.t
## W = 0.97735, p-value = 8.164e-09
For all consideration, it seems that Yeo Johnson transformation generate more centered an symmetric data than log transformation, even though they all are still considered as not normally distributed.
iqr_yeo_fare <- IQR(yeoJ_fare$x.t)
q1_yeo_fare <- quantile(yeoJ_fare$x.t,0.25)
q3_yeo_fare <-quantile(yeoJ_fare$x.t,0.75)
# Check for outliers
yeoJ_fare$x.t[which(yeoJ_fare$x.t < q1_yeo_fare - (1.5*iqr_yeo_fare)|
yeoJ_fare$x.t > q3_yeo_fare + (1.5*iqr_yeo_fare))]
## numeric(0)
# Check for normality using Q-Q plot
qqnorm(yeoJ_fare$x.t,main="Q-Q Plot for YeoJohnson-Fare Normality in Training Data")
qqline(yeoJ_fare$x.t,col="magenta")
iqr_yeo_age <- IQR(yeoJ_age$x.t)
q1_yeo_age <- quantile(yeoJ_age$x.t,0.25)
q3_yeo_age <-quantile(yeoJ_age$x.t,0.75)
# Check for outliers
yeoJ_age$x.t[which(yeoJ_age$x.t < q1_yeo_age - (1.5*iqr_yeo_age)|
yeoJ_age$x.t > q3_yeo_age + (1.5*iqr_yeo_age))]
## [1] -2.395473 2.398930 -2.551608 2.763624 2.733512 -2.395473 -2.527724
## [8] -2.527724 -2.527724 2.213698 2.275667 -2.395473 -2.538902 -2.395473
## [15] -2.527724 2.337409 2.398930 -2.563026 -2.395473 2.275667 2.763624
## [22] 2.337409 2.213698 2.213698 3.297668 -2.395473 2.703350 -2.574566
## [29] -2.527724 -2.395473 -2.527724 -2.551608 2.943285
# QQ plot
qqnorm(yeoJ_age$x.t,main="Q-Q Plot for YeoJohnson Age Normality in Training Data")
qqline(yeoJ_age$x.t,col="magenta")
Based on outliers number left after transformed, it seems that Yeo Johnson is not the right technique for Age (the untransformed Age2 has less outlier than the transformed). So, i running test on multiple transformation technique, but the best result i can get is 24 (the same outlier number as the original data) outliers left for the transformed ones.
Finally i try to compare outlier data on Age feature against Age2. Apparently, the increasing outlier number of Age variable happens right after imputation process. Age feature has only 6, and Age2 has 24.
The reason this happens is because before imputation, 75% of Age lies under 39 years old. While after imputation using median on each gender and title, 75% of passanger’s age lies under 36 years old. So, imputed Age generate smaller IQR and make it more sensitive to outlier.
quantile(train_X$Age,c(0.25,0.75),na.rm=TRUE)
## 25% 75%
## 21 39
quantile(train_X$Age2,c(0.25,0.75))
## 25% 75%
## 21 36
IQR(train_X$Age,na.rm=TRUE)
## [1] 18
IQR(train_X$Age2)
## [1] 15
Hence, i just let this outliers exist. For model building purpose, i will compare between log transformed Age and Yeo Johnson transformed Age as feature in the later section.
trainX_new$YJTrans_Fare2 <- yeoJ_fare$x.t
trainX_new$YJTrans_Age2 <- yeoJ_age$x.t
trainX_new$log_age <- log_age
str(trainX_new)
## 'data.frame': 686 obs. of 16 variables:
## $ PassengerId : int 1 2 3 4 6 7 9 10 12 13 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 1 2 ...
## $ SibSp : int 1 1 0 1 0 0 0 1 0 0 ...
## $ Parch : int 0 0 0 0 0 0 2 0 0 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 1 3 3 ...
## $ titlePass : chr "Mr" "Mrs" "Miss" "Mrs" ...
## $ surname : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
## $ Age2 : num 22 38 26 35 30 54 27 14 58 20 ...
## $ Fare2 : num 7.25 71.28 7.92 53.1 8.46 ...
## $ YJTrans_Fare2: num -1.235 1.36 -1.077 1.167 -0.964 ...
## $ YJTrans_Age2 : num -0.5323 0.644 -0.2239 0.4337 0.0738 ...
## $ log_age : num 3.09 3.64 3.26 3.56 3.4 ...
Normalization and Scaling in validX_new
yeoJ_age_valid <- predict(yeoJ_age, newdata = validX_new$Age2)
yeoJ_fare_valid <- predict(yeoJ_fare, newdata = validX_new$Fare2)
hist(yeoJ_age_valid)
hist(validX_new$Age2)
summary(yeoJ_age_valid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.61150 -0.77177 0.07384 -0.14596 0.29127 2.70335
skewness(yeoJ_age_valid)
## [1] -0.1632515
kurtosis(yeoJ_age_valid)
## [1] 3.363006
#outlier on Yeo Johnson Age2
length(yeoJ_age_valid[which(yeoJ_age_valid < quantile(yeoJ_age_valid,0.25) - (1.5*IQR(yeoJ_age_valid))|
yeoJ_age_valid > quantile(yeoJ_age_valid,0.75) + (1.5*IQR(yeoJ_age_valid)))])
## [1] 10
# Outlier on Age2
length(validX_new$Age2[which(validX_new$Age2 < quantile(validX_new$Age2,0.25) - (1.5*IQR(validX_new$Age2))|
validX_new$Age2 > quantile(validX_new$Age2,0.75) + (1.5*IQR(validX_new$Age2)))])
## [1] 5
hist(yeoJ_fare_valid)
hist(validX_new$Fare2)
summary(yeoJ_fare_valid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4384 -1.0833 -0.2913 -0.1113 0.6662 1.9274
skewness(yeoJ_fare_valid)
## [1] 0.380994
kurtosis(yeoJ_fare_valid)
## [1] 1.811828
length(yeoJ_fare_valid[which(yeoJ_fare_valid < quantile(yeoJ_fare_valid,0.25) - (1.5*IQR(yeoJ_fare_valid))|
yeoJ_fare_valid > quantile(yeoJ_fare_valid,0.75) + (1.5*IQR(yeoJ_fare_valid)))])
## [1] 0
length(validX_new$Fare2[which(validX_new$Fare2 < quantile(validX_new$Fare2,0.25) - (1.5*IQR(validX_new$Fare2))|
validX_new$Fare2 > quantile(validX_new$Fare2,0.75) + (1.5*IQR(validX_new$Fare2)))])
## [1] 23
Add yeoJ_age_valid and yeoJ_fare_valid to validX_new dataframe
validX_new$YJTrans_Fare2 <- yeoJ_fare_valid
validX_new$YJTrans_Age2 <- yeoJ_age_valid
validX_new$log_Age2 <- log(validX_new$Age2)
str(validX_new)
## 'data.frame': 205 obs. of 16 variables:
## $ PassengerId : int 5 8 11 18 21 24 31 34 37 44 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 1 2 3 2 ...
## $ Name : chr "Allen, Mr. William Henry" "Palsson, Master. Gosta Leonard" "Sandstrom, Miss. Marguerite Rut" "Williams, Mr. Charles Eugene" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 2 2 1 ...
## $ SibSp : int 0 3 1 0 0 0 0 0 0 1 ...
## $ Parch : int 0 1 1 0 0 0 0 0 0 2 ...
## $ Ticket : chr "373450" "349909" "PP 9549" "244373" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 1 3 1 1 ...
## $ titlePass : chr "Mr" "Master" "Miss" "Mr" ...
## $ surname : chr "Allen" "Palsson" "Sandstrom" "Williams" ...
## $ Age2 : num 35 2 4 31.5 35 28 40 66 31.5 3 ...
## $ Fare2 : num 8.05 21.07 16.7 13 26 ...
## $ YJTrans_Fare2: num -1.0495 0.3261 0.0462 -0.2913 0.5531 ...
## $ YJTrans_Age2 : num 0.434 -2.395 -2.158 0.183 0.434 ...
## $ log_Age2 : num 3.555 0.693 1.386 3.45 3.555 ...
Transformation on titan_test
yeoJ_age_test <- predict(yeojohnson(titan_test_new$Age2))
yeoJ_fare_test <- predict(yeojohnson(titan_test_new$Fare2))
head(yeoJ_age_test)
## [1] 0.4138232 1.2951371 2.2788698 -0.1527104 -0.5518869 -1.2413137
hist(yeoJ_age_test)
hist(titan_test_new$Age2)
summary(yeoJ_age_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.80232 -0.55189 0.00177 0.00000 0.55955 3.14418
skewness(yeoJ_age_test)
## [1] 0.08055547
kurtosis(yeoJ_age_test)
## [1] 3.715383
#outlier on Yeo Johnson Age2
length(yeoJ_age_test[which(yeoJ_age_test < quantile(yeoJ_age_test,0.25) - (1.5*IQR(yeoJ_age_test))|
yeoJ_age_test > quantile(yeoJ_age_test,0.75) + (1.5*IQR(yeoJ_age_test)))])
## [1] 19
# Outlier on Age2
length(titan_test_new$Age2[which(titan_test_new$Age2 < quantile(titan_test_new$Age2,0.25) - (1.5*IQR(titan_test_new$Age2))|
titan_test_new$Age2 > quantile(titan_test_new$Age2,0.75) + (1.5*IQR(titan_test_new$Age2)))])
## [1] 16
hist(yeoJ_fare_test)
hist(titan_test_new$Fare2)
summary(yeoJ_fare_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.9287 -1.0422 -0.1084 0.0000 0.7535 2.0217
skewness(yeoJ_fare_test)
## [1] 0.2385711
kurtosis(yeoJ_fare_test)
## [1] 1.856749
#outlier on Yeo Johnson Age2
length(yeoJ_fare_test[which(yeoJ_fare_test < quantile(yeoJ_fare_test,0.25) - (1.5*IQR(yeoJ_fare_test))|
yeoJ_fare_test > quantile(yeoJ_fare_test,0.75) + (1.5*IQR(yeoJ_fare_test)))])
## [1] 0
# Outlier on Age2
length(titan_test_new$Fare2[which(titan_test_new$Fare2 < quantile(titan_test_new$Fare2,0.25) - (1.5*IQR(titan_test_new$Fare2))|
titan_test_new$Fare2 > quantile(titan_test_new$Fare2,0.75) + (1.5*IQR(titan_test_new$Fare2)))])
## [1] 55
Add yeoJ_fare_test and yeoJ_age_valid to titan_test_new dataframe
titan_test_new$YJTrans_Fare2 <- yeoJ_fare_test
titan_test_new$YJTrans_Age2 <- yeoJ_age_test
titan_test_new$log_Age2 <- log(titan_test_new$Age2)
str(titan_test_new)
## 'data.frame': 418 obs. of 15 variables:
## $ PassengerId : int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ titlePass : chr "Mr" "Mrs" "Mr" "Mr" ...
## $ surname : chr "Kelly" "Wilkes" "Myles" "Wirz" ...
## $ Age2 : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ Fare2 : num 7.83 7 9.69 8.66 12.29 ...
## $ YJTrans_Fare2: num -1.057 -1.259 -0.698 -0.883 -0.335 ...
## $ YJTrans_Age2 : num 0.414 1.295 2.279 -0.153 -0.552 ...
## $ log_Age2 : num 3.54 3.85 4.13 3.3 3.09 ...
Suprisingly, Yeo Johnson transformation generate similar pattern on all data sets. They eliminate outliers in Fare, but increasing outliers in Age. All of them also manage to return a more centered data.
As i stated on preceding section (Exploratory Data Analysis), Sex and Embarked features are nominal measurements, which means there are no order or any kinds level to compare which group is better to another. For this reason, i will convert these two features, from factor data type to binary feature.
# unique value Sex feature
unique(trainX_new$Sex)
## [1] male female
## Levels: female male
unique(trainX_new$Embarked)
## [1] S C Q
## Levels: C Q S
# Convert each of this unique values to binary feature
trainX_new <- trainX_new %>% mutate(
SexMale = case_when(
Sex=="male"~1,
TRUE~0
),
SexFemale = case_when(
Sex=="female"~1,
TRUE~0
),
Embarked_S = case_when(
Embarked=="S"~1,
TRUE~0
),
Embarked_C = case_when(
Embarked=="C"~1,
TRUE~0
),
Embarked_Q = case_when(
Embarked=="Q"~1,
TRUE~0
)
)
str(trainX_new)
## 'data.frame': 686 obs. of 21 variables:
## $ PassengerId : int 1 2 3 4 6 7 9 10 12 13 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 1 2 ...
## $ SibSp : int 1 1 0 1 0 0 0 1 0 0 ...
## $ Parch : int 0 0 0 0 0 0 2 0 0 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 1 3 3 ...
## $ titlePass : chr "Mr" "Mrs" "Miss" "Mrs" ...
## $ surname : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
## $ Age2 : num 22 38 26 35 30 54 27 14 58 20 ...
## $ Fare2 : num 7.25 71.28 7.92 53.1 8.46 ...
## $ YJTrans_Fare2: num -1.235 1.36 -1.077 1.167 -0.964 ...
## $ YJTrans_Age2 : num -0.5323 0.644 -0.2239 0.4337 0.0738 ...
## $ log_age : num 3.09 3.64 3.26 3.56 3.4 ...
## $ SexMale : num 1 0 0 0 1 1 0 0 0 1 ...
## $ SexFemale : num 0 1 1 1 0 0 1 1 1 0 ...
## $ Embarked_S : num 1 0 1 1 0 1 1 0 1 1 ...
## $ Embarked_C : num 0 1 0 0 0 0 0 1 0 0 ...
## $ Embarked_Q : num 0 0 0 0 1 0 0 0 0 0 ...
Sex and Embarked engineering in validX_new
# unique value Sex feature
unique(validX_new$Sex)
## [1] male female
## Levels: female male
unique(validX_new$Embarked)
## [1] S C Q
## Levels: C Q S
# Convert each of this unique values to binary feature
validX_new <- validX_new %>% mutate(
SexMale = case_when(
Sex=="male"~1,
TRUE~0
),
SexFemale = case_when(
Sex=="female"~1,
TRUE~0
),
Embarked_S = case_when(
Embarked=="S"~1,
TRUE~0
),
Embarked_C = case_when(
Embarked=="C"~1,
TRUE~0
),
Embarked_Q = case_when(
Embarked=="Q"~1,
TRUE~0
)
)
str(validX_new)
## 'data.frame': 205 obs. of 21 variables:
## $ PassengerId : int 5 8 11 18 21 24 31 34 37 44 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 1 2 3 2 ...
## $ Name : chr "Allen, Mr. William Henry" "Palsson, Master. Gosta Leonard" "Sandstrom, Miss. Marguerite Rut" "Williams, Mr. Charles Eugene" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 2 2 1 ...
## $ SibSp : int 0 3 1 0 0 0 0 0 0 1 ...
## $ Parch : int 0 1 1 0 0 0 0 0 0 2 ...
## $ Ticket : chr "373450" "349909" "PP 9549" "244373" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 1 3 1 1 ...
## $ titlePass : chr "Mr" "Master" "Miss" "Mr" ...
## $ surname : chr "Allen" "Palsson" "Sandstrom" "Williams" ...
## $ Age2 : num 35 2 4 31.5 35 28 40 66 31.5 3 ...
## $ Fare2 : num 8.05 21.07 16.7 13 26 ...
## $ YJTrans_Fare2: num -1.0495 0.3261 0.0462 -0.2913 0.5531 ...
## $ YJTrans_Age2 : num 0.434 -2.395 -2.158 0.183 0.434 ...
## $ log_Age2 : num 3.555 0.693 1.386 3.45 3.555 ...
## $ SexMale : num 1 1 0 1 1 1 1 1 1 0 ...
## $ SexFemale : num 0 0 1 0 0 0 0 0 0 1 ...
## $ Embarked_S : num 1 1 1 1 1 1 0 1 0 0 ...
## $ Embarked_C : num 0 0 0 0 0 0 1 0 1 1 ...
## $ Embarked_Q : num 0 0 0 0 0 0 0 0 0 0 ...
Sex and Embarked engineering in titan_test_new
unique(titan_test_new$Sex)
## [1] "male" "female"
unique(titan_test_new$Embarked)
## [1] Q S C
## Levels: C Q S
# Convert each of this unique values to binary feature
titan_test_new <- titan_test_new %>% mutate(
SexMale = case_when(
Sex=="male"~1,
TRUE~0
),
SexFemale = case_when(
Sex=="female"~1,
TRUE~0
),
Embarked_S = case_when(
Embarked=="S"~1,
TRUE~0
),
Embarked_C = case_when(
Embarked=="C"~1,
TRUE~0
),
Embarked_Q = case_when(
Embarked=="Q"~1,
TRUE~0
)
)
str(titan_test_new)
## 'data.frame': 418 obs. of 20 variables:
## $ PassengerId : int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ titlePass : chr "Mr" "Mrs" "Mr" "Mr" ...
## $ surname : chr "Kelly" "Wilkes" "Myles" "Wirz" ...
## $ Age2 : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ Fare2 : num 7.83 7 9.69 8.66 12.29 ...
## $ YJTrans_Fare2: num -1.057 -1.259 -0.698 -0.883 -0.335 ...
## $ YJTrans_Age2 : num 0.414 1.295 2.279 -0.153 -0.552 ...
## $ log_Age2 : num 3.54 3.85 4.13 3.3 3.09 ...
## $ SexMale : num 1 0 1 1 0 1 0 1 0 1 ...
## $ SexFemale : num 0 1 0 0 1 0 1 0 1 0 ...
## $ Embarked_S : num 0 1 0 1 1 1 0 1 0 1 ...
## $ Embarked_C : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Embarked_Q : num 1 0 1 0 0 0 1 0 0 0 ...
str(trainX_new)
## 'data.frame': 686 obs. of 21 variables:
## $ PassengerId : int 1 2 3 4 6 7 9 10 12 13 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 1 2 ...
## $ SibSp : int 1 1 0 1 0 0 0 1 0 0 ...
## $ Parch : int 0 0 0 0 0 0 2 0 0 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 1 3 3 ...
## $ titlePass : chr "Mr" "Mrs" "Miss" "Mrs" ...
## $ surname : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
## $ Age2 : num 22 38 26 35 30 54 27 14 58 20 ...
## $ Fare2 : num 7.25 71.28 7.92 53.1 8.46 ...
## $ YJTrans_Fare2: num -1.235 1.36 -1.077 1.167 -0.964 ...
## $ YJTrans_Age2 : num -0.5323 0.644 -0.2239 0.4337 0.0738 ...
## $ log_age : num 3.09 3.64 3.26 3.56 3.4 ...
## $ SexMale : num 1 0 0 0 1 1 0 0 0 1 ...
## $ SexFemale : num 0 1 1 1 0 0 1 1 1 0 ...
## $ Embarked_S : num 1 0 1 1 0 1 1 0 1 1 ...
## $ Embarked_C : num 0 1 0 0 0 0 0 1 0 0 ...
## $ Embarked_Q : num 0 0 0 0 1 0 0 0 0 0 ...
In common data analysis procedure, there are many feature selection method to be choosen. Each method has its own strength and weakness. Generally, these methods are divided into 4 groups : 1. Filter 2. Wrapper 3. Embedded 4. Hybrid For this project, i will utilize one of the most popular and reliable feature selection method named Boruta. For detailed explanationof Boruta, please visit this link analytics vidyha
# install boruta package
#install.packages("Boruta")
library(Boruta)
## Warning: package 'Boruta' was built under R version 4.1.3
before proceeding, it’s a good idea to, once again, have a look at the data structure. Here, I just realized that there are several features that should not be involved or need to be separated before we start the selection process
str(trainX_new)
## 'data.frame': 686 obs. of 21 variables:
## $ PassengerId : int 1 2 3 4 6 7 9 10 12 13 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 1 2 ...
## $ SibSp : int 1 1 0 1 0 0 0 1 0 0 ...
## $ Parch : int 0 0 0 0 0 0 2 0 0 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 1 3 3 ...
## $ titlePass : chr "Mr" "Mrs" "Miss" "Mrs" ...
## $ surname : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
## $ Age2 : num 22 38 26 35 30 54 27 14 58 20 ...
## $ Fare2 : num 7.25 71.28 7.92 53.1 8.46 ...
## $ YJTrans_Fare2: num -1.235 1.36 -1.077 1.167 -0.964 ...
## $ YJTrans_Age2 : num -0.5323 0.644 -0.2239 0.4337 0.0738 ...
## $ log_age : num 3.09 3.64 3.26 3.56 3.4 ...
## $ SexMale : num 1 0 0 0 1 1 0 0 0 1 ...
## $ SexFemale : num 0 1 1 1 0 0 1 1 1 0 ...
## $ Embarked_S : num 1 0 1 1 0 1 1 0 1 1 ...
## $ Embarked_C : num 0 1 0 0 0 0 0 1 0 0 ...
## $ Embarked_Q : num 0 0 0 0 1 0 0 0 0 0 ...
First of all, i want to discard “PassangerId”, or rather, convert it into row names.
trainX_select <- trainX_new %>% remove_rownames() %>% column_to_rownames("PassengerId")
validX_select <- validX_new %>% remove_rownames() %>% column_to_rownames("PassengerId")
titan_test_select <- titan_test_new %>% remove_rownames() %>%
column_to_rownames("PassengerId")
Next, i need to discard Name,titlePass,surname, and Ticket feature, because they are unexplainable in term of data analysis interpretation. Moreover, the imbalance propotion of each levels would cause error in model.
trainX_select <- trainX_select %>% select(-c(Name,Ticket,surname,titlePass))
validX_select <- validX_select %>% select(-c(Name,Ticket,surname,titlePass))
titan_test_select <- titan_test_select %>% select(-c(Name,Ticket,surname,titlePass))
Furthermore, I also realized, that due to data transformation, our titanic data set has some redundant features. For example, the Sex feature with the SexMale and SexFemale features. For features like this, I want to discard non-transformed features such as Sex, Embarked, Age2, and Fare2.
# Data set with Yeo Johnson transformed Age2
trainX_trans <- trainX_select %>% select(-c(Sex,Embarked,Age2,Fare2,log_age))
validX_trans <- validX_select %>% select(-c(Sex,Embarked,Age2,Fare2,log_Age2))
titan_test_trans <- titan_test_select %>% select(-c(Sex,Embarked,Age2,Fare2,log_Age2))
# Data set with log transformed Age2
trainX_trans_log <- trainX_select %>% select(-c(Sex,Embarked,Age2,Fare2,YJTrans_Age2))
validX_trans_log <- validX_select %>% select(-c(Sex,Embarked,Age2,Fare2,YJTrans_Age2))
titan_test_trans_log <- titan_test_select %>% select(-c(Sex,Embarked,Age2,Fare2,YJTrans_Age2))
str(trainX_trans)
## 'data.frame': 686 obs. of 11 variables:
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
## $ SibSp : int 1 1 0 1 0 0 0 1 0 0 ...
## $ Parch : int 0 0 0 0 0 0 2 0 0 0 ...
## $ YJTrans_Fare2: num -1.235 1.36 -1.077 1.167 -0.964 ...
## $ YJTrans_Age2 : num -0.5323 0.644 -0.2239 0.4337 0.0738 ...
## $ SexMale : num 1 0 0 0 1 1 0 0 0 1 ...
## $ SexFemale : num 0 1 1 1 0 0 1 1 1 0 ...
## $ Embarked_S : num 1 0 1 1 0 1 1 0 1 1 ...
## $ Embarked_C : num 0 1 0 0 0 0 0 1 0 0 ...
## $ Embarked_Q : num 0 0 0 0 1 0 0 0 0 0 ...
str(trainX_trans_log)
## 'data.frame': 686 obs. of 11 variables:
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 2 2 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 2 1 3 ...
## $ SibSp : int 1 1 0 1 0 0 0 1 0 0 ...
## $ Parch : int 0 0 0 0 0 0 2 0 0 0 ...
## $ YJTrans_Fare2: num -1.235 1.36 -1.077 1.167 -0.964 ...
## $ log_age : num 3.09 3.64 3.26 3.56 3.4 ...
## $ SexMale : num 1 0 0 0 1 1 0 0 0 1 ...
## $ SexFemale : num 0 1 1 1 0 0 1 1 1 0 ...
## $ Embarked_S : num 1 0 1 1 0 1 1 0 1 1 ...
## $ Embarked_C : num 0 1 0 0 0 0 0 1 0 0 ...
## $ Embarked_Q : num 0 0 0 0 1 0 0 0 0 0 ...
# Apply Boruta
set.seed(141)
boruta_titan <- Boruta(Survived~., data=trainX_trans, doTrace=2)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## After 10 iterations, +9.2 secs:
## confirmed 8 attributes: Embarked_S, Parch, Pclass, SexFemale, SexMale and 3 more;
## still have 2 attributes left.
## 11. run of importance source...
## 12. run of importance source...
## 13. run of importance source...
## 14. run of importance source...
## After 14 iterations, +13 secs:
## confirmed 1 attribute: Embarked_Q;
## still have 1 attribute left.
## 15. run of importance source...
## 16. run of importance source...
## 17. run of importance source...
## 18. run of importance source...
## 19. run of importance source...
## 20. run of importance source...
## 21. run of importance source...
## 22. run of importance source...
## 23. run of importance source...
## 24. run of importance source...
## 25. run of importance source...
## 26. run of importance source...
## 27. run of importance source...
## 28. run of importance source...
## 29. run of importance source...
## 30. run of importance source...
## 31. run of importance source...
## 32. run of importance source...
## 33. run of importance source...
## 34. run of importance source...
## 35. run of importance source...
## 36. run of importance source...
## 37. run of importance source...
## 38. run of importance source...
## 39. run of importance source...
## 40. run of importance source...
## 41. run of importance source...
## 42. run of importance source...
## 43. run of importance source...
## 44. run of importance source...
## 45. run of importance source...
## 46. run of importance source...
## 47. run of importance source...
## 48. run of importance source...
## 49. run of importance source...
## 50. run of importance source...
## 51. run of importance source...
## 52. run of importance source...
## 53. run of importance source...
## 54. run of importance source...
## 55. run of importance source...
## 56. run of importance source...
## 57. run of importance source...
## 58. run of importance source...
## 59. run of importance source...
## 60. run of importance source...
## 61. run of importance source...
## 62. run of importance source...
## 63. run of importance source...
## 64. run of importance source...
## 65. run of importance source...
## 66. run of importance source...
## 67. run of importance source...
## 68. run of importance source...
## 69. run of importance source...
## 70. run of importance source...
## 71. run of importance source...
## 72. run of importance source...
## 73. run of importance source...
## 74. run of importance source...
## 75. run of importance source...
## 76. run of importance source...
## 77. run of importance source...
## 78. run of importance source...
## 79. run of importance source...
## 80. run of importance source...
## 81. run of importance source...
## 82. run of importance source...
## 83. run of importance source...
## 84. run of importance source...
## 85. run of importance source...
## 86. run of importance source...
## 87. run of importance source...
## 88. run of importance source...
## 89. run of importance source...
## 90. run of importance source...
## 91. run of importance source...
## 92. run of importance source...
## 93. run of importance source...
## 94. run of importance source...
## 95. run of importance source...
## 96. run of importance source...
## 97. run of importance source...
## 98. run of importance source...
## 99. run of importance source...
boruta_titan
## Boruta performed 99 iterations in 1.628624 mins.
## 9 attributes confirmed important: Embarked_Q, Embarked_S, Parch,
## Pclass, SexFemale and 4 more;
## No attributes deemed unimportant.
## 1 tentative attributes left: Embarked_C;
Suprisingly, all features in trainX_trans are deemed as important.
plot(boruta_titan,las=2)
Because there are no tentative feature, the only thing we need to do to print out a final data frame result derived from Boruta
attStats(boruta_titan)
## meanImp medianImp minImp maxImp normHits decision
## Pclass 29.373456 29.410889 26.3415017 33.056757 1.0000000 Confirmed
## SibSp 13.188614 13.179391 9.9398120 15.867781 1.0000000 Confirmed
## Parch 9.258545 9.195480 7.2416953 12.758392 0.9797980 Confirmed
## YJTrans_Fare2 26.320235 26.210472 22.2362680 30.762680 1.0000000 Confirmed
## YJTrans_Age2 21.451385 21.374573 17.9053520 26.006113 1.0000000 Confirmed
## SexMale 24.412133 24.269320 22.2670036 27.175844 1.0000000 Confirmed
## SexFemale 24.461863 24.504033 22.9301114 26.990550 1.0000000 Confirmed
## Embarked_S 9.085314 8.971972 6.0090053 12.977801 0.9797980 Confirmed
## Embarked_C 3.383204 3.483069 0.3167022 6.135968 0.5555556 Tentative
## Embarked_Q 7.286877 7.400099 4.4782788 10.231091 0.9494949 Confirmed
# just to ensure all data set has the same column name
colnames(validX_trans) <- colnames(trainX_trans)
Apply the same to the log_transformation data set
# Apply Boruta in log transformed data set
set.seed(141)
boruta_titan_log <- Boruta(Survived~., data=trainX_trans_log, doTrace=2)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## After 10 iterations, +10 secs:
## confirmed 8 attributes: Embarked_S, log_age, Parch, Pclass, SexFemale and 3 more;
## still have 2 attributes left.
## 11. run of importance source...
## 12. run of importance source...
## 13. run of importance source...
## 14. run of importance source...
## After 14 iterations, +18 secs:
## confirmed 1 attribute: Embarked_Q;
## still have 1 attribute left.
## 15. run of importance source...
## 16. run of importance source...
## 17. run of importance source...
## 18. run of importance source...
## 19. run of importance source...
## 20. run of importance source...
## 21. run of importance source...
## 22. run of importance source...
## 23. run of importance source...
## 24. run of importance source...
## 25. run of importance source...
## 26. run of importance source...
## 27. run of importance source...
## 28. run of importance source...
## 29. run of importance source...
## 30. run of importance source...
## 31. run of importance source...
## 32. run of importance source...
## 33. run of importance source...
## 34. run of importance source...
## 35. run of importance source...
## 36. run of importance source...
## 37. run of importance source...
## 38. run of importance source...
## 39. run of importance source...
## 40. run of importance source...
## 41. run of importance source...
## 42. run of importance source...
## 43. run of importance source...
## 44. run of importance source...
## 45. run of importance source...
## 46. run of importance source...
## 47. run of importance source...
## 48. run of importance source...
## 49. run of importance source...
## 50. run of importance source...
## 51. run of importance source...
## 52. run of importance source...
## 53. run of importance source...
## 54. run of importance source...
## 55. run of importance source...
## 56. run of importance source...
## 57. run of importance source...
## 58. run of importance source...
## 59. run of importance source...
## 60. run of importance source...
## 61. run of importance source...
## 62. run of importance source...
## 63. run of importance source...
## 64. run of importance source...
## 65. run of importance source...
## 66. run of importance source...
## 67. run of importance source...
## 68. run of importance source...
## 69. run of importance source...
## 70. run of importance source...
## 71. run of importance source...
## 72. run of importance source...
## 73. run of importance source...
## 74. run of importance source...
## 75. run of importance source...
## 76. run of importance source...
## 77. run of importance source...
## 78. run of importance source...
## 79. run of importance source...
## 80. run of importance source...
## 81. run of importance source...
## 82. run of importance source...
## 83. run of importance source...
## 84. run of importance source...
## 85. run of importance source...
## 86. run of importance source...
## 87. run of importance source...
## 88. run of importance source...
## 89. run of importance source...
## 90. run of importance source...
## 91. run of importance source...
## 92. run of importance source...
## 93. run of importance source...
## 94. run of importance source...
## 95. run of importance source...
## 96. run of importance source...
## 97. run of importance source...
## 98. run of importance source...
## 99. run of importance source...
boruta_titan_log
## Boruta performed 99 iterations in 1.654999 mins.
## 9 attributes confirmed important: Embarked_Q, Embarked_S, log_age,
## Parch, Pclass and 4 more;
## No attributes deemed unimportant.
## 1 tentative attributes left: Embarked_C;
Same as yeoJohnson transformed data, Boruta test in log transformed data set also deemed all features to be important.
colnames(validX_trans_log)<-colnames(trainX_trans_log)
There is no one-fit-all algorithm for any case. To create better model, i decide to just run multiple model and maybe ensembling it.
# First of all, i need to ensemble trainX_ and validX_ data set, to enable me conduct cross validation
titan_cross_df <- rbind(trainX_trans,validX_trans)
titan_cross_log_df <- rbind(trainX_trans_log,validX_trans_log)
dim(titan_cross_df)
## [1] 891 11
dim(titan_cross_log_df)
## [1] 891 11
1. Logistic Regression
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
#library(boot)
#library(MASS)
set.seed(909)
train_glm_control <- trainControl(method="cv",number=10)
titan_check_glm <- train(Survived~.,data=titan_cross_df,
trControl=train_glm_control,
method="glm",
family="binomial")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
titan_check_glm_step <- train(Survived~.,data=titan_cross_df,
trControl=train_glm_control,
method="glmStepAIC",
family="binomial")
## Start: AIC=721.08
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=721.08
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=721.08
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_S 1 701.08 719.08
## - Embarked_C 1 702.63 720.63
## <none> 701.08 721.08
## - YJTrans_Fare2 1 704.57 722.57
## - Parch 1 705.15 723.15
## - Pclass2 1 706.51 724.51
## - SibSp 1 715.48 733.48
## - Pclass3 1 716.17 734.17
## - YJTrans_Age2 1 735.77 753.77
## - SexMale 1 891.71 909.71
##
## Step: AIC=719.08
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_C
##
## Df Deviance AIC
## <none> 701.08 719.08
## - YJTrans_Fare2 1 704.57 720.57
## - Parch 1 705.20 721.20
## - Embarked_C 1 705.59 721.59
## - Pclass2 1 706.56 722.56
## - SibSp 1 715.53 731.53
## - Pclass3 1 716.18 732.18
## - YJTrans_Age2 1 735.98 751.98
## - SexMale 1 898.82 914.82
## Start: AIC=711.97
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=711.97
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=711.97
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 691.97 709.97
## - Embarked_S 1 693.35 711.35
## - Parch 1 693.50 711.50
## <none> 691.97 711.97
## - YJTrans_Fare2 1 694.08 712.08
## - Pclass2 1 696.41 714.41
## - SibSp 1 706.19 724.19
## - Pclass3 1 706.45 724.45
## - YJTrans_Age2 1 721.17 739.17
## - SexMale 1 900.52 918.52
##
## Step: AIC=709.97
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S
##
## Df Deviance AIC
## - Parch 1 693.51 709.51
## <none> 691.97 709.97
## - YJTrans_Fare2 1 694.08 710.08
## - Embarked_S 1 695.41 711.41
## - Pclass2 1 696.42 712.42
## - SibSp 1 706.21 722.21
## - Pclass3 1 706.49 722.49
## - YJTrans_Age2 1 721.18 737.18
## - SexMale 1 902.52 918.52
##
## Step: AIC=709.51
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 +
## SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 694.44 708.44
## <none> 693.51 709.51
## - Embarked_S 1 697.24 711.24
## - Pclass2 1 700.26 714.26
## - SibSp 1 707.34 721.34
## - Pclass3 1 715.43 729.43
## - YJTrans_Age2 1 722.39 736.39
## - SexMale 1 903.69 917.69
##
## Step: AIC=708.44
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale +
## Embarked_S
##
## Df Deviance AIC
## <none> 694.44 708.44
## - Embarked_S 1 698.34 710.34
## - SibSp 1 710.20 722.20
## - Pclass2 1 710.23 722.23
## - YJTrans_Age2 1 723.80 735.80
## - Pclass3 1 785.32 797.32
## - SexMale 1 916.31 928.31
## Start: AIC=727.26
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=727.26
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=727.26
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_S 1 707.57 725.57
## - Embarked_C 1 707.65 725.65
## <none> 707.26 727.26
## - Pclass2 1 709.59 727.59
## - YJTrans_Fare2 1 709.82 727.82
## - Parch 1 710.22 728.22
## - SibSp 1 717.16 735.16
## - Pclass3 1 720.08 738.08
## - YJTrans_Age2 1 731.54 749.54
## - SexMale 1 899.33 917.33
##
## Step: AIC=725.57
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_C
##
## Df Deviance AIC
## <none> 707.57 725.57
## - YJTrans_Fare2 1 710.03 726.03
## - Pclass2 1 710.09 726.09
## - Parch 1 710.67 726.67
## - Embarked_C 1 710.67 726.67
## - SibSp 1 717.63 733.63
## - Pclass3 1 720.28 736.28
## - YJTrans_Age2 1 732.29 748.29
## - SexMale 1 907.72 923.72
## Start: AIC=725.69
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=725.69
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=725.69
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 705.78 723.78
## - Embarked_S 1 706.20 724.20
## - Parch 1 706.97 724.97
## <none> 705.69 725.69
## - YJTrans_Fare2 1 708.48 726.48
## - Pclass2 1 708.72 726.72
## - SibSp 1 715.58 733.58
## - Pclass3 1 719.49 737.49
## - YJTrans_Age2 1 729.17 747.17
## - SexMale 1 905.21 923.21
##
## Step: AIC=723.78
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S
##
## Df Deviance AIC
## - Parch 1 707.05 723.05
## <none> 705.78 723.78
## - Embarked_S 1 708.37 724.37
## - YJTrans_Fare2 1 708.76 724.76
## - Pclass2 1 708.82 724.82
## - SibSp 1 715.87 731.87
## - Pclass3 1 719.70 735.70
## - YJTrans_Age2 1 729.23 745.23
## - SexMale 1 906.62 922.62
##
## Step: AIC=723.05
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 +
## SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 708.87 722.87
## <none> 707.05 723.05
## - Embarked_S 1 709.90 723.90
## - Pclass2 1 711.63 725.63
## - SibSp 1 716.92 730.92
## - Pclass3 1 727.09 741.09
## - YJTrans_Age2 1 730.11 744.11
## - SexMale 1 907.62 921.62
##
## Step: AIC=722.87
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale +
## Embarked_S
##
## Df Deviance AIC
## <none> 708.87 722.87
## - Embarked_S 1 712.06 724.06
## - SibSp 1 717.32 729.32
## - Pclass2 1 722.32 734.32
## - YJTrans_Age2 1 732.83 744.83
## - Pclass3 1 802.55 814.55
## - SexMale 1 924.10 936.10
## Start: AIC=714.17
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=714.17
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=714.17
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_S 1 694.33 712.33
## - Embarked_C 1 694.38 712.38
## <none> 694.17 714.17
## - Parch 1 696.66 714.66
## - YJTrans_Fare2 1 697.09 715.09
## - Pclass2 1 699.29 717.29
## - SibSp 1 706.24 724.24
## - Pclass3 1 711.00 729.00
## - YJTrans_Age2 1 731.96 749.96
## - SexMale 1 889.94 907.94
##
## Step: AIC=712.33
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 695.91 711.91
## <none> 694.33 712.33
## - Parch 1 696.93 712.93
## - YJTrans_Fare2 1 697.19 713.19
## - Pclass2 1 699.66 715.66
## - SibSp 1 706.55 722.55
## - Pclass3 1 711.07 727.07
## - YJTrans_Age2 1 732.65 748.65
## - SexMale 1 898.26 914.26
##
## Step: AIC=711.91
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale
##
## Df Deviance AIC
## <none> 695.91 711.91
## - Parch 1 698.62 712.62
## - YJTrans_Fare2 1 699.39 713.39
## - Pclass2 1 702.17 716.17
## - SibSp 1 709.76 723.76
## - Pclass3 1 713.26 727.26
## - YJTrans_Age2 1 735.57 749.57
## - SexMale 1 901.72 915.72
## Start: AIC=725.64
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=725.64
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=725.64
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 705.66 723.66
## - Embarked_S 1 706.52 724.52
## - Parch 1 706.91 724.91
## <none> 705.64 725.64
## - YJTrans_Fare2 1 708.44 726.44
## - Pclass2 1 710.37 728.37
## - SibSp 1 719.59 737.59
## - Pclass3 1 720.70 738.70
## - YJTrans_Age2 1 728.52 746.52
## - SexMale 1 901.00 919.00
##
## Step: AIC=723.66
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S
##
## Df Deviance AIC
## - Parch 1 706.92 722.92
## <none> 705.66 723.66
## - YJTrans_Fare2 1 708.52 724.52
## - Embarked_S 1 708.59 724.59
## - Pclass2 1 710.41 726.41
## - SibSp 1 719.68 735.68
## - Pclass3 1 720.90 736.90
## - YJTrans_Age2 1 728.55 744.55
## - SexMale 1 902.19 918.19
##
## Step: AIC=722.92
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 +
## SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 708.70 722.70
## <none> 706.92 722.92
## - Embarked_S 1 710.17 724.17
## - Pclass2 1 713.18 727.18
## - SibSp 1 720.98 734.98
## - Pclass3 1 727.17 741.17
## - YJTrans_Age2 1 729.00 743.00
## - SexMale 1 903.00 917.00
##
## Step: AIC=722.7
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale +
## Embarked_S
##
## Df Deviance AIC
## <none> 708.70 722.70
## - Embarked_S 1 712.18 724.18
## - SibSp 1 722.47 734.47
## - Pclass2 1 724.91 736.91
## - YJTrans_Age2 1 731.51 743.51
## - Pclass3 1 802.07 814.07
## - SexMale 1 915.42 927.42
## Start: AIC=714.35
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=714.35
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=714.35
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 694.41 712.41
## - Embarked_S 1 694.82 712.82
## - Parch 1 695.45 713.45
## <none> 694.35 714.35
## - YJTrans_Fare2 1 696.54 714.54
## - Pclass2 1 698.53 716.53
## - SibSp 1 708.39 726.39
## - Pclass3 1 709.14 727.14
## - YJTrans_Age2 1 724.42 742.42
## - SexMale 1 891.83 909.83
##
## Step: AIC=712.41
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S
##
## Df Deviance AIC
## - Parch 1 695.49 711.49
## <none> 694.41 712.41
## - Embarked_S 1 696.42 712.42
## - YJTrans_Fare2 1 696.72 712.72
## - Pclass2 1 698.58 714.58
## - SibSp 1 708.51 724.51
## - Pclass3 1 709.30 725.30
## - YJTrans_Age2 1 724.46 740.46
## - SexMale 1 893.90 909.90
##
## Step: AIC=711.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 +
## SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 696.84 710.84
## <none> 695.49 711.49
## - Embarked_S 1 697.77 711.77
## - Pclass2 1 701.50 715.50
## - SibSp 1 709.33 723.33
## - Pclass3 1 717.00 731.00
## - YJTrans_Age2 1 725.07 739.07
## - SexMale 1 895.63 909.63
##
## Step: AIC=710.84
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale +
## Embarked_S
##
## Df Deviance AIC
## <none> 696.84 710.84
## - Embarked_S 1 699.33 711.33
## - SibSp 1 711.03 723.03
## - Pclass2 1 711.64 723.64
## - YJTrans_Age2 1 727.18 739.18
## - Pclass3 1 788.19 800.19
## - SexMale 1 908.79 920.79
## Start: AIC=711.47
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=711.47
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=711.47
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 691.55 709.55
## - Embarked_S 1 692.59 710.59
## <none> 691.47 711.47
## - Parch 1 694.28 712.28
## - YJTrans_Fare2 1 694.64 712.64
## - Pclass2 1 696.79 714.79
## - SibSp 1 704.79 722.79
## - Pclass3 1 707.25 725.25
## - YJTrans_Age2 1 723.41 741.41
## - SexMale 1 881.42 899.42
##
## Step: AIC=709.55
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S
##
## Df Deviance AIC
## <none> 691.55 709.55
## - Parch 1 694.32 710.32
## - YJTrans_Fare2 1 694.90 710.90
## - Embarked_S 1 696.24 712.24
## - Pclass2 1 696.87 712.87
## - SibSp 1 704.94 720.94
## - Pclass3 1 707.50 723.50
## - YJTrans_Age2 1 723.44 739.44
## - SexMale 1 884.20 900.20
## Start: AIC=719.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=719.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=719.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 699.50 717.50
## - YJTrans_Fare2 1 699.66 717.66
## - Embarked_S 1 699.76 717.76
## - Parch 1 700.35 718.35
## <none> 699.29 719.29
## - Pclass2 1 705.85 723.85
## - SibSp 1 709.45 727.45
## - Pclass3 1 721.90 739.90
## - YJTrans_Age2 1 731.18 749.18
## - SexMale 1 900.41 918.41
##
## Step: AIC=717.5
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 699.95 715.95
## - Parch 1 700.54 716.54
## <none> 699.50 717.50
## - Embarked_S 1 702.67 718.67
## - Pclass2 1 706.11 722.11
## - SibSp 1 709.92 725.92
## - Pclass3 1 722.41 738.41
## - YJTrans_Age2 1 731.40 747.40
## - SexMale 1 901.33 917.33
##
## Step: AIC=715.95
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Age2 +
## SexMale + Embarked_S
##
## Df Deviance AIC
## - Parch 1 700.57 714.57
## <none> 699.95 715.95
## - Embarked_S 1 703.41 717.41
## - SibSp 1 711.83 725.83
## - Pclass2 1 715.12 729.12
## - YJTrans_Age2 1 731.97 745.97
## - Pclass3 1 805.98 819.98
## - SexMale 1 906.07 920.07
##
## Step: AIC=714.57
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale +
## Embarked_S
##
## Df Deviance AIC
## <none> 700.57 714.57
## - Embarked_S 1 704.13 716.13
## - SibSp 1 715.48 727.48
## - Pclass2 1 715.63 727.63
## - YJTrans_Age2 1 732.22 744.22
## - Pclass3 1 806.49 818.49
## - SexMale 1 910.38 922.38
## Start: AIC=712.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=712.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=712.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 692.55 710.55
## - Parch 1 692.80 710.80
## - Embarked_S 1 693.00 711.00
## <none> 692.49 712.49
## - YJTrans_Fare2 1 694.60 712.60
## - Pclass2 1 696.77 714.77
## - Pclass3 1 708.93 726.93
## - SibSp 1 710.36 728.36
## - YJTrans_Age2 1 722.80 740.80
## - SexMale 1 885.11 903.11
##
## Step: AIC=710.55
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S
##
## Df Deviance AIC
## - Parch 1 692.84 708.84
## <none> 692.55 710.55
## - YJTrans_Fare2 1 694.77 710.77
## - Embarked_S 1 694.91 710.91
## - Pclass2 1 696.85 712.85
## - Pclass3 1 709.20 725.20
## - SibSp 1 710.66 726.66
## - YJTrans_Age2 1 722.87 738.87
## - SexMale 1 887.42 903.42
##
## Step: AIC=708.84
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 +
## SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 694.80 708.80
## <none> 692.84 708.84
## - Embarked_S 1 695.41 709.41
## - Pclass2 1 698.17 712.17
## - SibSp 1 710.99 724.99
## - Pclass3 1 713.54 727.54
## - YJTrans_Age2 1 722.91 736.91
## - SexMale 1 889.91 903.91
##
## Step: AIC=708.8
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale +
## Embarked_S
##
## Df Deviance AIC
## <none> 694.80 708.80
## - Embarked_S 1 697.72 709.72
## - Pclass2 1 709.59 721.59
## - SibSp 1 713.44 725.44
## - YJTrans_Age2 1 725.67 737.67
## - Pclass3 1 790.70 802.70
## - SexMale 1 904.79 916.79
## Start: AIC=797.11
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=797.11
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=797.11
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 777.28 795.28
## - Embarked_S 1 777.64 795.64
## - Parch 1 779.05 797.05
## <none> 777.11 797.11
## - YJTrans_Fare2 1 779.68 797.68
## - Pclass2 1 782.03 800.03
## - SibSp 1 791.38 809.38
## - Pclass3 1 794.54 812.54
## - YJTrans_Age2 1 809.85 827.85
## - SexMale 1 995.09 1013.09
##
## Step: AIC=795.28
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## YJTrans_Age2 + SexMale + Embarked_S
##
## Df Deviance AIC
## - Parch 1 779.19 795.19
## <none> 777.28 795.28
## - YJTrans_Fare2 1 780.05 796.05
## - Embarked_S 1 780.39 796.39
## - Pclass2 1 782.23 798.23
## - SibSp 1 791.76 807.76
## - Pclass3 1 794.97 810.97
## - YJTrans_Age2 1 810.00 826.00
## - SexMale 1 996.61 1012.61
##
## Step: AIC=795.19
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + YJTrans_Age2 +
## SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 780.49 794.49
## <none> 779.19 795.19
## - Embarked_S 1 782.69 796.69
## - Pclass2 1 786.55 800.55
## - SibSp 1 793.40 807.40
## - Pclass3 1 805.13 819.13
## - YJTrans_Age2 1 811.26 825.26
## - SexMale 1 997.47 1011.47
##
## Step: AIC=794.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Age2 + SexMale +
## Embarked_S
##
## Df Deviance AIC
## <none> 780.49 794.49
## - Embarked_S 1 784.24 796.24
## - SibSp 1 795.47 807.47
## - Pclass2 1 797.78 809.78
## - YJTrans_Age2 1 813.15 825.15
## - Pclass3 1 887.71 899.71
## - SexMale 1 1012.13 1024.13
Apply the same steps to log transformed data set
set.seed(821)
#train_glm_control <- trainControl(method="cv",number=10)
titan_log_check_glm <- train(Survived~.,data=titan_cross_log_df,
trControl=train_glm_control,
method="glm",
family="binomial")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
titan_log_check_glm_step <- train(Survived~.,data=titan_cross_log_df,
trControl=train_glm_control,
method="glmStepAIC",
family="binomial")
## Start: AIC=702.9
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=702.9
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=702.9
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_S 1 683.32 701.32
## - Embarked_C 1 683.40 701.40
## <none> 682.90 702.90
## - YJTrans_Fare2 1 685.79 703.79
## - Parch 1 686.51 704.51
## - Pclass2 1 686.99 704.99
## - Pclass3 1 695.52 713.52
## - SibSp 1 702.03 720.03
## - log_age 1 722.48 740.48
## - SexMale 1 893.38 911.38
##
## Step: AIC=701.32
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_C
##
## Df Deviance AIC
## <none> 683.32 701.32
## - YJTrans_Fare2 1 686.15 702.15
## - Parch 1 687.20 703.20
## - Embarked_C 1 687.58 703.58
## - Pclass2 1 687.65 703.65
## - Pclass3 1 695.73 711.73
## - SibSp 1 702.72 718.72
## - log_age 1 723.34 739.34
## - SexMale 1 902.94 918.94
## Start: AIC=705.99
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=705.99
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=705.99
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 686.12 704.12
## - Embarked_S 1 686.38 704.38
## - YJTrans_Fare2 1 687.12 705.12
## - Parch 1 687.15 705.15
## <none> 685.99 705.99
## - Pclass2 1 691.46 709.46
## - SibSp 1 700.38 718.38
## - Pclass3 1 703.71 721.71
## - log_age 1 718.94 736.94
## - SexMale 1 889.60 907.60
##
## Step: AIC=704.12
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## - Parch 1 687.27 703.27
## - YJTrans_Fare2 1 687.38 703.38
## <none> 686.12 704.12
## - Embarked_S 1 688.38 704.38
## - Pclass2 1 691.59 707.59
## - SibSp 1 700.74 716.74
## - Pclass3 1 704.02 720.02
## - log_age 1 719.12 735.12
## - SexMale 1 890.86 906.86
##
## Step: AIC=703.27
## .outcome ~ Pclass2 + Pclass3 + SibSp + YJTrans_Fare2 + log_age +
## SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 687.77 701.77
## <none> 687.27 703.27
## - Embarked_S 1 689.85 703.85
## - Pclass2 1 694.58 708.58
## - SibSp 1 701.63 715.63
## - Pclass3 1 711.72 725.72
## - log_age 1 719.16 733.16
## - SexMale 1 892.22 906.22
##
## Step: AIC=701.77
## .outcome ~ Pclass2 + Pclass3 + SibSp + log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## <none> 687.77 701.77
## - Embarked_S 1 690.48 702.48
## - Pclass2 1 703.15 715.15
## - SibSp 1 705.59 717.59
## - log_age 1 720.81 732.81
## - Pclass3 1 789.39 801.39
## - SexMale 1 903.50 915.50
## Start: AIC=705.92
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=705.92
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=705.92
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 685.95 703.95
## - Embarked_S 1 686.86 704.86
## <none> 685.92 705.92
## - YJTrans_Fare2 1 688.78 706.78
## - Pclass2 1 690.80 708.80
## - Parch 1 692.98 710.98
## - SibSp 1 699.17 717.17
## - Pclass3 1 700.09 718.09
## - log_age 1 733.16 751.16
## - SexMale 1 900.08 918.08
##
## Step: AIC=703.95
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## - Embarked_S 1 687.83 703.83
## <none> 685.95 703.95
## - YJTrans_Fare2 1 688.79 704.79
## - Pclass2 1 690.83 706.83
## - Parch 1 693.03 709.03
## - SibSp 1 699.19 715.19
## - Pclass3 1 700.09 716.09
## - log_age 1 733.17 749.17
## - SexMale 1 902.19 918.19
##
## Step: AIC=703.83
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale
##
## Df Deviance AIC
## <none> 687.83 703.83
## - YJTrans_Fare2 1 691.09 705.09
## - Pclass2 1 693.78 707.78
## - Parch 1 695.45 709.45
## - Pclass3 1 702.12 716.12
## - SibSp 1 702.76 716.76
## - log_age 1 736.84 750.84
## - SexMale 1 913.35 927.35
## Start: AIC=715.61
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=715.61
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=715.61
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 695.61 713.61
## - Embarked_S 1 697.13 715.13
## <none> 695.61 715.61
## - Pclass2 1 698.17 716.17
## - YJTrans_Fare2 1 700.32 718.32
## - Parch 1 701.41 719.41
## - Pclass3 1 704.59 722.59
## - SibSp 1 711.67 729.67
## - log_age 1 733.17 751.17
## - SexMale 1 902.99 920.99
##
## Step: AIC=713.61
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## <none> 695.61 713.61
## - Pclass2 1 698.17 714.17
## - Embarked_S 1 699.37 715.37
## - YJTrans_Fare2 1 700.35 716.35
## - Parch 1 701.43 717.43
## - Pclass3 1 704.60 720.60
## - SibSp 1 711.72 727.72
## - log_age 1 733.17 749.17
## - SexMale 1 905.86 921.86
## Start: AIC=692.81
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=692.81
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=692.81
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_S 1 672.99 690.99
## - Embarked_C 1 673.16 691.16
## <none> 672.81 692.81
## - Parch 1 675.48 693.48
## - YJTrans_Fare2 1 675.93 693.93
## - Pclass2 1 677.11 695.11
## - Pclass3 1 685.56 703.56
## - SibSp 1 691.12 709.12
## - log_age 1 716.75 734.75
## - SexMale 1 893.86 911.86
##
## Step: AIC=690.99
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_C
##
## Df Deviance AIC
## <none> 672.99 690.99
## - Embarked_C 1 675.10 691.10
## - Parch 1 675.76 691.76
## - YJTrans_Fare2 1 676.03 692.03
## - Pclass2 1 677.48 693.48
## - Pclass3 1 685.64 701.64
## - SibSp 1 691.50 707.50
## - log_age 1 717.35 733.35
## - SexMale 1 903.52 919.52
## Start: AIC=719.65
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=719.65
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=719.65
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 699.67 717.67
## - Embarked_S 1 700.54 718.54
## <none> 699.65 719.65
## - YJTrans_Fare2 1 701.96 719.96
## - Parch 1 702.99 720.99
## - Pclass2 1 704.86 722.86
## - SibSp 1 713.14 731.14
## - Pclass3 1 714.62 732.62
## - log_age 1 744.59 762.59
## - SexMale 1 889.67 907.67
##
## Step: AIC=717.67
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## - Embarked_S 1 701.63 717.63
## <none> 699.67 717.67
## - YJTrans_Fare2 1 701.96 717.96
## - Parch 1 703.02 719.02
## - Pclass2 1 704.86 720.86
## - SibSp 1 713.14 729.14
## - Pclass3 1 714.65 730.65
## - log_age 1 744.60 760.60
## - SexMale 1 893.89 909.89
##
## Step: AIC=717.63
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale
##
## Df Deviance AIC
## <none> 701.63 717.63
## - YJTrans_Fare2 1 704.40 718.40
## - Parch 1 705.44 719.44
## - Pclass2 1 707.74 721.74
## - Pclass3 1 716.44 730.44
## - SibSp 1 717.35 731.35
## - log_age 1 748.51 762.51
## - SexMale 1 903.95 917.95
## Start: AIC=704.98
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=704.98
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=704.98
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 685.04 703.04
## - YJTrans_Fare2 1 685.38 703.38
## - Embarked_S 1 685.61 703.61
## <none> 684.98 704.98
## - Parch 1 687.10 705.10
## - Pclass2 1 691.77 709.77
## - SibSp 1 697.12 715.12
## - Pclass3 1 705.79 723.79
## - log_age 1 729.14 747.14
## - SexMale 1 883.19 901.19
##
## Step: AIC=703.04
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## - YJTrans_Fare2 1 685.49 701.49
## <none> 685.04 703.04
## - Parch 1 687.14 703.14
## - Embarked_S 1 687.70 703.70
## - Pclass2 1 691.86 707.86
## - SibSp 1 697.35 713.35
## - Pclass3 1 706.07 722.07
## - log_age 1 729.30 745.30
## - SexMale 1 885.70 901.70
##
## Step: AIC=701.49
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + log_age + SexMale +
## Embarked_S
##
## Df Deviance AIC
## - Parch 1 687.15 701.15
## <none> 685.49 701.49
## - Embarked_S 1 688.35 702.35
## - SibSp 1 699.76 713.76
## - Pclass2 1 701.70 715.70
## - log_age 1 730.10 744.10
## - Pclass3 1 788.03 802.03
## - SexMale 1 889.93 903.93
##
## Step: AIC=701.15
## .outcome ~ Pclass2 + Pclass3 + SibSp + log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## <none> 687.15 701.15
## - Embarked_S 1 690.28 702.28
## - Pclass2 1 703.01 715.01
## - SibSp 1 705.77 717.77
## - log_age 1 730.21 742.21
## - Pclass3 1 789.21 801.21
## - SexMale 1 893.88 905.88
## Start: AIC=708.25
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=708.25
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=708.25
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 688.40 706.40
## - Embarked_S 1 688.68 706.68
## <none> 688.25 708.25
## - Pclass2 1 691.10 709.10
## - YJTrans_Fare2 1 691.69 709.69
## - Parch 1 692.39 710.39
## - Pclass3 1 700.70 718.70
## - SibSp 1 705.84 723.84
## - log_age 1 727.32 745.32
## - SexMale 1 892.35 910.35
##
## Step: AIC=706.4
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## <none> 688.40 706.40
## - Embarked_S 1 690.91 706.91
## - Pclass2 1 691.27 707.27
## - YJTrans_Fare2 1 692.08 708.08
## - Parch 1 692.51 708.51
## - Pclass3 1 701.04 717.04
## - SibSp 1 706.28 722.28
## - log_age 1 727.59 743.59
## - SexMale 1 893.21 909.21
## Start: AIC=707.07
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=707.07
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=707.07
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 687.39 705.39
## - Embarked_S 1 687.48 705.48
## <none> 687.07 707.07
## - YJTrans_Fare2 1 689.30 707.30
## - Parch 1 690.16 708.16
## - Pclass2 1 690.27 708.27
## - Pclass3 1 700.16 718.16
## - SibSp 1 700.38 718.38
## - log_age 1 723.67 741.67
## - SexMale 1 900.13 918.13
##
## Step: AIC=705.39
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## <none> 687.39 705.39
## - YJTrans_Fare2 1 689.83 705.83
## - Parch 1 690.33 706.33
## - Pclass2 1 690.65 706.65
## - Embarked_S 1 690.66 706.66
## - Pclass3 1 700.92 716.92
## - SibSp 1 701.14 717.14
## - log_age 1 723.97 739.97
## - SexMale 1 901.49 917.49
## Start: AIC=707.45
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=707.45
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=707.45
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 687.55 705.55
## - Embarked_S 1 688.02 706.02
## <none> 687.45 707.45
## - YJTrans_Fare2 1 690.14 708.14
## - Pclass2 1 690.86 708.86
## - Parch 1 691.74 709.74
## - Pclass3 1 701.30 719.30
## - SibSp 1 702.47 720.47
## - log_age 1 732.57 750.57
## - SexMale 1 882.76 900.76
##
## Step: AIC=705.55
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## <none> 687.55 705.55
## - Embarked_S 1 690.31 706.31
## - YJTrans_Fare2 1 690.46 706.46
## - Pclass2 1 690.96 706.96
## - Parch 1 691.83 707.83
## - Pclass3 1 701.51 717.51
## - SibSp 1 702.82 718.82
## - log_age 1 732.84 748.84
## - SexMale 1 884.41 900.41
## Start: AIC=784.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C +
## Embarked_Q
##
##
## Step: AIC=784.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + SexFemale + Embarked_S + Embarked_C
##
##
## Step: AIC=784.29
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S + Embarked_C
##
## Df Deviance AIC
## - Embarked_C 1 764.38 782.38
## - Embarked_S 1 764.94 782.94
## <none> 764.29 784.29
## - YJTrans_Fare2 1 766.97 784.97
## - Parch 1 768.25 786.25
## - Pclass2 1 768.95 786.95
## - Pclass3 1 779.88 797.88
## - SibSp 1 781.19 799.19
## - log_age 1 809.85 827.85
## - SexMale 1 992.87 1010.87
##
## Step: AIC=782.38
## .outcome ~ Pclass2 + Pclass3 + SibSp + Parch + YJTrans_Fare2 +
## log_age + SexMale + Embarked_S
##
## Df Deviance AIC
## <none> 764.38 782.38
## - YJTrans_Fare2 1 767.22 783.22
## - Embarked_S 1 767.34 783.34
## - Parch 1 768.30 784.30
## - Pclass2 1 769.06 785.06
## - Pclass3 1 780.19 796.19
## - SibSp 1 781.52 797.52
## - log_age 1 810.00 826.00
## - SexMale 1 994.81 1010.81
check the result
titan_check_glm
## Generalized Linear Model
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 801, 802, 802, 802, 802, 802, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7979401 0.5679377
titan_check_glm_step
## Generalized Linear Model with Stepwise Feature Selection
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8002747 0.5738776
titan_log_check_glm
## Generalized Linear Model
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 801, 802, 802, 801, 802, 802, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8080706 0.5876776
titan_log_check_glm_step
## Generalized Linear Model with Stepwise Feature Selection
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 801, 802, 803, 802, 802, 802, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8114683 0.5947568
In short,stepwise logistic regression with log-transformed Age feature is the model with highest accuracy among four logistic models i builded; It has 81.14% accuracy.
2. Decision Tree
In this project, i use C.50 algorithm decision tree, because it combines the speed from C4.5 and accuracy from CART.
# load library
library(C50)
## Warning: package 'C50' was built under R version 4.1.2
set.seed(123)
# define train control
train_control <- trainControl(method="cv", number=7)
#train model
titan_check_c50tree <- train(Survived~.,data=titan_cross_df,trControl=train_control,
method="C5.0")
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
titan_check_c50tree_log <- train(Survived~.,data=titan_cross_log_df,
trControl=train_control,
method="C5.0")
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
# Result of C5.0 Cross Validation
titan_check_c50tree
## C5.0
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (7 fold)
## Summary of sample sizes: 765, 764, 763, 763, 764, 764, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8248351 0.6106760
## rules FALSE 10 0.8226380 0.6145652
## rules FALSE 20 0.8147995 0.5999015
## rules TRUE 1 0.8068545 0.5727792
## rules TRUE 10 0.8012387 0.5676018
## rules TRUE 20 0.7934529 0.5535953
## tree FALSE 1 0.8259687 0.6129549
## tree FALSE 10 0.8215223 0.6127888
## tree FALSE 20 0.8136565 0.5965768
## tree TRUE 1 0.8057384 0.5705426
## tree TRUE 10 0.8000875 0.5702632
## tree TRUE 20 0.8046137 0.5735554
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = tree and winnow
## = FALSE.
titan_check_c50tree_log
## C5.0
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (7 fold)
## Summary of sample sizes: 764, 764, 763, 764, 763, 764, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8270968 0.6211079
## rules FALSE 10 0.8260071 0.6206052
## rules FALSE 20 0.8137040 0.5963872
## rules TRUE 1 0.8282217 0.6254257
## rules TRUE 10 0.8181155 0.6039522
## rules TRUE 20 0.8192755 0.6070621
## tree FALSE 1 0.8248559 0.6162159
## tree FALSE 10 0.8192843 0.6083743
## tree FALSE 20 0.8215252 0.6091254
## tree TRUE 1 0.8271056 0.6228232
## tree TRUE 10 0.8248998 0.6189354
## tree TRUE 20 0.8170522 0.6038817
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules and winnow
## = TRUE.
C5.0 decision tree has higher accuracy than logistic regression model, and log transformed decision tree has the highest accuracy among them. Its accuracy is 82.7%.
3. Random Forest
As the name sugest, Random Forest (RF) is a collection of decision trees. It is a common type of ensemble methods that aggregate results from multiple predictors. Random Forest utilizes bagging technique that allows each tree trained on random sampling of original data set and takes the majority vote from the trees. It less interpretable, but has better accuracy than decision tree.
set.seed(761)
train_control_rf <- trainControl(method="cv", number=10)
#train model
titan_check_rf <- train(Survived~.,data=titan_cross_df,
trControl=train_control_rf,
method="rf")
titan_log_check_rf <- train(Survived~.,data=titan_cross_log_df,
trControl=train_control_rf,
method="rf")
Let’s see the accuracy result
titan_check_rf
## Random Forest
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 803, 802, 802, 801, 802, 802, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8136281 0.5862745
## 6 0.8203825 0.6111680
## 11 0.8057627 0.5805231
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
titan_log_check_rf
## Random Forest
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 802, 802, 801, 802, 802, 803, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8237782 0.6081292
## 6 0.8102820 0.5927915
## 11 0.8068485 0.5882175
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Supprisingly, both random forest model using yeo Johnson and Log transformed data set is lower than C5.0 decision tree using log transformed data set. While generally, these models still genereate good and higher result compared to other.
4. K-Nearest Neighbour (KNN)
K-Nearest Neighbor (KNN) algorithm predicts based on the specified number (k) of the nearest neighboring data points. Here, the pre-processing of the data is significant as it impacts the distance measurements directly. Unlike others, the model does not have a mathematical formula, neither any descriptive ability.
Here, the parameter ‘k’ needs to be chosen wisely; as a value lower than optimal leads to bias, whereas a higher value impacts prediction accuracy.
set.seed(721)
train_control_knn <- trainControl(method="cv", number=10)
#train model
titan_check_knn <- train(Survived~.,data=titan_cross_df,
trControl=train_control_knn,
method="knn")
titan_log_check_knn <- train(Survived~.,data=titan_cross_log_df,
trControl=train_control_knn,
method="knn")
Check the result of KNN model
titan_check_knn
## k-Nearest Neighbors
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 801, 802, 802, 801, 802, 802, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8114062 0.5902591
## 7 0.8001822 0.5650622
## 9 0.8159511 0.5994137
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
titan_log_check_knn
## k-Nearest Neighbors
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 803, 802, 801, 802, 801, 802, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8180842 0.6070636
## 7 0.8102818 0.5903405
## 9 0.8102562 0.5851296
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
Still, no KNN model has accuracy higher than the accuracy of C5.0 model.
After running all of these model, we can conclude that accuracy of our model is ranged from 79%-82.7%, higher accuracy with log-transformed Age2 feature, and propably, the most suitable model for this data is the decision tree algorithm c5.0. Therefore, I will build a Titanic passenger survival classification model using this model.
library(C50)
# Build model
titan_treec50_fit <- C5.0(trainX_trans[,-1],trainX_trans[,1],rules = TRUE)
titan_treec50_fit
##
## Call:
## C5.0.default(x = trainX_trans[, -1], y = trainX_trans[, 1], rules = TRUE)
##
## Rule-Based Model
## Number of samples: 686
## Number of predictors: 10
##
## Number of Rules: 7
##
## Non-standard options: attempt to group attributes
summary(titan_treec50_fit)
##
## Call:
## C5.0.default(x = trainX_trans[, -1], y = trainX_trans[, 1], rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Nov 28 04:03:10 2022
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 686 cases (11 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (19/1, lift 1.5)
## SibSp > 2
## SexMale > 0
## -> class 0 [0.905]
##
## Rule 2: (420/67, lift 1.4)
## YJTrans_Age2 > -1.278483
## SexMale > 0
## -> class 0 [0.839]
##
## Rule 3: (263/51, lift 1.3)
## Pclass = 3
## Embarked_S > 0
## -> class 0 [0.804]
##
## Rule 4: (129/7, lift 2.4)
## Pclass in {1, 2}
## SexMale <= 0
## -> class 1 [0.939]
##
## Rule 5: (10, lift 2.4)
## Parch > 0
## YJTrans_Fare2 <= 0.2986848
## SexMale <= 0
## Embarked_S > 0
## -> class 1 [0.917]
##
## Rule 6: (18/1, lift 2.3)
## SibSp <= 2
## YJTrans_Age2 <= -1.278483
## SexMale > 0
## -> class 1 [0.900]
##
## Rule 7: (77/9, lift 2.3)
## SexMale <= 0
## Embarked_S <= 0
## -> class 1 [0.873]
##
## Default class: 0
##
##
## Evaluation on training data (686 cases):
##
## Rules
## ----------------
## No Errors
##
## 7 103(15.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 407 16 (a): class 0
## 87 176 (b): class 1
##
##
## Attribute usage:
##
## 91.25% SexMale
## 63.85% YJTrans_Age2
## 57.14% Pclass
## 50.15% Embarked_S
## 5.39% SibSp
## 1.46% Parch
## 1.46% YJTrans_Fare2
##
##
## Time: 0.0 secs
Do the same to log-transformed data set
titan_log_treec50_fit <- C5.0(trainX_trans_log[,-1],trainX_trans_log[,1],rules = TRUE)
titan_log_treec50_fit
##
## Call:
## C5.0.default(x = trainX_trans_log[, -1], y = trainX_trans_log[, 1], rules
## = TRUE)
##
## Rule-Based Model
## Number of samples: 686
## Number of predictors: 10
##
## Number of Rules: 7
##
## Non-standard options: attempt to group attributes
summary(titan_log_treec50_fit)
##
## Call:
## C5.0.default(x = trainX_trans_log[, -1], y = trainX_trans_log[, 1], rules
## = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Nov 28 04:03:10 2022
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 686 cases (11 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (19/1, lift 1.5)
## SibSp > 2
## SexMale > 0
## -> class 0 [0.905]
##
## Rule 2: (420/67, lift 1.4)
## log_age > 2.564949
## SexMale > 0
## -> class 0 [0.839]
##
## Rule 3: (263/51, lift 1.3)
## Pclass = 3
## Embarked_S > 0
## -> class 0 [0.804]
##
## Rule 4: (129/7, lift 2.4)
## Pclass in {1, 2}
## SexMale <= 0
## -> class 1 [0.939]
##
## Rule 5: (10, lift 2.4)
## Parch > 0
## YJTrans_Fare2 <= 0.2986848
## SexMale <= 0
## Embarked_S > 0
## -> class 1 [0.917]
##
## Rule 6: (18/1, lift 2.3)
## SibSp <= 2
## log_age <= 2.564949
## SexMale > 0
## -> class 1 [0.900]
##
## Rule 7: (77/9, lift 2.3)
## SexMale <= 0
## Embarked_S <= 0
## -> class 1 [0.873]
##
## Default class: 0
##
##
## Evaluation on training data (686 cases):
##
## Rules
## ----------------
## No Errors
##
## 7 103(15.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 407 16 (a): class 0
## 87 176 (b): class 1
##
##
## Attribute usage:
##
## 91.25% SexMale
## 63.85% log_age
## 57.14% Pclass
## 50.15% Embarked_S
## 5.39% SibSp
## 1.46% Parch
## 1.46% YJTrans_Fare2
##
##
## Time: 0.0 secs
# Predict the value in validation data set
titan_treec50_predict <- predict(titan_treec50_fit,validX_trans)
titan_log_treec50_predict <- predict(titan_log_treec50_fit,validX_trans_log)
titan_treec50_predict
## [1] 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0
## [38] 0 1 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1
## [75] 1 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0
## [112] 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1
## [149] 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
## [186] 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0
## Levels: 0 1
Plot of decision tree
titan_treec50_plot <- C5.0(trainX_trans[,-1],trainX_trans[,1])
plot(titan_treec50_plot)
titan_log_treec50_plot <- C5.0(trainX_trans_log[,-1],trainX_trans_log[,1])
plot(titan_log_treec50_plot)
At last, let’s evaluate our prediction
confusionMatrix(validX_trans[,1],titan_treec50_predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 111 15
## 1 23 56
##
## Accuracy : 0.8146
## 95% CI : (0.7546, 0.8654)
## No Information Rate : 0.6537
## P-Value [Acc > NIR] : 2.742e-07
##
## Kappa : 0.6012
##
## Mcnemar's Test P-Value : 0.2561
##
## Sensitivity : 0.8284
## Specificity : 0.7887
## Pos Pred Value : 0.8810
## Neg Pred Value : 0.7089
## Prevalence : 0.6537
## Detection Rate : 0.5415
## Detection Prevalence : 0.6146
## Balanced Accuracy : 0.8085
##
## 'Positive' Class : 0
##
confusionMatrix(validX_trans_log[,1],titan_log_treec50_predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 111 15
## 1 23 56
##
## Accuracy : 0.8146
## 95% CI : (0.7546, 0.8654)
## No Information Rate : 0.6537
## P-Value [Acc > NIR] : 2.742e-07
##
## Kappa : 0.6012
##
## Mcnemar's Test P-Value : 0.2561
##
## Sensitivity : 0.8284
## Specificity : 0.7887
## Pos Pred Value : 0.8810
## Neg Pred Value : 0.7089
## Prevalence : 0.6537
## Detection Rate : 0.5415
## Detection Prevalence : 0.6146
## Balanced Accuracy : 0.8085
##
## 'Positive' Class : 0
##
Based on feature analysis using the Boruta algorithm, all features (Pclass, SibSp, Parch, transformed Age, SexMale, SexFemale, Embarked_S, Embarked_Q, Embarked_C) have a significant correlation with the target features. These results were then confirmed with the C5.0 model, where out of the 10 features used to build the model, 7 features were used as the basis for making algorithm decisions (3 are feature pairs, such as SexFemale, Embarked_Q, and Embarked_C which can be considered sufficiently represented in 7 features the). The first determining factor is the gender variable, if male, the next determining factor is age. If the passenger’s age is below the limit of 1.278(Age2 transformed), and brings a relative or partner under two(2), then the probability of survival is higher than vice versa. If the passenger is a woman, then the next determining feature is cabin class, coming from Southampton, bringing children or parents, then her chances of survival are higher.
Using this decision rule, our model can predict the safety of Titanic passengers with 85% accuracy.