Introduction

The sinking of Titanic in the early morning hours of 15 April 192 is considered as the deadliest ocean disaster. This caused the deaths of more than 1500 people.

In this project, we will use Titanic Disaster dataset downloaded on Kaggle to analyze and apply machinea learning technique to predict the survivors based on passenges information such as age, gender or the deck they stayed.

Now, let’s take a look at our dataset.

library(readr)
train<-read.csv("train.csv")
test<-read.csv("test.csv")

#First 2 rows of dataset
head(train,2)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
##      Ticket    Fare Cabin Embarked
## 1 A/5 21171  7.2500              S
## 2  PC 17599 71.2833   C85        C

#Dimension of train dataset and test dataset.
dim(train)

## [1] 891  12

dim(test)

## [1] 418  11

We have two dataset files, train and test dataset. We will analyze the training data for the descriptive analysis and choosing feature. After training machine learning model on train dataset, we will use testing data to compare the accuracy of our model.

Train data

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

I will describe all the attributes in Titanic dataset.

PassengerId: The unique ID for each passenger.
Survived: 1 denoted as Survived; 0 as Dead.
Pclass: The class ticket of passenger. (1= Upper Deck, 2= Middle Deck, 3= Lower Deck).
Name: Name of passenger.
Sex: Gender of passenger.
Age: Age of passenger.
SibSp: How many siblings or spouses the passenger had on board with them. Sibling = brother, sister, stepbrother, stepsister and Spouse = husband, wife (mistresses and fiancés were ignored).
Parch: How many parents or children the passenger had on boad with them. Parent = mother and father, child = daughter, son, stepdaughter and stepson and some children travelled only with a nanny, therefore parch=0 for them.
Ticket: Ticket of passenger.
Fare: Ticket fare.
Cabin: Cabin that passenger stayed.
Embarked: The place from which the passenger embarked (S,C,Q)

Preparing dataset

The most important thing before analyzing dataset is to deal with missing values.

#Checking NA value
sapply(train, function(x) sum(is.na(x)))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

#checking empty value
sapply(train,function(x) sum(x==""))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0          NA 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

In the training dataset, we have 177 NA values in age. And 687 empty values in Cabin and 2 on Embarked.

The empty values in Cabin could be the case where passenger had to share Cabin with other passengers. So, we will convert empty value into 0, and 1 for passenger who had the cabin.

train$Cabin<-ifelse(train$Cabin=="",0,1)

For Embarked, the number of missing value in Embarked variable is very small; hence we can assign them with the gate which passenger embarked the most.

a<-table(train$Embarked)
print(a)

## 
##       C   Q   S 
##   2 168  77 644

#
a<-data.frame(a)
train$Embarked<-ifelse(train$Embarked=="",as.character(a[which.max(a$Freq),1]),train$Embarked)

From the table, most passenger got on board at the Southampton. Therefore, we assign the empty values with the gate S.

To deal with the NA values in Age attribute, I propose the way that assign the average age based on class ticket into NA values.

#Number of passenger not knowing ages in each class
a<-train[which(is.na(train$Age)),]
table(a$Pclass)

## 
##   1   2   3 
##  30  11 136

#Average age of each class ticket
avg<-tapply(train$Age,train$Pclass,median,na.rm=TRUE)
print(avg)

##  1  2  3 
## 37 29 24

avg<-as.data.frame(avg)
#Assign average value into Na values.
train[which(is.na(train$Age)&train$Pclass==3),c("Age")]<-avg$avg[3]
train[which(is.na(train$Age)&train$Pclass==2),c("Age")]<-avg$avg[2]
train[which(is.na(train$Age)&train$Pclass==1),c("Age")]<-avg$avg[1]

There could be some outliers in the fare. We check for the distribution of Fare based on class ticket

library(ggplot2)
ggplot(train,aes(x=factor(Pclass),y=Fare,fill=factor(Pclass))) + geom_boxplot(outlier.colour="black", outlier.shape=16,
outlier.size=2, notch=FALSE) + theme_classic()+labs(title = "Distribution of ticket price",x="Class",y="Fare price")+
  theme(legend.position="none")

In general, the higher class has the higher fare paid. However, we can see some possible outliers when these were unusual high price compared to others.

EDA

In this part, we analyze the distribution of some attributes to have a better insight of dataset.

a<-prop.table(table(train$Survived))
a<-as.data.frame(a)
library(ggplot2)
ggplot(a, aes(x = "", y = Freq, fill = Var1)) + geom_col(color = "black") +geom_text(aes(label = round(Freq,2)), position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank())+ labs(title = "Distribution of survivors")+scale_fill_discrete(name = "Survived", labels = c("Dead", "Survived"))+theme_void()

In train dataset, deaths took account for for more than 62%.

a<-prop.table(table(train$Sex))
a<-as.data.frame(a)

ggplot(a, aes(x = "", y = Freq, fill = Var1)) + geom_col(color = "black") +geom_text(aes(label = round(Freq,2)), position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank())+ labs(title = "Distribution of gender")+scale_fill_discrete(name = "Sex", labels = c("Female", "Male"))+theme_void()

We only have around 35% Titanic passenger were female.

#PClass
a<-prop.table(table(train$Pclass))
a<-as.data.frame(a)
ggplot(a, aes(x = "", y = Freq, fill = Var1)) + geom_col(color = "black") +geom_text(aes(label = round(Freq,2)), position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank())+ labs(title = "Distribution of ticket class") +scale_fill_discrete(name = "Class", labels = c("Upper Deck", "Middel Deck","Lower Deck"))+theme_void()

As we expect, the majority of passenger was in the third class, with more than a half.

a<-prop.table(table(train$Embarked))
a<-as.data.frame(a)
library(ggplot2)
ggplot(a, aes(x = "", y = Freq, fill = Var1)) + geom_col(color = "black") +geom_text(aes(label = round(Freq,2)), position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank())+labs(title = "Distribution of on-boarding port")+scale_fill_discrete(name = "Boarding port", labels = c("Cherbourg", "Queenstown","Southampton"))+theme_void()

Southampton were the main boarding place of Titanic ship, when more than 70% passenger get on the Titanic here. Cherbourg was the second port of the route then it had around 19% passenger.

a<-prop.table(table(train$Cabin))
a<-as.data.frame(a)
ggplot(a, aes(x = "", y = Freq, fill = Var1)) + geom_col(color = "black") +geom_text(aes(label = round(Freq,2)), position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank())+labs(title = "Distribution of cabin") +scale_fill_discrete(name="Cabin",label=c("Shared cabin","Owned cabin"))+theme_void()

There were only some private cabins on the Titan ship, while around 77% passenger had to share cabin with others.

ggplot(train,aes(x=Age))+ 
  geom_histogram(color="darkblue", fill="lightblue") + labs(title="Distribution of passenger age",x="Age",y="Frequency")+theme_classic()

#Statistics
summary(train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   26.00   29.07   37.00   80.00

The age of Titanic passenger was quite young, when the average age was only 29.7 and 75% of passenger were below 38. Moreover, there were also newborn babies on the Titanic.

Training Titanic dataset includes two variable Parch and Sibsp. We can combine these two variables to have a new family size. The new family size contains all family relative from siblings, parents, children or spouse travelled with passenger.

#Adding a new Family size attribute
train$Fam_size<-train$SibSp+train$Parch+1
#Plotting
a<-prop.table(table(train$Fam_size))
a<-as.data.frame(a)
ggplot(a, aes(x =Var1, y = Freq, fill = Var1)) + geom_bar(stat='identity',color = 'black') +theme_classic() + labs(title = "Family size",x="Number of members",y="Proportion")+
  theme(legend.position="none")

The plot shows that around 60% passenger traveled alone, followed by family of two or three members.

In next step, we will find out the conditional distribution in training titanic dataset.

#Contingency table
a<-prop.table(table(train$Sex,train$Survived),margin = 2)
print(a)

##         
##                  0         1
##   female 0.1475410 0.6812865
##   male   0.8524590 0.3187135

a<-as.data.frame(a)
ggplot(a,aes(x=Var2,y=Freq,fill=Var1))+geom_bar(stat="identity",position="dodge",color = 'black') + labs(title="Conditional distribution of sex among survival",x="Survived",y="Proportion",fill="Gender")+theme_classic()

Training dataset shows female passengers had the higher rate of survival. It could be the case that when “ladies first”.

Next, we will want to know if the passenger in the higher class could have the higher chance of survival.

a<-prop.table(table(train$Pclass,train$Survived),margin = 2)
print(a)

##    
##             0         1
##   1 0.1457195 0.3976608
##   2 0.1766849 0.2543860
##   3 0.6775956 0.3479532

a<-as.data.frame(a)
ggplot(a,aes(x=Var2,y=Freq,fill=Var1))+geom_bar(stat="identity",position="dodge",color = 'black')+ labs(title="Conditional distribution of class among survival",x="Survived",y="Proportion")+theme_classic()+scale_fill_discrete(name = "Class", labels = c("Upper Deck", "Middle Deck","Lower Deck"))

The bar chart indicates that the passenger stayed at the lower deck had small chance to survive since there were 68% deaths were low-class passenger or the employees on the boat.

a<-prop.table(table(train$Survived,train$Fam_size),margin=1)
#Contigency table
a<-data.frame(a)
ggplot(a,aes(x=Var2,y=Freq,fill=Var2))+geom_bar(stat = 'identity',color="black") + facet_grid(.~Var1) + labs("Survival rate based on family size",x="Family size","Proportion")+theme_classic()+theme(legend.position="none")

since the majority of passenger traveled alone, then most of the deaths were single passenger. On the other hand, it is harder to escape the sink when travelling with a lot of members. People who traveled alone or with 2 or 3 others had the bigger survival chance.

#Contingency table

a<-prop.table(table(train$Cabin,train$Pclass),margin=1)
a<-data.frame(a)

library(ggplot2)
ggplot(a,aes(x=Var1,y=Freq,fill=Var2))+geom_bar(stat="identity",position="dodge",color = 'black') + labs(title = "Cabin owned of each class",x="Cabin",y="Proportion")+scale_fill_discrete(name = "Cabin", labels = c("Upper deck", "Middle deck","Lower deck"))+theme_classic()

#Conditional Survival and cabin 
a<-prop.table(table(train$Cabin,train$Survived),margin=2)
a<-data.frame(a)
#Contingency table
library(ggplot2)
ggplot(a,aes(x=Var2,y=Freq,fill=Var1))+geom_bar(stat="identity",position="dodge",color = 'black') + labs(title="Conditional distribution of cabin owned among survival",x="Survived",y="Proportion")+scale_fill_discrete(name = "Cabin", labels = c("Shared cabin", "Private cabin"))+theme_classic()

It can be seen that the majority of deaths were people who shared cabin, with more than 87%. And most of passenger owned cabin were in the upper class, while the middle and lower-deck passenger had to share cabin with others.

We will cut age into the age group.

#Age
train$age_interval<-cut(train$Age,breaks = c(0,20,40,60,80))


ggplot(train,aes(x=age_interval,fill=Sex))+geom_bar(position='dodge',color='black')+facet_grid(Sex~Survived) + labs("Number of survivors of each age group",x="Age interval",y="Frequency")+theme_classic()

It shows that male passengers were deceased much higher than male. While there were no female from 60 to 80 ages were died. Moreover, female passengers aged 20 to 40 survived the most.

Correlation analysis

After analyzing dataset, it seems like that age, class and gender has the strong relation with their survival chance. We will calculate the correlation to see if this assumption is true or not

We will encode the categorical variable.

library(fastDummies)
train<-dummy_cols(train,select_columns = c("Sex","Pclass","Embarked","Cabin"),remove_first_dummy = TRUE)
colnames(train)

##  [1] "PassengerId"  "Survived"     "Pclass"       "Name"         "Sex"         
##  [6] "Age"          "SibSp"        "Parch"        "Ticket"       "Fare"        
## [11] "Cabin"        "Embarked"     "Fam_size"     "age_interval" "Sex_male"    
## [16] "Pclass_2"     "Pclass_3"     "Embarked_Q"   "Embarked_S"   "Cabin_1"

Now, there are some additional dummy variables in our training dataset.

In Titanic dataset, “Name”, “PassengerID”, “SibSp”, “Parch”, “Ticket” are no longer meaningful for the model. Therefore, we eliminate these variable out of training data.

library(dplyr)
train_select<-train%>%select(c("Survived","Pclass_2","Pclass_3","Sex_male","Age","Fare","Cabin_1","Fam_size",
                   "Embarked_Q","Embarked_S"))
library(ggcorrplot)
ggcorrplot(cor(train_select),hc.order = TRUE, type = "lower",
          outline.col = "white")

At the descriptive analysis part, we thought that age could have a strong relation wtih survival chance. However, correlation matrix shows that there is nearly no relation between survival and age. Moreover, male passenger has the relatively strong negative relation with survival, which implies that female has the positive relation with survival.

Test data

Preparing dataset

Now, we combine gender_test data to test data to create the Survived variable for testing data.

#Import gender file
gender_test<-read.csv("D:\\project\\titanic_data\\gender_submission.csv")
#Adding gender_submission file
test<-left_join(test,gender_test,by="PassengerId")
dim(test)

## [1] 418  12

We continue to prepare test dataset for prediction. First, we check for the NA values and missing values in test data

#Checking NA value
sapply(test, function(x) sum(is.na(x)))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1           0           0           0

#checking empty value
sapply(test,function(x) sum(x==""))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          NA           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0          NA         327           0           0

In Titanic test dataset, there are 86 NA values in ages and one in fare price. Furthermore, as same as train dataset, cabin variable contains a large number of missing values, which means that 327 passenger had to share their cabins with others.

To deal with this the missing value of ages, we find the class of each missing age values. After that, we assign the median age of each class to the corresponding class of missing age passenger.

#Number of passenger not knowing ages in each class
a<-test[which(is.na(test$Age)),]
table(a$Pclass)

## 
##  1  2  3 
##  9  5 72

#Average age of each class ticket
avg<-tapply(test$Age,test$Pclass,median,na.rm=TRUE)
print(avg)

##    1    2    3 
## 42.0 26.5 24.0

avg<-as.data.frame(avg)
#Assign average value into Na values.
test[which(is.na(test$Age)&test$Pclass==3),c("Age")]<-avg$avg[3]
test[which(is.na(test$Age)&test$Pclass==2),c("Age")]<-avg$avg[2]
test[which(is.na(test$Age)&test$Pclass==1),c("Age")]<-avg$avg[1]

For cabin empty values, we denote 1 as passenger owned the cabin and 0 as shared-cabin passenger.

#Encode for cabin
test$Cabin<-ifelse(test$Cabin=="",0,1)

In test data, we have one NA value in fare. We use the same technique to deal with age values. Find the class of the passenger, and then assign the median price of that class ticket to NA value.

#Fare not known in each class
a<-test[which(is.na(test$Fare)),]
table(a$Pclass)

## 
## 3 
## 1

#Assign median value of fare in class 3 into NA values
avg_fare<-tapply(test$Fare,test$Pclass,median,na.rm=TRUE)
avg_fare<-as.data.frame(avg_fare)
test[which(is.na(test$Fare)),c("Fare")]<-avg_fare$avg_fare[3]

We create a new variable family size for test data by adding SibSp and Parch variables.

#Adding a new Family size attribute
test$Fam_size<-test$SibSp+test$Parch+1

Now, we get dummy variables for test dataset.

#Get dummie for categorical variable
library(fastDummies)
test<-dummy_cols(test,select_columns = c("Sex","Pclass","Embarked","Cabin"),remove_first_dummy =TRUE)

For feature selection, we drop out the Ticket, PassengerId, Name and other categorical variable used for dummies.

test_select<-test%>%select(c("Survived","Pclass_2","Pclass_3","Sex_male","Age","Fare","Cabin_1","Fam_size",
                             "Embarked_Q","Embarked_S"))

Logistic regression

Since we have to predict if the passenger survived or not, this is the binary problem. Therefore, logistic regression will be applied to predict the categorical value of Survived.

In the previous step, we have selected the features of train data for training the model.

#Logistic regression
lg_model<-glm(Survived~.,data=train_select,family = "binomial")

print(lg_model)

## 
## Call:  glm(formula = Survived ~ ., family = "binomial", data = train_select)
## 
## Coefficients:
## (Intercept)     Pclass_2     Pclass_3     Sex_male          Age         Fare  
##    3.744578    -0.306486    -1.550411    -2.734216    -0.040659     0.001949  
##     Cabin_1     Fam_size   Embarked_Q   Embarked_S  
##    0.959271    -0.233351    -0.174857    -0.472206  
## 
## Degrees of Freedom: 890 Total (i.e. Null);  881 Residual
## Null Deviance:       1187 
## Residual Deviance: 775.4     AIC: 795.4

predicted_prob<-predict(lg_model,train_select,type="response")
predicted_class<-ifelse(predicted_prob>0.5,'Survived','Dead')
tab<-table(Predicted=predicted_class,Actual=train_select$Survived)
#The accuracy of predicting deaths
sum(tab[1,1])/sum(tab[,1])

## [1] 0.8743169

#The accuracy of predicting survivors
sum(tab[2,2])/sum(tab[,2])

## [1] 0.7134503

#The accuracy of the model
sum(diag(tab))/sum(tab)

## [1] 0.8125701

We have the accuracy of the model is 81.2%. While it predicts correctly 87.4% deaths, only 71% of all survivals predicted is correct.

#Plotting the confusion matrix
library(e1071)
library(caTools)
library(ggplot2)
library(scales)
library(caret)
train_select$Survived<-ifelse(train_select$Survived==0,"Dead","Survived")
confusion_matrix<-confusionMatrix(factor(predicted_class),factor(train_select$Survived))


  ggplotConfusionMatrix <- function(m){
    mytitle <- paste("Accuracy", percent_format()(m$overall[1]),
                     "Kappa", percent_format()(m$overall[2]))
    p <-
      ggplot(data = as.data.frame(m$table) ,
             aes(x = Reference, y = Prediction)) +
      geom_tile(aes(fill = log(Freq)), colour = "white") +
      scale_fill_gradient(low = "white", high = "steelblue") +
      geom_text(aes(x = Reference, y = Prediction, label = Freq)) +
      theme(legend.position = "none") +
      ggtitle(mytitle)
    return(p)
  }
ggplotConfusionMatrix(confusion_matrix)

After training the classifiation model on training dataset. We use this model to predict on the test data to estimate the accuracy of the model on the unseen dataset.

#Logistic regression


predicted_prob<-predict(lg_model,test_select,type="response")
predicted_class<-ifelse(predicted_prob>0.5,'Survived','Dead')
tab<-table(Predicted=predicted_class,Actual=test_select$Survived)

#The accuracy of predicting deaths
sum(tab[1,1])/sum(tab[,1])

## [1] 0.9285714

#The accuracy of predicting survivors
sum(tab[2,2])/sum(tab[,2])

## [1] 0.9210526

#The accuracy of the model
sum(diag(tab))/sum(tab)

## [1] 0.9258373

We get the better accuracy when predicting on the test data. The accuracy of model on test data is 92.5%. And around 92% of survivors and deaths are predicted correctly.

Plotting the confusion matrix

test_select$Survived<-ifelse(test_select$Survived==0,"Dead","Survived")
confusion_matrix<-confusionMatrix(factor(predicted_class),factor(test_select$Survived))


  ggplotConfusionMatrix <- function(m){
    mytitle <- paste("Accuracy", percent_format()(m$overall[1]),
                     "Kappa", percent_format()(m$overall[2]))
    p <-
      ggplot(data = as.data.frame(m$table) ,
             aes(x = Reference, y = Prediction)) +
      geom_tile(aes(fill = log(Freq)), colour = "white") +
      scale_fill_gradient(low = "white", high = "steelblue") +
      geom_text(aes(x = Reference, y = Prediction, label = Freq)) +
      theme(legend.position = "none") +
      ggtitle(mytitle)
    return(p)
  }
ggplotConfusionMatrix(confusion_matrix)

Naive bayes classifier

Naive Bayes is the classification method based on Bayes’s theorem with an assumption of independence among predictors.

library(naivebayes)
naive_model <- naive_bayes(Survived ~ ., data = train_select, usekernel = T)

predicted_prob_naive<-predict(naive_model,train_select,type="class")
tab_naive<-table(predicted_prob_naive,train_select$Survived)
#The accuracy of predicting deaths
sum(tab_naive[1,1])/sum(tab_naive[,1])

## [1] 0.9744991

#The accuracy of predicting survivors
sum(tab_naive[2,2])/sum(tab_naive[,2])

## [1] 0.4269006

#The accuracy of the model
sum(diag(tab_naive))/sum(tab_naive)

## [1] 0.7643098

The accuracy of naive model on train dataset is lower than the logistic regression model, only 76%. However, there is a suprising difference between the misclassification between true survivors and deaths. While the accuracy of predicting true deaths is up to 97%, only 42.6% of all survivors are predicted correctly.

confusion_matrix<-confusionMatrix(factor(predicted_prob_naive),factor(train_select$Survived))
ggplotConfusionMatrix <- function(m){
    mytitle <- paste("Accuracy", percent_format()(m$overall[1]),
                     "Kappa", percent_format()(m$overall[2]))
    p <-
      ggplot(data = as.data.frame(m$table) ,
             aes(x = Reference, y = Prediction)) +
      geom_tile(aes(fill = log(Freq)), colour = "white") +
      scale_fill_gradient(low = "white", high = "steelblue") +
      geom_text(aes(x = Reference, y = Prediction, label = Freq)) +
      theme(legend.position = "none") +
      ggtitle(mytitle)
    return(p)
  }
ggplotConfusionMatrix(confusion_matrix)

predicted_prob_naive<-predict(naive_model,test_select,type="class")
tab_naive<-table(predicted_prob_naive,test_select$Survived)
#The accuracy of predicting deaths
sum(tab_naive[1,1])/sum(tab_naive[,1])

## [1] 0.9511278

#The accuracy of predicting survivors
sum(tab_naive[2,2])/sum(tab_naive[,2])

## [1] 0.3947368

#The accuracy of the model
sum(diag(tab_naive))/sum(tab_naive)

## [1] 0.7488038

The accuracy of naive model on test dataset is much lower than the logistic regression model, only 75% compared to 92%. As predicting on training data, while 95.1% deaths are predicted correctly, only 39.4% survivors are predicted correctly.

confusion_matrix<-confusionMatrix(factor(predicted_prob_naive),factor(test_select$Survived))
ggplotConfusionMatrix <- function(m){
    mytitle <- paste("Accuracy", percent_format()(m$overall[1]),
                     "Kappa", percent_format()(m$overall[2]))
    p <-
      ggplot(data = as.data.frame(m$table) ,
             aes(x = Reference, y = Prediction)) +
      geom_tile(aes(fill = log(Freq)), colour = "white") +
      scale_fill_gradient(low = "white", high = "steelblue") +
      geom_text(aes(x = Reference, y = Prediction, label = Freq)) +
      theme(legend.position = "none") +
      ggtitle(mytitle)
    return(p)
  }
ggplotConfusionMatrix(confusion_matrix)

head(test_select)

##   Survived Pclass_2 Pclass_3 Sex_male  Age    Fare Cabin_1 Fam_size Embarked_Q
## 1     Dead        0        1        1 34.5  7.8292       0        1          1
## 2 Survived        0        1        0 47.0  7.0000       0        2          0
## 3     Dead        1        0        1 62.0  9.6875       0        1          1
## 4     Dead        0        1        1 27.0  8.6625       0        1          0
## 5 Survived        0        1        0 22.0 12.2875       0        3          0
## 6     Dead        0        1        1 14.0  9.2250       0        1          0
##   Embarked_S
## 1          0
## 2          1
## 3          0
## 4          1
## 5          1
## 6          1

KNN

K-nearest neighbor is a simple, and lazy classification model. In KNN model, we need to choose the number of neighbor \(k\) for the model. We can compare of the accuracy of each k number to choose the optimal \(k\) for KNN model.

Since KNN model works based on the Euclidean calculation. Therefore, we will normalize the dataset firstly.

#Normalize train data
train_knn<-train_select
normalize<-function(x){
  return ((x - min(x)) / (max(x) - min(x))) } # creating a normalize function for easy convertion.
train_knn_normalize<-sapply(train_knn%>%select(-c("Survived")), normalize)
#Normalize test data
test_knn<-test_select
test_knn_normalize<-sapply(test_knn%>%select(-c("Survived")), normalize)

library(class)
i=1                          # declaration to initiate for loop
k.optm=1                     # declaration to initiate for loop

for (i in 1:28){ 
  knn_model<-  knn(train=train_knn_normalize, test=test_knn_normalize, cl=train_select[,c("Survived")], k=i)
  k.optm[i] <- 100*sum(test_select$Survived==knn_model)/NROW(test_select$Survived)
  k=i  
  cat(k,'=',k.optm[i],'\n')       # to print % accuracy 
}

## 1 = 77.03349 
## 2 = 82.77512 
## 3 = 84.44976 
## 4 = 84.21053 
## 5 = 85.16746 
## 6 = 86.60287 
## 7 = 86.84211 
## 8 = 86.60287 
## 9 = 85.88517 
## 10 = 86.36364 
## 11 = 86.1244 
## 12 = 87.79904 
## 13 = 87.55981 
## 14 = 87.55981 
## 15 = 87.79904 
## 16 = 87.79904 
## 17 = 87.79904 
## 18 = 87.55981 
## 19 = 88.03828 
## 20 = 88.51675 
## 21 = 88.03828 
## 22 = 88.03828 
## 23 = 87.08134 
## 24 = 86.36364 
## 25 = 86.60287 
## 26 = 88.03828 
## 27 = 88.27751 
## 28 = 88.99522

plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level")  # to plot % accuracy wrt to k-value

At \(k=28\), we have the optimal accuracy. Therefore, we choose \(k=28\) for our KNN model.

#Applying model
knn_model<-knn(train=train_knn_normalize, test=test_knn_normalize, cl=train_select[,c("Survived")], k=28)
tab_knn<-table(knn_model,test_select$Survived)
#The accuracy of predicting deaths
sum(tab_knn[1,1])/sum(tab_knn[,1])

## [1] 0.9511278

#The accuracy of predicting survivors
sum(tab_knn[2,2])/sum(tab_knn[,2])

## [1] 0.7894737

#The accuracy of the model
sum(diag(tab_knn))/sum(tab_knn)

## [1] 0.8923445

KNN model performs well, with around 90% of accuracy. Especially, KNN model predicts correctly 95.8% of all deaths. However, only 79.6% of all survivors predicted correctly.

#Plotting confusion matrix
confusion_matrix_knn<-confusionMatrix(factor(knn_model),factor(test_select$Survived))
ggplotConfusionMatrix <- function(m){
    mytitle <- paste("Accuracy", percent_format()(m$overall[1]),
                     "Kappa", percent_format()(m$overall[2]))
    p <-
      ggplot(data = as.data.frame(m$table) ,
             aes(x = Reference, y = Prediction)) +
      geom_tile(aes(fill = log(Freq)), colour = "white") +
      scale_fill_gradient(low = "white", high = "steelblue") +
      geom_text(aes(x = Reference, y = Prediction, label = Freq)) +
      theme(legend.position = "none") +
      ggtitle(mytitle)
    return(p)
  }
ggplotConfusionMatrix(confusion_matrix_knn)

Conclusion

In three classification methods that we applied to Titanic dataset, logistic regression model performs much better than Naive Bayes classifier . However, we see the accuracy of logistic regression model on test data is much higher than on train data. This is the underfitting problem. To avoid this cause, the best strategy could be to increase the model complexity. Or train the model on the larger dataset also avoids the underfitting.

titanic_disaster

quoc_nguyen

2023-01-17

Introduction

Train data

Preparing dataset

EDA

Correlation analysis

Test data

Preparing dataset

Logistic regression

Naive bayes classifier

KNN

Conclusion