1 Background

“There is no danger that Titanic will sink. The boat is unsinkable and nothing but inconvenience will be suffered by the passengers.”

Phillip Franklin, White Star Line vice-president, 1912

Once thought to be unsinkable, the Titanic ironically sank on her first voyage in the North Atlantic Ocean on 15 April 1912. It is generally believed that some 1.500 people perished and only 706 people survived. It has been claimed that out of 706 survivors, most of them were women and children, and the majority were first-class passengers.

Based on this claim, we will do a classification analysis of Titanic data to classify which passengers survived or didn’t survive. In this report, we will use the Titanic dataset that was taken from Titanic Kaggle challenge.

Based on the problem stated above, we will be focusing on the positive case of predicting survival. In this context, the positive case refers to individuals who survived the event under consideration. Because we are focusing on the positive case of predicting survival, we will primarily concentrate on evaluation metrics that are particularly relevant for binary classification tasks, especially those emphasizing the correct identification of survivors. Therefore, the main metrics of interest will be Recall and Precision.

This report explains how we used three different methods—Naive Bayes, Decision Tree, and Random Forest—to make predictions based on a dataset we have. We trained each model to predict something specific. Our goal is to find out which model performs the best in terms of two important measures: Recall and Precision. Recall tells us how well the model finds all the correct answers among all the right ones, while Precision tells us how accurate the model is when it says something is correct. We will compare these metrics for each model to determine which one is the the best model for the case.

2 Library Used

Here is the list of libraries used for this report:

library(dplyr)
library(inspectdf)
library(e1071)
library(caret)
library(ROCR)
library(partykit)

3 Data Preparation

The titanic dataset has three csv files: gender_submission.csv, test.csv, and train.csv. We will use train.csv file and split the data for training and testing the model that’s been made. As for gender_submission.csv and train.csv, both data are related to Kaggle challenge, so it will not be used for this report’s scope.

First, read train.csv and assign it to titanic:

titanic <- read.csv("data_input/train.csv")

3.1 Data Inspection

Check the head of titanic:

head(titanic)

Check the tail of train:

tail(titanic)

Let’s see the structure of train dataframe:

glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

The explanation of each column is:

PassengerId: The id of the passenger
Survived: The survival of the passenger (1 = Yes and 0 = No)
Pclass: Ticket class
Name: The name of the passenger
Sex: The sex of the passenger
Age: The age of the passenger (in years)
SibSp: The numbers of siblings or spouses who aboard the Titanic
Parch: The numbers of parents or children who aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown and S = Southampton)

We also want to know how many rows and columns there are in titanic:

dim(titanic)

## [1] 891  12

There are 891 rows and 12 columns.

3.2 Data Cleansing & Coertions

Because of the scope of this analysis, we will remove unimportant columns such as PassengerId, Name, Ticket, and Cabin, and make the count of columns down to 8 columns:

titanic <- subset(titanic, select = c(-PassengerId, -Ticket, -Cabin, -Name))
dim(titanic)

## [1] 891   8

We want to make sure that all the data types are correct. One of the ways is to check for unique values to determine whether it’s a factor data type. We will only check the columns that are suspected to have less than 10 unique values, so in this case Survived, Pclass, Sex, and Embarked columns:

unique(titanic$Survived)

## [1] 0 1

unique(titanic$Pclass)

## [1] 3 1 2

unique(titanic$Sex)

## [1] "male"   "female"

unique(titanic$Embarked)

## [1] "S" "C" "Q" ""

Notice that in the Embarked column, there is an empty value (““).

It’s likely because the ports in which the passengers embarked were unknown. So, before changing the datatypes, we will convert the empty value into another value. Check for the Embarked empty values position in the train dataframe:

titanic[titanic$Embarked == "", ]

The empty values are in rows 62 and 830. We will assign the empty values of the Embarked column to the median value of Embarked:

embarked_median <- median(titanic$Embarked)
titanic[titanic$Embarked == "", ]$Embarked <- embarked_median

After assigning the empty values to the median value, let’s recheck to make sure that the empty string value has disappeared:

unique(titanic$Embarked)

## [1] "S" "C" "Q"

3.3 Data Encoding

Great. Now we will continue the prior step which is changing data types. The result above showed us that these four columns have less than 10 unique values and their data types can be regarded as factor data type. Changing data types to factor:

titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)

Recheck for the newly changed datatypes:

str(titanic)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Done.

3.4 Missing Data

Now, we need to make sure that there are no missing values. Let’s check for missing values:

colSums(is.na(titanic))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        0

There are 177 empty values in the Age column. For the sake of data clarity, we will assign the missing data in the Age column as the median age based on sex. Search for median age in female passengers and male passengers:

female_pass <- titanic[titanic$Sex == "female", ]
female_median <- median(female_pass$Age, na.rm = TRUE)
female_median

## [1] 27

male_pass <- titanic[titanic$Sex == "male", ]
male_median <- median(male_pass$Age, na.rm = TRUE)
male_median

## [1] 29

Set all NA in column Age as 27 for female passengers and 29 for male passengers:

titanic$Age <- ifelse(titanic$Sex == "female" & is.na(titanic$Age), female_median, titanic$Age)
titanic$Age <- ifelse(titanic$Sex == "male" & is.na(titanic$Age), male_median, titanic$Age)

Recheck for any missing values:

colSums(is.na(titanic))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

Now we’re good! All the values are finally not empty. Let’s move on to the next point which is Data Explanation.

3.5 Duplicates Data

Let’s move on to the next step which is to check for duplicated data. We want to check for titanic with same Name, Sex, and Age:

duplicates_pssgr <- duplicated(titanic$Name, titanic$Sex, fromLast = TRUE)
sum(duplicates_pssgr)

## [1] 0

There’s no duplicates data.

4 Data Explanation

Check for the statistic’s summary of titanic dataframe:

summary(titanic)

##  Survived Pclass      Sex           Age            SibSp           Parch       
##  0:549    1:216   female:314   Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  1:342    2:184   male  :577   1st Qu.:22.00   1st Qu.:0.000   1st Qu.:0.0000  
##           3:491                Median :29.00   Median :0.000   Median :0.0000  
##                                Mean   :29.44   Mean   :0.523   Mean   :0.3816  
##                                3rd Qu.:35.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                                Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##       Fare        Embarked
##  Min.   :  0.00   C:168   
##  1st Qu.:  7.91   Q: 77   
##  Median : 14.45   S:646   
##  Mean   : 32.20           
##  3rd Qu.: 31.00           
##  Max.   :512.33

The insights gained from the summary above:

There were 342 passengers survived and 549 passengers didn’t
There were 216 passengers in the 1st class, 184 passengers in the 2nd class, and 491 passengers in the 3rd class
The passengers were consisted of 314 females and 577 males
The average age of the passengers is 29.44, whereas the minimum age is 0.42 and the maximum age is 88.00
The average fare is 32.20, the minimum fare is 0.00, and the maximum fare is 512.33
As many as 646 passengers embarked from Southampton, followed by 168 passengers embarked from Cherbourg, and 77 passengers embarked from Queenstown

5 Classification Analysis

In this section, we will create a model based on titanic data to determine survivability of passengers.

5.1 Target Variable

The goal of this report is to determine survivability of passengers, so the Survived column will be the target variable.

5.2 Predictor Variables

To search for the best columns to used as predictor variables, we will use GGally to check the correlation of all the columns in titanic. GGally needs all the columns not to be factor data types, so before generating the plot, we need to change all the factor columns to be numeric columns and insert it to titanic_plot:

library(GGally)
titanic_plot <- titanic %>% 
  mutate(
    Survived = as.numeric(Survived),
    Pclass = as.numeric(Pclass),
    Sex = as.numeric(Sex),
    Embarked = as.numeric(Embarked)
  )

ggcorr(titanic_plot, label = T,label_size = 2.9)

From the plot above, the column that has the highest correlation with the column Survived is Sex with -0.5, followed by Pclass with -0.3 and Fare with 0.3. The rest of the columns only have -0.1, 0.1 and 0, which means they are not highly correlated.

In conclusion, we will use Sex, Pclass, and Fare to be the predictor variables.

6 Cross Validation

Now, we want to divide titanic data into two datasets, titanic_train for training the model, and titanic_test for testing the model that’s been made:

set.seed(2024)

index <- sample(x = nrow(titanic),
                size = nrow(titanic) * 0.8 ) 

titanic_train <- titanic[index, ]
titanic_test <- titanic[-index, ]

Good. We will check if the titanic_train is balance:

round(prop.table(table(titanic_train$Survived))*100, 0)

## 
##  0  1 
## 63 37

The proportion is 63:37, the proportion need to be tuned a bit so it will be balanced. We will do an upsampling:

titanic_train <- upSample(x = titanic_train %>% select(-Survived),
                         y = titanic_train$Survived,
                         yname = "Survived")

Let’s check for the new proportion after upsampling process:

round(prop.table(table(titanic_train$Survived))*100, 0)

## 
##  0  1 
## 50 50

Great, now the proportion is balanced with 50% Survived and 50% Not Survived. But before we build the model, we have to divide titanic_test data to titanic_test_x for predictor variables and titanic_test_y for target variables.

titanic_test_x <- titanic_test[, !names(titanic_test) %in% "Survived"]
titanic_test_y <- titanic_test$Survived

7 Build Model

Now we will build model for naive bayes, decision and KNN:

7.1 Naive Bayes Model

For the first model, we will build a Naive Bayes model:

model_nb <- naiveBayes(formula = Survived ~ Sex + Pclass + Fare,
                   data = titanic_train, laplace = 1)

We will predict the data test using model_nb and also create a confusion matrix for later evaluation:

pred_nb <- predict(model_nb, titanic_test_x)
eval_nb <- confusionMatrix(pred_nb, titanic_test_y, positive = "1")

7.2 Decision Tree Model

The second model is a Decision Tree model:

model_dt <- ctree(formula = Survived ~ Sex + Pclass + Fare,
                     data = titanic,
                     control = ctree_control(mincriterion = 0.5,
                                             minsplit = 0,
                                             minbucket = 0))

We will predict the data test using model_dt and also create a confusion matrix for later evaluation:

pred_dt <- predict(model_dt, newdata = titanic_test_x, type="response")
eval_dt <- confusionMatrix(pred_dt, titanic_test_y, positive = "1")

7.3 Random Forest Model

Lastly, we will create a Random Forest Model. But first, create the control variables:

set.seed(123)
control <- trainControl(method = "repeatedcv",
             number = 10,
             repeats = 2)

And create the model:

model_rf <- train(Survived ~ Sex + Pclass + Fare,
      data = titanic_train,
      method = "rf",
      trControl = control,
      importance = T)

We will predict the data test using model_rf and also create a confusion matrix for later evaluation:

pred_rf <- predict(model_rf, newdata = titanic_test_x, type = "raw")
eval_rf <- confusionMatrix(pred_rf, titanic_test_y, positive = "1")

8 Evaluation

Now that we have create confusion matrices from all the model, let’s see the comparison:

compare_nb <- data_frame(Model = "Naive Bayes",
                         Accuracy = round((eval_nb$overall["Accuracy"] * 100), 2),
                         Recall = round((eval_nb$byClass["Recall"] * 100), 2),
                         Precision = round((eval_nb$byClass["Precision"] * 100), 2),
                         Specificity = round((eval_nb$byClass["Specificity"] * 100), 2))

compare_dt <- data_frame(Model = "Decision Tree",
                         Accuracy = round((eval_dt$overall["Accuracy"] * 100), 2),
                         Recall = round((eval_dt$byClass["Recall"] * 100), 2),
                         Precision = round((eval_dt$byClass["Precision"] * 100), 2),
                         Specificity = round((eval_dt$byClass["Specificity"] * 100), 2))

compare_rf <- data_frame(Model = "Random Forest",
                         Accuracy = round((eval_rf$overall["Accuracy"] * 100), 2),
                         Recall = round((eval_rf$byClass["Recall"] * 100), 2),
                         Precision = round((eval_rf$byClass["Precision"] * 100), 2),
                         Specificity = round((eval_rf$byClass["Specificity"] * 100), 2))

rbind(compare_nb, compare_dt, compare_rf)

Based on the background case stated above, we want to see which model has the highest Recall and Precision metrics. With that being said, we can see that Random Forest has the highest Recall metrics value of 77.22% and Precision metrics value of 85.92%.

9 Conclusion

We’ve looked at how different methods could predict if passengers on the Titanic survived or not. We focused on finding the best method for this task by paying attention to two important measures: Recall and Precision.

We found that for this case, Sex, Pclass, and Fare are the best columns because they have high correlation values with Survived column.

We also found that between Naive Bayes, Decision Tree, and Random Forest, the Random Forest model performed the best. It had the highest chances of spotting survivors correctly and making accurate predictions. With a Recall rate of 77.22% and a Precision rate of 85.92%, Random Forest showed it could find survivors well while also avoiding wrong guesses.

This suggests Random Forest is a good choice for predicting survival in situations similar to the Titanic disaster. Using Random Forest can help us make better decisions and understand survival patterns more effectively.

Finding the Best Model for Predicting Survival on the Titanic: A Comparison of Naive Bayes, Decision Tree, and Random Forest

Maudy N Avianti

2024-03-02