1 Background

“There is no danger that Titanic will sink. The boat is unsinkable and nothing but inconvenience will be suffered by the passengers.”

Phillip Franklin, White Star Line vice-president, 1912

Once thought to be unsinkable, the Titanic ironically sank on her first voyage in the North Atlantic Ocean on 15 April 1912. It is generally believed that some 1.500 people perished and only 706 people survived. It has been claimed that out of 706 survivors, most of them were women and children, and the majority were first-class passengers.

Based on this claim, we will do a classification analysis of Titanic data to classify which passengers survived or didn’t survive. In this report, we will use the Titanic dataset that was taken from Titanic Kaggle challenge.

Based on the problem stated above, we will be focusing on the positive case of predicting survival. In this context, the positive case refers to individuals who survived the event under consideration. Because we are focusing on the positive case of predicting survival, we will primarily concentrate on evaluation metrics that are particularly relevant for binary classification tasks, especially those emphasizing the correct identification of survivors. Therefore, the main metrics of interest will be Recall and Precision.

This report explains how we used three different methods—Logistic Regression and K-nearest neighbor (k-NN)—to make predictions based on a dataset we have. We trained each model to predict something specific. Our goal is to find out which model performs the best in terms of two important measures: Recall and Precision. Recall tells us how well the model finds all the correct answers among all the right ones, while Precision tells us how accurate the model is when it says something is correct. We will compare these metrics for each model to determine which one is the the best model for the case.

2 Library Used

Here is the list of libraries used for this report:

library(dplyr)
library(inspectdf)
library(e1071)
library(caret)
library(ROCR)
library(partykit)
library(GGally)
library(class)

3 Data Preparation

The titanic dataset has three csv files: gender_submission.csv, test.csv, and train.csv. We will use train.csv file and split the data for training and testing the model that’s been made. As for gender_submission.csv and train.csv, both data are related to Kaggle challenge, so it will not be used for this report’s scope.

First, read train.csv and assign it to titanic:

titanic <- read.csv("data_input/train.csv")

3.1 Data Inspection

Check the head of titanic:

head(titanic)

Check the tail of train:

tail(titanic)

Let’s see the structure of train dataframe:

glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

The explanation of each column is:

PassengerId: The id of the passenger
Survived: The survival of the passenger (1 = Yes and 0 = No)
Pclass: Ticket class
Name: The name of the passenger
Sex: The sex of the passenger
Age: The age of the passenger (in years)
SibSp: The numbers of siblings or spouses who aboard the Titanic
Parch: The numbers of parents or children who aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown and S = Southampton)

We also want to know how many rows and columns there are in titanic:

dim(titanic)

## [1] 891  12

There are 891 rows and 12 columns.

3.2 Data Cleansing & Coertions

Because of the scope of this analysis, we will remove unimportant columns such as PassengerId, Name, Ticket, and Cabin, and make the count of columns down to 8 columns:

titanic <- subset(titanic, select = c(-PassengerId, -Ticket, -Cabin, -Name))
dim(titanic)

## [1] 891   8

We want to make sure that all the data types are correct. One of the ways is to check for unique values to determine whether it’s a factor data type. We will only check the columns that are suspected to have less than 10 unique values, so in this case Survived, Pclass, Sex, and Embarked columns:

unique(titanic$Survived)

## [1] 0 1

unique(titanic$Pclass)

## [1] 3 1 2

unique(titanic$Sex)

## [1] "male"   "female"

unique(titanic$Embarked)

## [1] "S" "C" "Q" ""

Notice that in the Embarked column, there is an empty value (““).

It’s likely because the ports in which the passengers embarked were unknown. So, before changing the datatypes, we will convert the empty value into another value. Check for the Embarked empty values position in the train dataframe:

titanic[titanic$Embarked == "", ]

The empty values are in rows 62 and 830. We will assign the empty values of the Embarked column to the median value of Embarked:

embarked_median <- median(titanic$Embarked)
titanic[titanic$Embarked == "", ]$Embarked <- embarked_median

After assigning the empty values to the median value, let’s recheck to make sure that the empty string value has disappeared:

unique(titanic$Embarked)

## [1] "S" "C" "Q"

3.3 Data Encoding

Great. Now we will continue the prior step which is changing data types. The result above showed us that these four columns have less than 10 unique values and their data types can be regarded as factor data type. But instead of factor data type, we will change them into non-numerical data types. Why? Logistic Regression and KNN model rely on mathematical computations that involve distances or gradients, which are inherently defined for numeric values.

We will change Survived and Pclass into numerical directly:

titanic$Survived <- as.numeric(titanic$Survived)
titanic$Pclass <- as.numeric(titanic$Pclass)

And for Sex and Embarked', we will convert the character value into numerical value usingifelse`:

titanic$Sex<- ifelse(titanic$Sex == "female", 0, 1)
titanic$Embarked <- ifelse(titanic$Embarked == "S", 0,
                                         ifelse(titanic$Embarked == "C", 1, 2))

Recheck for the newly changed datatypes:

str(titanic)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : num  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : num  1 0 0 0 1 1 1 1 0 0 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: num  0 1 0 0 0 2 0 0 0 1 ...

Nice, now all the columns in titanic are numerical.

3.4 Missing Data

Now, we need to make sure that there are no missing values. Let’s check for missing values:

colSums(is.na(titanic))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        0

There are 177 empty values in the Age column. For the sake of data clarity, we will assign the missing data in the Age column as the median age based on sex. Search for median age in female passengers and male passengers:

female_pass <- titanic[titanic$Sex == "0", ]
female_median <- median(female_pass$Age, na.rm = TRUE)
female_median

## [1] 27

male_pass <- titanic[titanic$Sex == "1", ]
male_median <- median(male_pass$Age, na.rm = TRUE)
male_median

## [1] 29

Set all NA in column Age as 27 for female passengers and 29 for male passengers:

titanic$Age <- ifelse(titanic$Sex == "0" & is.na(titanic$Age), female_median, titanic$Age)
titanic$Age <- ifelse(titanic$Sex == "1" & is.na(titanic$Age), male_median, titanic$Age)

Recheck for any missing values:

colSums(is.na(titanic))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

Now we’re good! All the values are finally not empty. Let’s move on to the next point which is Data Explanation.

3.5 Duplicates Data

Let’s move on to the next step which is to check for duplicated data. We want to check for titanic with same Name, Sex, and Age:

duplicates_pssgr <- duplicated(titanic$Name, titanic$Sex, fromLast = TRUE)
sum(duplicates_pssgr)

## [1] 0

There’s no duplicates data.

4 Data Explanation

Check for the statistic’s summary of titanic dataframe:

summary(titanic)

##     Survived          Pclass           Sex              Age       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.0000   Min.   : 0.42  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:22.00  
##  Median :0.0000   Median :3.000   Median :1.0000   Median :29.00  
##  Mean   :0.3838   Mean   :2.309   Mean   :0.6476   Mean   :29.44  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:35.00  
##  Max.   :1.0000   Max.   :3.000   Max.   :1.0000   Max.   :80.00  
##      SibSp           Parch             Fare           Embarked     
##  Min.   :0.000   Min.   :0.0000   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median : 14.45   Median :0.0000  
##  Mean   :0.523   Mean   :0.3816   Mean   : 32.20   Mean   :0.3614  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00   3rd Qu.:1.0000  
##  Max.   :8.000   Max.   :6.0000   Max.   :512.33   Max.   :2.0000

The insights gained from the summary above:

The average age of the passengers is 29.44, whereas the minimum age is 0.42 and the maximum age is 88.00
The average fare is 32.20, the minimum fare is 0.00, and the maximum fare is 512.33

5 Classification Analysis

In this section, we will create a model based on titanic data to determine survivability of passengers.

5.1 Target Variable

The goal of this report is to determine survivability of passengers, so the Survived column will be the target variable.

5.2 Predictor Variables

To search for the best columns to used as predictor variables, we will use GGally to check the correlation of all the columns in titanic. GGally needs all the columns not to be factor data types, so before generating the plot, we need to change all the factor columns to be numeric columns and insert it to titanic_plot:

titanic_plot <- titanic %>% 
  mutate(
    Survived = as.numeric(Survived),
    Pclass = as.numeric(Pclass),
    Sex = as.numeric(Sex),
    Embarked = as.numeric(Embarked)
  )

ggcorr(titanic_plot, label = T,label_size = 2.9)

From the plot above, the column that has the highest correlation with the column Survived is Sex with -0.5, followed by Pclass with -0.3 and Fare with 0.3. The rest of the columns only have -0.1, 0.1 and 0, which means they are not highly correlated.

In conclusion, we will use Sex, Pclass, and Fare to be the predictor variables.

6 Cross Validation

Now, we want to divide titanic data into two datasets, titanic_train for training the model, and titanic_test for testing the model that’s been made:

set.seed(2024)

index <- sample(x = nrow(titanic),
                size = nrow(titanic) * 0.8 ) 

titanic_train <- titanic[index, ]
titanic_test <- titanic[-index, ]

Good. We will check if the titanic_train is balance:

round(prop.table(table(titanic_train$Survived))*100, 0)

## 
##  0  1 
## 63 37

The proportion is 63:37, it is balanced enough. But before we build the model, we have to divide titanic_train and titanic_test data to titanic_train_x & titanic_test_x for predictor variables and titanic_train_y & titanic_test_y for target variables.

titanic_train_x <- titanic_train[, !names(titanic_train) %in% "Survived"]
titanic_test_x <- titanic_test[, !names(titanic_test) %in% "Survived"]
titanic_train_y <- titanic_train$Survived
titanic_test_y <- titanic_test$Survived

7 Build Model

Now we will build model for Logistic Regression and KNN:

7.1 Logistic Regression

For the first model, we will build a Logistic Regression model:

model_lr <- glm(Survived ~ Sex + Pclass + Fare,
                   data = titanic_train, family = binomial)

We will predict the data test using model_lr and also create a confusion matrix for later evaluation:

pred_lr <- predict(model_lr, titanic_test_x)
pred_binary_lr <- as.factor(ifelse(pred_lr > 0.5, 1, 0))
eval_lr <- confusionMatrix(pred_binary_lr, as.factor(titanic_test_y), positive = "1")

7.2 K-nearest neighbor (k-NN) Model

The second model is a K-nearest neighbor (k-NN) Model. First, we have to scale the titanic_train_x and titanic_test_x using Z-score Standardization. It should be done because it can preserves the shape of the original distribution of each feature while centering it around zero and scaling it to have a standard deviation of 1. This ensures that the relative relationships between data points within each feature are maintained:

titanic_train_xs <- scale(x = titanic_train_x)
titanic_test_xs <- scale(x = titanic_test_x,
                      center = attr(titanic_train_xs, "scaled:center"),
                      scale = attr(titanic_train_xs, "scaled:scale"))

Search for k optimal value:

sqrt(nrow(titanic_train_xs))

## [1] 26.68333

The k value is 26.

In KNN model, we don’t need to train and predict the data. Instead, we can use knn from library class to predict it directly. Save it topred_knn and also create a confusion matrix for later evaluation:

pred_knn <- knn(train = titanic_train_xs,
                 test = titanic_test_xs,
                 cl = titanic_train_y,
                 k = 26)
eval_knn <- confusionMatrix(pred_knn, as.factor(titanic_test_y), positive = "1")

8 Evaluation

Now that we have create confusion matrices from all the model, let’s see the comparison:

compare_lr <- data_frame(Model = "Logistic Regression",
                         Accuracy = round((eval_lr$overall["Accuracy"] * 100), 2),
                         Recall = round((eval_lr$byClass["Recall"] * 100), 2),
                         Precision = round((eval_lr$byClass["Precision"] * 100), 2),
                         Specificity = round((eval_lr$byClass["Specificity"] * 100), 2))

compare_knn <- data_frame(Model = "K-nearest neighbor (k-NN)",
                         Accuracy = round((eval_knn$overall["Accuracy"] * 100), 2),
                         Recall = round((eval_knn$byClass["Recall"] * 100), 2),
                         Precision = round((eval_knn$byClass["Precision"] * 100), 2),
                         Specificity = round((eval_knn$byClass["Specificity"] * 100), 2))

rbind(compare_lr, compare_knn)

Based on the background case stated above, we want to see which model has the highest Recall and Precision metrics. With that being said, based solely on the recall and precision metrics provided, k-nearest neighbors (k-NN) appears to be a more suitable choice than logistic regression for predicting survivability. The higher recall of k-NN with 70.89% indicates that it’s better at finding actual survivors among all those who survived. Although logistic regression shows a higher precision of 95.35%, implying it’s more accurate when it predicts survival, its lower recall of 51.90% suggests it might miss some survivors. In contrast, k-NN achieves a balance between recall and precision, making it a more reliable option for identifying survivors accurately.

9 Conclusion

We’ve looked at how different methods could predict if passengers on the Titanic survived or not. We focused on finding the best method for this task by paying attention to two important measures: Recall and Precision.

We found that for this case, Sex, Pclass, and Fare are the best columns because they have high correlation values with Survived column.

We also found that between Logistic Regression and K-nearest neighbor (k-NN), the K-nearest neighbor (k-NN) model performed the best. It had the highest chances of spotting survivors correctly and making accurate predictions. With a Recall rate of 70.89% and a Precision rate of 86.15%, K-nearest neighbor (k-NN) showed it could find survivors well while also avoiding wrong guesses.

This suggests K-nearest neighbor (k-NN) is a good choice for predicting survival in situations similar to the Titanic disaster. Using K-nearest neighbor (k-NN) can help us make better decisions and understand survival patterns more effectively.

Finding the Best Model for Predicting Survival on the Titanic: A Comparison of Logistic Regression and K-nearest neighbor (k-NN)

Maudy N Avianti

2024-03-02