“There is no danger that Titanic will sink. The boat is unsinkable and nothing but inconvenience will be suffered by the passengers.”
Phillip Franklin, White Star Line vice-president, 1912
Once thought to be unsinkable, the Titanic ironically sank on her first voyage in the North Atlantic Ocean on 15 April 1912. It is generally believed that some 1.500 people perished and only 706 people survived. It has been claimed that out of 706 survivors, most of them were women and children, and the majority were first-class passengers.
Based on this claim, we will do a classification analysis of Titanic data to classify which passengers survived or didn’t survive. In this report, we will use the Titanic dataset that was taken from Titanic Kaggle challenge.
Based on the problem stated above, we will be focusing on the positive case of predicting survival. In this context, the positive case refers to individuals who survived the event under consideration. Because we are focusing on the positive case of predicting survival, we will primarily concentrate on evaluation metrics that are particularly relevant for binary classification tasks, especially those emphasizing the correct identification of survivors. Therefore, the main metrics of interest will be Recall and Precision.
This report explains how we used three different methods—Logistic Regression and K-nearest neighbor (k-NN)—to make predictions based on a dataset we have. We trained each model to predict something specific. Our goal is to find out which model performs the best in terms of two important measures: Recall and Precision. Recall tells us how well the model finds all the correct answers among all the right ones, while Precision tells us how accurate the model is when it says something is correct. We will compare these metrics for each model to determine which one is the the best model for the case.
Here is the list of libraries used for this report:
The titanic dataset has three csv files:
gender_submission.csv, test.csv, and
train.csv. We will use train.csv file and
split the data for training and testing the model that’s been made. As
for gender_submission.csv and train.csv, both
data are related to Kaggle challenge, so it will not be used for this
report’s scope.
First, read train.csv and assign it to
titanic:
Check the head of titanic:
Check the tail of train:
Let’s see the structure of train dataframe:
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
The explanation of each column is:
PassengerId: The id of the passenger
Survived: The survival of the passenger (1 = Yes and 0 = No)
Pclass: Ticket class
Name: The name of the passenger
Sex: The sex of the passenger
Age: The age of the passenger (in years)
SibSp: The numbers of siblings or spouses who aboard the Titanic
Parch: The numbers of parents or children who aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown and S = Southampton)
We also want to know how many rows and columns there are in
titanic:
## [1] 891 12
There are 891 rows and 12 columns.
Because of the scope of this analysis, we will remove unimportant
columns such as PassengerId, Name,
Ticket, and Cabin, and make the count of
columns down to 8 columns:
## [1] 891 8
We want to make sure that all the data types are correct. One of the
ways is to check for unique values to determine whether it’s a factor
data type. We will only check the columns that are suspected to have
less than 10 unique values, so in this case Survived,
Pclass, Sex, and Embarked
columns:
## [1] 0 1
## [1] 3 1 2
## [1] "male" "female"
## [1] "S" "C" "Q" ""
Notice that in the Embarked column, there is an empty
value (““).
It’s likely because the ports in which the passengers embarked were
unknown. So, before changing the datatypes, we will convert the empty
value into another value. Check for the Embarked empty
values position in the train dataframe:
The empty values are in rows 62 and 830. We will assign the empty
values of the Embarked column to the median value of
Embarked:
embarked_median <- median(titanic$Embarked)
titanic[titanic$Embarked == "", ]$Embarked <- embarked_medianAfter assigning the empty values to the median value, let’s recheck to make sure that the empty string value has disappeared:
## [1] "S" "C" "Q"
Great. Now we will continue the prior step which is changing data types. The result above showed us that these four columns have less than 10 unique values and their data types can be regarded as factor data type. But instead of factor data type, we will change them into non-numerical data types. Why? Logistic Regression and KNN model rely on mathematical computations that involve distances or gradients, which are inherently defined for numeric values.
We will change Survived and Pclass into
numerical directly:
And for Sex and
Embarked', we will convert the character value into numerical value usingifelse`:
titanic$Sex<- ifelse(titanic$Sex == "female", 0, 1)
titanic$Embarked <- ifelse(titanic$Embarked == "S", 0,
ifelse(titanic$Embarked == "C", 1, 2))Recheck for the newly changed datatypes:
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : num 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : num 1 0 0 0 1 1 1 1 0 0 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: num 0 1 0 0 0 2 0 0 0 1 ...
Nice, now all the columns in titanic are numerical.
Now, we need to make sure that there are no missing values. Let’s check for missing values:
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 0
There are 177 empty values in the Age column. For the
sake of data clarity, we will assign the missing data in the
Age column as the median age based on sex. Search for
median age in female passengers and male passengers:
female_pass <- titanic[titanic$Sex == "0", ]
female_median <- median(female_pass$Age, na.rm = TRUE)
female_median## [1] 27
male_pass <- titanic[titanic$Sex == "1", ]
male_median <- median(male_pass$Age, na.rm = TRUE)
male_median## [1] 29
Set all NA in column Age as 27 for female passengers and
29 for male passengers:
titanic$Age <- ifelse(titanic$Sex == "0" & is.na(titanic$Age), female_median, titanic$Age)
titanic$Age <- ifelse(titanic$Sex == "1" & is.na(titanic$Age), male_median, titanic$Age)Recheck for any missing values:
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
Now we’re good! All the values are finally not empty. Let’s move on to the next point which is Data Explanation.
Check for the statistic’s summary of titanic
dataframe:
## Survived Pclass Sex Age
## Min. :0.0000 Min. :1.000 Min. :0.0000 Min. : 0.42
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:22.00
## Median :0.0000 Median :3.000 Median :1.0000 Median :29.00
## Mean :0.3838 Mean :2.309 Mean :0.6476 Mean :29.44
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:35.00
## Max. :1.0000 Max. :3.000 Max. :1.0000 Max. :80.00
## SibSp Parch Fare Embarked
## Min. :0.000 Min. :0.0000 Min. : 0.00 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median : 14.45 Median :0.0000
## Mean :0.523 Mean :0.3816 Mean : 32.20 Mean :0.3614
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00 3rd Qu.:1.0000
## Max. :8.000 Max. :6.0000 Max. :512.33 Max. :2.0000
The insights gained from the summary above:
The average age of the passengers is 29.44, whereas the minimum age is 0.42 and the maximum age is 88.00
The average fare is 32.20, the minimum fare is 0.00, and the maximum fare is 512.33
In this section, we will create a model based on titanic
data to determine survivability of passengers.
The goal of this report is to determine survivability of passengers,
so the Survived column will be the target variable.
To search for the best columns to used as predictor variables, we
will use GGally to check the correlation of all the columns
in titanic. GGally needs all the columns not
to be factor data types, so before generating the plot, we need to
change all the factor columns to be numeric columns and insert it to
titanic_plot:
titanic_plot <- titanic %>%
mutate(
Survived = as.numeric(Survived),
Pclass = as.numeric(Pclass),
Sex = as.numeric(Sex),
Embarked = as.numeric(Embarked)
)
ggcorr(titanic_plot, label = T,label_size = 2.9)From the plot above, the column that has the highest correlation with
the column Survived is Sex with -0.5, followed
by Pclass with -0.3 and Fare with 0.3. The
rest of the columns only have -0.1, 0.1 and 0, which means they are not
highly correlated.
In conclusion, we will use Sex, Pclass, and
Fare to be the predictor variables.
Now, we want to divide titanic data into two datasets,
titanic_train for training the model, and
titanic_test for testing the model that’s been made:
set.seed(2024)
index <- sample(x = nrow(titanic),
size = nrow(titanic) * 0.8 )
titanic_train <- titanic[index, ]
titanic_test <- titanic[-index, ]Good. We will check if the titanic_train is balance:
##
## 0 1
## 63 37
The proportion is 63:37, it is balanced enough. But before we build
the model, we have to divide titanic_train and
titanic_test data to titanic_train_x &
titanic_test_x for predictor variables and
titanic_train_y & titanic_test_y for
target variables.
Now we will build model for Logistic Regression and KNN:
For the first model, we will build a Logistic Regression model:
We will predict the data test using model_lr and also
create a confusion matrix for later evaluation:
The second model is a K-nearest neighbor (k-NN) Model. First, we have
to scale the titanic_train_x and
titanic_test_x using Z-score Standardization. It should be
done because it can preserves the shape of the original distribution of
each feature while centering it around zero and scaling it to have a
standard deviation of 1. This ensures that the relative relationships
between data points within each feature are maintained:
titanic_train_xs <- scale(x = titanic_train_x)
titanic_test_xs <- scale(x = titanic_test_x,
center = attr(titanic_train_xs, "scaled:center"),
scale = attr(titanic_train_xs, "scaled:scale"))Search for k optimal value:
## [1] 26.68333
The k value is 26.
In KNN model, we don’t need to train and predict the data. Instead,
we can use knn from library class to predict
it directly. Save it topred_knn and also create a confusion
matrix for later evaluation:
Now that we have create confusion matrices from all the model, let’s see the comparison:
compare_lr <- data_frame(Model = "Logistic Regression",
Accuracy = round((eval_lr$overall["Accuracy"] * 100), 2),
Recall = round((eval_lr$byClass["Recall"] * 100), 2),
Precision = round((eval_lr$byClass["Precision"] * 100), 2),
Specificity = round((eval_lr$byClass["Specificity"] * 100), 2))
compare_knn <- data_frame(Model = "K-nearest neighbor (k-NN)",
Accuracy = round((eval_knn$overall["Accuracy"] * 100), 2),
Recall = round((eval_knn$byClass["Recall"] * 100), 2),
Precision = round((eval_knn$byClass["Precision"] * 100), 2),
Specificity = round((eval_knn$byClass["Specificity"] * 100), 2))
rbind(compare_lr, compare_knn)Based on the background case stated above, we want to see which model has the highest Recall and Precision metrics. With that being said, based solely on the recall and precision metrics provided, k-nearest neighbors (k-NN) appears to be a more suitable choice than logistic regression for predicting survivability. The higher recall of k-NN with 70.89% indicates that it’s better at finding actual survivors among all those who survived. Although logistic regression shows a higher precision of 95.35%, implying it’s more accurate when it predicts survival, its lower recall of 51.90% suggests it might miss some survivors. In contrast, k-NN achieves a balance between recall and precision, making it a more reliable option for identifying survivors accurately.
We’ve looked at how different methods could predict if passengers on the Titanic survived or not. We focused on finding the best method for this task by paying attention to two important measures: Recall and Precision.
We found that for this case, Sex, Pclass,
and Fare are the best columns because they have high
correlation values with Survived column.
We also found that between Logistic Regression and K-nearest neighbor (k-NN), the K-nearest neighbor (k-NN) model performed the best. It had the highest chances of spotting survivors correctly and making accurate predictions. With a Recall rate of 70.89% and a Precision rate of 86.15%, K-nearest neighbor (k-NN) showed it could find survivors well while also avoiding wrong guesses.
This suggests K-nearest neighbor (k-NN) is a good choice for predicting survival in situations similar to the Titanic disaster. Using K-nearest neighbor (k-NN) can help us make better decisions and understand survival patterns more effectively.