“There is no danger that Titanic will sink. The boat is unsinkable and nothing but inconvenience will be suffered by the passengers.”
Phillip Franklin, White Star Line vice-president, 1912
Once thought to be unsinkable, the Titanic ironically sank on her first voyage in the North Atlantic Ocean on 15 April 1912. It is generally believed that some 1.500 people perished and only 706 people survived. It has been claimed that out of 706 survivors, most of them were women and children, and the majority were first-class passengers.
Based on this claim, we will do a classification analysis of Titanic data to classify which passengers survived or didn’t survive. In this report, we will use the Titanic dataset that was taken from Titanic Kaggle challenge.
Based on the problem stated above, we will be focusing on the positive case of predicting survival. In this context, the positive case refers to individuals who survived the event under consideration. Because we are focusing on the positive case of predicting survival, we will primarily concentrate on evaluation metrics that are particularly relevant for binary classification tasks, especially those emphasizing the correct identification of survivors. Therefore, the main metrics of interest will be Recall and Precision.
This report explains how we used three different methods—Naive Bayes, Decision Tree, and Random Forest—to make predictions based on a dataset we have. We trained each model to predict something specific. Our goal is to find out which model performs the best in terms of two important measures: Recall and Precision. Recall tells us how well the model finds all the correct answers among all the right ones, while Precision tells us how accurate the model is when it says something is correct. We will compare these metrics for each model to determine which one is the the best model for the case.
Here is the list of libraries used for this report:
The titanic dataset has three csv files:
gender_submission.csv, test.csv, and
train.csv. We will use train.csv file and
split the data for training and testing the model that’s been made. As
for gender_submission.csv and train.csv, both
data are related to Kaggle challenge, so it will not be used for this
report’s scope.
First, read train.csv and assign it to
titanic:
Check the head of titanic:
Check the tail of train:
Let’s see the structure of train dataframe:
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
The explanation of each column is:
PassengerId: The id of the passenger
Survived: The survival of the passenger (1 = Yes and 0 = No)
Pclass: Ticket class
Name: The name of the passenger
Sex: The sex of the passenger
Age: The age of the passenger (in years)
SibSp: The numbers of siblings or spouses who aboard the Titanic
Parch: The numbers of parents or children who aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown and S = Southampton)
We also want to know how many rows and columns there are in
titanic:
## [1] 891 12
There are 891 rows and 12 columns.
Because of the scope of this analysis, we will remove unimportant
columns such as PassengerId, Name,
Ticket, and Cabin, and make the count of
columns down to 8 columns:
## [1] 891 8
We want to make sure that all the data types are correct. One of the
ways is to check for unique values to determine whether it’s a factor
data type. We will only check the columns that are suspected to have
less than 10 unique values, so in this case Survived,
Pclass, Sex, and Embarked
columns:
## [1] 0 1
## [1] 3 1 2
## [1] "male" "female"
## [1] "S" "C" "Q" ""
Notice that in the Embarked column, there is an empty
value (““).
It’s likely because the ports in which the passengers embarked were
unknown. So, before changing the datatypes, we will convert the empty
value into another value. Check for the Embarked empty
values position in the train dataframe:
The empty values are in rows 62 and 830. We will assign the empty
values of the Embarked column to the median value of
Embarked:
embarked_median <- median(titanic$Embarked)
titanic[titanic$Embarked == "", ]$Embarked <- embarked_medianAfter assigning the empty values to the median value, let’s recheck to make sure that the empty string value has disappeared:
## [1] "S" "C" "Q"
Great. Now we will continue the prior step which is changing data types. The result above showed us that these four columns have less than 10 unique values and their data types can be regarded as factor data type. Changing data types to factor:
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)Recheck for the newly changed datatypes:
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Done.
Now, we need to make sure that there are no missing values. Let’s check for missing values:
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 0
There are 177 empty values in the Age column. For the
sake of data clarity, we will assign the missing data in the
Age column as the median age based on sex. Search for
median age in female passengers and male passengers:
female_pass <- titanic[titanic$Sex == "female", ]
female_median <- median(female_pass$Age, na.rm = TRUE)
female_median## [1] 27
male_pass <- titanic[titanic$Sex == "male", ]
male_median <- median(male_pass$Age, na.rm = TRUE)
male_median## [1] 29
Set all NA in column Age as 27 for female passengers and
29 for male passengers:
titanic$Age <- ifelse(titanic$Sex == "female" & is.na(titanic$Age), female_median, titanic$Age)
titanic$Age <- ifelse(titanic$Sex == "male" & is.na(titanic$Age), male_median, titanic$Age)Recheck for any missing values:
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
Now we’re good! All the values are finally not empty. Let’s move on to the next point which is Data Explanation.
Check for the statistic’s summary of titanic
dataframe:
## Survived Pclass Sex Age SibSp Parch
## 0:549 1:216 female:314 Min. : 0.42 Min. :0.000 Min. :0.0000
## 1:342 2:184 male :577 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000
## 3:491 Median :29.00 Median :0.000 Median :0.0000
## Mean :29.44 Mean :0.523 Mean :0.3816
## 3rd Qu.:35.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## Fare Embarked
## Min. : 0.00 C:168
## 1st Qu.: 7.91 Q: 77
## Median : 14.45 S:646
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
The insights gained from the summary above:
There were 342 passengers survived and 549 passengers didn’t
There were 216 passengers in the 1st class, 184 passengers in the 2nd class, and 491 passengers in the 3rd class
The passengers were consisted of 314 females and 577 males
The average age of the passengers is 29.44, whereas the minimum age is 0.42 and the maximum age is 88.00
The average fare is 32.20, the minimum fare is 0.00, and the maximum fare is 512.33
As many as 646 passengers embarked from Southampton, followed by 168 passengers embarked from Cherbourg, and 77 passengers embarked from Queenstown
In this section, we will create a model based on titanic
data to determine survivability of passengers.
The goal of this report is to determine survivability of passengers,
so the Survived column will be the target variable.
To search for the best columns to used as predictor variables, we
will use GGally to check the correlation of all the columns
in titanic. GGally needs all the columns not
to be factor data types, so before generating the plot, we need to
change all the factor columns to be numeric columns and insert it to
titanic_plot:
library(GGally)
titanic_plot <- titanic %>%
mutate(
Survived = as.numeric(Survived),
Pclass = as.numeric(Pclass),
Sex = as.numeric(Sex),
Embarked = as.numeric(Embarked)
)
ggcorr(titanic_plot, label = T,label_size = 2.9)From the plot above, the column that has the highest correlation with
the column Survived is Sex with -0.5, followed
by Pclass with -0.3 and Fare with 0.3. The
rest of the columns only have -0.1, 0.1 and 0, which means they are not
highly correlated.
In conclusion, we will use Sex, Pclass, and
Fare to be the predictor variables.
Now, we want to divide titanic data into two datasets,
titanic_train for training the model, and
titanic_test for testing the model that’s been made:
set.seed(2024)
index <- sample(x = nrow(titanic),
size = nrow(titanic) * 0.8 )
titanic_train <- titanic[index, ]
titanic_test <- titanic[-index, ]Good. We will check if the titanic_train is balance:
##
## 0 1
## 63 37
The proportion is 63:37, the proportion need to be tuned a bit so it will be balanced. We will do an upsampling:
titanic_train <- upSample(x = titanic_train %>% select(-Survived),
y = titanic_train$Survived,
yname = "Survived")Let’s check for the new proportion after upsampling process:
##
## 0 1
## 50 50
Great, now the proportion is balanced with 50% Survived and 50% Not
Survived. But before we build the model, we have to divide
titanic_test data to titanic_test_x for
predictor variables and titanic_test_y for target
variables.
Now we will build model for naive bayes, decision and KNN:
For the first model, we will build a Naive Bayes model:
We will predict the data test using model_nb and also
create a confusion matrix for later evaluation:
The second model is a Decision Tree model:
model_dt <- ctree(formula = Survived ~ Sex + Pclass + Fare,
data = titanic,
control = ctree_control(mincriterion = 0.5,
minsplit = 0,
minbucket = 0))We will predict the data test using model_dt and also
create a confusion matrix for later evaluation:
Lastly, we will create a Random Forest Model. But first, create the control variables:
And create the model:
model_rf <- train(Survived ~ Sex + Pclass + Fare,
data = titanic_train,
method = "rf",
trControl = control,
importance = T)We will predict the data test using model_rf and also
create a confusion matrix for later evaluation:
Now that we have create confusion matrices from all the model, let’s see the comparison:
compare_nb <- data_frame(Model = "Naive Bayes",
Accuracy = round((eval_nb$overall["Accuracy"] * 100), 2),
Recall = round((eval_nb$byClass["Recall"] * 100), 2),
Precision = round((eval_nb$byClass["Precision"] * 100), 2),
Specificity = round((eval_nb$byClass["Specificity"] * 100), 2))
compare_dt <- data_frame(Model = "Decision Tree",
Accuracy = round((eval_dt$overall["Accuracy"] * 100), 2),
Recall = round((eval_dt$byClass["Recall"] * 100), 2),
Precision = round((eval_dt$byClass["Precision"] * 100), 2),
Specificity = round((eval_dt$byClass["Specificity"] * 100), 2))
compare_rf <- data_frame(Model = "Random Forest",
Accuracy = round((eval_rf$overall["Accuracy"] * 100), 2),
Recall = round((eval_rf$byClass["Recall"] * 100), 2),
Precision = round((eval_rf$byClass["Precision"] * 100), 2),
Specificity = round((eval_rf$byClass["Specificity"] * 100), 2))
rbind(compare_nb, compare_dt, compare_rf)Based on the background case stated above, we want to see which model has the highest Recall and Precision metrics. With that being said, we can see that Random Forest has the highest Recall metrics value of 77.22% and Precision metrics value of 85.92%.
We’ve looked at how different methods could predict if passengers on the Titanic survived or not. We focused on finding the best method for this task by paying attention to two important measures: Recall and Precision.
We found that for this case, Sex, Pclass,
and Fare are the best columns because they have high
correlation values with Survived column.
We also found that between Naive Bayes, Decision Tree, and Random Forest, the Random Forest model performed the best. It had the highest chances of spotting survivors correctly and making accurate predictions. With a Recall rate of 77.22% and a Precision rate of 85.92%, Random Forest showed it could find survivors well while also avoiding wrong guesses.
This suggests Random Forest is a good choice for predicting survival in situations similar to the Titanic disaster. Using Random Forest can help us make better decisions and understand survival patterns more effectively.