1 Prelude

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

RMS Titanic

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. The RMS Titanic was the largest ship afloat at the time it entered service and was the second of three Olympic-class ocean liners operated by the White Star Line. The Titanic was built by the Harland and Wolff shipyard in Belfast. Thomas Andrews, her architect, died in the disaster.

I would like to implement machine learning to predict will passenger survived during this catastrophy. It provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age and survival.

Needed Libraries

#Packages for dataframe transformation
library(dplyr)
library(tibble)
library(tidyr)
library(lubridate)
library(highcharter)
library(leaflet)
library(tidyverse)

#Package for visualization& interactive
library(ggplot2)
library(plotly)

# Arrange the plots side by side
library(gridExtra)

#Package for RandomForest
library(randomForest)
library(MLmetrics) #for RMSE or MAPE error checking
library(caret)
library(pROC)

The data has been split into two groups:

    1. training set (train.csv) contains the details of a subset of the passengers on board (891 passengers to be exact, where each passenger gets a different row in the table)
    1. test set (test.csv) (We will use this later for testing the model)

2 Exploratory Data Analysis

titanic <- read.csv("data_input/train.csv", stringsAsFactors = T)
head(titanic)
titanic_test <- read.csv("data_input/test.csv", stringsAsFactors = T)

Explanation of each column:

  • PassengerId : Number of the passenger
  • Survived : Indicator of survival, 0 is not survived, 1 is survived.
  • Pclass : A proxy for socio-economic status (SES) Ticket class: 1 = 1st (Upper), 2 = 2nd (Middle), and 3 = 3rd (Lower)
  • Name : Name
  • Sex : Sex
  • Age : Age (is fractional if less than 1. If the age is estimated, is it in the form of xx.5)
  • SibSp : # of siblings / spouses aboard the Titanic
  • Parch : # of parents / children aboard the Titanic
  • Ticket : Ticket number
  • Fare : Passenger fare
  • Cabin : Cabin number
  • Embarked : Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

sibsp: The dataset defines family relations

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.


SUMMARY

summary(titanic)
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, 330…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <fct> , C85, , C123, , , E46, , , , G6, C103, , , , , , , , , , …
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…

Facts

  • The train dataset has 891 examples and 12 columns including target variable(which is Survived)

  • 5 columns are in integer format, 5 in character and 2 in double

  • 38%(Mean on Survived 0.38) of passengers survived the Titanic

  • Median of Pclass was in 3rd class, which is the lowest one

  • There are 314 female and 577 male passengers

  • Passengers age ranged from 0.42 to 80 years old

  • There are 177 passengers that has no age data

  • Fare average is 14.45 with maximum of 512.33

  • Embarked from Cherbourg are 168 passengers, 77 passengers from Queenstown and 644 passengers from Southampton

MISSING DATA

# Assuming 'train_df' is your dataset in R
total <- sapply(titanic, function(x) sum(is.na(x) | x == "")) # Total missing values
percent <- round((total / nrow(titanic)) * 100, 1)   # Percentage of missing values

# Create a data frame of missing data summary
missing_data <- data.frame(Total = total, Percent = percent) %>%
  arrange(desc(Percent))

# Display the top 5 columns with the highest percentage of missing values
head(missing_data, 5)

Facts

  • Embarked has only two missing values which can easily be filled.

  • Age has 177 missing values, a bit tricky to fill this.

  • Cabin has 687 missing values but it looks like that we might want to drop it from the dataset, since 77 % of it are missing.

colnames(titanic)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

From 12 columns above, i will use all except PassengerId, Ticket and Name as it is more likely not correlated with a high survival rate.


2.1 Survivor Proportion

Checking the proportional of the survival

# Assuming titanic_summarized is already created as before
titanic_summarized <- titanic %>%
  group_by(Survived) %>%
  summarise(Count = n()) %>%
  mutate(Survived = ifelse(Survived == 1, "Survived", "Not Survived"),
         Percentage = Count/sum(Count)*100)

# Define colors for the chart
colors <- c("Not Survived" = "red", "Survived" = "blue") # Red for Not Survived, Green for Survived

# Plotting with specified colors
highchart() %>%
  hc_add_series(
    data = titanic_summarized,
    type = "pie",
    name = "Survival",
    hcaes(x = Survived, y = Count, color = Survived) # Use color mapping
  ) %>%
  hc_colors(colors = unname(colors[titanic_summarized$Survived])) %>%  # Apply the color mapping
  hc_size(width = 400, height = 300) %>%   # Adjust the size as needed
  hc_tooltip(crosshairs=TRUE, borderWidth=5, sort=TRUE, shared=TRUE, table=TRUE,
            pointFormat=paste('<br><b>Percentage: {point.Percentage:.2f}%</b>')) %>%
  hc_title(text="Proportion of the Survivor", margin=0, style=list(color="#144746", useHTML=TRUE))

We could see that only 38,38% of passenger is survived.

2.2 Gender - Age & Survival Rate

See the survived or not survived from range of age and gender

# Separate data for women and men
women <- titanic %>% filter(Sex == 'female')
men <- titanic %>% filter(Sex == 'male')

# Plot for Female
p_women <- ggplot() +
  geom_histogram(data = women %>% filter(Survived == 1), aes(x = Age, fill = 'survived'), binwidth = 4, alpha = 0.6) +
  geom_histogram(data = women %>% filter(Survived == 0), aes(x = Age, fill = 'not survived'), binwidth = 4, alpha = 0.6) +
  labs(title = "Women", fill = "") +
  theme_minimal() +
  xlim(NA, 80) +
  scale_fill_manual(values = c('survived' = 'blue', 'not survived' = 'red'))

# Plot for Male
p_men <- ggplot() +
  geom_histogram(data = men %>% filter(Survived == 1), aes(x = Age, fill = 'survived'), binwidth = 4, alpha = 0.6) +
  geom_histogram(data = men %>% filter(Survived == 0), aes(x = Age, fill = 'not survived'), binwidth = 4, alpha = 0.6) +
  labs(title = "Men", fill = "") +
  theme_minimal() +
  xlim(NA, 80) +
  scale_fill_manual(values = c('survived' = 'blue', 'not survived' = 'red'))

grid.arrange(p_women, p_men, ncol = 2)

As we can see here, woman have a high probability of survival compared to men, especially when they are between 18 to 40 years old.

Men, at the other side, facing high probability of not survived between 18 and 50 years old. They are facing lowest probability of survival between age 5 to 18. Another thing to note is that infants in both gender also have a little bit higher probability of survival.

Since there seem to be certain ages, which have increased odds of survival and because I want every feature to be roughly on the same scale, I will create age groups later on.


Find the Survival Rate of each gender

#Number of Passenger by Gender
passenger_female <- nrow(titanic[titanic$Sex == "female", ])
passenger_male <- nrow(titanic[titanic$Sex == "male", ])

#Number of Passenger by Gender and Survived condition
female <- nrow(titanic[titanic$Sex == "female" & titanic$Survived == 1, ])
male <- nrow(titanic[titanic$Sex == "male" & titanic$Survived == 1, ])

#Survival Rate
rate_female <- round((female/passenger_female) * 100, 2)
rate_male <- round((male/passenger_male) * 100,2)

#Male Survival
cat("Percentage of Female Survivor is", rate_female, "Percents")
## Percentage of Female Survivor is 74.2 Percents
cat("Percentage of Male Survivor is", rate_male, "Percents")
## Percentage of Male Survivor is 18.89 Percents

2.3 Histogram of Age Distribution

Find the age distribution of the passenger

hist(titanic$Age, breaks=30, xlim=c(0,90), col=rgb(1,0,0,0.5), xlab="Age", 
     ylab="Amount", main="Age Distribution" )

It can be seen that most of the passengers ranged in age from 16 to 40 years and babies were around 2 years old.

(1 bar = 2 years)

2.4 Ticket Class

ticket_class <- titanic %>%
  group_by(Pclass) %>%
  summarise(Count = n()) %>%
  mutate(Class = case_when(
    Pclass == 1 ~ "Class One",
    Pclass == 2 ~ "Class Two",
    Pclass == 3 ~ "Class Three"
  ))


# Prepare the data for Highcharts
pie_data <- list_parse2(
  ticket_class %>% 
    select(Class, Count) %>% 
    rename(name = Class, y = Count) %>% 
    mutate(name = as.character(name)) # Ensure names are characters
)

# Plotting
highchart() %>%
  hc_chart(type = "pie") %>%
  hc_add_series(
    data = pie_data,
    showInLegend = TRUE
  ) %>%
  hc_title(text = "Proportion of Ticket Classes") %>%
  hc_tooltip(pointFormat=paste('<br><b>Percentage: {point.percentage:.1f}%'))

Survival rate based on ticket class

ticket_class2 <- titanic %>%
  group_by(Pclass, Survived) %>%
  summarise(Count = n()) %>%
  mutate(Class = case_when(
    Pclass == 1 ~ "Class One",
    Pclass == 2 ~ "Class Two",
    Pclass == 3 ~ "Class Three"
  ),
  Survived = case_when(
    Survived == 0 ~ "Not Survived",
    Survived == 1 ~ "Survived"
  )) %>%
  select(c(Class, Survived, Count)) %>%
  arrange(desc(Count))
## `summarise()` has grouped output by 'Pclass'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `Pclass`
ticket_class2

2.5 Port of Embarkation & Survival Rate

Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

embark <- titanic %>%
  group_by(Embarked) %>%
  summarise(Count = n()) %>%
  mutate(Embarked = recode(
    Embarked,
    "C" = "Cherbourg, France",
    "Q" = "Queenstown, Ireland",
    "S" = "Southampton, U.K."
  ),#rename the CSQ to Cherbourg, Southampton and Queenstown 
  lat = c(0, 49.630001, 51.857222, 50.909698), #adding latitude for these city
  lon = c(0, -1.620000, -8.299167, -1.404351)) #adding longitude for these city


leaflet(embark) %>%
  setView(lat = 51.206077, lng = -3.163499, zoom = 6) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~lon,
    lat = ~lat,
    popup = ~paste(Embarked, Count, "passengers"),
    label = ~paste(Embarked, Count),
    color = "#ff2f00",         # Outline color for circle marker
    fillColor = "#FF0000",     # Fill color for circle marker
    fillOpacity = 0.9,
    radius = 8                 # Size of the circle marker
  )

There are two passengers that don’t have Port of Embarkation:

titanic %>% 
  filter(Embarked == "") %>% 
  select(PassengerId, Pclass, Survived, Name, Sex)

We might also want to check survival rate on each Port of Embarkation and Gender.

embark_c <- titanic %>%
  filter(Embarked %in% c("S", "Q", "C")) %>%
  group_by(Embarked, Pclass, Sex, Survived) %>%
  summarise(Count = n()) %>%
  arrange(desc(Survived), desc(Count)) %>%
  mutate(
    Embarked = case_when(
      Embarked == "S" ~ "Southampton",
      Embarked == "Q" ~ "Queenstown",
      Embarked == "C" ~ "Cherbourg",
      TRUE ~ ""
    ),
    Survived = case_when(
      Survived == 1 ~ "Survived",
      TRUE ~ "Not Survived"
    )
  )

# embark_c$Embarked <- factor(embark_c$Embarked, 
#                               levels = c("Cherbourg", "Queenstown", "Southampton"))

plot <- ggplot(data = embark_c, aes(x = Pclass, y = Count, fill = Sex)) +
  geom_col(position = "dodge") +
  facet_wrap(~Embarked+Survived, ncol = 2) + # Optional: Facet by 'Embarked' to separate plots by survival status
  labs(x = "Passenger Class", y = "Count", fill = "Sex") +
  theme_minimal()+
  theme(
    panel.spacing.y = unit(3, "lines"), # Adjust vertical spacing between panels
    panel.spacing.x = unit(1, "lines")  # Adjust horizontal spacing between panels if needed
  )

ggplotly(plot, tooltip = "Count")

2.6 Sibling& Families

sibparch <- titanic %>%
  mutate(Relatives = case_when(SibSp + Parch == 0 ~ "Alone",
                               TRUE ~ "Not Alone"),
         Survived = ifelse(Survived == 1, "Survived", "Not Survived")) %>%
  select(PassengerId, Relatives, Pclass, Survived) %>%
  arrange(desc(Survived))%>%
  group_by(Relatives, Survived) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
## `summarise()` has grouped output by 'Relatives'. You can override using the
## `.groups` argument.
sibparch
# sibparch <- titanic %>% 
#   mutate(Relatives = SibSp + Parch) %>% 
#   select(PassengerId, Relatives, Pclass, Survived) %>% 
#   arrange(desc(Relatives))
# sibparch

Facts

  • There are 374 passengers that has no relatives on board and not survived

  • The rest on “Not Alone and Survived”, “Not Alone and Not Survived” and “Alone and Survived” shares the similar amount


3 Data Preprocessing

  • Drop PassengerId column because it gives no contribution to a persons survival probability.

  • Drop Cabin as it has lot of empty column

  • Drop Ticket as attribute has 681 unique tickets, it will be a bit tricky to convert them into useful categories. So we will drop it from the dataset.

  • Fill Age Column with mean and standard deviation

  • Fill Embarked NA column with the common one which is “S”

  • Fill Fare NA column with 0

# Age
# Compute mean and standard deviation from the training data
mean_age <- mean(titanic$Age, na.rm = TRUE)
std_age <- sd(titanic$Age, na.rm = TRUE) # It's better to use training data for both

# Create a function to impute Age
impute_age <- function(dataset) {
  is_null <- sum(is.na(dataset$Age))
  rand_age <- runif(is_null, mean_age - std_age, mean_age + std_age)
  dataset$Age <- ifelse(is.na(dataset$Age), rand_age, dataset$Age)
  dataset$Age <- as.integer(dataset$Age)
  dataset$Fare <- replace_na(0)
  return(dataset)
}

titanic2 <- titanic %>%
  select(-PassengerId, -Cabin, -Ticket)
  
# Apply the imputation function to both datasets
train_df <- impute_age(titanic2)
test_df <- impute_age(titanic_test)

# Check for missing values in train_df's Age
sum(is.na(train_df$Age))
## [1] 0
#Fill NA in Embarked column with the common one
# train_df$Embarked %>% replace_na("S")
  • Name title extraction
# Define the mapping of titles to numerical codes
titles <- c("Mr" = 1, "Miss" = 2, "Mrs" = 3, "Master" = 4, "Rare" = 5)

# Function to process titles in a dataset
process_titles <- function(dataset) {
  dataset %>%
    # Extract titles
    mutate(Title = str_extract(Name, "([A-Za-z]+)\\.")) %>%
    # Standardize titles
    mutate(Title = case_when(
      Title %in% c("Mr") ~ "Mr",
      Title %in% c("Miss", "Mlle", "Ms") ~ "Miss",
      Title %in% c("Mrs", "Mme") ~ "Mrs",
      Title == "Master" ~ "Master",
      TRUE ~ "Rare"  # Treat all other titles as "Rare"
    )) %>%
    # Map titles to numerical codes
    mutate(Title = titles[Title]) %>%
    # Ensure all entries have a valid title code; fill any NA values with 5 (Rare)
    mutate(Title = ifelse(is.na(Title), 5, Title)) %>%
    # Optionally, remove the 'Name' column
    select(-Name)
}

# Apply the function to both the training and test datasets
train_df <- process_titles(train_df)
test_df <- process_titles(test_df)
  • Convert sex info to numeric
# Define the mapping of titles to numerical codes
genders <- c(male = 0, female = 1)

# Apply the mapping to both datasets
train_df$Sex <- genders[train_df$Sex]
test_df$Sex <- genders[test_df$Sex]
  • Convert Embarked to numeric
train_df <- train_df %>% 
  mutate(Embarked = case_when(Embarked == "S" ~ 0,
                              Embarked == "C" ~ 1, 
                              Embarked == "Q" ~ 2,
                              TRUE ~ 0))

test_df <- test_df %>% 
  mutate(Embarked = case_when(Embarked == "S" ~ 0,
                              Embarked == "C" ~ 1, 
                              Embarked == "Q" ~ 2,
                              TRUE ~ 0))

4 Creating Categories

With following features:

  • Age, we want to categorize every age into a group, make sure to avoid 80% data falls into group 1.
# Function to categorize age
categorize_age <- function(age) {
  case_when(
    age <= 11 ~ 0,
    age > 11 & age <= 18 ~ 1,
    age > 18 & age <= 22 ~ 2,
    age > 22 & age <= 27 ~ 3,
    age > 27 & age <= 33 ~ 4,
    age > 33 & age <= 40 ~ 5,
    age > 40 & age <= 66 ~ 6,
    age > 66 ~ 7  # Adjusted to include a separate category for > 66
  )
}

# Apply the categorization to both data frames
train_df <- train_df %>%
  mutate(Age = as.integer(Age), # Ensure Age is integer type
         Age = categorize_age(Age))

test_df <- test_df %>%
  mutate(Age = as.integer(Age), # Ensure Age is integer type
         Age = categorize_age(Age))
  • Fare, will do the same as with the ‘Age’ feature. Keep in mind that we have to do it proportionally by using the ntile function from dplyr
categorize_fare <- function(fare) {
  ntile(fare, 5)
}

train_df <- train_df %>%
  mutate(Fare = categorize_fare(Fare), Fare = as.integer(Fare))

test_df <- test_df %>%
  mutate(Fare = categorize_fare(Fare), Fare = as.integer(Fare))

5 Creating New Feature

  • Age and Class
train_df <- train_df %>%
  mutate(Age_Class = Age * Pclass)

test_df <- test_df %>%
  mutate(Age_Class = Age * Pclass)
  • Fare per Person
train_df <- train_df %>%
  mutate(Fare_Per_Person = Fare / ((SibSp + Parch) + 1),
         Fare_Per_Person = as.integer(Fare_Per_Person))

test_df <- test_df %>%
  mutate(Fare_Per_Person = Fare / ((SibSp + Parch) + 1),
         Fare_Per_Person = as.integer(Fare_Per_Person))

6 Building Machine Learning Model

We will use Random Forest Model to make a prediction on who will survived during catastrophy and see the godness of the model.

Imagine walking through a lush forest, where each tree represents a decision maker. This forest is not ordinary; it’s a Random Forest in the world of machine learning, simplified for everyone to understand.

Picture the Scene: The Trees: Each tree in our magical forest makes decisions based on different pieces of information it observes. Imagine these trees looking at various aspects of an apple to decide if it’s ripe enough to eat. One tree might consider its color, another its size, and yet another its firmness.

Diverse Opinions: Just like in a community where everyone brings a different perspective, each tree in our forest has its unique viewpoint, leading to a variety of decisions.

Coming Together: Now, to decide if an apple is truly ripe, the forest does something remarkable. It combines the wisdom of all its trees. Each tree casts a vote based on its individual assessment.

The Final Verdict: The forest then tallies these votes. The outcome with the most votes is the final decision of the Random Forest on whether the apple is ripe or not.

Why is this Useful? In real life, Random Forests help us make sense of complex problems by breaking them down into simpler, smaller decisions, much like how different trees look at various aspects of the apple. This approach makes Random Forests very powerful and accurate in predicting outcomes in diverse areas, from deciding which movie you might like next to identifying diseases from medical tests.

The Magic of Ensemble Learning: This process is a part of what’s known as ensemble learning, where multiple models (our trees) join forces to make a decision, often leading to better results than any single model (tree) could achieve on its own.

The illustration visualizes this concept in a fun and accessible way, making the idea of Random Forests and ensemble learning approachable to everyone, regardless of their background in data science.

To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Model making:

set.seed(100) # for reproducibility

#Make Survived column as factor
train_df <- train_df %>% 
  mutate(Survived = as.factor(Survived))

ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)

train_forest <- train(Survived ~ ., data = train_df, method = "rf", trControl = ctrl)

train_forest
## Random Forest 
## 
## 891 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 712, 714, 713, 713, 712, 713, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8204428  0.6058867
##    6    0.7871595  0.5418680
##   10    0.7793089  0.5246811
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Here is the explanation of the above result:

  • Number = 5 means 5 K-Fold Cross Validation applied for the train_df data, this step will divide the data in K parts, each part will be used as test data in turn.

  • Repeats = 3 means the k-fold Cross Validation will be done three times.

## Warning: package 'knitr' was built under R version 4.3.3

Random Forest computations are very large and long (a consequence of their robust nature). Therefore, it highly suggested to save the model, so that it can be used repeatedly without having to repeat the creation process.

If you look at the summary model, several mtry experiments were carried out (the number of predictors that can be used for splitting nodes (1 predictor can be used more than 1 time)). At each repeat, a different mtry will be tried (the mtry number is chosen randomly too). Random forest will automatically select the mtry that produces the best evaluation metrics (in this case Accuracy).

In this case, the model chosen is with mtry = 2, which has the highest accuracy when tested on boostrap sampling data (can be considered as train data for making decision trees in random forests).

train_forest$results

with Standard Deviation of 0.027(or 2.7%) in mtry 2. The standard deviation shows us, how precise the estimates are.

Confusion Matrix

train_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 17.51%
## Confusion matrix:
##     0   1 class.error
## 0 509  40  0.07285974
## 1 116 226  0.33918129

The first row is about the not-survived-predictions: 503 passengers were correctly classified as not survived (called true negatives) and 46 where wrongly classified as not survived (false positives).

The second row is about the survived-predictions: 117 passengers where wrongly classified as survived (false negatives) and 225 where correctly classified as survived (true positives).

A confusion matrix gives you a lot of information about how well your model does, but theres a way to get even more, like computing the classifiers precision.

Here we can find the Confusion Matrix result with OOB Estimate of error rate in 18.07%

it means the \[Accuracy = 100 - 18.07 = 81.93 \%\]

Precision& Recall

confusionMatrix(data = as.factor(train_df$Survived), reference = as.factor(train_forest$finalModel$predicted), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 509  40
##          1 116 226
##                                           
##                Accuracy : 0.8249          
##                  95% CI : (0.7983, 0.8493)
##     No Information Rate : 0.7015          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6137          
##                                           
##  Mcnemar's Test P-Value : 1.916e-09       
##                                           
##             Sensitivity : 0.8496          
##             Specificity : 0.8144          
##          Pos Pred Value : 0.6608          
##          Neg Pred Value : 0.9271          
##              Prevalence : 0.2985          
##          Detection Rate : 0.2536          
##    Detection Prevalence : 0.3838          
##       Balanced Accuracy : 0.8320          
##                                           
##        'Positive' Class : 1               
## 
  • Sensitivity/ Recall: How good the model to predict the positive class
  • Pos Pred Value/Precision: How precise the model predict the positive class

Our model predicts 65% of the time, a passengers survival correctly (precision). The recall tells us that it predicted the survival of 83 % of the people who actually survived.

F1 Score You can combine precision and recall into one score, which is called the F-score. The F-score is computed with the harmonic mean of precision and recall. Note that it assigns much more weight to low values. As a result of that, the classifier will only get a high F-score, if both recall and precision are high.

# Precision and Recall values
precision <- 0.6579
recall <- 0.8303

# Calculate F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)

# Print F1 Score
f1_score
## [1] 0.7341142

There we have it, a 73 % F-score. The score is not that high, because we have a recall of 83%. But unfortunately the F-score is not perfect, because it favors classifiers that have a similar precision and recall. This is a problem, because you sometimes want a high precision and sometimes a high recall. The thing is that an increasing precision, sometimes results in an decreasing recall and vice versa (depending on the threshold). This is called the precision/recall tradeoff.

ROC and AUC

Accuracy has shortcomings to show the goodness of the model in classifying both classes. To overcome this lack of accuracy, ROC and AUC are available as other evaluation tools after the Confusion Matrix.

Receiver-Operating Curve (ROC) is a curve that plots the relationship between True Positive Rate (Sensitivity or Recall) and False Positive Rate (1-Specificity). A good model should ideally have a high True Positive Rate and a low False Positive Rate.

Why is that can be happened? Accuracy is good when we are facing a balanced class but not on imbalance class.

In the test data, we dont have any Survived column, therefore, i will try to split the training data to its own train and test data so we are able to see the ROC and AUC test later.

set.seed(123) # for reproducibility
trainIndex <- createDataPartition(train_df$Survived, p = .8, 
                                  list = FALSE, 
                                  times = 1)
trainData <- train_df[trainIndex, ]
validationData <- train_df[-trainIndex, ]

and do retrain on the trainData

model <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + Age_Class + Fare_Per_Person, 
                      data = trainData,
                      ntree = 100)

Predict probabilities for the validation set.

Make sure to predict the probability of each class, as pROC requires probabilities for the positive class (usually Survived = 1)

validationData$probSurvived <- predict(model, newdata = validationData, type = "prob")[,2]

Use the roc function from pROC to generate the ROC object, then plot it and calculate AUC:

rocObj <- roc(response = validationData$Survived, 
              predictor = validationData$probSurvived,
              levels = c("0", "1"))

# Plot ROC curve
plot(rocObj, main="ROC Curve", col="#1c61b6")

  • A bad model is one that produces a diagonal line on the plot -> the model is only good at predicting one of the classes.

The vertical line in the middle represents a purely random classifier (e.g a coin flip) and therefore your classifier should be as far away from it as possible. Our Random Forest model seems to do a good job.

Calculate AUC:

auc(rocObj)
## Area under the curve: 0.8192

AUC indicates the area under the ROC curve. The closer to 1, the better the model performance.

The ROC AUC Score is the corresponding score to the ROC AUC Curve. It is simply computed by measuring the area under the curve, which is called AUC.

A classifiers that is 100% correct, would have a ROC AUC Score of 1 and a completely random classiffier would have a score of 0.5.

The AUC value is close to 1 (0.8393), meaning that our model is quite able to classify the positive class and negative class well.

Even though random forests are labeled as non-interpretable models, at least we can see what predictors are most used (important) in creating random forests:

# varImp(train_forest)
train_forest$finalModel$importance
##                 MeanDecreaseGini
## Pclass                  33.21952
## Sex                     92.45657
## Age                     20.18701
## SibSp                   15.61285
## Parch                   10.72975
## Fare                    16.44319
## Embarked                12.24665
## Title                    0.00000
## Age_Class               32.75138
## Fare_Per_Person         16.89692

7 Conclusion

Title, Embarked and Parch doesn’t play a significant role in our random forest classifiers prediction process, you can choose to drop them and see the new results if it will increase the Accuracy or not. Let’s train the classifier again with dropped column.

set.seed(100)

#Make Survived column as factor
train_df2 <- train_df %>% 
  mutate(Survived = as.factor(Survived)) %>% 
  select(-c(Title, Embarked, Parch))

ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)

train_forest2 <- train(Survived ~ ., data = train_df2, method = "rf", trControl = ctrl)

train_forest2
## Random Forest 
## 
## 891 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 712, 714, 713, 713, 712, 713, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8167122  0.6012328
##   4     0.7961568  0.5596998
##   7     0.7826693  0.5316147
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

As we can see above, the model accuracy is slightly decrease to 0.80 on mtry 2, therefore, we will keep the first model to predict the test data.

Implement the prediction to data test and see the results

#Predict with test data
Survived <- predict(train_forest, newdata = test_df)

#column bind test data with predicted result
prediction_result <- cbind(test_df, Survived) %>% select(c(PassengerId, Survived))
head(prediction_result)

8 Summary

We started with the data exploration to get insight from the dataset, checked about missing data and learned which features are important. During this process, we used some R libraries to do the visualization and analyzing. For preprocessing part, we are trying to fill the missing values, converted features into numeric ones, groupd value into categories and created a few new features. We used Random Forest machine learning algorithm and applied cross validation on it.

Brief explanation on how random forest works, took a look at the importance it assigns to the different features and tried to optimize the performance by simply take any unimportant predictors out. However, we are not doing hyperparameter tuning at this moment to simplify the process.

Lastly, we looked at it’s confusion matrix and computed the models precision, recall and f-score.

Before

head(titanic)

After

head(train_df)

Of course there is still room for improvement, like doing a more extensive feature engineering, by comparing and plotting the features against each other and identifying and removing the noisy features. Another thing that can improve the overall result is doing hyperparameter tuning on several machine learning models.


Article: