The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
RMS Titanic
The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. The RMS Titanic was the largest ship afloat at the time it entered service and was the second of three Olympic-class ocean liners operated by the White Star Line. The Titanic was built by the Harland and Wolff shipyard in Belfast. Thomas Andrews, her architect, died in the disaster.
I would like to implement machine learning to predict will passenger survived during this catastrophy. It provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age and survival.
Needed Libraries
#Packages for dataframe transformation
library(dplyr)
library(tibble)
library(tidyr)
library(lubridate)
library(highcharter)
library(leaflet)
library(tidyverse)
#Package for visualization& interactive
library(ggplot2)
library(plotly)
# Arrange the plots side by side
library(gridExtra)
#Package for RandomForest
library(randomForest)
library(MLmetrics) #for RMSE or MAPE error checking
library(caret)
library(pROC)The data has been split into two groups:
titanic <- read.csv("data_input/train.csv", stringsAsFactors = T)
head(titanic)titanic_test <- read.csv("data_input/test.csv", stringsAsFactors = T)Explanation of each column:
PassengerId : Number of the passengerSurvived : Indicator of survival, 0 is not survived, 1 is survived.Pclass : A proxy for socio-economic status (SES) Ticket class: 1 = 1st (Upper), 2 = 2nd (Middle), and 3 = 3rd (Lower)Name : NameSex : SexAge : Age (is fractional if less than 1. If the age is estimated, is it in the form of xx.5)SibSp : # of siblings / spouses aboard the TitanicParch : # of parents / children aboard the TitanicTicket : Ticket numberFare : Passenger fareCabin : Cabin numberEmbarked : Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southamptonsibsp: The dataset defines family relations
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
SUMMARY
summary(titanic)## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
glimpse(titanic)## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <fct> male, female, female, female, male, male, male, male, fema…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, 330…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <fct> , C85, , C123, , , E46, , , , G6, C103, , , , , , , , , , …
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…
Facts
The train dataset has 891 examples and 12 columns including target variable(which is Survived)
5 columns are in integer format, 5 in character and 2 in double
38%(Mean on Survived 0.38) of passengers survived the Titanic
Median of Pclass was in 3rd class, which is the lowest one
There are 314 female and 577 male passengers
Passengers age ranged from 0.42 to 80 years old
There are 177 passengers that has no age data
Fare average is 14.45 with maximum of 512.33
Embarked from Cherbourg are 168 passengers, 77 passengers from Queenstown and 644 passengers from Southampton
MISSING DATA
# Assuming 'train_df' is your dataset in R
total <- sapply(titanic, function(x) sum(is.na(x) | x == "")) # Total missing values
percent <- round((total / nrow(titanic)) * 100, 1) # Percentage of missing values
# Create a data frame of missing data summary
missing_data <- data.frame(Total = total, Percent = percent) %>%
arrange(desc(Percent))
# Display the top 5 columns with the highest percentage of missing values
head(missing_data, 5)Facts
Embarked has only two missing values which can easily be filled.
Age has 177 missing values, a bit tricky to fill this.
Cabin has 687 missing values but it looks like that we might want to drop it from the dataset, since 77 % of it are missing.
colnames(titanic)## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
From 12 columns above, i will use all except PassengerId, Ticket and Name as it is more likely not correlated with a high survival rate.
Checking the proportional of the survival
# Assuming titanic_summarized is already created as before
titanic_summarized <- titanic %>%
group_by(Survived) %>%
summarise(Count = n()) %>%
mutate(Survived = ifelse(Survived == 1, "Survived", "Not Survived"),
Percentage = Count/sum(Count)*100)
# Define colors for the chart
colors <- c("Not Survived" = "red", "Survived" = "blue") # Red for Not Survived, Green for Survived
# Plotting with specified colors
highchart() %>%
hc_add_series(
data = titanic_summarized,
type = "pie",
name = "Survival",
hcaes(x = Survived, y = Count, color = Survived) # Use color mapping
) %>%
hc_colors(colors = unname(colors[titanic_summarized$Survived])) %>% # Apply the color mapping
hc_size(width = 400, height = 300) %>% # Adjust the size as needed
hc_tooltip(crosshairs=TRUE, borderWidth=5, sort=TRUE, shared=TRUE, table=TRUE,
pointFormat=paste('<br><b>Percentage: {point.Percentage:.2f}%</b>')) %>%
hc_title(text="Proportion of the Survivor", margin=0, style=list(color="#144746", useHTML=TRUE))We could see that only 38,38% of passenger is survived.
See the survived or not survived from range of age and gender
# Separate data for women and men
women <- titanic %>% filter(Sex == 'female')
men <- titanic %>% filter(Sex == 'male')
# Plot for Female
p_women <- ggplot() +
geom_histogram(data = women %>% filter(Survived == 1), aes(x = Age, fill = 'survived'), binwidth = 4, alpha = 0.6) +
geom_histogram(data = women %>% filter(Survived == 0), aes(x = Age, fill = 'not survived'), binwidth = 4, alpha = 0.6) +
labs(title = "Women", fill = "") +
theme_minimal() +
xlim(NA, 80) +
scale_fill_manual(values = c('survived' = 'blue', 'not survived' = 'red'))
# Plot for Male
p_men <- ggplot() +
geom_histogram(data = men %>% filter(Survived == 1), aes(x = Age, fill = 'survived'), binwidth = 4, alpha = 0.6) +
geom_histogram(data = men %>% filter(Survived == 0), aes(x = Age, fill = 'not survived'), binwidth = 4, alpha = 0.6) +
labs(title = "Men", fill = "") +
theme_minimal() +
xlim(NA, 80) +
scale_fill_manual(values = c('survived' = 'blue', 'not survived' = 'red'))
grid.arrange(p_women, p_men, ncol = 2)As we can see here, woman have a high probability of survival compared to men, especially when they are between 18 to 40 years old.
Men, at the other side, facing high probability of not survived between 18 and 50 years old. They are facing lowest probability of survival between age 5 to 18. Another thing to note is that infants in both gender also have a little bit higher probability of survival.
Since there seem to be certain ages, which have increased odds of survival and because I want every feature to be roughly on the same scale, I will create age groups later on.
Find the Survival Rate of each gender
#Number of Passenger by Gender
passenger_female <- nrow(titanic[titanic$Sex == "female", ])
passenger_male <- nrow(titanic[titanic$Sex == "male", ])
#Number of Passenger by Gender and Survived condition
female <- nrow(titanic[titanic$Sex == "female" & titanic$Survived == 1, ])
male <- nrow(titanic[titanic$Sex == "male" & titanic$Survived == 1, ])
#Survival Rate
rate_female <- round((female/passenger_female) * 100, 2)
rate_male <- round((male/passenger_male) * 100,2)
#Male Survival
cat("Percentage of Female Survivor is", rate_female, "Percents")## Percentage of Female Survivor is 74.2 Percents
cat("Percentage of Male Survivor is", rate_male, "Percents")## Percentage of Male Survivor is 18.89 Percents
Find the age distribution of the passenger
hist(titanic$Age, breaks=30, xlim=c(0,90), col=rgb(1,0,0,0.5), xlab="Age",
ylab="Amount", main="Age Distribution" )It can be seen that most of the passengers ranged in age from 16 to 40 years and babies were around 2 years old.
(1 bar = 2 years)
ticket_class <- titanic %>%
group_by(Pclass) %>%
summarise(Count = n()) %>%
mutate(Class = case_when(
Pclass == 1 ~ "Class One",
Pclass == 2 ~ "Class Two",
Pclass == 3 ~ "Class Three"
))
# Prepare the data for Highcharts
pie_data <- list_parse2(
ticket_class %>%
select(Class, Count) %>%
rename(name = Class, y = Count) %>%
mutate(name = as.character(name)) # Ensure names are characters
)
# Plotting
highchart() %>%
hc_chart(type = "pie") %>%
hc_add_series(
data = pie_data,
showInLegend = TRUE
) %>%
hc_title(text = "Proportion of Ticket Classes") %>%
hc_tooltip(pointFormat=paste('<br><b>Percentage: {point.percentage:.1f}%'))Survival rate based on ticket class
ticket_class2 <- titanic %>%
group_by(Pclass, Survived) %>%
summarise(Count = n()) %>%
mutate(Class = case_when(
Pclass == 1 ~ "Class One",
Pclass == 2 ~ "Class Two",
Pclass == 3 ~ "Class Three"
),
Survived = case_when(
Survived == 0 ~ "Not Survived",
Survived == 1 ~ "Survived"
)) %>%
select(c(Class, Survived, Count)) %>%
arrange(desc(Count))## `summarise()` has grouped output by 'Pclass'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `Pclass`
ticket_class2Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton
embark <- titanic %>%
group_by(Embarked) %>%
summarise(Count = n()) %>%
mutate(Embarked = recode(
Embarked,
"C" = "Cherbourg, France",
"Q" = "Queenstown, Ireland",
"S" = "Southampton, U.K."
),#rename the CSQ to Cherbourg, Southampton and Queenstown
lat = c(0, 49.630001, 51.857222, 50.909698), #adding latitude for these city
lon = c(0, -1.620000, -8.299167, -1.404351)) #adding longitude for these city
leaflet(embark) %>%
setView(lat = 51.206077, lng = -3.163499, zoom = 6) %>%
addTiles() %>%
addCircleMarkers(
lng = ~lon,
lat = ~lat,
popup = ~paste(Embarked, Count, "passengers"),
label = ~paste(Embarked, Count),
color = "#ff2f00", # Outline color for circle marker
fillColor = "#FF0000", # Fill color for circle marker
fillOpacity = 0.9,
radius = 8 # Size of the circle marker
)There are two passengers that don’t have Port of Embarkation:
titanic %>%
filter(Embarked == "") %>%
select(PassengerId, Pclass, Survived, Name, Sex)We might also want to check survival rate on each Port of Embarkation and Gender.
embark_c <- titanic %>%
filter(Embarked %in% c("S", "Q", "C")) %>%
group_by(Embarked, Pclass, Sex, Survived) %>%
summarise(Count = n()) %>%
arrange(desc(Survived), desc(Count)) %>%
mutate(
Embarked = case_when(
Embarked == "S" ~ "Southampton",
Embarked == "Q" ~ "Queenstown",
Embarked == "C" ~ "Cherbourg",
TRUE ~ ""
),
Survived = case_when(
Survived == 1 ~ "Survived",
TRUE ~ "Not Survived"
)
)
# embark_c$Embarked <- factor(embark_c$Embarked,
# levels = c("Cherbourg", "Queenstown", "Southampton"))
plot <- ggplot(data = embark_c, aes(x = Pclass, y = Count, fill = Sex)) +
geom_col(position = "dodge") +
facet_wrap(~Embarked+Survived, ncol = 2) + # Optional: Facet by 'Embarked' to separate plots by survival status
labs(x = "Passenger Class", y = "Count", fill = "Sex") +
theme_minimal()+
theme(
panel.spacing.y = unit(3, "lines"), # Adjust vertical spacing between panels
panel.spacing.x = unit(1, "lines") # Adjust horizontal spacing between panels if needed
)
ggplotly(plot, tooltip = "Count")sibparch <- titanic %>%
mutate(Relatives = case_when(SibSp + Parch == 0 ~ "Alone",
TRUE ~ "Not Alone"),
Survived = ifelse(Survived == 1, "Survived", "Not Survived")) %>%
select(PassengerId, Relatives, Pclass, Survived) %>%
arrange(desc(Survived))%>%
group_by(Relatives, Survived) %>%
summarise(count = n()) %>%
arrange(desc(count))## `summarise()` has grouped output by 'Relatives'. You can override using the
## `.groups` argument.
sibparch# sibparch <- titanic %>%
# mutate(Relatives = SibSp + Parch) %>%
# select(PassengerId, Relatives, Pclass, Survived) %>%
# arrange(desc(Relatives))
# sibparchFacts
There are 374 passengers that has no relatives on board and not survived
The rest on “Not Alone and Survived”, “Not Alone and Not Survived” and “Alone and Survived” shares the similar amount
Drop PassengerId column because it gives no contribution to a persons survival probability.
Drop Cabin as it has lot of empty column
Drop Ticket as attribute has 681 unique tickets, it will be a bit tricky to convert them into useful categories. So we will drop it from the dataset.
Fill Age Column with mean and standard deviation
Fill Embarked NA column with the common one which is “S”
Fill Fare NA column with 0
# Age
# Compute mean and standard deviation from the training data
mean_age <- mean(titanic$Age, na.rm = TRUE)
std_age <- sd(titanic$Age, na.rm = TRUE) # It's better to use training data for both
# Create a function to impute Age
impute_age <- function(dataset) {
is_null <- sum(is.na(dataset$Age))
rand_age <- runif(is_null, mean_age - std_age, mean_age + std_age)
dataset$Age <- ifelse(is.na(dataset$Age), rand_age, dataset$Age)
dataset$Age <- as.integer(dataset$Age)
dataset$Fare <- replace_na(0)
return(dataset)
}
titanic2 <- titanic %>%
select(-PassengerId, -Cabin, -Ticket)
# Apply the imputation function to both datasets
train_df <- impute_age(titanic2)
test_df <- impute_age(titanic_test)
# Check for missing values in train_df's Age
sum(is.na(train_df$Age))## [1] 0
#Fill NA in Embarked column with the common one
# train_df$Embarked %>% replace_na("S")# Define the mapping of titles to numerical codes
titles <- c("Mr" = 1, "Miss" = 2, "Mrs" = 3, "Master" = 4, "Rare" = 5)
# Function to process titles in a dataset
process_titles <- function(dataset) {
dataset %>%
# Extract titles
mutate(Title = str_extract(Name, "([A-Za-z]+)\\.")) %>%
# Standardize titles
mutate(Title = case_when(
Title %in% c("Mr") ~ "Mr",
Title %in% c("Miss", "Mlle", "Ms") ~ "Miss",
Title %in% c("Mrs", "Mme") ~ "Mrs",
Title == "Master" ~ "Master",
TRUE ~ "Rare" # Treat all other titles as "Rare"
)) %>%
# Map titles to numerical codes
mutate(Title = titles[Title]) %>%
# Ensure all entries have a valid title code; fill any NA values with 5 (Rare)
mutate(Title = ifelse(is.na(Title), 5, Title)) %>%
# Optionally, remove the 'Name' column
select(-Name)
}
# Apply the function to both the training and test datasets
train_df <- process_titles(train_df)
test_df <- process_titles(test_df)sex info to numeric# Define the mapping of titles to numerical codes
genders <- c(male = 0, female = 1)
# Apply the mapping to both datasets
train_df$Sex <- genders[train_df$Sex]
test_df$Sex <- genders[test_df$Sex]Embarked to numerictrain_df <- train_df %>%
mutate(Embarked = case_when(Embarked == "S" ~ 0,
Embarked == "C" ~ 1,
Embarked == "Q" ~ 2,
TRUE ~ 0))
test_df <- test_df %>%
mutate(Embarked = case_when(Embarked == "S" ~ 0,
Embarked == "C" ~ 1,
Embarked == "Q" ~ 2,
TRUE ~ 0))With following features:
# Function to categorize age
categorize_age <- function(age) {
case_when(
age <= 11 ~ 0,
age > 11 & age <= 18 ~ 1,
age > 18 & age <= 22 ~ 2,
age > 22 & age <= 27 ~ 3,
age > 27 & age <= 33 ~ 4,
age > 33 & age <= 40 ~ 5,
age > 40 & age <= 66 ~ 6,
age > 66 ~ 7 # Adjusted to include a separate category for > 66
)
}
# Apply the categorization to both data frames
train_df <- train_df %>%
mutate(Age = as.integer(Age), # Ensure Age is integer type
Age = categorize_age(Age))
test_df <- test_df %>%
mutate(Age = as.integer(Age), # Ensure Age is integer type
Age = categorize_age(Age))ntile function from dplyrcategorize_fare <- function(fare) {
ntile(fare, 5)
}
train_df <- train_df %>%
mutate(Fare = categorize_fare(Fare), Fare = as.integer(Fare))
test_df <- test_df %>%
mutate(Fare = categorize_fare(Fare), Fare = as.integer(Fare))train_df <- train_df %>%
mutate(Age_Class = Age * Pclass)
test_df <- test_df %>%
mutate(Age_Class = Age * Pclass)train_df <- train_df %>%
mutate(Fare_Per_Person = Fare / ((SibSp + Parch) + 1),
Fare_Per_Person = as.integer(Fare_Per_Person))
test_df <- test_df %>%
mutate(Fare_Per_Person = Fare / ((SibSp + Parch) + 1),
Fare_Per_Person = as.integer(Fare_Per_Person))We will use Random Forest Model to make a prediction on who will survived during catastrophy and see the godness of the model.
Imagine walking through a lush forest, where each tree represents a decision maker. This forest is not ordinary; it’s a Random Forest in the world of machine learning, simplified for everyone to understand.
Picture the Scene: The Trees: Each tree in our magical forest makes decisions based on different pieces of information it observes. Imagine these trees looking at various aspects of an apple to decide if it’s ripe enough to eat. One tree might consider its color, another its size, and yet another its firmness.
Diverse Opinions: Just like in a community where everyone brings a different perspective, each tree in our forest has its unique viewpoint, leading to a variety of decisions.
Coming Together: Now, to decide if an apple is truly ripe, the forest does something remarkable. It combines the wisdom of all its trees. Each tree casts a vote based on its individual assessment.
The Final Verdict: The forest then tallies these votes. The outcome with the most votes is the final decision of the Random Forest on whether the apple is ripe or not.
Why is this Useful? In real life, Random Forests help us make sense of complex problems by breaking them down into simpler, smaller decisions, much like how different trees look at various aspects of the apple. This approach makes Random Forests very powerful and accurate in predicting outcomes in diverse areas, from deciding which movie you might like next to identifying diseases from medical tests.
The Magic of Ensemble Learning: This process is a part of what’s known as ensemble learning, where multiple models (our trees) join forces to make a decision, often leading to better results than any single model (tree) could achieve on its own.
The illustration visualizes this concept in a fun and accessible way, making the idea of Random Forests and ensemble learning approachable to everyone, regardless of their background in data science.
To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
Model making:
set.seed(100) # for reproducibility
#Make Survived column as factor
train_df <- train_df %>%
mutate(Survived = as.factor(Survived))
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)
train_forest <- train(Survived ~ ., data = train_df, method = "rf", trControl = ctrl)
train_forest## Random Forest
##
## 891 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 712, 714, 713, 713, 712, 713, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8204428 0.6058867
## 6 0.7871595 0.5418680
## 10 0.7793089 0.5246811
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Here is the explanation of the above result:
Number = 5 means 5 K-Fold Cross Validation applied for the train_df data, this step will divide the data in K parts, each part will be used as test data in turn.
Repeats = 3 means the k-fold Cross Validation will be done three times.
## Warning: package 'knitr' was built under R version 4.3.3
Random Forest computations are very large and long (a consequence of their robust nature). Therefore, it highly suggested to save the model, so that it can be used repeatedly without having to repeat the creation process.
If you look at the summary model, several mtry experiments were carried out (the number of predictors that can be used for splitting nodes (1 predictor can be used more than 1 time)). At each repeat, a different mtry will be tried (the mtry number is chosen randomly too). Random forest will automatically select the mtry that produces the best evaluation metrics (in this case Accuracy).
In this case, the model chosen is with mtry = 2, which has the highest accuracy when tested on boostrap sampling data (can be considered as train data for making decision trees in random forests).
train_forest$resultswith Standard Deviation of 0.027(or 2.7%) in mtry 2. The standard deviation shows us, how precise the estimates are.
Confusion Matrix
train_forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 17.51%
## Confusion matrix:
## 0 1 class.error
## 0 509 40 0.07285974
## 1 116 226 0.33918129
The first row is about the not-survived-predictions: 503 passengers were correctly classified as not survived (called true negatives) and 46 where wrongly classified as not survived (false positives).
The second row is about the survived-predictions: 117 passengers where wrongly classified as survived (false negatives) and 225 where correctly classified as survived (true positives).
A confusion matrix gives you a lot of information about how well your model does, but theres a way to get even more, like computing the classifiers precision.
Here we can find the Confusion Matrix result with OOB Estimate of error rate in 18.07%
it means the \[Accuracy = 100 - 18.07 = 81.93 \%\]
Precision& Recall
confusionMatrix(data = as.factor(train_df$Survived), reference = as.factor(train_forest$finalModel$predicted), positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 509 40
## 1 116 226
##
## Accuracy : 0.8249
## 95% CI : (0.7983, 0.8493)
## No Information Rate : 0.7015
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6137
##
## Mcnemar's Test P-Value : 1.916e-09
##
## Sensitivity : 0.8496
## Specificity : 0.8144
## Pos Pred Value : 0.6608
## Neg Pred Value : 0.9271
## Prevalence : 0.2985
## Detection Rate : 0.2536
## Detection Prevalence : 0.3838
## Balanced Accuracy : 0.8320
##
## 'Positive' Class : 1
##
Our model predicts 65% of the time, a passengers survival correctly (precision). The recall tells us that it predicted the survival of 83 % of the people who actually survived.
F1 Score You can combine precision and recall into one score, which is called the F-score. The F-score is computed with the harmonic mean of precision and recall. Note that it assigns much more weight to low values. As a result of that, the classifier will only get a high F-score, if both recall and precision are high.
# Precision and Recall values
precision <- 0.6579
recall <- 0.8303
# Calculate F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)
# Print F1 Score
f1_score## [1] 0.7341142
There we have it, a 73 % F-score. The score is not that high, because we have a recall of 83%. But unfortunately the F-score is not perfect, because it favors classifiers that have a similar precision and recall. This is a problem, because you sometimes want a high precision and sometimes a high recall. The thing is that an increasing precision, sometimes results in an decreasing recall and vice versa (depending on the threshold). This is called the precision/recall tradeoff.
ROC and AUC
Accuracy has shortcomings to show the goodness of the model in classifying both classes. To overcome this lack of accuracy, ROC and AUC are available as other evaluation tools after the Confusion Matrix.
Receiver-Operating Curve (ROC) is a curve that plots the relationship between True Positive Rate (Sensitivity or Recall) and False Positive Rate (1-Specificity). A good model should ideally have a high True Positive Rate and a low False Positive Rate.
Why is that can be happened? Accuracy is good when we are facing a balanced class but not on imbalance class.
In the test data, we dont have any Survived column, therefore, i will try to split the training data to its own train and test data so we are able to see the ROC and AUC test later.
set.seed(123) # for reproducibility
trainIndex <- createDataPartition(train_df$Survived, p = .8,
list = FALSE,
times = 1)
trainData <- train_df[trainIndex, ]
validationData <- train_df[-trainIndex, ]and do retrain on the trainData
model <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + Age_Class + Fare_Per_Person,
data = trainData,
ntree = 100)Predict probabilities for the validation set.
Make sure to predict the probability of each class, as pROC requires probabilities for the positive class (usually Survived = 1)
validationData$probSurvived <- predict(model, newdata = validationData, type = "prob")[,2]Use the roc function from pROC to generate the ROC object, then plot it and calculate AUC:
rocObj <- roc(response = validationData$Survived,
predictor = validationData$probSurvived,
levels = c("0", "1"))
# Plot ROC curve
plot(rocObj, main="ROC Curve", col="#1c61b6")The vertical line in the middle represents a purely random classifier (e.g a coin flip) and therefore your classifier should be as far away from it as possible. Our Random Forest model seems to do a good job.
Calculate AUC:
auc(rocObj)## Area under the curve: 0.8192
AUC indicates the area under the ROC curve. The closer to 1, the better the model performance.
The ROC AUC Score is the corresponding score to the ROC AUC Curve. It is simply computed by measuring the area under the curve, which is called AUC.
A classifiers that is 100% correct, would have a ROC AUC Score of 1 and a completely random classiffier would have a score of 0.5.
The AUC value is close to 1 (0.8393), meaning that our model is quite able to classify the positive class and negative class well.
Even though random forests are labeled as non-interpretable models, at least we can see what predictors are most used (important) in creating random forests:
# varImp(train_forest)
train_forest$finalModel$importance## MeanDecreaseGini
## Pclass 33.21952
## Sex 92.45657
## Age 20.18701
## SibSp 15.61285
## Parch 10.72975
## Fare 16.44319
## Embarked 12.24665
## Title 0.00000
## Age_Class 32.75138
## Fare_Per_Person 16.89692
Title, Embarked and Parch doesn’t play a significant role in our random forest classifiers prediction process, you can choose to drop them and see the new results if it will increase the Accuracy or not. Let’s train the classifier again with dropped column.
set.seed(100)
#Make Survived column as factor
train_df2 <- train_df %>%
mutate(Survived = as.factor(Survived)) %>%
select(-c(Title, Embarked, Parch))
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)
train_forest2 <- train(Survived ~ ., data = train_df2, method = "rf", trControl = ctrl)
train_forest2## Random Forest
##
## 891 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 712, 714, 713, 713, 712, 713, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8167122 0.6012328
## 4 0.7961568 0.5596998
## 7 0.7826693 0.5316147
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
As we can see above, the model accuracy is slightly decrease to 0.80 on mtry 2, therefore, we will keep the first model to predict the test data.
Implement the prediction to data test and see the results
#Predict with test data
Survived <- predict(train_forest, newdata = test_df)
#column bind test data with predicted result
prediction_result <- cbind(test_df, Survived) %>% select(c(PassengerId, Survived))
head(prediction_result)We started with the data exploration to get insight from the dataset, checked about missing data and learned which features are important. During this process, we used some R libraries to do the visualization and analyzing. For preprocessing part, we are trying to fill the missing values, converted features into numeric ones, groupd value into categories and created a few new features. We used Random Forest machine learning algorithm and applied cross validation on it.
Brief explanation on how random forest works, took a look at the importance it assigns to the different features and tried to optimize the performance by simply take any unimportant predictors out. However, we are not doing hyperparameter tuning at this moment to simplify the process.
Lastly, we looked at it’s confusion matrix and computed the models precision, recall and f-score.
Before
head(titanic)After
head(train_df)Of course there is still room for improvement, like doing a more extensive feature engineering, by comparing and plotting the features against each other and identifying and removing the noisy features. Another thing that can improve the overall result is doing hyperparameter tuning on several machine learning models.
Article: