Titanic sinking ilustration.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
Royal Mail Ship (RMS) Titanic, a British passenger liner, sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during its maiden voyage from Southampton to New York City. Out of the estimated 2,224 passengers and crew aboard, more than 1,500 died. There were not enough lifeboats and because of a “women and children first” protocol for loading lifeboats, a disproportionate number of men were left aboard. This project aims to predict passenger survival onboard Titanic using dataset from Kaggle.
library(tidyverse) # data manipulation
library(caret)
library(e1071) # building naive bayes model
library(partykit) # building decision tree model
library(randomForest) # building random forest model
library(mice) # imputation
library(GGally) # building correlation checking
library(RColorBrewer) # visualization
library(yardstick) # model evaluation
library(plotly)
<- read.csv("train.csv")
train <- read.csv("test.csv")
test <- read.csv("gender_submission.csv")
test_result <- inner_join(test, test_result, by = "PassengerId")
test <- rbind(train, test)
titanic str(titanic)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
There are 12 variables in this dataset, and Survived
will be the target variable.
Dataset Explanations:
* PassengerId: Passenger id
* Pclass: Ticket class (1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (Lower))
* Name: Name of the passenger
* Sex: Sex of the passenger
* Age: Age of the passenger in years
* Sibsp: Number of siblings/spouses aboard the titanic
* Parch: Number of parents/children aboard the titanic
* Ticket: Ticket number
* Fare: Passenger fare
* Cabin: Cabin number
* Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southamption)
* Survival: survival status (0 = No, 1 = Yes)
Convert data types
<- titanic %>% mutate(
titanic Survived = factor(Survived, levels = c(0, 1), labels = c("No", "Yes")),
Pclass = factor(Pclass, levels = c(1, 2, 3), labels = c("1st", "2nd", "3rd")),
Sex = factor(Sex, levels = c("female", "male"), labels = c("Female", "Male")),
Embarked = factor(Embarked)
)
We have Name
variable and looks like it will not provide useful information on the model. But, we can extract the passengers’ title and see if it’s useful.
# Modifying name
<- titanic %>% separate(Name, into = c("LastName", "firstname"), sep = ", ") %>%
titanic separate(firstname, into = c("Title", "FirstName"), sep = "\\. ")
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [514].
$Title == "Mlle", ]$Title <- "Miss"
titanic[titanic$Title == "Mme", ]$Title <- "Mrs"
titanic[titanic$Title == "Ms", ]$Title <- "Miss"
titanic[titanic
table(titanic$Title)
##
## Capt Col Don Dona Dr Jonkheer
## 1 4 1 1 8 1
## Lady Major Master Miss Mr Mrs
## 1 2 61 264 757 198
## Rev Sir the Countess
## 8 1 1
There are 18 titles in total, and seems like there are some rare titles. I will combine the rare titles into one variable.
$Title <- ifelse(titanic$Title %in% c("Master", "Miss", "Mr", "Mrs"), titanic$Title, "Other")
titanic$Title <- as.factor(titanic$Title) titanic
This dataset also gives on information on number of passengers’ parents or spouse (Parch
) and sibling (Sibsp
) aboard the Titanic. We can create a new variable to represent a passenger’s family size using the variables mentioned.
$FamilySize <- 1 + titanic$SibSp + titanic$Parch titanic
Then, check for missing values.
anyNA(titanic)
## [1] TRUE
colSums(is.na(titanic))
## PassengerId Survived Pclass LastName Title FirstName
## 0 0 0 0 0 0
## Sex Age SibSp Parch Ticket Fare
## 0 263 0 0 0 1
## Cabin Embarked FamilySize
## 0 0 0
There are 263 missing values in passengers’ Age and 1 missing value in Fare. Let’s check the respective variables.
%>% filter(is.na(Fare)) titanic
## PassengerId Survived Pclass LastName Title FirstName Sex Age SibSp Parch
## 1 1044 No 3rd Storey Mr Thomas Male 60.5 0 0
## Ticket Fare Cabin Embarked FamilySize
## 1 3701 NA S 1
The missing Fare belongs to Mr. Thomas Storey from the 3rd class. We can impute the missing value using mean/median depending on the data distribution.
hist(titanic$Fare)
Since the variable is not normally distributed, we can impute the missing value using median from the corresponding ticket class.
%>% group_by(Pclass) %>%
titanic summarise(median = median(Fare, na.rm = TRUE))
## # A tibble: 3 x 2
## Pclass median
## <fct> <dbl>
## 1 1st 60
## 2 2nd 15.0
## 3 3rd 8.05
is.na(titanic$Fare), ]$Fare <- 8.05 titanic[
Next, let’s deal with the missing age values. We can do it using mice package.
set.seed(123)
<- mice(titanic[, !names(titanic) %in% c('PassengerId','LastName', 'FirstName', 'Ticket','Cabin','FamilySize','Survived')], method='pmm') mice_mod
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 1 4 Age
## 1 5 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 2 4 Age
## 2 5 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 3 4 Age
## 3 5 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 4 4 Age
## 4 5 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
## 5 4 Age
## 5 5 Age
<- complete(mice_mod) mice_output
Let’s compare the distribution of the original data and the imputed data.
par(mfrow=c(1,2))
hist(titanic$Age, freq=F, main='Age: Original Data',
col='blue')
hist(mice_output$Age, freq=F, main='Age: MICE Output',
col='lightblue')
The original and imputed data distribution is similar, then we can use the mice output to fill our missing values.
$Age <- mice_output$Age titanic
Then, check for the missing values once again to make sure.
anyNA(titanic)
## [1] FALSE
Now we’re good to go. Next, check the data summary.
summary(titanic)
## PassengerId Survived Pclass LastName Title
## Min. : 1 No :815 1st:323 Length:1309 Master: 61
## 1st Qu.: 328 Yes:494 2nd:277 Class :character Miss :264
## Median : 655 3rd:709 Mode :character Mr :757
## Mean : 655 Mrs :198
## 3rd Qu.: 982 Other : 29
## Max. :1309
## FirstName Sex Age SibSp
## Length:1309 Female:466 Min. : 0.17 Min. :0.0000
## Class :character Male :843 1st Qu.:20.00 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.0000
## Mean :29.45 Mean :0.4989
## 3rd Qu.:37.00 3rd Qu.:1.0000
## Max. :80.00 Max. :8.0000
## Parch Ticket Fare Cabin
## Min. :0.000 Length:1309 Min. : 0.000 Length:1309
## 1st Qu.:0.000 Class :character 1st Qu.: 7.896 Class :character
## Median :0.000 Mode :character Median : 14.454 Mode :character
## Mean :0.385 Mean : 33.276
## 3rd Qu.:0.000 3rd Qu.: 31.275
## Max. :9.000 Max. :512.329
## Embarked FamilySize
## : 2 Min. : 1.000
## C:270 1st Qu.: 1.000
## Q:123 Median : 1.000
## S:914 Mean : 1.884
## 3rd Qu.: 2.000
## Max. :11.000
Looks like there are 2 passengers with empty value of port of embarkation. Check the corresponding passengers.
%>% filter(Embarked == "") titanic
## PassengerId Survived Pclass LastName Title FirstName
## 1 62 Yes 1st Icard Miss Amelie
## 2 830 Yes 1st Stone Mrs George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked FamilySize
## 1 Female 38 0 0 113572 80 B28 1
## 2 Female 62 0 0 113572 80 B28 1
Both passengers with missing port of embarkation were in the 1st class and paid fare of 80. Let’s see where most passengers from the 1st class embarked from.
ggplot(titanic, aes(Embarked, fill = Pclass)) +
geom_bar(position = "fill")
Looks like half passengers embarked from Cherbourg. To make sure, let’s see how much fare passengers paid based on their port of embarkation in average.
%>% filter(Embarked != "") %>% group_by(Embarked) %>%
titanic summarise(mean = mean(Fare),
median = median(Fare))
## # A tibble: 3 x 3
## Embarked mean median
## <fct> <dbl> <dbl>
## 1 C 62.3 28.5
## 2 Q 12.4 7.75
## 3 S 27.4 13
Majority of passengers embarked from Cherbourg paid higher amount of fare than passengers from other ports. Then, based on the exploratory done, we can conclude that the 2 passengers with missing value of port of embarkation embarked from Cherbourg.
$Embarked == "", ]$Embarked <- "C"
titanic[titanic$Embarked <- droplevels(titanic$Embarked) titanic
Before we continue our analysis, let’s see if there is a correlation between each predictor.
ggcorr(titanic, label = T)
## Warning in ggcorr(titanic, label = T): data in column(s) 'Survived', 'Pclass',
## 'LastName', 'Title', 'FirstName', 'Sex', 'Ticket', 'Cabin', 'Embarked' are not
## numeric and were ignored
There is a strong correlation between
Parch
, Sibsp
and FamilySize
. It is inevitable since FamilySize
is derived from those variables. To prevent biased result, we need to choose one variable to keep. Let’s choose Family Size.
Some variables have unique values and do not provide any useful information for our model, let’s just exclude them.
<- titanic %>% select(-c(PassengerId, LastName, FirstName, Ticket, Cabin, Parch, SibSp)) titanic
How was the survival rate overall?
prop.table(table(titanic$Survived))
##
## No Yes
## 0.6226127 0.3773873
More than half passengers did not survive from the disaster. Let’s see if passengers with particular characteristics had more chance to survive than others.
ggplot(titanic, aes(as.factor(FamilySize), fill = Survived)) +
geom_bar(position = "fill") +
labs(title = "Survival by Family Size Class",
x = "Family Size",
y = "Survived") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
Passengers who were onboard alone and with a larger number of family members were more likely to not survive.
ggplot(titanic, aes(Sex, fill = Survived)) +
geom_bar(position = "fill") +
labs(title = "Survival by Sex",
y = "Survived")+
theme_minimal() +
scale_fill_brewer(palette = "Set2")
Female passengers had a higher chance of surviving than their male counterparts.
ggplot(titanic, aes(Pclass, fill = Survived)) +
geom_bar(position = "fill") +
labs(title = "Survival by Ticket Class",
y = "Survived") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
Passengers in the 1st class were more likely to survive than passengers in other classes.
ggplot(train, aes(x = Fare, y = as.numeric(Survived)))+
geom_point(alpha = 0.1)+
geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
scale_y_continuous(breaks = seq(1, 2, 0.5), labels = c(0, 0.5, 1)) +
labs(title = "Survival by Fare",
y = "Survived")+
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Passengers who paid more fare had a higher chance of surviving.
From the data wrangling done beforehand, it seems that passengers from the 1st class paid much more fare than others. To make sure, let’s visualize it.
ggplot(titanic, aes(Pclass, Fare)) +
geom_boxplot() + labs(title = "Fare Paid by Passengers from Each Ticket Class") +
theme_classic()
Passengers from the 1st class indeed paid more fare than passengers from other classes. Since fare and ticket class pretty much are correlated and explain the same thing, I will exclude one of these 2 variables.
<- titanic %>% select(-Fare) titanic
ggplot(titanic, aes(Embarked, fill = Survived)) +
geom_bar(position = "fill") +
labs(title = "Survival by Port of Embarkation",
y = "Survived") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
Passengers embarked from Cherbourg were more likely to survive than passengers who embarked from other ports.
ggplot(titanic, aes(Title, fill = Survived)) +
geom_bar(position = "fill") +
labs(title = "Survival by Title",
y = "Survived") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
Passengers titled Miss and Mrs had a higher chance of surviving. It makes sense since women were more likely to survive this disaster.
ggplot(train, aes(x = Age, y = as.numeric(Survived)))+
geom_point(alpha = 0.1)+
geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
scale_y_continuous(breaks = seq(0, 1, 0.5), labels = c (0, 0.5, 1))+
labs(title = "Survival by Age",
y = "Survived") +
theme_minimal()
## Warning: Removed 177 rows containing non-finite values (stat_smooth).
## Warning: Removed 177 rows containing missing values (geom_point).
The younger the passengers, the more likely they survived.
For this project, I will use 3 machine learning algorithms to predict the survival of the passengers, namely Naive Bayes
, Decision Tree
, Random Forest
and compare the performances of those algorithms.
Split the data into train and test data. Since this dataset is originally already split into two different files, I will split it that way.
set.seed(610)
<- sample(nrow(titanic), nrow(titanic)*0.8)
index <- titanic[index, ]
train <- titanic[-index, ] test
Before continuing the analysis, let’s see the proportion of both classes.
prop.table(table(train$Survived))
##
## No Yes
## 0.6150907 0.3849093
Build the model and use it to predict the survival rate on test data.
<- naiveBayes(Survived ~ ., data = train, laplace = 1)
model_nb <- predict(model_nb, newdata = test, type = "class")
naive_pred <- predict(model_nb, newdata = test, type = "raw")
naive_prob
<- select(test, Survived) %>%
naive_table bind_cols(surv_pred = naive_pred) %>%
bind_cols(surv_eprob = round(naive_prob[,1],4)) %>%
bind_cols(surv_pprob = round(naive_prob[,2],4))
Let’s see how well the model predicts new data.
%>%
naive_table conf_mat(Survived, surv_pred) %>%
autoplot(type = "heatmap")
<- naive_table %>%
eval_naive summarise(
Accuracy = accuracy_vec(Survived, surv_pred),
Sensitivity = sens_vec(Survived, surv_pred),
Specificity = spec_vec(Survived, surv_pred),
Precision = precision_vec(Survived, surv_pred)
)
eval_naive
## Accuracy Sensitivity Specificity Precision
## 1 0.8549618 0.8947368 0.7802198 0.8843931
This model is capable of predicting 98% cases correctly overall.
The other way to evaluate a model performance is by using AUC - ROC curve, a performance measurement for the classification problems. It tells us how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting negative class as negative and vice versa. It ranges from 0 - 1.
library(ROCR)
<- prediction(predictions = naive_prob[, 2], labels = test$Survived)
pred <- performance(prediction.obj = pred, measure = "tpr", x.measure = "fpr")
perf plot(perf)
<- performance(pred, "auc")
auc @y.values[[1]] auc
## [1] 0.8960221
This model also has AUC value of 0.97, which means it performs very well on distinguishing both positive and negative class.
<- eval_naive %>%
eval_naive cbind(AUC = auc@y.values[[1]])
Build the model.
<- ctree(Survived ~ ., data = train) model_dt
Predict a new data using the model.
<- predict(model_dt, newdata = test, type = "response")
dt_pred <- predict(model_dt, newdata = test, type = "prob")
dt_prob
<- select(test, Survived) %>%
dt_table bind_cols(surv_pred = dt_pred) %>%
bind_cols(surv_eprob = round(dt_prob[,1],4)) %>%
bind_cols(surv_pprob = round(dt_prob[,2],4))
Check for the model performance on predicting a new data.
<- dt_table %>%
eval_dt summarise(
Accuracy = accuracy_vec(Survived, surv_pred),
Sensitivity = sens_vec(Survived, surv_pred),
Specificity = spec_vec(Survived, surv_pred),
Precision = precision_vec(Survived, surv_pred)
)
eval_dt
## Accuracy Sensitivity Specificity Precision
## 1 0.8625954 0.8947368 0.8021978 0.8947368
This decision tree model also gives us 94.7% accuracy in predicting new data, slighlty lower than the previous model.
plot(model_dt, type = "simple")
ROC
<- prediction(predictions = dt_prob[, 2], labels = test$Survived)
pred.dt <- performance(prediction.obj = pred, measure = "tpr", x.measure = "fpr")
perf.dt plot(perf.dt)
<- performance(pred.dt, "auc")
auc2 @y.values[[1]] auc2
## [1] 0.8880535
Based on the auc value, even though it’s slightly lower than the previous one, this model also has a good performance.
<- eval_dt %>%
eval_dt cbind(AUC = auc2@y.values[[1]])
Before starting with the analysis, we need to exclude variable with little variation, since it doesn’t provide much information for the model.
<- nearZeroVar(titanic)
n0_var n0_var
## integer(0)
Turns out our data doesn’t have a variable with little variation. Now we’re good to go.
set.seed(417)
<- trainControl(method = "repeatedcv", number = 5, repeats = 3) ctrl
<- train(Survived ~ ., data = train, method = "rf", trControl= ctrl) model_rf
# print output
model_rf
## Random Forest
##
## 1047 samples
## 6 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 839, 838, 837, 837, 837, 838, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8631264 0.7070843
## 6 0.8590055 0.6978656
## 11 0.8453044 0.6710996
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# final model used
$finalModel model_rf
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.94%
## Confusion matrix:
## No Yes class.error
## No 584 60 0.0931677
## Yes 86 317 0.2133995
# visualize model
plot(model_rf)
# the most important predictors
varImp(model_rf) %>% plot()
In Random Forest model, we can see the most important variables to the model. 3 most important variables in this model are Title, Sex, and Ticket class.
Use the built model to predict the survival rate on test data.
<- predict(object = model_rf, newdata = test, type = "raw")
pred_rf <- predict(object = model_rf, newdata = test, type = "prob")
prob_rf
<- select(test, Survived) %>%
rf_table bind_cols(surv_pred = pred_rf) %>%
bind_cols(surv_eprob = round(prob_rf[,1],4)) %>%
bind_cols(surv_pprob = round(prob_rf[,2],4))
Let’s see how well the model predicts new data.
%>%
rf_table conf_mat(Survived, surv_pred) %>%
autoplot(type = "heatmap")
<- rf_table %>%
eval_rf summarise(
Accuracy = accuracy_vec(Survived, surv_pred),
Sensitivity = sens_vec(Survived, surv_pred),
Specificity = spec_vec(Survived, surv_pred),
Precision = precision_vec(Survived, surv_pred)
)
eval_rf
## Accuracy Sensitivity Specificity Precision
## 1 0.8587786 0.8947368 0.7912088 0.8895349
The random forest model also performs fairly well, but it has accuracy value lower than 2 previous model. Let’s see if we can increase its performance by tuning it.
<- prediction(predictions = prob_rf[, 2], labels = test$Survived)
pred.rf <- performance(prediction.obj = pred.rf, measure = "tpr", x.measure = "fpr")
perf.rf plot(perf.rf)
<- performance(pred.rf, "auc")
auc3 @y.values[[1]] auc3
## [1] 0.871088
Turns out Random Forest gives the highest value of AUC.
<- eval_rf %>%
eval_rf cbind(AUC = auc3@y.values[[1]])
rbind("Naive Bayes" = eval_naive, "Decision Tree" = eval_dt, "Random Forest" = eval_rf) %>% data.frame()
## Accuracy Sensitivity Specificity Precision AUC
## Naive Bayes 0.8549618 0.8947368 0.7802198 0.8843931 0.8960221
## Decision Tree 0.8625954 0.8947368 0.8021978 0.8947368 0.8880535
## Random Forest 0.8587786 0.8947368 0.7912088 0.8895349 0.8710880
Overall, Naive Bayes model has the best performances in predicting both positive and negative classes among all the models, even though Random Forest has the highest value of AUC, slightly higher than the Naive Bayes model gives. In conclusion, Naive Bayes is the best model to predict survival rate of passengers aboard Titanic based on ticket class, title, sex, age, port of embarkation, and family size.