RMS Titanic, or better known simply as Titanic,
is a British ocean liner that sank after crashing into an iceberg on the
15th of April 1912, during its first voyage. Out of the estimated 2224
passengers and crew on board, 1496 perished. It was considered one of
the deadliest sinking accidents at the time, and has become somewhat
infamous, inspiring movies and other multimedia works.
One of the biggest causes of the high death toll is the lack of lifeboats. There wasn’t nearly enough for the total amount of passengers on board, causing the crew to prioritize some groups of people more than others.
Last time, we took a look at the passenger data and tried to do some Exploratory Data Analysis in order to figure out the correlation between variables. This time, we’ll try creating a model that will predict whether or not a passenger will survive based off their characteristics.
You can find the data set that we will be using for this project here. It consists of three different files,
train, test, and
gender_submission. In this case, we’ll only be using
train, as the test data does not have the
correct answers (Survived values) so we’d be unable to check the model’s
accuracy using it. As usual, let’s read the train data set using
read.csv and use dplyr’s glimpse to check data
types.
library(dplyr)
train <- read.csv("data/train.csv")
glimpse(train)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
This data consists of 12 columns, those being:
First off, let’s drop columns that have too many different categorical values. This would be PassengerId, Name, Cabin, and Ticket.
train <- train %>%
select(-c(PassengerId, Name, Cabin, Ticket))
glimpse(train)
## Rows: 891
## Columns: 8
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "male",…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S", "…
Let’s check for NA values. Since we have a lot of columns that use strings/characters, let’s also make sure that blank values are also treated as NA. For that, we’ll use mutate across all character columns and change blank values (““) into NA.
train %>% mutate(across(where(is.character), ~ na_if(.,""))) %>% is.na() %>% colSums()
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 2
Seems like we have some NA values. Let’s get rid of those using omit.
train <- train %>% na.omit()
train %>% is.na() %>% colSums()
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
It would seem like some columns such as Survived, Pclass, Sex, and
Embarked have commonly repeating values. Let’s change them into a
categorical format using mutate.
train <- train %>%
mutate(Survived=as.factor(Survived),
Pclass=as.factor(Pclass),
Sex=as.factor(Sex),
Embarked=as.factor(Embarked))
glimpse(train)
## Rows: 714
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ Pclass <fct> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1…
## $ Sex <fct> male, female, female, female, male, male, male, female, femal…
## $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
## $ Embarked <fct> S, C, S, S, S, S, S, S, C, S, S, S, S, S, S, Q, S, S, S, Q, S…
Although we already have a data labeled as test, we still need to
split our data so that we can evaluate the model. Let’s split the data
with a proportion of 0.8 using splitter.
library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(777)
splitter <- initial_split(data= train, prop=0.8)
# splitting
train_data <- training(splitter)
test_data <- testing(splitter)
Now, let’s check if the data distribution in our training data is balanced.
table(train_data$Survived) %>%
prop.table()
##
## 0 1
## 0.5831874 0.4168126
Seems quite balanced to me. Looks like we’re ready to start making models.
The first model we’ll try to make is Logistic Regression. We’ll
simply use glm and all of our predictors as a starting
point.
logres_all <- glm(formula = Survived ~., family = "binomial",
data = train_data)
summary(logres_all)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 14.963905 535.411352 0.028 0.97770
## Pclass2 -0.851380 0.364589 -2.335 0.01953 *
## Pclass3 -1.932228 0.376431 -5.133 2.85e-07 ***
## Sexmale -2.504881 0.239834 -10.444 < 2e-16 ***
## Age -0.043218 0.009238 -4.678 2.89e-06 ***
## SibSp -0.432944 0.141890 -3.051 0.00228 **
## Parch 0.043208 0.136527 0.316 0.75164
## Fare 0.003056 0.003025 1.010 0.31251
## EmbarkedC -10.973225 535.411284 -0.020 0.98365
## EmbarkedQ -12.083389 535.411561 -0.023 0.98199
## EmbarkedS -11.388800 535.411265 -0.021 0.98303
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 775.69 on 570 degrees of freedom
## Residual deviance: 520.98 on 560 degrees of freedom
## AIC: 542.98
##
## Number of Fisher Scoring iterations: 12
While this model seems fine, we can see that there are still factors
that doesn’t seem to contribute much to our target variable. In this
case, let’s try using the step method instead. Using the previous model
with all the predictors, we can input it into step and use
the backward direction. This will drop the predictors one by one until
it reaches the lowest AIC value it can get.
logres_step <- step(object = logres_all,
direction = "backward",
trace=F)
summary(logres_step)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Fare, family = "binomial",
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.717439 0.550948 6.747 1.51e-11 ***
## Pclass2 -0.942016 0.354094 -2.660 0.00781 **
## Pclass3 -2.033789 0.367815 -5.529 3.21e-08 ***
## Sexmale -2.502982 0.232155 -10.782 < 2e-16 ***
## Age -0.044320 0.009160 -4.838 1.31e-06 ***
## SibSp -0.425369 0.133364 -3.190 0.00143 **
## Fare 0.003894 0.002972 1.310 0.19023
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 775.69 on 570 degrees of freedom
## Residual deviance: 524.72 on 564 degrees of freedom
## AIC: 538.72
##
## Number of Fisher Scoring iterations: 5
Let’s use this model to predict data from our test group.
pred_logres <- predict(object = logres_step,
newdata = test_data,
type = "response")
head(pred_logres)
## 1 2 3 4 5 6
## 0.27351103 0.86373450 0.05460894 0.38926516 0.62944359 0.15191265
Now, let’s store that data inside a column in a different table.
test_result <- test_data
test_result$logres_prediction <- ifelse(pred_logres > 0.5, yes = 1, no = 0)
rmarkdown::paged_table(test_result)
While we can see manually which of the predictions are accurate, the
more efficient method of evaluating this model would be by using
Confusion Matrix, which can be accessed through the caret
package. Make sure that the prediction column is the same format as the
Survived column, which would be factor.
library(caret)
logresMatrix <- confusionMatrix(data = as.factor(test_result$logres_prediction),
reference = test_result$Survived,
positive = "1")
logresMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 84 19
## 1 7 33
##
## Accuracy : 0.8182
## 95% CI : (0.7451, 0.8777)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : 1.573e-06
##
## Kappa : 0.5867
##
## Mcnemar's Test P-Value : 0.03098
##
## Sensitivity : 0.6346
## Specificity : 0.9231
## Pos Pred Value : 0.8250
## Neg Pred Value : 0.8155
## Prevalence : 0.3636
## Detection Rate : 0.2308
## Detection Prevalence : 0.2797
## Balanced Accuracy : 0.7788
##
## 'Positive' Class : 1
##
We can see that our logistic regression model has these values:
Let’s input these values into a data frame as we go, in order to better evaluate the different models.
evaluation <- data_frame(Model= "Logistic Regression",
Accuracy = logresMatrix$overall[1],
Sensitivity = logresMatrix$byClass[1],
Specificity = logresMatrix$byClass[2],
Precision = logresMatrix$byClass[3])
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
rmarkdown::paged_table(evaluation)
K-Nearest Neighbour tries to classify a data depending on its
neighbouring datas’ properties. Before trying to make a K-Nearest
Neighbour model, we first have to take out categorical data within our
data frame. Let’s use select to achieve this.
train_knn <- train_data %>%
select(-c(Pclass, Sex, Embarked))
test_knn <- test_data %>%
select(-c(Pclass, Sex, Embarked))
We then have to separate the target and predictors into separate variables like below.
#predictors
train_x <- train_knn %>% select(-Survived)
test_x <- test_knn %>% select(-Survived)
# target
train_y <- train_knn$Survived
test_y <- test_knn$Survived
Then, we have to scale the values in our data frame. Do this for both the training and test data.
# train
train_xs <- scale(train_x)
# test
test_xs <- scale(test_x,
center = attr(train_xs, "scaled:center"),
scale = attr(train_xs, "scaled:scale"))
We’ll find the optimum K by counting the amount of rows in our training data and taking the square root of it.
sqrt(nrow(train_data))
## [1] 23.89561
Let’s round that up into 24 and start making the KNN model. Unlike logistic regression, KNN immediately gives you the resulting values of the prediction like below.
library(class)
knn <- knn(train=train_xs,
test=test_xs,
cl=train_y,
k=24)
head(knn)
## [1] 0 1 1 0 0 0
## Levels: 0 1
Now we can input it into our result data frame by inserting a new column.
test_result$knn_prediction <- knn
rmarkdown::paged_table(test_result)
Finally, in order to evaluate the model, we’ll use Confusion Matrix.
KNNMatrix <- confusionMatrix(data=test_result$knn_prediction,
reference= test_result$Survived,
positive= "1")
KNNMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 76 27
## 1 15 25
##
## Accuracy : 0.7063
## 95% CI : (0.6244, 0.7794)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : 0.04770
##
## Kappa : 0.3324
##
## Mcnemar's Test P-Value : 0.08963
##
## Sensitivity : 0.4808
## Specificity : 0.8352
## Pos Pred Value : 0.6250
## Neg Pred Value : 0.7379
## Prevalence : 0.3636
## Detection Rate : 0.1748
## Detection Prevalence : 0.2797
## Balanced Accuracy : 0.6580
##
## 'Positive' Class : 1
##
As for our K-Nearest Neighbour model, it has these values:
Let’s add these to our previous data frame using
add_row.
evaluation <- evaluation %>% add_row(Model= "K-Nearest Neighbour",
Accuracy = KNNMatrix$overall[1],
Sensitivity = KNNMatrix$byClass[1],
Specificity = KNNMatrix$byClass[2],
Precision = KNNMatrix$byClass[3])
rmarkdown::paged_table(evaluation)
The next model we’ll try out is Naive Bayes. The Naive Bayes model is
based off the Bayes’ Theory of Probability, where the probability of a
certain event can be affected by the existence of new information. We’ll
use the nb function from the e1071 library. We’ll once more
try to use a formula where all the predictors are included.
library(e1071)
nb <- naiveBayes(
formula = Survived ~ .,
data = test_data
)
nb
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.6363636 0.3636364
##
## Conditional probabilities:
## Pclass
## Y 1 2 3
## 0 0.1208791 0.2087912 0.6703297
## 1 0.5192308 0.2500000 0.2307692
##
## Sex
## Y female male
## 0 0.08791209 0.91208791
## 1 0.59615385 0.40384615
##
## Age
## Y [,1] [,2]
## 0 30.53297 13.30033
## 1 31.09615 17.26524
##
## SibSp
## Y [,1] [,2]
## 0 0.4725275 0.8735707
## 1 0.4807692 0.7273477
##
## Parch
## Y [,1] [,2]
## 0 0.4395604 1.0350394
## 1 0.3653846 0.6576502
##
## Fare
## Y [,1] [,2]
## 0 24.45302 39.95824
## 1 51.46803 58.49952
##
## Embarked
## Y C Q S
## 0 0.00000000 0.13186813 0.02197802 0.84615385
## 1 0.01923077 0.28846154 0.03846154 0.65384615
Here we can see all the probabilities for the different predictors, where 1 is the likelihood of a passenger having that characteristic surviving. We can take away some interesting things from this, such as the high likelihood that a non-surviving passenger is male (91.2%).
Now, let’s try predicting the test data using this model by using
predict. Simply input the type as class so
that it will output the predicted class.
nb_prediction <- predict(
object = nb,
newdata = test_data,
type = "class"
)
test_result$nb_prediction <- nb_prediction
rmarkdown::paged_table(test_result)
Let’s use Confusion Matrix in order to evaluate this model.
NBMatrix <- confusionMatrix(data = nb_prediction,
reference = test_data$Survived,
positive = "1",
mode = "everything")
NBMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 80 14
## 1 11 38
##
## Accuracy : 0.8252
## 95% CI : (0.7528, 0.8836)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : 5.904e-07
##
## Kappa : 0.6175
##
## Mcnemar's Test P-Value : 0.6892
##
## Sensitivity : 0.7308
## Specificity : 0.8791
## Pos Pred Value : 0.7755
## Neg Pred Value : 0.8511
## Precision : 0.7755
## Recall : 0.7308
## F1 : 0.7525
## Prevalence : 0.3636
## Detection Rate : 0.2657
## Detection Prevalence : 0.3427
## Balanced Accuracy : 0.8049
##
## 'Positive' Class : 1
##
Our Naive Bayes model has these values:
evaluation <- evaluation %>% add_row(Model= "Naive Bayes",
Accuracy = NBMatrix$overall[1],
Sensitivity = NBMatrix$byClass[1],
Specificity = NBMatrix$byClass[2],
Precision = NBMatrix$byClass[3])
rmarkdown::paged_table(evaluation)
Decision Tree is a bit different compared to the other models.
Although we can predict data using it akin to the previous models we’ve
discussed, it can also output a set of rules that are immediately
interpretable. In order to see this, we’ll create the model with all
predictors using tree from the partykit package, and input
it into a plot.
library(partykit)
tree <- ctree(formula = Survived ~ .,
data = train_data)
plot(tree, type = "simple")
We can see that the resulting tree has set Sex as the root node, making
it the most important factor first and foremost. Then, depending on the
sex, it branches out into some interior nodes which are Pclass, Fare,
and Age.
The example as to how to interpret this plot would be by testing it out on an existing passenger profile. Let’s select a number at random… how about 17? And let’s get rid of the Survived column first so that we can check answers later.
train[17,] %>% select(-Survived)
## Pclass Sex Age SibSp Parch Fare Embarked
## 19 3 female 31 1 0 18 S
Following the passenger profile above, by their sex (female) we would immediately jump to check their Pclass. Their Pclass is 3, which means they would be predicted as 0 according to the resulting leaf node. Now, let’s check our answer.
train[17,]$Survived
## [1] 0
## Levels: 0 1
Seems like we were correct.
Now, let’s try to predict using this model like we have
done for the previous ones. Put in the type as response in
order to output the predicted class.
tree_prediction <- predict(
object = tree,
newdata = test_data,
type = "response"
)
test_result$tree_prediction <- tree_prediction
rmarkdown::paged_table(test_result)
As always, let’s check the performance of our model with Confusion Matrix.
treeMatrix <- confusionMatrix(
data = tree_prediction,
reference = test_data$Survived
)
treeMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 90 29
## 1 1 23
##
## Accuracy : 0.7902
## 95% CI : (0.7143, 0.8538)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : 5.117e-05
##
## Kappa : 0.4876
##
## Mcnemar's Test P-Value : 8.244e-07
##
## Sensitivity : 0.9890
## Specificity : 0.4423
## Pos Pred Value : 0.7563
## Neg Pred Value : 0.9583
## Prevalence : 0.6364
## Detection Rate : 0.6294
## Detection Prevalence : 0.8322
## Balanced Accuracy : 0.7157
##
## 'Positive' Class : 0
##
These values are not too bad, especially when it comes to
Sensitivity. But we can improve this model by adjusting some of the
controls, such as mincriterion, minsplit and
minbucket. Typically, we would call this Pruning, but
seeing as our tree is already quite simple, how about we loosen the
parameters from the default a bit? It’s possible that our tree may be
underfitted.
The default mincriterion value is 0.95, let’s lower it down to 0.75. Let’s also lower the minsplit and minbucket.
# tuning model decision tree
tree_tuned <- ctree(formula = Survived ~ .,
data = train_data,
control = ctree_control(mincriterion = 0.75,
minsplit = 10,
minbucket = 5))
plot(tree_tuned, type="simple")
Let’s predict using this new model.
tree_tuned_prediction <- predict(
object = tree_tuned,
newdata = test_data,
type = "response"
)
test_result$tree_tuned_prediction <- tree_tuned_prediction
tunedTreeMatrix <- confusionMatrix(
data = tree_tuned_prediction,
reference = test_data$Survived
)
tunedTreeMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 20
## 1 3 32
##
## Accuracy : 0.8392
## 95% CI : (0.7685, 0.8952)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : 7.221e-08
##
## Kappa : 0.6263
##
## Mcnemar's Test P-Value : 0.0008492
##
## Sensitivity : 0.9670
## Specificity : 0.6154
## Pos Pred Value : 0.8148
## Neg Pred Value : 0.9143
## Prevalence : 0.6364
## Detection Rate : 0.6154
## Detection Prevalence : 0.7552
## Balanced Accuracy : 0.7912
##
## 'Positive' Class : 0
##
Seems like we managed to get a better accuracy score. Let’s store both into the evaluation data frame to make it easier for us to compare.
evaluation <- evaluation %>% add_row(Model= "Decision Tree",
Accuracy = treeMatrix$overall[1],
Sensitivity = treeMatrix$byClass[1],
Specificity = treeMatrix$byClass[2],
Precision = treeMatrix$byClass[3]) %>%
add_row(Model= "Tuned Decision Tree",
Accuracy = tunedTreeMatrix$overall[1],
Sensitivity = tunedTreeMatrix$byClass[1],
Specificity = tunedTreeMatrix$byClass[2],
Precision = tunedTreeMatrix$byClass[3])
rmarkdown::paged_table(evaluation)
While our Tuned Decision Tree has higher Accuracy values, our Decision Tree is still between when it comes to Sensitivity.
The last method we’ll be trying out is Random Forest. In this method,
multiple models will be created and compared against each other in order
to create the best-performing model. In order to make this model, we
first have to create the control parameters, and input it into the model
using train.
# train_ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# repeats = 3)
#
# forest <- train(Survived ~ .,
# data = train_data,
# method = "rf",
# trControl = train_ctrl)
# saveRDS(forest, "forest.RDS")
Since making Random Forest models take quite a while, I’ve saved the
resulting model into an RDS format. Let’s load it up using
readRDS.
forest <- readRDS("forest.RDS")
Same as before, we can predict using this model. In the case of
Random Forest, it only accepts raw or prob, so
let’s just use raw to see the predicted class.
forest_prediction <- predict(
object = forest,
newdata = test_data,
type = "raw"
)
test_result$forest_prediction <- forest_prediction
rmarkdown::paged_table(test_result)
We can’t interpret Random Forest the same way we did with Decision
Tree, but we can see the importance of each variable/predictor that was
used in the model by plotting the varImp.
varImp(forest) %>% plot
Similar to the conclusion that we’ve reached when looking at our
Decision Tree, Sex is the most important variable in our model, followed
by Fare, Age, then Class.
We can access the model’s OOB (Out of Bag) value by accessing its
finalModel. The OOB value is the error rate of the model
when looking at unseen data (data that was not used when creating the
model).
forest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.32%
## Confusion matrix:
## 0 1 class.error
## 0 308 25 0.07507508
## 1 91 147 0.38235294
Our Random Forest model has an OOB error rate of 19.79%, which means it has an accuracy of 80.21%.
evaluation <- evaluation %>% add_row(Model= "Random Forest",
Accuracy = 0.8021)
In order to conclude the performance of all our models, let’s call
the evaluation data frame. In the imaginary case where we would use
these models to predict a potential passenger’s survival rate, I think
it would be better to be safe than sorry. Which means we’d prioritize
predicting the negative (0) case and reducing the amount of False
Positives (predicted as survived, but ended up not surviving). This
would mean we’d use the Precision metric.
rmarkdown::paged_table(evaluation)
As we can see from the table, it would seem that our Logistic Regression, Tuned Decision Tree, and Random Forest models would be the most suited for the job.
Throughout the course of making these models, we’ve learned that
there are many ways to improve a model, such as only using important
predictors, or watching out for instances of overfitting or
underfitting. Aside from that, we’ve also discovered that there are many
tools that may help us in choosing important predictors, such as looking
at a model’s summary, using stepwise models, and plotting out varImp.
We can also conclude that different models can be used depending on what we’re looking for. For example, KNN can only use numerical variables, which explains why its score is quite terrible in our case, since we have a lot of categorical variables. Meanwhile, if we’d like a set of interpretable rules rather than immediate predictions, we can use the Decision Tree.
We can conclude that the biggest contributing factors of a passenger’s survival is their Sex, followed by their Fare, Age, and then Class. It would seem that female passengers were prioritized during the evacuation process, causing their survival probabilities to be higher. What’s probably most interesting to me personally is that fare seems to affect a passenger’s likelihood of surviving more than their class. Through this analysis and the creation of these models, we can safely say that gender as well as socioeconomic status was a big factor during the evacuation of passengers on the Titanic.