Titanic was a ship that sunk 100 years ago with about 2000 souls on board. I will describe a classification alogorithm that predicts who survived or died that night. Data visualization were done here.
setwd("C:/Users/Admin/Documents/Dell E6320/R Datasets")
library(tidyverse)
library(caret)
library(randomForest)
dat <- read_csv("titanic_data.csv")
head(dat)
# A tibble: 6 x 7
PassengerId Survived Pclass Sex Age Fare Embarked
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 1 0 3 male 22 7.25 S
2 2 1 1 female 38 71.3 C
3 3 1 3 female 26 7.92 S
4 4 1 1 female 35 53.1 S
5 5 0 3 male 35 8.05 S
6 6 0 3 male NA 8.46 Q
We can get rid of the PassengerId (has no value to our dataset). We also need to convert Survived
, PClass
and Sex
to factors.
dat <- dat %>%
select(-1) %>%
mutate(Survived = factor(Survived),
Pclass = factor(Pclass),
Sex = factor(Sex))
Age has missing values. Let’s check it’s distribution
dat %>%
ggplot(aes(x = Age)) +
geom_density() +
theme_classic()
Age is skewed to the right. We should therefore impute using the median
dat[is.na(dat$Age), "Age"] <- median(dat$Age, na.rm = TRUE)
We will create the train and test sets. The train set will have 70% of the observations while the test set will have 30% of the total observations. I will use the CreateDataPartition
of caret
package to maintain proportionality of the Survived
variable in both sets. I will train the model using the train set and implement it in the test set.
indexes <- createDataPartition(dat$Survived, p = 0.7)
train <- dat[indexes$Resample1, ]
test <- dat[-indexes$Resample1, ]
I will use all variables but Embarked
train_model <- randomForest(Survived ~ Pclass + Sex + Age + Fare, data = train)
pred_model <- predict(train_model, test)
model_df <- cbind(Actuals = test$Survived, Predicted = pred_model) %>%
as.data.frame() %>%
mutate(Actuals = factor(Actuals),
Predicted = factor(Predicted))
mod_evaluation <- confusionMatrix(model_df$Actuals, model_df$Predicted)
mod_evaluation
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 152 12
2 34 68
Accuracy : 0.8271
95% CI : (0.7762, 0.8705)
No Information Rate : 0.6992
P-Value [Acc > NIR] : 1.269e-06
Kappa : 0.6187
Mcnemar's Test P-Value : 0.00196
Sensitivity : 0.8172
Specificity : 0.8500
Pos Pred Value : 0.9268
Neg Pred Value : 0.6667
Prevalence : 0.6992
Detection Rate : 0.5714
Detection Prevalence : 0.6165
Balanced Accuracy : 0.8336
'Positive' Class : 1
Our model is up to 82.71% accurate! The performace can be improved by feature engineering or by using other advanced classification algorithms such as extreme gradient boosting.