Classification Algorithm: A case study of the titanic dataset

Background

Titanic was a ship that sunk 100 years ago with about 2000 souls on board. I will describe a classification alogorithm that predicts who survived or died that night. Data visualization were done here.

Importing the data and libraries

setwd("C:/Users/Admin/Documents/Dell E6320/R  Datasets")
library(tidyverse)
library(caret)
library(randomForest)

dat <- read_csv("titanic_data.csv")
head(dat)

# A tibble: 6 x 7
  PassengerId Survived Pclass Sex      Age  Fare Embarked
        <dbl>    <dbl>  <dbl> <chr>  <dbl> <dbl> <chr>   
1           1        0      3 male      22  7.25 S       
2           2        1      1 female    38 71.3  C       
3           3        1      3 female    26  7.92 S       
4           4        1      1 female    35 53.1  S       
5           5        0      3 male      35  8.05 S       
6           6        0      3 male      NA  8.46 Q

We can get rid of the PassengerId (has no value to our dataset). We also need to convert Survived, PClass and Sex to factors.

dat <- dat %>% 
  select(-1) %>% 
  mutate(Survived = factor(Survived),
         Pclass = factor(Pclass),
         Sex = factor(Sex))

Handling missing values in Age

Age has missing values. Let’s check it’s distribution

dat %>% 
  ggplot(aes(x = Age)) +
  geom_density() + 
  theme_classic()

Age is skewed to the right. We should therefore impute using the median

dat[is.na(dat$Age), "Age"] <- median(dat$Age, na.rm = TRUE)

Train and Test sets

We will create the train and test sets. The train set will have 70% of the observations while the test set will have 30% of the total observations. I will use the CreateDataPartition of caret package to maintain proportionality of the Survived variable in both sets. I will train the model using the train set and implement it in the test set.

indexes <- createDataPartition(dat$Survived, p = 0.7)
train <- dat[indexes$Resample1, ]
test <- dat[-indexes$Resample1, ]

Building the model

I will use all variables but Embarked

train_model <- randomForest(Survived ~ Pclass + Sex + Age + Fare, data = train)
pred_model <- predict(train_model, test)

Evaluating the model performace

model_df <- cbind(Actuals = test$Survived, Predicted = pred_model) %>% 
  as.data.frame() %>% 
  mutate(Actuals = factor(Actuals),
         Predicted = factor(Predicted))

mod_evaluation <- confusionMatrix(model_df$Actuals, model_df$Predicted)
mod_evaluation

Confusion Matrix and Statistics

          Reference
Prediction   1   2
         1 152  12
         2  34  68
                                          
               Accuracy : 0.8271          
                 95% CI : (0.7762, 0.8705)
    No Information Rate : 0.6992          
    P-Value [Acc > NIR] : 1.269e-06       
                                          
                  Kappa : 0.6187          
                                          
 Mcnemar's Test P-Value : 0.00196         
                                          
            Sensitivity : 0.8172          
            Specificity : 0.8500          
         Pos Pred Value : 0.9268          
         Neg Pred Value : 0.6667          
             Prevalence : 0.6992          
         Detection Rate : 0.5714          
   Detection Prevalence : 0.6165          
      Balanced Accuracy : 0.8336          
                                          
       'Positive' Class : 1

Our model is up to 82.71% accurate! The performace can be improved by feature engineering or by using other advanced classification algorithms such as extreme gradient boosting.