Overview

This exercise is part of Algoritma’s LearnByBuilding practice in its full stack Data Science programme.

In this exercise, we will be exploring the data of Titanic’s tragic accident, which caused fatalities to 1502 out of 2224 passengers, to see if we can predict if a certain socio-demographic trait may contribute to the passengers’ survivability. The datasets were obtained from kaggle (https://www.kaggle.com/competitions/titanic/overview) on January 26, 2023.

Without further ado, let us “dive” into how we got these results.

First Glimpse

We firstly would like to import the dataset into RStudio. In this exercise, we are using the train and test datasets provided.

Preliminary examination of the imported datasets on the environment pane shows that the train dataset has 891 observations of 12 variables. A logical variable “Survived” indicating whether each passenger was deceased (0) or survived (1); will be our target variable during the machine learning.

For this purpose, we will be looking at the column names, data type, and observations by calling the head() function for the first 5 observations from the dataset:

train_setup=train %>% select(-c(PassengerId,Name,Ticket,Cabin)) %>% na.omit()
RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index=sample(nrow(train_setup), nrow(train_setup)*0.8)
  
# splitting
train_clean=train_setup[index,]
test=train_setup[-index,]
head(train_clean,5)
#>     Survived Pclass    Sex Age SibSp Parch    Fare Embarked
#> 255        0      3 female  41     0     2 20.2125        S
#> 708        1      1   male  42     0     0 26.2875        S
#> 364        0      3   male  35     0     0  7.0500        S
#> 788        0      3   male   8     4     1 29.1250        Q
#> 836        1      1 female  39     1     1 83.1583        C

Since the train data has several NAs in variables Age and Cabin, we decide to omit NAs observations in Age and deselecting variable Cabin. We are also removing variables PassengerId and Ticket as it only functions as an index and would thus interfere with the model if kept.

Variables Pclass, SibSp, and Parch provides the passengers’ data on ticket class (where 1 would denote upper socio-economic status of the passenger, while 3 being the lowest), siblings or spouses on the Titanic, and parents or children on the Titanic, respectively. The last variable, Embarked, shows the port of embarkation of the passenger: Southampton (S), Cherbourg (C), and Queenstown (Q).

The type of each variable is configured, as follows:

str(train_clean)
#> 'data.frame':    571 obs. of  8 variables:
#>  $ Survived: int  0 1 0 0 1 0 0 0 0 1 ...
#>  $ Pclass  : int  3 1 3 3 1 2 1 2 3 2 ...
#>  $ Sex     : chr  "female" "male" "male" "male" ...
#>  $ Age     : num  41 42 35 8 39 27 56 16 30 28 ...
#>  $ SibSp   : int  0 0 0 4 1 1 0 0 0 0 ...
#>  $ Parch   : int  2 0 0 1 1 0 0 0 0 0 ...
#>  $ Fare    : num  20.21 26.29 7.05 29.12 83.16 ...
#>  $ Embarked: chr  "S" "S" "S" "Q" ...
#>  - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
#>   ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...

Data cleansing

Upon closer investigation, we know that we need to change the data type by explicit coercions with details as follows:

#>      Variables      Type   Coerced
#> 1  PassengerId   integer   integer
#> 2     Survived   integer   logical
#> 3       Pclass   integer    factor
#> 4         Name character character
#> 5          Sex character    factor
#> 6          Age   numeric   numeric
#> 7        SibSp   integer   integer
#> 8        Parch   integer   integer
#> 9       Ticket character character
#> 10        Fare   numeric   numeric
#> 11       Cabin character character
#> 12    Embarked character    factor

(Note: we would usually use numeric for Age, but there were passengers who were less than 1 year old, making it more suitable to use integer as the data type.)

We will transform the data type by running the following lines:

train_clean=train_clean %>% 
  mutate(Survived=as.logical(Survived)) %>% 
  mutate_at(vars(Pclass, Sex, Embarked),as.factor)
test=test%>% 
  mutate(Survived=as.logical(Survived)) %>%
  mutate_at(vars(Pclass, Sex, Embarked),as.factor)

We can then check if the data types have been transformed:

str(train_clean)
#> 'data.frame':    571 obs. of  8 variables:
#>  $ Survived: logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
#>  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 3 1 2 1 2 3 2 ...
#>  $ Sex     : Factor w/ 2 levels "female","male": 1 2 2 2 1 1 2 2 2 1 ...
#>  $ Age     : num  41 42 35 8 39 27 56 16 30 28 ...
#>  $ SibSp   : int  0 0 0 4 1 1 0 0 0 0 ...
#>  $ Parch   : int  2 0 0 1 1 0 0 0 0 0 ...
#>  $ Fare    : num  20.21 26.29 7.05 29.12 83.16 ...
#>  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 4 4 3 2 4 4 4 4 4 ...
#>  - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
#>   ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...

Exploring the data

After understanding the dataset and conducting a preliminary transformation of the data type, We are particularly interested in several numerical variables,

summary(train_clean)
#>   Survived       Pclass      Sex           Age            SibSp       
#>  Mode :logical   1:148   female:201   Min.   : 0.42   Min.   :0.0000  
#>  FALSE:340       2:142   male  :370   1st Qu.:20.25   1st Qu.:0.0000  
#>  TRUE :231       3:281                Median :28.00   Median :0.0000  
#>                                       Mean   :29.30   Mean   :0.5044  
#>                                       3rd Qu.:37.50   3rd Qu.:1.0000  
#>                                       Max.   :74.00   Max.   :5.0000  
#>      Parch             Fare        Embarked
#>  Min.   :0.0000   Min.   :  0.00    :  1   
#>  1st Qu.:0.0000   1st Qu.:  8.05   C:104   
#>  Median :0.0000   Median : 15.55   Q: 22   
#>  Mean   :0.4203   Mean   : 34.01   S:444   
#>  3rd Qu.:1.0000   3rd Qu.: 31.85           
#>  Max.   :6.0000   Max.   :512.33

From the data’s quick snapshot, we found that:

  1. There were 891 passengers with their own unique passenger ID, of which 714 were identified by their age.

  2. Only 290 out of 714 passenger recorded in this data survived.

  3. Most passengers boarded the ship with ticket class 3, meaning that they were least likely of upper class-men, evident by the median fare paid of USD 15.74.

  4. The passengers of the Titanic vary in age, with a child who was less than a year as the youngest passenger, and an 80-year old as the oldest. The discrepancy of the mean and median of age is small, meaning the age distribution of the passengers recorded in this dataset nearly follows the normal distribution.

  5. Most passengers seem to have boarded the Titanic without their siblings, spouse, parents, or children; shown by the low values of the second and third quantiles of variables SipSp and Parch. A maximum value of 8 and of 6 for Sibsp and Parch were observed, respectively.

  6. The numerical variables have weak correlation

Building the Model:

Prior to doing any modelling, we will be checking the balance of the data:

prop.table(table(train_clean$Survived)) %>% round(2)
#> 
#> FALSE  TRUE 
#>   0.6   0.4

The data is pretty balanced. We can thus continue with the modelling.

Classification I: Logistic Regression

titanic_glm <- glm(formula = Survived~TRUE,
                  data = train_clean,
                  family = "binomial")
summary(titanic_glm)
#> 
#> Call:
#> glm(formula = Survived ~ TRUE, family = "binomial", data = train_clean)
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -1.018  -1.018  -1.018   1.345   1.345  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.38653    0.08527  -4.533 5.81e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 770.64  on 570  degrees of freedom
#> Residual deviance: 770.64  on 570  degrees of freedom
#> AIC: 772.64
#> 
#> Number of Fisher Scoring iterations: 4

Intercept=-0,3799 is the log of odds of passengers surviving the titanic crash. To interpret this value, we would need to turn it into probability and odds:

titanic_glm_summary=data.frame(log_odds=titanic_glm$coefficients,
                               probability=inv.logit(titanic_glm$coefficients),
                               odds=exp(titanic_glm$coefficients),
                               row.names = "Intercept")
titanic_glm_summary
#>             log_odds probability      odds
#> Intercept -0.3865279   0.4045534 0.6794118

The model indicates that the probability of passengers surviving the titanic is 0.4 (or 40%) and the odds of passengers not surviving the crash is 0.68 more likely than surviving it.

Categorical Predictor

For the first attempt of modelling a categorical predictor, we will be using the variable “Class” to see if socioeconomic status matter during a life and death situation.

titanic_glm_class <- glm(Survived~Pclass,
                data = train_clean,
                family = "binomial")
summary(titanic_glm_class)
#> 
#> Call:
#> glm(formula = Survived ~ Pclass, family = "binomial", data = train_clean)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.4200  -0.7694  -0.7694   0.9528   1.6503  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   0.5543     0.1708   3.246  0.00117 ** 
#> Pclass2      -0.7237     0.2399  -3.017  0.00255 ** 
#> Pclass3      -1.6200     0.2187  -7.407 1.29e-13 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 770.64  on 570  degrees of freedom
#> Residual deviance: 709.88  on 568  degrees of freedom
#> AIC: 715.88
#> 
#> Number of Fisher Scoring iterations: 4

We can then interpret the coefficients, as follows:

glm_class_summary=data.frame(log_odds=titanic_glm_class$coefficients,
                               probability=inv.logit(titanic_glm_class$coefficients),
                               odds=exp(titanic_glm_class$coefficients),
                             row.names = c("Upper Class", "Middle Class", "Lower Class"))
glm_class_summary
#>                log_odds probability      odds
#> Upper Class   0.5543107   0.6351351 1.7407407
#> Middle Class -0.7237289   0.3265724 0.4849406
#> Lower Class  -1.6199789   0.1652078 0.1979029

Interpretation This would mean that the probability of surviving would be drastically reduced when the passengers boarded the lower class (Pclass=3).

Numerical Predictor

We can also use a numerical predictor in glm, we will be using the variable “Fare”.

titanic_glm_fare <- glm(Survived~Fare,
                data = train_clean,
                family = "binomial")
glm_fare_summary=data.frame(log_odds=titanic_glm_fare$coefficients,
                               probability=inv.logit(titanic_glm_fare$coefficients),
                               odds=exp(titanic_glm_fare$coefficients))
glm_fare_summary
#>                log_odds probability      odds
#> (Intercept) -0.84870178   0.2997053 0.4279702
#> Fare         0.01441983   0.5036049 1.0145243

The function of this model would be:

\[Log_{odds} = -0.89 * 0.016 Fare\] with the probability of surviving would be 29% if Fare was 0 and survivability’s log of odds would increase by 0.016 for every addition to Fare.

Step Wise

model_all=glm(Survived~., data=train_clean,family="binomial")
titanic_step = step(model_all,
                   direction = "both",
                   trace = F)
summary(titanic_step)
#> 
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
#>     data = train_clean)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8872  -0.6272  -0.4103   0.6782   2.4664  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  4.648615   0.528860   8.790  < 2e-16 ***
#> Pclass2     -1.624485   0.326393  -4.977 6.45e-07 ***
#> Pclass3     -2.579307   0.320817  -8.040 9.00e-16 ***
#> Sexmale     -2.542184   0.236967 -10.728  < 2e-16 ***
#> Age         -0.055996   0.009744  -5.746 9.11e-09 ***
#> SibSp       -0.384307   0.136376  -2.818  0.00483 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 770.64  on 570  degrees of freedom
#> Residual deviance: 520.40  on 565  degrees of freedom
#> AIC: 532.4
#> 
#> Number of Fisher Scoring iterations: 5
#>               log_odds probability         odds
#> (Intercept)  4.6486151  0.99051596 104.44024932
#> Pclass2     -1.6244848  0.16458729   0.19701315
#> Pclass3     -2.5793066  0.07048214   0.07582656
#> Sexmale     -2.5421842  0.07295331   0.07869433
#> Age         -0.0559956  0.48600476   0.94554330
#> SibSp       -0.3843072  0.40508848   0.68092223

The stepwise approach yielded us with a model that has negative coefficients for Pclass, Sex, Age, and SibSp, indicating that the log of odds of survivability if the passengers were categorised as Class 2 or 3, Male, Older, and boarded the ship with a sibling or a spouse would be reduced. Conversely, if the passengers were of Class 1, female, younger, and boarded without a sibling nor a spouse, the probability of that passenger surviving the crash would be 0.99.

We will be using this stepwise model to predict our test dataset.

pred_glm=predict(object = titanic_step,
        newdata = test,
        type = "response")
pred_glm_label=ifelse(pred_glm>0.5,TRUE,FALSE)
table(pred_glm_label);table(test$Survived)
#> pred_glm_label
#> FALSE  TRUE 
#>    81    62
#> 
#> FALSE  TRUE 
#>    84    59

It seems there were 3 False Negatives. We will look into the confusion matrix later on in this exercise. We will do another type of classification model, KNN.

Classification I: K-Nearest Neighbors

Before doing the modelling, we want to scale the predictors first,

titanic_train_x=train_clean %>% select_if(is.numeric)
titanic_test_x=test %>% select_if(is.numeric) %>% na.omit()

titanic_train_y=train_clean[,1]
titanic_test_y=test[,1]

titanic_train_x_scale=scale(titanic_train_x)

titanic_test_x_scale=scale(x=titanic_test_x,
                           center = attr(titanic_train_x_scale, "scaled:center"),
                           scale = attr(titanic_train_x_scale, "scaled:scale"))

Using the knn function in the class library, and considering the square root of the number of observations of the train is 23, we get:

library(class)
test=na.omit(test)
pred_knn= knn(train = titanic_train_x,
                 test = titanic_test_x,
                 cl = titanic_train_y,
                 k= 23)
test$predict_knn=pred_knn
test$predict_glm=pred_glm_label

Evaluating The Model

By using the confusion matrix, we can evaluate which model results in fewer false negatives, meaning people who survived but was modeled as deceased (higher sensitivity is preferable)

library(caret)
test=test %>% mutate_at(vars(Survived,predict_knn,predict_glm),as.factor)
confusionMatrix(data = test$predict_knn,
                reference = test$Survived,
                positive="TRUE")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction FALSE TRUE
#>      FALSE    72   28
#>      TRUE     12   31
#>                                           
#>                Accuracy : 0.7203          
#>                  95% CI : (0.6391, 0.7921)
#>     No Information Rate : 0.5874          
#>     P-Value [Acc > NIR] : 0.0006709       
#>                                           
#>                   Kappa : 0.3987          
#>                                           
#>  Mcnemar's Test P-Value : 0.0177061       
#>                                           
#>             Sensitivity : 0.5254          
#>             Specificity : 0.8571          
#>          Pos Pred Value : 0.7209          
#>          Neg Pred Value : 0.7200          
#>              Prevalence : 0.4126          
#>          Detection Rate : 0.2168          
#>    Detection Prevalence : 0.3007          
#>       Balanced Accuracy : 0.6913          
#>                                           
#>        'Positive' Class : TRUE            
#> 
confusionMatrix(data = test$predict_glm,
                reference = test$Survived,
                positive="TRUE")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction FALSE TRUE
#>      FALSE    67   14
#>      TRUE     17   45
#>                                           
#>                Accuracy : 0.7832          
#>                  95% CI : (0.7066, 0.8477)
#>     No Information Rate : 0.5874          
#>     P-Value [Acc > NIR] : 6.118e-07       
#>                                           
#>                   Kappa : 0.5561          
#>                                           
#>  Mcnemar's Test P-Value : 0.7194          
#>                                           
#>             Sensitivity : 0.7627          
#>             Specificity : 0.7976          
#>          Pos Pred Value : 0.7258          
#>          Neg Pred Value : 0.8272          
#>              Prevalence : 0.4126          
#>          Detection Rate : 0.3147          
#>    Detection Prevalence : 0.4336          
#>       Balanced Accuracy : 0.7802          
#>                                           
#>        'Positive' Class : TRUE            
#> 

It seems that the logistic model resulted in higher sensitivity, thus is more appropriate for this use. KNN would be preferable if we desire a higher Specificity.