This exercise is part of Algoritma’s LearnByBuilding practice in its full stack Data Science programme.
In this exercise, we will be exploring the data of Titanic’s tragic accident, which caused fatalities to 1502 out of 2224 passengers, to see if we can predict if a certain socio-demographic trait may contribute to the passengers’ survivability. The datasets were obtained from kaggle (https://www.kaggle.com/competitions/titanic/overview) on January 26, 2023.
Without further ado, let us “dive” into how we got these results.
We firstly would like to import the dataset into RStudio. In this exercise, we are using the train and test datasets provided.
Preliminary examination of the imported datasets on the environment pane shows that the train dataset has 891 observations of 12 variables. A logical variable “Survived” indicating whether each passenger was deceased (0) or survived (1); will be our target variable during the machine learning.
For this purpose, we will be looking at the column names, data type, and observations by calling the head() function for the first 5 observations from the dataset:
train_setup=train %>% select(-c(PassengerId,Name,Ticket,Cabin)) %>% na.omit()
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index=sample(nrow(train_setup), nrow(train_setup)*0.8)
# splitting
train_clean=train_setup[index,]
test=train_setup[-index,]
head(train_clean,5)
#> Survived Pclass Sex Age SibSp Parch Fare Embarked
#> 255 0 3 female 41 0 2 20.2125 S
#> 708 1 1 male 42 0 0 26.2875 S
#> 364 0 3 male 35 0 0 7.0500 S
#> 788 0 3 male 8 4 1 29.1250 Q
#> 836 1 1 female 39 1 1 83.1583 C
Since the train data has several NAs in variables Age and Cabin, we decide to omit NAs observations in Age and deselecting variable Cabin. We are also removing variables PassengerId and Ticket as it only functions as an index and would thus interfere with the model if kept.
Variables Pclass, SibSp, and Parch provides the passengers’ data on ticket class (where 1 would denote upper socio-economic status of the passenger, while 3 being the lowest), siblings or spouses on the Titanic, and parents or children on the Titanic, respectively. The last variable, Embarked, shows the port of embarkation of the passenger: Southampton (S), Cherbourg (C), and Queenstown (Q).
The type of each variable is configured, as follows:
str(train_clean)
#> 'data.frame': 571 obs. of 8 variables:
#> $ Survived: int 0 1 0 0 1 0 0 0 0 1 ...
#> $ Pclass : int 3 1 3 3 1 2 1 2 3 2 ...
#> $ Sex : chr "female" "male" "male" "male" ...
#> $ Age : num 41 42 35 8 39 27 56 16 30 28 ...
#> $ SibSp : int 0 0 0 4 1 1 0 0 0 0 ...
#> $ Parch : int 2 0 0 1 1 0 0 0 0 0 ...
#> $ Fare : num 20.21 26.29 7.05 29.12 83.16 ...
#> $ Embarked: chr "S" "S" "S" "Q" ...
#> - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
#> ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...
Upon closer investigation, we know that we need to change the data type by explicit coercions with details as follows:
#> Variables Type Coerced
#> 1 PassengerId integer integer
#> 2 Survived integer logical
#> 3 Pclass integer factor
#> 4 Name character character
#> 5 Sex character factor
#> 6 Age numeric numeric
#> 7 SibSp integer integer
#> 8 Parch integer integer
#> 9 Ticket character character
#> 10 Fare numeric numeric
#> 11 Cabin character character
#> 12 Embarked character factor
(Note: we would usually use numeric for Age, but there were passengers who were less than 1 year old, making it more suitable to use integer as the data type.)
We will transform the data type by running the following lines:
train_clean=train_clean %>%
mutate(Survived=as.logical(Survived)) %>%
mutate_at(vars(Pclass, Sex, Embarked),as.factor)
test=test%>%
mutate(Survived=as.logical(Survived)) %>%
mutate_at(vars(Pclass, Sex, Embarked),as.factor)
We can then check if the data types have been transformed:
str(train_clean)
#> 'data.frame': 571 obs. of 8 variables:
#> $ Survived: logi FALSE TRUE FALSE FALSE TRUE FALSE ...
#> $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 3 1 2 1 2 3 2 ...
#> $ Sex : Factor w/ 2 levels "female","male": 1 2 2 2 1 1 2 2 2 1 ...
#> $ Age : num 41 42 35 8 39 27 56 16 30 28 ...
#> $ SibSp : int 0 0 0 4 1 1 0 0 0 0 ...
#> $ Parch : int 2 0 0 1 1 0 0 0 0 0 ...
#> $ Fare : num 20.21 26.29 7.05 29.12 83.16 ...
#> $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 4 4 3 2 4 4 4 4 4 ...
#> - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
#> ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...
After understanding the dataset and conducting a preliminary transformation of the data type, We are particularly interested in several numerical variables,
summary(train_clean)
#> Survived Pclass Sex Age SibSp
#> Mode :logical 1:148 female:201 Min. : 0.42 Min. :0.0000
#> FALSE:340 2:142 male :370 1st Qu.:20.25 1st Qu.:0.0000
#> TRUE :231 3:281 Median :28.00 Median :0.0000
#> Mean :29.30 Mean :0.5044
#> 3rd Qu.:37.50 3rd Qu.:1.0000
#> Max. :74.00 Max. :5.0000
#> Parch Fare Embarked
#> Min. :0.0000 Min. : 0.00 : 1
#> 1st Qu.:0.0000 1st Qu.: 8.05 C:104
#> Median :0.0000 Median : 15.55 Q: 22
#> Mean :0.4203 Mean : 34.01 S:444
#> 3rd Qu.:1.0000 3rd Qu.: 31.85
#> Max. :6.0000 Max. :512.33
From the data’s quick snapshot, we found that:
There were 891 passengers with their own unique passenger ID, of
which 714 were identified by their age.
Only 290 out of 714 passenger recorded in this data
survived.
Most passengers boarded the ship with ticket class 3, meaning
that they were least likely of upper class-men, evident by the median
fare paid of USD 15.74.
The passengers of the Titanic vary in age, with a child who was less than a year as the youngest passenger, and an 80-year old as the oldest. The discrepancy of the mean and median of age is small, meaning the age distribution of the passengers recorded in this dataset nearly follows the normal distribution.
Most passengers seem to have boarded the Titanic without their
siblings, spouse, parents, or children; shown by the low values of the
second and third quantiles of variables SipSp and Parch. A maximum value
of 8 and of 6 for Sibsp and Parch were observed,
respectively.
The numerical variables have weak correlation
Prior to doing any modelling, we will be checking the balance of the data:
prop.table(table(train_clean$Survived)) %>% round(2)
#>
#> FALSE TRUE
#> 0.6 0.4
The data is pretty balanced. We can thus continue with the modelling.
titanic_glm <- glm(formula = Survived~TRUE,
data = train_clean,
family = "binomial")
summary(titanic_glm)
#>
#> Call:
#> glm(formula = Survived ~ TRUE, family = "binomial", data = train_clean)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.018 -1.018 -1.018 1.345 1.345
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.38653 0.08527 -4.533 5.81e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 770.64 on 570 degrees of freedom
#> Residual deviance: 770.64 on 570 degrees of freedom
#> AIC: 772.64
#>
#> Number of Fisher Scoring iterations: 4
Intercept=-0,3799 is the log of odds of passengers surviving the titanic crash. To interpret this value, we would need to turn it into probability and odds:
titanic_glm_summary=data.frame(log_odds=titanic_glm$coefficients,
probability=inv.logit(titanic_glm$coefficients),
odds=exp(titanic_glm$coefficients),
row.names = "Intercept")
titanic_glm_summary
#> log_odds probability odds
#> Intercept -0.3865279 0.4045534 0.6794118
The model indicates that the probability of passengers surviving the titanic is 0.4 (or 40%) and the odds of passengers not surviving the crash is 0.68 more likely than surviving it.
For the first attempt of modelling a categorical predictor, we will be using the variable “Class” to see if socioeconomic status matter during a life and death situation.
titanic_glm_class <- glm(Survived~Pclass,
data = train_clean,
family = "binomial")
summary(titanic_glm_class)
#>
#> Call:
#> glm(formula = Survived ~ Pclass, family = "binomial", data = train_clean)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.4200 -0.7694 -0.7694 0.9528 1.6503
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.5543 0.1708 3.246 0.00117 **
#> Pclass2 -0.7237 0.2399 -3.017 0.00255 **
#> Pclass3 -1.6200 0.2187 -7.407 1.29e-13 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 770.64 on 570 degrees of freedom
#> Residual deviance: 709.88 on 568 degrees of freedom
#> AIC: 715.88
#>
#> Number of Fisher Scoring iterations: 4
We can then interpret the coefficients, as follows:
glm_class_summary=data.frame(log_odds=titanic_glm_class$coefficients,
probability=inv.logit(titanic_glm_class$coefficients),
odds=exp(titanic_glm_class$coefficients),
row.names = c("Upper Class", "Middle Class", "Lower Class"))
glm_class_summary
#> log_odds probability odds
#> Upper Class 0.5543107 0.6351351 1.7407407
#> Middle Class -0.7237289 0.3265724 0.4849406
#> Lower Class -1.6199789 0.1652078 0.1979029
Interpretation This would mean that the probability of surviving would be drastically reduced when the passengers boarded the lower class (Pclass=3).
We can also use a numerical predictor in glm, we will be using the variable “Fare”.
titanic_glm_fare <- glm(Survived~Fare,
data = train_clean,
family = "binomial")
glm_fare_summary=data.frame(log_odds=titanic_glm_fare$coefficients,
probability=inv.logit(titanic_glm_fare$coefficients),
odds=exp(titanic_glm_fare$coefficients))
glm_fare_summary
#> log_odds probability odds
#> (Intercept) -0.84870178 0.2997053 0.4279702
#> Fare 0.01441983 0.5036049 1.0145243
The function of this model would be:
\[Log_{odds} = -0.89 * 0.016 Fare\] with the probability of surviving would be 29% if Fare was 0 and survivability’s log of odds would increase by 0.016 for every addition to Fare.
model_all=glm(Survived~., data=train_clean,family="binomial")
titanic_step = step(model_all,
direction = "both",
trace = F)
summary(titanic_step)
#>
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
#> data = train_clean)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8872 -0.6272 -0.4103 0.6782 2.4664
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 4.648615 0.528860 8.790 < 2e-16 ***
#> Pclass2 -1.624485 0.326393 -4.977 6.45e-07 ***
#> Pclass3 -2.579307 0.320817 -8.040 9.00e-16 ***
#> Sexmale -2.542184 0.236967 -10.728 < 2e-16 ***
#> Age -0.055996 0.009744 -5.746 9.11e-09 ***
#> SibSp -0.384307 0.136376 -2.818 0.00483 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 770.64 on 570 degrees of freedom
#> Residual deviance: 520.40 on 565 degrees of freedom
#> AIC: 532.4
#>
#> Number of Fisher Scoring iterations: 5
#> log_odds probability odds
#> (Intercept) 4.6486151 0.99051596 104.44024932
#> Pclass2 -1.6244848 0.16458729 0.19701315
#> Pclass3 -2.5793066 0.07048214 0.07582656
#> Sexmale -2.5421842 0.07295331 0.07869433
#> Age -0.0559956 0.48600476 0.94554330
#> SibSp -0.3843072 0.40508848 0.68092223
The stepwise approach yielded us with a model that has negative coefficients for Pclass, Sex, Age, and SibSp, indicating that the log of odds of survivability if the passengers were categorised as Class 2 or 3, Male, Older, and boarded the ship with a sibling or a spouse would be reduced. Conversely, if the passengers were of Class 1, female, younger, and boarded without a sibling nor a spouse, the probability of that passenger surviving the crash would be 0.99.
We will be using this stepwise model to predict our test dataset.
pred_glm=predict(object = titanic_step,
newdata = test,
type = "response")
pred_glm_label=ifelse(pred_glm>0.5,TRUE,FALSE)
table(pred_glm_label);table(test$Survived)
#> pred_glm_label
#> FALSE TRUE
#> 81 62
#>
#> FALSE TRUE
#> 84 59
It seems there were 3 False Negatives. We will look into the confusion matrix later on in this exercise. We will do another type of classification model, KNN.
Before doing the modelling, we want to scale the predictors first,
titanic_train_x=train_clean %>% select_if(is.numeric)
titanic_test_x=test %>% select_if(is.numeric) %>% na.omit()
titanic_train_y=train_clean[,1]
titanic_test_y=test[,1]
titanic_train_x_scale=scale(titanic_train_x)
titanic_test_x_scale=scale(x=titanic_test_x,
center = attr(titanic_train_x_scale, "scaled:center"),
scale = attr(titanic_train_x_scale, "scaled:scale"))
Using the knn function in the class
library, and considering the square root of the number of observations
of the train is 23, we get:
library(class)
test=na.omit(test)
pred_knn= knn(train = titanic_train_x,
test = titanic_test_x,
cl = titanic_train_y,
k= 23)
test$predict_knn=pred_knn
test$predict_glm=pred_glm_label
By using the confusion matrix, we can evaluate which model results in fewer false negatives, meaning people who survived but was modeled as deceased (higher sensitivity is preferable)
library(caret)
test=test %>% mutate_at(vars(Survived,predict_knn,predict_glm),as.factor)
confusionMatrix(data = test$predict_knn,
reference = test$Survived,
positive="TRUE")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 72 28
#> TRUE 12 31
#>
#> Accuracy : 0.7203
#> 95% CI : (0.6391, 0.7921)
#> No Information Rate : 0.5874
#> P-Value [Acc > NIR] : 0.0006709
#>
#> Kappa : 0.3987
#>
#> Mcnemar's Test P-Value : 0.0177061
#>
#> Sensitivity : 0.5254
#> Specificity : 0.8571
#> Pos Pred Value : 0.7209
#> Neg Pred Value : 0.7200
#> Prevalence : 0.4126
#> Detection Rate : 0.2168
#> Detection Prevalence : 0.3007
#> Balanced Accuracy : 0.6913
#>
#> 'Positive' Class : TRUE
#>
confusionMatrix(data = test$predict_glm,
reference = test$Survived,
positive="TRUE")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 67 14
#> TRUE 17 45
#>
#> Accuracy : 0.7832
#> 95% CI : (0.7066, 0.8477)
#> No Information Rate : 0.5874
#> P-Value [Acc > NIR] : 6.118e-07
#>
#> Kappa : 0.5561
#>
#> Mcnemar's Test P-Value : 0.7194
#>
#> Sensitivity : 0.7627
#> Specificity : 0.7976
#> Pos Pred Value : 0.7258
#> Neg Pred Value : 0.8272
#> Prevalence : 0.4126
#> Detection Rate : 0.3147
#> Detection Prevalence : 0.4336
#> Balanced Accuracy : 0.7802
#>
#> 'Positive' Class : TRUE
#>
It seems that the logistic model resulted in higher sensitivity, thus is more appropriate for this use. KNN would be preferable if we desire a higher Specificity.