Prediction the Survival of Titanic Passengers Using Logistic Regression Model and K-NN
Introduction
The RMS Titanic sank in the early morning hours of 15 April 1912 in the North Atlantic Ocean, four days into the ship’s maiden voyage from Southampton to New York City. The largest ocean liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg at around 23:40 (ship’s time) on Sunday, 14 April 1912. Her sinking two hours and forty minutes later at 02:20 (ship’s time; 05:18 GMT) on Monday, 15 April, resulted in the deaths of more than 1,500 people, making it one of the deadliest peacetime marine disasters in history. In this time, we’ll try to predict whether the passengers is survived or not.
Data Import
This data contains 3 dataset, such as :
1. train.csv
: is using for training data (complete with predictor variable)
2. test.csv
: is using for testing data (without survived variable)
3. gender_submission.csv
: contains id and survived variable
Let’s read the data
Data train
Train data has 891 obs and 12 columns. We’ll use this data to make model later.
Data Test
Test data has 418 obs and 11 columns. If we see in train data has 12 columns and in test data has 11 columns. The difference is that in train data has Survived
variable which in test data has not. The Survived
variable in test data is contained in gender_submission.csv
. Let’s read the data.
Some information about the data :
Pclass
: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = LowerName
: Name of the passengerSex
: Gender of passengerAge
: Age in yearsSibsp
: siblings / spouses aboard the Titanic
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)Parch
: parents / children aboard the Titanic
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.Ticket
: Ticket numberFare
: Passenger fareCabin
: Cabin numberEmbarked
: Port of Embarkation
C = Cherbourg
Q = Queenstown
S = Southampton
Data Preparation and EDA
To make it easier to read, let’s join the test data and survive, so that we just only work with only two data.
Now, we only have 2 data : train
and test
dataset. Let’s check the if there is missing value in data and the data type of data.
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
There is 177 missing value in age column. It was around 19% of our data. Instead of remove the data. let’s try to replace the age number with the mean of age.
train_clean <- train %>%
mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))
colSums(is.na(train_clean))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Let’s check if there’s any missing value in data test.
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 86 0
## Parch Ticket Fare Cabin Embarked Survived
## 0 0 1 0 0 0
There’s some missing value in test data. To avoid NA from affecting the prediction results, we will delete data that contains NA
In this case, Survived
is the target variable, and some variables are not contain much information about the modeling. Let’s select some predictor variable and change the data type.
data_train <- train_clean %>%
select(-c(PassengerId, Name, Ticket, Embarked, Cabin)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch))
data_test <- test_clean %>%
select(-c(PassengerId, Name, Ticket, Embarked, Cabin)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch))
Let’s check the structure of data train by using str()
## 'data.frame': 891 obs. of 7 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
Before we continue to modeling, let’s check the proportion of our data in class target.
##
## 0 1
## 0.6161616 0.3838384
In the train dataset, the proportion of our data was around 61,61% not survived and around 38,38% survived. It’s balanced enough. Let’s continue to make the model.
Modeling
Logistic Regression
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8306 -0.6225 -0.4347 0.6019 2.4601
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.510e+00 4.662e-01 7.531 5.05e-14 ***
## Pclass2 -1.057e+00 2.987e-01 -3.539 0.000402 ***
## Pclass3 -2.016e+00 2.946e-01 -6.841 7.86e-12 ***
## Sexmale -2.719e+00 2.007e-01 -13.544 < 2e-16 ***
## Age -3.639e-02 8.343e-03 -4.362 1.29e-05 ***
## SibSp1 8.279e-02 2.239e-01 0.370 0.711570
## SibSp2 -2.858e-01 5.342e-01 -0.535 0.592645
## SibSp3 -2.353e+00 7.175e-01 -3.280 0.001040 **
## SibSp4 -1.773e+00 7.687e-01 -2.306 0.021108 *
## SibSp5 -1.611e+01 9.549e+02 -0.017 0.986541
## SibSp8 -1.609e+01 7.542e+02 -0.021 0.982984
## Parch1 3.808e-01 2.891e-01 1.317 0.187769
## Parch2 4.165e-02 3.789e-01 0.110 0.912462
## Parch3 2.485e-01 1.040e+00 0.239 0.811241
## Parch4 -1.609e+01 1.053e+03 -0.015 0.987805
## Parch5 -1.329e+00 1.168e+00 -1.138 0.255228
## Parch6 -1.671e+01 2.400e+03 -0.007 0.994444
## Fare 2.810e-03 2.483e-03 1.132 0.257803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 768.71 on 873 degrees of freedom
## AIC: 804.71
##
## Number of Fisher Scoring iterations: 15
From model m_LR1
, we see that so many variables is not signicificant effected the survival rate. Let’s try do model fitting by using stepwise with backward
method.
## Start: AIC=804.71
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
##
## Df Deviance AIC
## - Parch 6 777.98 801.98
## - Fare 1 770.13 804.13
## <none> 768.71 804.71
## - SibSp 6 794.41 818.41
## - Age 1 789.10 823.10
## - Pclass 2 817.59 849.59
## - Sex 1 997.14 1031.14
##
## Step: AIC=801.98
## Survived ~ Pclass + Sex + Age + SibSp + Fare
##
## Df Deviance AIC
## - Fare 1 779.24 801.24
## <none> 777.98 801.98
## - SibSp 6 805.17 817.17
## - Age 1 805.91 827.91
## - Pclass 2 836.86 856.86
## - Sex 1 1013.15 1035.15
##
## Step: AIC=801.24
## Survived ~ Pclass + Sex + Age + SibSp
##
## Df Deviance AIC
## <none> 779.24 801.24
## - SibSp 6 805.29 815.29
## - Age 1 808.21 828.21
## - Pclass 2 880.32 898.32
## - Sex 1 1020.51 1040.51
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8259 -0.5998 -0.4326 0.6147 2.4463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.914045 0.406967 9.618 < 2e-16 ***
## Pclass2 -1.220029 0.264067 -4.620 3.83e-06 ***
## Pclass3 -2.288869 0.245340 -9.329 < 2e-16 ***
## Sexmale -2.708998 0.195246 -13.875 < 2e-16 ***
## Age -0.041263 0.008001 -5.157 2.51e-07 ***
## SibSp1 0.142801 0.210125 0.680 0.49676
## SibSp2 -0.142497 0.519745 -0.274 0.78396
## SibSp3 -2.073128 0.685422 -3.025 0.00249 **
## SibSp4 -1.668710 0.744919 -2.240 0.02508 *
## SibSp5 -16.004773 956.874492 -0.017 0.98666
## SibSp8 -15.833751 753.839723 -0.021 0.98324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 779.24 on 880 degrees of freedom
## AIC: 801.24
##
## Number of Fisher Scoring iterations: 15
Model m_LR2
which use stepwise has AIC of 801.24 and m_LR1
has AIC of 804.71. We’ll using model m_LR2
to predict.
Prediction
Let’s see the distribution of predicted data.
ggplot(data_test, aes(x=prob)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()
From the plot, we can see that the data interpreted the distribution of prediction data lean more towards 0 which means Not Survived. If we set the threshold is 0.5 (by default). Let’s give the label of the prediction data by :
- If the probability is greater than 0.5, the label will get 1 (survived).
- If the probability is lower than 0.5, the label will get 0 (not survived)
data_test$pred_label <- factor(ifelse(data_test$prob > 0.5, 1, 0)) #1 = survived, 0 = not survived
data_test %>%
select(Survived, pred_label) %>%
head()
## Survived pred_label
## 1 0 0
## 2 1 0
## 3 0 0
## 4 0 0
## 5 1 1
## 6 0 0
K-NN Model
Because of K-NN is counting based on distance, we should make sure our data is only contain numeric data.
#Select data train and data test without PassengerId, Name, Sex, Ticket, Cabin, and Embarked
train_k <- train_clean %>%
select(-c(PassengerId, Name, Sex, Ticket, Cabin, Embarked))
test_k <- test_clean %>%
select(-c(PassengerId, Name, Sex, Ticket, Cabin, Embarked))
After select the predictor and target, let’s scale the data because the scale in some variable like fare
is not the same with Age
#scale the predictor and the target variable(Survived don't need to be scale)
train_x <- scale(train_k[, -1])
test_x <- scale(test_k[, -6], center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale"))
Then, let’s create the target Variable Survived
#Create the target variable in data train and data test
train_y <- train_k$Survived
test_y <- test_k$Survived
Before we go ahead, let’s find the optimum k-number of KNN based on the number of observation in data train.
## [1] 29.84962
We, get the number of k = 29.84962 and we’ll use 29 for the k. Use the function knn()
from package class
to do modeling. In KNN, there is no need to do the prediction. We just need to make the model, and the prediction is automatically done in modeling.
Let’s see some sample of model_knn
## [1] 0 0 0 0 0 0 0 1 0 0
## Levels: 0 1
Evaluation Model
We have made model and prediction with logistic regression and K-NN model, let’s make some model evaluation to see if our model is good enough. Before we do some evaluation model, let’s get to know what metric will we value more.
TN : True Negative FP : False Positive FN : False Negative TP : True Positive
There is 4 types of metrics in evaluation model, such as :
- Accuracy: how precisely our model predicts the target class (globally)
(TP+TN/TOTAL)
- Sensitivity/ Recall: measure of the model’s goodness of the
positive
class
(TP/(TP+FN))
- Specificity: a measure of the model’s goodness of the
negative
class
(TN/(TN+FP))
- Pos Pred Value/Precision: how precision the model predicts positive classes
(TP/(TP+FP))
In this case, the positive class (1) is survived. We’ll use sensitivity / recall metric for model evaluation in this case, because we don’t wish to see the False Negative class (actually survived but the model predict not survived).
Logistic Regression
For model evaluation, we’ll use confusionMatrix()
function from package caret
confusionMatrix(reference = as.factor(data_test$Survived),
data = as.factor(data_test$pred_label), positive = "1" )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 191 5
## 1 13 122
##
## Accuracy : 0.9456
## 95% CI : (0.9154, 0.9675)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8864
##
## Mcnemar's Test P-Value : 0.09896
##
## Sensitivity : 0.9606
## Specificity : 0.9363
## Pos Pred Value : 0.9037
## Neg Pred Value : 0.9745
## Prevalence : 0.3837
## Detection Rate : 0.3686
## Detection Prevalence : 0.4079
## Balanced Accuracy : 0.9485
##
## 'Positive' Class : 1
##
From the confusion matrix of the prediction using logistic regression model, we can get information such as:
1. The model can predict correctly in positive class and negative class base on the real data is 94.56% (Accuracy)
2. The measurement of the model’s goodness of the positive
class (survived) / Sensitivity is 96.06%
3. Based on the Specificity metrics, we assume the model is good enough to predict the data test.
Model K-NN
Let’s try the model evaluation of K-NN model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 151 63
## 1 53 64
##
## Accuracy : 0.6495
## 95% CI : (0.5955, 0.7009)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : 0.1173
##
## Kappa : 0.2478
##
## Mcnemar's Test P-Value : 0.4034
##
## Sensitivity : 0.5039
## Specificity : 0.7402
## Pos Pred Value : 0.5470
## Neg Pred Value : 0.7056
## Prevalence : 0.3837
## Detection Rate : 0.1934
## Detection Prevalence : 0.3535
## Balanced Accuracy : 0.6221
##
## 'Positive' Class : 1
##
From the confusion matrix of the prediction using K-NN model, we can get information such as:
1. The model can predict correctly in positive class and negative class base on the real data is 64.95% (Accuracy)
2. The measurement of the model’s goodness of the positive
class (survived) / Sensitivity is 50.39%
3. Based on the Specificity metrics, we assume the model has not good enough and we’ll try to tune the model a bit.
Model Improvement
In this case, logistic regression model has perdict the data test well. So, we’ll try to tune the model in K-NN. Let’s try tune the model by balancing the class target.
##
## 0 1
## 0.6161616 0.3838384
The proportion of survived
and not survived
in data train train_k
is not balance enough. Let’s do down-sampling to make it balance.
#do down sampling in train data
train_tune <- downSample(x = train_k[, -1], y = as.factor(train_k[, 1]), yname = "Survived")
rmarkdown::paged_table(train_tune)
The number of rows get decrease because we do down-sampling to balancing the proportion in class target Survived
. let’s check the proportion of class target.
##
## 0 1
## 0.5 0.5
Now, the propotion is balance enough. Let’s do the K-NN modeling and check the evaluation model. Is it get better result ? P.s : we only make some improvement in data train, and the data test in still the same.
#the column of Survived (target) is in column 6
train_xT <- scale(train_tune[, -6])
train_yT <- train_tune$Survived
#make K-NN modeling by the data which we have balancing and store in model_knnT
model_knnT <- knn(train = train_xT, test = test_x, cl = train_yT, k = 29)
#check the evaluation model again of model_knnT by using confusion Matrix
confusionMatrix(data = as.factor(model_knnT), reference = as.factor(test_y),
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 104 39
## 1 100 88
##
## Accuracy : 0.5801
## 95% CI : (0.5249, 0.6338)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : 0.9206
##
## Kappa : 0.1859
##
## Mcnemar's Test P-Value : 3.597e-07
##
## Sensitivity : 0.6929
## Specificity : 0.5098
## Pos Pred Value : 0.4681
## Neg Pred Value : 0.7273
## Prevalence : 0.3837
## Detection Rate : 0.2659
## Detection Prevalence : 0.5680
## Balanced Accuracy : 0.6014
##
## 'Positive' Class : 1
##
The previous K-NN model we have done model_knn
which the class target has not balance only got the score of Sensitivity is 50.39%. After we balancing the proportion of positive and negative class in target Survived
, we got the Sensitivity score of 65.35%. This K-NN model model_knnT
is better than model_knn
by evaluating with Sensitivity metric.
Conclusion
The titanic ship accident that occurred in 1912 might have caused trauma for Survivors. If I am a psychologist who want to help the Survivors overcome the trauma, I’ll use the Sensitivity metric. I can tolerate False Positive
rather than False Negative
. I don’t wish Survivor who actually survived, but we predict not survive. Therefore, we can get more people to get treatment. For the model of prediction, I rather use logistic regression rather than K-NN model for this case.
Reference
- [wikipedia] (https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic)
- [Kaggle] (https://www.kaggle.com/c/titanic)