We will try to predict the survival of Titanic passengers based on some of the predictor variables using Logistic Regression and K-Nearest Neighbor(KNN).
Load the required packages.
library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(tidyverse)
library(caret)
library(scales)
library(GGally)
library(class)The data that will be use is titanic data from Kaggle. You can download the data here
data <- read.csv("data_input/titanic.csv")
str(data)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
head(data)## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
PassengerID : ID of the passenger survival: Survival status of the passenger (0 = No, 1 = Yes) pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) name : Name of the passenger sex : Gender of the passenger
Age : Age in years
sibsp : numbers of siblings / spouses aboard the Titanic
parch : numbers of parents / children aboard the Titanic
ticket : Ticket number
fare : Passenger fare
cabin : Cabin number
embarked : Port of Embarkation(C = Cherbourg, Q = Queenstown, S = Southampton)
colSums(is.na(data))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
There are missing values in our data so we need to delete those data.
data_clean <- na.omit(data)Re-check for missing values
colSums(is.na(data_clean))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Great! There are no more missing values in our data.
We need to change the data types of some variables and also eliminate some of the irrelevant data for the model
data_clean <- data_clean %>%
mutate(Survived = as.factor(Survived)) %>%
mutate(Pclass = as.factor(Pclass)) %>%
mutate(Sex = as.factor(Sex)) %>%
mutate(Embarked = as.factor(Embarked)) %>%
dplyr::select(-PassengerId, -Name, -Ticket, -Cabin)Let’s see the proportion of the target variable
prop.table(table(data_clean$Survived))##
## 0 1
## 0.5938375 0.4061625
It seems that the proportion of the target variable is pretty balanced so we don’t need to additional pre-processing of the data to balance the proportion
We will split our data into train and test data with the proportion of 80% train data and 20% test data
RNGkind(sample.kind = "Rounding")
set.seed(417)
index_data_clean <- sample(nrow(data_clean), nrow(data_clean)*0.8)
data_train <- data_clean[index_data_clean,]
data_test <- data_clean[-index_data_clean,]Check the proportion of the train data
prop.table(table(data_train$Survived))##
## 0 1
## 0.5989492 0.4010508
head(data_clean)## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 1 0 3 male 22 1 0 7.2500 S
## 2 1 1 female 38 1 0 71.2833 C
## 3 1 3 female 26 0 0 7.9250 S
## 4 1 1 female 35 1 0 53.1000 S
## 5 0 3 male 35 0 0 8.0500 S
## 7 0 1 male 54 0 0 51.8625 S
We will create a logistic regression model using all predictor variables and named it model
model <- glm(formula = Survived~., family = "binomial",
data = data_train)
summary(model)##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6468 -0.6764 -0.4080 0.6712 2.3997
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.529034 609.237874 0.027 0.97836
## Pclass2 -1.009352 0.355580 -2.839 0.00453 **
## Pclass3 -2.207270 0.375362 -5.880 4.09e-09 ***
## Sexmale -2.540175 0.244564 -10.387 < 2e-16 ***
## Age -0.040920 0.009115 -4.489 7.15e-06 ***
## SibSp -0.365919 0.142287 -2.572 0.01012 *
## Parch -0.058835 0.134371 -0.438 0.66149
## Fare 0.001882 0.002871 0.655 0.51218
## EmbarkedC -12.377652 609.237721 -0.020 0.98379
## EmbarkedQ -13.202104 609.237953 -0.022 0.98271
## EmbarkedS -12.776693 609.237702 -0.021 0.98327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 769.06 on 570 degrees of freedom
## Residual deviance: 522.68 on 560 degrees of freedom
## AIC: 544.68
##
## Number of Fisher Scoring iterations: 13
In the first model, there are many insignificant predictor variables, therefore we will create a new model using stepwise method and we will name it model2
library(MASS)
model2 <- stepAIC(model, direction = "backward")## Start: AIC=544.68
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Embarked 3 525.69 541.69
## - Parch 1 522.87 542.87
## - Fare 1 523.14 543.14
## <none> 522.68 544.68
## - SibSp 1 529.88 549.88
## - Age 1 544.65 564.65
## - Pclass 2 562.00 580.00
## - Sex 1 655.64 675.64
##
## Step: AIC=541.69
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
##
## Df Deviance AIC
## - Parch 1 525.93 539.93
## - Fare 1 526.59 540.59
## <none> 525.69 541.69
## - SibSp 1 533.43 547.43
## - Age 1 549.12 563.12
## - Pclass 2 568.78 580.78
## - Sex 1 662.94 676.94
##
## Step: AIC=539.93
## Survived ~ Pclass + Sex + Age + SibSp + Fare
##
## Df Deviance AIC
## - Fare 1 526.68 538.68
## <none> 525.93 539.93
## - SibSp 1 535.08 547.08
## - Age 1 549.28 561.28
## - Pclass 2 571.60 581.60
## - Sex 1 665.91 677.91
##
## Step: AIC=538.68
## Survived ~ Pclass + Sex + Age + SibSp
##
## Df Deviance AIC
## <none> 526.68 538.68
## - SibSp 1 535.33 545.33
## - Age 1 551.21 561.21
## - Pclass 2 603.60 611.60
## - Sex 1 668.90 678.90
The summary of the model2
summary(model2)##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7059 -0.6792 -0.4054 0.6654 2.4088
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.096828 0.489796 8.364 < 2e-16 ***
## Pclass2 -1.261508 0.307725 -4.099 4.14e-05 ***
## Pclass3 -2.502068 0.314380 -7.959 1.74e-15 ***
## Sexmale -2.528740 0.235241 -10.750 < 2e-16 ***
## Age -0.042459 0.008987 -4.724 2.31e-06 ***
## SibSp -0.376970 0.134639 -2.800 0.00511 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 769.06 on 570 degrees of freedom
## Residual deviance: 526.68 on 565 degrees of freedom
## AIC: 538.68
##
## Number of Fisher Scoring iterations: 5
We will create prediction of survival probability based on model2
data_test$prob_survival<-predict(model2, type = "response", newdata = data_test)We will create a plot to see the distribution of probability prediction data
ggplot(data_test, aes(x=prob_survival)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()In the plot above, we can see that the prediction result have a tendency towards 0 meaning there are more predictions of not survived than survived.
We will create a threshold of 0.5, meaning that if the prediction probability > 0.5 then it will be considered that the passengers survived.
data_test$pred_survival <- factor(ifelse(data_test$prob_survival> 0.5, 1 , 0))
data_test[1:10, c("pred_survival", "Survived")]## pred_survival Survived
## 4 1 1
## 9 1 1
## 10 1 1
## 12 1 1
## 23 1 1
## 24 1 1
## 28 0 0
## 40 1 1
## 42 1 0
## 51 0 0
We will create a confusion matrix to evaluate the model
log_conf <- confusionMatrix(data_test$pred_survival, data_test$Survived, positive = "1")
log_conf## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 73 14
## 1 9 47
##
## Accuracy : 0.8392
## 95% CI : (0.7685, 0.8952)
## No Information Rate : 0.5734
## P-Value [Acc > NIR] : 9.57e-12
##
## Kappa : 0.6677
##
## Mcnemar's Test P-Value : 0.4042
##
## Sensitivity : 0.7705
## Specificity : 0.8902
## Pos Pred Value : 0.8393
## Neg Pred Value : 0.8391
## Prevalence : 0.4266
## Detection Rate : 0.3287
## Detection Prevalence : 0.3916
## Balanced Accuracy : 0.8304
##
## 'Positive' Class : 1
##
Accuracy = refers to how accurate the model can correctly predict the target variable. Re-call/Sensitivity = refers to the probability of a positive test when the actual outcome is positive Specificity = refers to the probability of a negative test when the actual outcome is negative Precision = refers to how many of our positive prediction is correct
\[ Accuracy=\frac {TP+TN}{TP+TN+FP+FN}\\ Sensitivity = \frac {TP}{TP+FN}\\ Specificity = \frac {TN}{TN+FP}\\ Precision = \frac {TP}{TP+FP} \] Logistic regression produced coefficients in the form of log of odds. To interpret the coefficients from the model, we need to transforms the coefficients into odds.
exp(model2$coefficients) %>%
data.frame() ## .
## (Intercept) 60.14918886
## Pclass2 0.28322648
## Pclass3 0.08191538
## Sexmale 0.07975942
## Age 0.95842936
## SibSp 0.68593653
Interpretation: -Pclass2: Passengers in 2nd class is 0.28 times more likely to survived -Pclass3: Passengers in 3rd class is 0.08 times more likely to survived -Sexmale: Male passengers is 0.08 times more likely to survived -Age: For every 1 year increase in age, the passenger is 0.95 times more likely to survived -SibSp: For every 1 sibling, the passenger is 0.68 times more likely to survived
We will create dummy variables for categorical data since K-NN is not suitable to use with categorical data
dmy <- dummyVars(" ~Survived+Sex+Pclass+Embarked", data = data_clean)
dmy <- data.frame(predict(dmy, newdata = data_clean))
str(dmy)## 'data.frame': 714 obs. of 11 variables:
## $ Survived.0: num 1 0 0 0 1 1 1 0 0 0 ...
## $ Survived.1: num 0 1 1 1 0 0 0 1 1 1 ...
## $ Sex.female: num 0 1 1 1 0 0 0 1 1 1 ...
## $ Sex.male : num 1 0 0 0 1 1 1 0 0 0 ...
## $ Pclass.1 : num 0 1 0 1 0 1 0 0 0 0 ...
## $ Pclass.2 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Pclass.3 : num 1 0 1 0 1 0 1 1 0 1 ...
## $ Embarked. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Embarked.C: num 0 1 0 0 0 0 0 0 1 0 ...
## $ Embarked.Q: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Embarked.S: num 1 0 1 1 1 1 1 1 0 1 ...
We will delete the dummy variables that only consisted of 2 categories
dmy$Survived.0 <- NULL
dmy$Sex.female <- NULLLet’s check the names of the dummy variables
names(dmy)## [1] "Survived.1" "Sex.male" "Pclass.1" "Pclass.2" "Pclass.3"
## [6] "Embarked." "Embarked.C" "Embarked.Q" "Embarked.S"
Create train and test data from the dummy variables
set.seed(417)
dmy_train <- dmy[index_data_clean,2:9]
dmy_test <- dmy[-index_data_clean,2:9]
dmy_train_label <- dmy[index_data_clean,1]
dmy_test_label <- dmy[-index_data_clean,1]Picking optimum K
sqrt(nrow(data_train))## [1] 23.89561
pred_knn <- class::knn(train = dmy_train,
test = dmy_test,
cl = dmy_train_label,
k = 24)pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(dmy_test_label),"1")
pred_knn_conf## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 80 26
## 1 2 35
##
## Accuracy : 0.8042
## 95% CI : (0.7296, 0.8658)
## No Information Rate : 0.5734
## P-Value [Acc > NIR] : 4.642e-09
##
## Kappa : 0.5785
##
## Mcnemar's Test P-Value : 1.383e-05
##
## Sensitivity : 0.5738
## Specificity : 0.9756
## Pos Pred Value : 0.9459
## Neg Pred Value : 0.7547
## Prevalence : 0.4266
## Detection Rate : 0.2448
## Detection Prevalence : 0.2587
## Balanced Accuracy : 0.7747
##
## 'Positive' Class : 1
##
Comparing the results of the confusion matrix from Logistic Regression and K-NN model that we have created
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
Recall = log_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = log_conf$byClass[3])
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
Recall = pred_knn_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = pred_knn_conf$byClass[3])eval_logit## # A tibble: 1 x 4
## Accuracy Recall Specificity Precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.839 0.770 0.890 0.839
eval_knn## # A tibble: 1 x 4
## Accuracy Recall Specificity Precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.804 0.574 0.890 0.946
Logistic regression performs better in all metrics compared to K-NN except for precision. In this case, the evaluation metrics will be precision because precision measures how many of our positive prediction is correct. K-NN has 94.5% precision, meanwhile Logistic regression only has 83.9% precision. Therefore, K-NN is the better model in predicting the survival chance of titanic passengers.