At this time, we want to build predictive model to predict passengers of Titanic whether or not they survived the sinking of the ship. It will be a model to answer question like “what sorts of people were more likely to survive?”, using passenger data (ie name, age, gender, socio-economic class, etc). The logistic regression and K-Nearest Neighbor (K-NN) would be used as the classification method. The dataset is splitted into two groups, i.e. training set and test set taken from here.
Load library
# libraries
library(readr)
library(tidyverse)
library(gtools)
library(class)
library(caret)
library(e1071)
library(car)
library(rsample)Read data train.csv
titanic_train <- read.csv("train.csv")
head(titanic_train)Description:
PassengerId: Row ID in the data setSurvived: If a passenger survived or not (0 = No, 1 =
Yes)Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)Name: Passenger’s nameSex: Passenger’s gender (male or female)Age: Passenger’s age (in years)SibSp: # of siblings / spouses aboard the TitanicParch: # of parents / children aboard the TitanicTicket: Ticket numberFare: Passenger fareCabin: Cabin numberEmbarked: Port of Embarkation (C = Cherbourg, Q =
Queenstown, S = Southampton)Check missing value
colSums(is.na(titanic_train))#> PassengerId Survived Pclass Name Sex Age
#> 0 0 0 0 0 177
#> SibSp Parch Ticket Fare Cabin Embarked
#> 0 0 0 0 0 0
Since there is missing value in Age variable, we will only select the data that has a value in Age variable. This is important because we do not want the data interrupt our model later on. We will do this in data wrangling section.
Data wrangling
glimpse(titanic_train) #> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
#> $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
#> $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
#> $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
#> $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
#> $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
#> $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
# data selection and transformation
titanic_train_clean <- titanic_train %>%
filter(!is.na(titanic_train$Age) & !titanic_train$Embarked=="") %>% # filter out data with missing value or empty data
select(-c("PassengerId","Name","Ticket","Cabin")) %>% # unselect column that is not useful
mutate(
Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Embarked = as.factor(Embarked)
)
str(titanic_train_clean)#> 'data.frame': 712 obs. of 8 variables:
#> $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
#> $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
#> $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
#> $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
#> $ SibSp : int 1 1 0 1 0 0 3 0 1 1 ...
#> $ Parch : int 0 0 0 0 0 0 1 2 0 1 ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 3 3 3 1 3 ...
Check missing value after data wrangling
colSums(is.na(titanic_train_clean))#> Survived Pclass Sex Age SibSp Parch Fare Embarked
#> 0 0 0 0 0 0 0 0
summary(titanic_train_clean)#> Survived Pclass Sex Age SibSp Parch
#> 0:424 1:184 female:259 Min. : 0.42 Min. :0.000 Min. :0.0000
#> 1:288 2:173 male :453 1st Qu.:20.00 1st Qu.:0.000 1st Qu.:0.0000
#> 3:355 Median :28.00 Median :0.000 Median :0.0000
#> Mean :29.64 Mean :0.514 Mean :0.4326
#> 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:1.0000
#> Max. :80.00 Max. :5.000 Max. :6.0000
#> Fare Embarked
#> Min. : 0.00 C:130
#> 1st Qu.: 8.05 Q: 28
#> Median : 15.65 S:554
#> Mean : 34.57
#> 3rd Qu.: 33.00
#> Max. :512.33
prop.table(table(titanic_train_clean$Survived))#>
#> 0 1
#> 0.5955056 0.4044944
Based on the proportion value above, the target variable class
(Survived) is balance enough so that we do not need to do
additional data pre-processing to balance the class.
When doing prediction, we need to show that a model can predict new
data. Thus, we have evaluate the model using new data and we need to
split the data into test data and train data.
Let’s say we use 80% of the data as train data and the rest
as test data.
RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(nrow(titanic_train_clean), nrow(titanic_train_clean)*0.8)
titanic_train <- titanic_train_clean[index,]
titanic_test <- titanic_train_clean[-index,]# re-check class imbalance
table(titanic_train$Survived)#>
#> 0 1
#> 343 226
titanic_survival <- glm(Survived~., data = titanic_train, family = "binomial")
summary(titanic_survival)#>
#> Call:
#> glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6780 -0.6943 -0.4103 0.6845 2.3938
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 4.050751 0.583123 6.947 0.00000000000374 ***
#> Pclass2 -1.150354 0.356663 -3.225 0.00126 **
#> Pclass3 -2.376974 0.373226 -6.369 0.00000000019061 ***
#> Sexmale -2.491100 0.244823 -10.175 < 0.0000000000000002 ***
#> Age -0.037553 0.009033 -4.157 0.00003221446050 ***
#> SibSp -0.261659 0.141787 -1.845 0.06497 .
#> Parch -0.027426 0.131446 -0.209 0.83473
#> Fare 0.001380 0.002765 0.499 0.61754
#> EmbarkedQ -1.019041 0.697578 -1.461 0.14406
#> EmbarkedS -0.310492 0.302544 -1.026 0.30476
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 764.57 on 568 degrees of freedom
#> Residual deviance: 522.99 on 559 degrees of freedom
#> AIC: 542.99
#>
#> Number of Fisher Scoring iterations: 5
💡 Interpretation:
Convert log off odds to odds
exp(titanic_survival$coefficients)#> (Intercept) Pclass2 Pclass3 Sexmale Age SibSp
#> 57.44055420 0.31652480 0.09283107 0.08281880 0.96314357 0.76977338
#> Parch Fare EmbarkedQ EmbarkedS
#> 0.97294705 1.00138142 0.36094098 0.73308636
Passengers with ticket class 2 are more likely to survive 0.31652480 times than passengers with ticket class 1, note that other variable values are constant.
Male passengers are more likely to survive 0.08281880 times than female passengers, note that other variable values are constant.
Passengers who were embarked from S (Southampton) port are more likely to survive 0.73308636 times than passengers who embarked from C (Cherbourg) port, note that other variable values are constant.
Every 1 point increase of passenger’s age, the passengers are more likely to survive 0.96314357 times than passengers with below age, note that other variable values are constant.
Every 1 point increase of # of siblings / spouses, the passengers are more likely to survive 0.76977338 times than passengers with below # of siblings / spouses, note that other variable values are constant.
Every 1 point increase of # of parents / children, the passengers are more likely to survive 0.97294705 times than passengers with below # of parents / children, note that other variable values are constant.
Every 1 point increase of passenger fare, the passengers are more likely to survive 1.00138142 times than passengers with below passenger fare, note that other variable values are constant.
Looking at Pr(>|z|) value, the significant predictor variables are
Pclass, Sex and Age.
Let’s try step wise regression to find out a fitted model.
# logistic regression with step
step(titanic_survival, direction = "backward", trace = 0)#>
#> Call: glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
#> data = titanic_train)
#>
#> Coefficients:
#> (Intercept) Pclass2 Pclass3 Sexmale Age SibSp
#> 3.96186 -1.32407 -2.60295 -2.46487 -0.03837 -0.25703
#>
#> Degrees of Freedom: 568 Total (i.e. Null); 563 Residual
#> Null Deviance: 764.6
#> Residual Deviance: 525.9 AIC: 537.9
# save model resulted from stepwise
titanic_survival2 <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
data = titanic_train)
summary(titanic_survival2)#>
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
#> data = titanic_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.7034 -0.6870 -0.4194 0.6725 2.4041
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 3.961858 0.489064 8.101 0.000000000000000546 ***
#> Pclass2 -1.324069 0.311397 -4.252 0.000021184394278967 ***
#> Pclass3 -2.602950 0.315110 -8.260 < 0.0000000000000002 ***
#> Sexmale -2.464869 0.235042 -10.487 < 0.0000000000000002 ***
#> Age -0.038373 0.008942 -4.291 0.000017773245301492 ***
#> SibSp -0.257034 0.132223 -1.944 0.0519 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 764.57 on 568 degrees of freedom
#> Residual deviance: 525.95 on 563 degrees of freedom
#> AIC: 537.95
#>
#> Number of Fisher Scoring iterations: 5
With the model from step-wise, we try to predict using
test data we already have.
# predict probability of survived and save to a new column in test data
titanic_test$pred.Survived <- predict(object = titanic_survival2, newdata = titanic_test, type="response")Classify titanic_test data based on pred.Survived and
save to a new column named pred.Label
titanic_test$pred.Label <- ifelse(titanic_test$pred.Survived>0.5, 1, 0) %>% as.factor()# see the prediction result
titanic_test %>% select(pred.Survived, pred.Label, Survived)Let’s try to see the probability distribution of predicted data.
ggplot(titanic_test, aes(x=pred.Survived)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()💡 Insight: The graph shows that the predicted data is skewed to 0, which means the model’s prediction is more to 0 (not survived) when predicts new data.
After doing prediction with a model, there is still tendency to wrong prediction. In this situation, we want to evaluate our model based on Confusion Matrix.
There are 4 metrics to evaluate model’s performance: *
Re-call/Sensitivity: how far a model predicts all positive
class of target variable correctly * Specificity: how far a
model predicts all negative class of target variable correctly *
Accuracy: how far a model predicts target variable
correctly (in general) * Precision: how precise a model
predicts positive class of target variable correctly
log_conf <- confusionMatrix(data=titanic_test$pred.Label, reference=titanic_test$Survived, positive="1")
log_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 71 10
#> 1 10 52
#>
#> Accuracy : 0.8601
#> 95% CI : (0.7923, 0.9124)
#> No Information Rate : 0.5664
#> P-Value [Acc > NIR] : 0.00000000000003934
#>
#> Kappa : 0.7153
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.8387
#> Specificity : 0.8765
#> Pos Pred Value : 0.8387
#> Neg Pred Value : 0.8765
#> Prevalence : 0.4336
#> Detection Rate : 0.3636
#> Detection Prevalence : 0.4336
#> Balanced Accuracy : 0.8576
#>
#> 'Positive' Class : 1
#>
Accuracy How far our model predicts target class (positive or negative). It is used when positive and negative class are important or the class proportion is balance.
(TP+TN/TOTAL)
(52+71)/(71+10+10+52)#> [1] 0.8601399
Sensitivity/Recall How many that can be predicted positive from the actually positive
(TP/(TP+FN))
52/(52+10)#> [1] 0.8387097
Pos Pred Value/Precision How many that can be predicted positive from the predictively positive
(TP/(TP+FP))
52/(52+10)#> [1] 0.8387097
In this Titanic Survival Passenger case, we want to avoid
passenger’s characteristics that are not survived. Therefore,
we use Precision as metric to avoid False Positive
(passengers who were actually not survived, but they’re predicted as
survived).
6.4 Specificity How many that can be predicted negative from the actually negative
(TN/(TN+FP))
71/(71+10)#> [1] 0.8765432
There are 3 assumptions in Logistic Regression:
*
Multicollinearity: no interference between predictor
variables
* Independence of Observations:
independent observations
* Linearity of Predictor & Log
of Odds: value of predictor variable has linear correlation
with log of odds value
vif(titanic_survival2)#> GVIF Df GVIF^(1/(2*Df))
#> Pclass 1.430189 2 1.093574
#> Sex 1.105275 1 1.051321
#> Age 1.464801 1 1.210289
#> SibSp 1.138885 1 1.067186
Assumption is fulfilled.
This method will classify new data by comparing
test data with train data. How close the
characteristics of both data is measured with Euclidean
Distance. Then it will select the closest k
neighboor from the new data and decide the class with majority
voting.
set.seed(100) # refer to the key of cross validation in k-NN
index <- initial_split(data=titanic_train_clean, # beginning data before splitting
prop = 0.8, # split proportion 80:20
strata = Survived) # class label
titanic_knn_train <- training(index)
titanic_knn_test <- testing(index)# check proportion
prop.table(table(titanic_knn_train$Survived))#>
#> 0 1
#> 0.5957821 0.4042179
prop.table(table(titanic_knn_test$Survived))#>
#> 0 1
#> 0.5944056 0.4055944
# train data predictor
train_titanic_x <- titanic_knn_train %>% select_if(is.numeric) # select only numeric column for scaling
# train data target
train_titanic_y <- titanic_knn_train %>% select(Survived) # select target class only
# test data predictor
test_titanic_x <- titanic_knn_test %>% select_if(is.numeric)
# test data target
test_titanic_y <- titanic_knn_test %>% select(Survived)Scaling for test data and train data
# run this code 1x only
train_titanic_x <- scale(train_titanic_x)
test_titanic_x <- scale(test_titanic_x,
center=attr(train_titanic_x, "scaled:center"), # mean value of train data
scale=attr(train_titanic_x, "scaled:scale")) # std deviation value of train dataSelect k value based on no of train data rows
sqrt(nrow(train_titanic_x))#> [1] 23.85372
Since there are 2 target classes (Survived or Not Survived), we use k = 23 (odd number) for our model.
# create k-NN model
titanic_knn_pred <- knn(train = train_titanic_x, # train data predictor
test = test_titanic_x, # test data predictor
cl = train_titanic_y$Survived, # train data target
k=23) # no of k used for classificationknn_conf <- confusionMatrix(data=titanic_knn_pred, reference = test_titanic_y$Survived, positive="1")
knn_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 74 30
#> 1 11 28
#>
#> Accuracy : 0.7133
#> 95% CI : (0.6318, 0.7858)
#> No Information Rate : 0.5944
#> P-Value [Acc > NIR] : 0.002108
#>
#> Kappa : 0.3727
#>
#> Mcnemar's Test P-Value : 0.004937
#>
#> Sensitivity : 0.4828
#> Specificity : 0.8706
#> Pos Pred Value : 0.7179
#> Neg Pred Value : 0.7115
#> Prevalence : 0.4056
#> Detection Rate : 0.1958
#> Detection Prevalence : 0.2727
#> Balanced Accuracy : 0.6767
#>
#> 'Positive' Class : 1
#>
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
Recall = log_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = log_conf$byClass[3])
eval_knn <- data_frame(Accuracy = knn_conf$overall[1],
Recall = knn_conf$byClass[1],
Specificity = knn_conf$byClass[2],
Precision = knn_conf$byClass[3])
# Model Evaluation Logit
eval_logit# Model Evaluation K-NN
eval_knnAccording to the above evaluation, Logistic Regression model has better precision than k-NN model (precision value = 83.87%). It means that Logistic Regression model can predict better for whether or not Titanic’s passengers survived the sinking of the ship.