The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this project, we will try to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data from kaggle. The algorithm that we will use are logistic regression and k-nearest neighbor which is included in supervised learning.
library(dplyr)#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(MASS)#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
library(caret)#> Loading required package: ggplot2
#> Loading required package: lattice
titanic <- read.csv('data_input/train.csv')
titanicglimpse(titanic)#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
#> $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
#> $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
#> $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
#> $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
#> $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
#> $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
summary(titanic)#> PassengerId Survived Pclass Name
#> Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
#> 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
#> Median :446.0 Median :0.0000 Median :3.000 Mode :character
#> Mean :446.0 Mean :0.3838 Mean :2.309
#> 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
#> Max. :891.0 Max. :1.0000 Max. :3.000
#>
#> Sex Age SibSp Parch
#> Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
#> Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
#> Mode :character Median :28.00 Median :0.000 Median :0.0000
#> Mean :29.70 Mean :0.523 Mean :0.3816
#> 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
#> Max. :80.00 Max. :8.000 Max. :6.0000
#> NA's :177
#> Ticket Fare Cabin Embarked
#> Length:891 Min. : 0.00 Length:891 Length:891
#> Class :character 1st Qu.: 7.91 Class :character Class :character
#> Mode :character Median : 14.45 Mode :character Mode :character
#> Mean : 32.20
#> 3rd Qu.: 31.00
#> Max. :512.33
#>
Parch, min value to Q3 tends to 0, meaning that the data is not very informative (it may not be used later in modeling)
colSums(is.na(titanic))#> PassengerId Survived Pclass Name Sex Age
#> 0 0 0 0 0 177
#> SibSp Parch Ticket Fare Cabin Embarked
#> 0 0 0 0 0 0
The Age column has 177 missing values, we will fill these missing values with the median value of Age
Glossary data titanic:
PassengerId : Unique ID Number for each Titanic passenger (out of 831)Survived : Survival (0 = No, 1 = Yes)Pclass : Ticket Class (1= Upper, 2= Middle, 3= Low)Name : Unique Name of each Titanic passengerSex : Female or MaleAge : Age in year (out of 80)SibSp : Number of siblings / spouses aboard the Titanic (out of 8)Parch : Number of parents . children aboard the Titanic (out of 6)Ticket : Unique Ticket Number (out of 831)Fare : Passenger Fare (out of 512.33)Cabin : Cabin NumberEmbarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)Data Resume: - The data has 891 rows and 12 columns. - We can ignore PassengerId, Ticket, Name because it cannot be classified as a category or number that can affect the model. - Our target variable is Survived. - Survived, Pclass, Sex data type can be changed into a factor.
titanic <- titanic %>%
mutate(Embarked_S = ifelse(Embarked == "S", 1, 0), #create column Embarked_S
Embarked_C = ifelse(Embarked == "C", 1, 0), #create column Embarked_C
Embarked_Q = ifelse(Embarked == "Q", 1, 0), #create column Embarked_Q
Sex = ifelse(Sex == "female", 0, 1),
Age = replace_na(Age, replace = round(median(Age, na.rm = T))), #fill missing value with mean
Age = as.integer(Age)) %>%
dplyr::select(-c(PassengerId, Ticket, Cabin, Name, Embarked)) %>%
mutate_at(c('Survived', 'Pclass', 'Sex', 'Embarked_S', 'Embarked_C', 'Embarked_Q'), as.factor)glimpse(titanic)#> Rows: 891
#> Columns: 10
#> $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,~
#> $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3,~
#> $ Sex <fct> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,~
#> $ Age <int> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 5~
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0,~
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0,~
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, ~
#> $ Embarked_S <fct> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,~
#> $ Embarked_C <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
#> $ Embarked_Q <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,~
anyNA(titanic)#> [1] FALSE
prop.table(table(titanic$Survived))#>
#> 0 1
#> 0.6161616 0.3838384
The target variable has a fairly balanced proportion, doesn’t need additional pre-processing.
RNGkind(sample.kind = "Rounding")
set.seed(303)
index <- sample(nrow(titanic), nrow(titanic)*0.7)
#Data Splitting
t_train <- titanic[index,]
t_test <- titanic[-index,]model_all <- glm(formula = Survived ~ ., family = "binomial", data = t_train)
summary(model_all)#>
#> Call:
#> glm(formula = Survived ~ ., family = "binomial", data = t_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6390 -0.6078 -0.4148 0.5991 2.3877
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 15.708014 535.411546 0.029 0.9766
#> Pclass2 -0.837625 0.359627 -2.329 0.0199 *
#> Pclass3 -2.021656 0.358867 -5.633 0.0000000177 ***
#> Sex1 -2.755338 0.244294 -11.279 < 0.0000000000000002 ***
#> Age -0.039604 0.009294 -4.261 0.0000203320 ***
#> SibSp -0.370450 0.135074 -2.743 0.0061 **
#> Parch -0.195371 0.160192 -1.220 0.2226
#> Fare 0.003919 0.003151 1.244 0.2136
#> Embarked_S1 -12.010468 535.411300 -0.022 0.9821
#> Embarked_C1 -11.445806 535.411328 -0.021 0.9829
#> Embarked_Q1 -11.598464 535.411409 -0.022 0.9827
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 840.01 on 622 degrees of freedom
#> Residual deviance: 546.78 on 612 degrees of freedom
#> AIC: 568.78
#>
#> Number of Fisher Scoring iterations: 12
model_back <- stepAIC(model_all, direction = "backward", trace = F)
summary(model_back)#>
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked_S,
#> family = "binomial", data = t_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6255 -0.5960 -0.4075 0.6132 2.4064
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 4.455798 0.501566 8.884 < 0.0000000000000002 ***
#> Pclass2 -1.052548 0.318168 -3.308 0.000939 ***
#> Pclass3 -2.282750 0.291912 -7.820 0.00000000000000528 ***
#> Sex1 -2.714222 0.235121 -11.544 < 0.0000000000000002 ***
#> Age -0.039171 0.009164 -4.275 0.00001914625862058 ***
#> SibSp -0.389312 0.127386 -3.056 0.002242 **
#> Embarked_S1 -0.573857 0.246259 -2.330 0.019790 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 840.01 on 622 degrees of freedom
#> Residual deviance: 549.94 on 616 degrees of freedom
#> AIC: 563.94
#>
#> Number of Fisher Scoring iterations: 5
t_test$prob_survive<-predict(model_back, type = "response", newdata = t_test)t_test$pred_survive <- factor(ifelse(t_test$prob_survive > 0.5, 1, 0))
t_test[1:10, c("pred_survive", "Survived")]At a glance the prediction results have the same value as the actual data.
log_conf <- confusionMatrix(t_test$pred_survive, t_test$Survived, positive = "1")
log_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 145 24
#> 1 32 67
#>
#> Accuracy : 0.791
#> 95% CI : (0.7374, 0.8381)
#> No Information Rate : 0.6604
#> P-Value [Acc > NIR] : 0.000001877
#>
#> Kappa : 0.5439
#>
#> Mcnemar's Test P-Value : 0.3496
#>
#> Sensitivity : 0.7363
#> Specificity : 0.8192
#> Pos Pred Value : 0.6768
#> Neg Pred Value : 0.8580
#> Prevalence : 0.3396
#> Detection Rate : 0.2500
#> Detection Prevalence : 0.3694
#> Balanced Accuracy : 0.7777
#>
#> 'Positive' Class : 1
#>
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
Recall = log_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = log_conf$byClass[3]) %>% print()#> # A tibble: 1 x 4
#> Accuracy Recall Specificity Precision
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.791 0.736 0.819 0.677
titanic2 <- titanic #duplicate dataset
for (i in names(titanic2)){
if(is.factor(titanic2[,i])){
titanic2[,i] = as.numeric(titanic2[,i])
}else if (is.character(titanic2[,i])){
titanic2 <- dplyr::select(titanic2,-i)
}
}titanic2 <- titanic2 %>%
mutate(Survived = as.factor(Survived))glimpse(titanic2)#> Rows: 891
#> Columns: 10
#> $ Survived <fct> 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2,~
#> $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3,~
#> $ Sex <dbl> 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1,~
#> $ Age <int> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 5~
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0,~
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0,~
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, ~
#> $ Embarked_S <dbl> 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1,~
#> $ Embarked_C <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,~
#> $ Embarked_Q <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,~
RNGkind(sample.kind = "Rounding")
set.seed(303)
knn_train <- titanic2[index,]
knn_test <- titanic2[-index,]#Predictor
knn_train_x <- knn_train %>% select_if(is.numeric)
knn_test_x <- knn_test %>% select_if(is.numeric)
#Target
knn_train_y <- knn_train[,'Survived']
knn_test_y <- knn_test[,'Survived']knn_train_xs <- scale(knn_train_x)
knn_test_xs <- scale(knn_test_x,
center = attr(knn_train_xs, "scaled:center"),
scale = attr(knn_train_xs, "scaled:scale"))sqrt(nrow(knn_train_xs))#> [1] 24.95997
pred_knn <- class::knn(train=knn_train_xs, test=knn_test_xs, cl=knn_train_y,k=25)pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(knn_test_y),"2")
pred_knn_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2
#> 1 155 27
#> 2 22 64
#>
#> Accuracy : 0.8172
#> 95% CI : (0.7656, 0.8616)
#> No Information Rate : 0.6604
#> P-Value [Acc > NIR] : 0.000000009161
#>
#> Kappa : 0.5868
#>
#> Mcnemar's Test P-Value : 0.5677
#>
#> Sensitivity : 0.7033
#> Specificity : 0.8757
#> Pos Pred Value : 0.7442
#> Neg Pred Value : 0.8516
#> Prevalence : 0.3396
#> Detection Rate : 0.2388
#> Detection Prevalence : 0.3209
#> Balanced Accuracy : 0.7895
#>
#> 'Positive' Class : 2
#>
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
Recall = pred_knn_conf$byClass[1],
Specificity = pred_knn_conf$byClass[2],
Precision = pred_knn_conf$byClass[3]) %>% print()#> # A tibble: 1 x 4
#> Accuracy Recall Specificity Precision
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.817 0.703 0.876 0.744
# Model Evaluation Logit
eval_logit# Model Evaluation K-NN
eval_knnBased on the performance results of the two methods (Logistics Regression and K-NN), the model’s ability to correctly predict the actual data of survivors is better by using the K-NN method in terms of Accuracy, Specifity, and Precision. But if you refer to the Recall value, Logistic Regression is better with a value of 73,6%.