In this research we will analyze the survivability of a titanic passenger during the tragic disaster of the cruise ship. Many will consider the survival of the person is a luck or even a miracle, it can be true, but in this case we will try to interpret those miracle into information based on the provided data. Hopefully we can undertand the parameters that are crucial to the survival of a passenger.
But first we will need to load the necessary packages to help this research. Those packages must be installed first to be able to be load as below.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
The dataset we are using today are provided from Kaggle containing passenger information from the Titanic, more specifically their data regarding their time at the ship.
data_titanic <- read.csv("titanic/train.csv")
str(data_titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Detailed information on the column:
Survived : Passenger survival, 0 = No; 1 = YesPclass : Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rdName : Passenger NameSex : Passenger SexAge : Passenger AgeSibSp : Number of siblings / spouses aboard the
TitanicParch : Number of parents / children aboard the
TitanicTicket : Ticket NumberFare : The ticket priceCabin : Cabin numberEmbarked : Port of Embarkation, C = Cherbourg; Q =
Queenstown; S = SouthamptonIn the provided dataset there are some unmatched data type, in other words it is still poorly presented. A adjustment that are needed to be done.
data_titanic <- data_titanic %>%
mutate_at(vars(Survived, Pclass,Sex), as.factor) %>%
mutate(Survived = factor(Survived, levels = c(0,1), labels = c("Not Survived", "Survived")))
glimpse(data_titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <fct> Not Survived, Survived, Survived, Survived, Not Survived, …
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <fct> male, female, female, female, male, male, male, male, fema…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
Next step is to recognize the missing data in the dataset
colSums(is.na(data_titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
It is observable that in the Age column there are significant amount of missing data. But we decide to drop it all because in our consideration that the Age column is such a determinant factor in deciding the rate of survival in each passenger.
data_titanic <- data_titanic[!is.na(data_titanic$Age),]
colSums(is.na(data_titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Our target variable is in the Survival column wheter the
passenger survived or not. Therefore it is essential to make sure the
dataset in balanced so that the model is sensitive to both target class
(0=Not Survived, 1=Survived).
prop.table(table(data_titanic$Survived))
##
## Not Survived Survived
## 0.5938375 0.4061625
table(data_titanic$Survived)
##
## Not Survived Survived
## 424 290
As we can see, the data is fairly balance, so we can continue using this dataset without anymore action about skewed data
From the given predictor, there has been an attention to the Cabin column, it contain an information about a passenger’s cabin in string. We will process the string data to extract the number of cabin a passenger has. We suspect it can be a predicting factor for out model.
data_titanic$totalCabin <- ifelse(test = data_titanic$Cabin == "", yes = 0, no=sapply(strsplit(data_titanic$Cabin, " "), length))
We will devide the dataset into two parts. The first one served as the trainning data, that will be the base of the prediction model. And the second part is the testing data, a unseen data for the model so that we can evaluate the model’s performance.
set.seed(303)
intrain <- sample(nrow(data_titanic), nrow(data_titanic)*0.8)
titanic_train <- data_titanic[intrain,]
titanic_test <- data_titanic[-intrain,]
names(data_titanic)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked" "totalCabin"
The step where we actually make the prediction model using logistic
regression or in this case using glm() function. We will
not use all of the variables since not all of the variable are
considered important to predict the survavibility of a passenger.
titanic_model <- glm(formula = Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked+totalCabin, family = "binomial",
data = titanic_train)
summary(titanic_model)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked + totalCabin, family = "binomial", data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5696 -0.7152 -0.4085 0.6871 2.3991
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.570e+01 5.354e+02 0.029 0.9766
## Pclass2 -8.621e-01 4.094e-01 -2.106 0.0352 *
## Pclass3 -2.011e+00 4.291e-01 -4.686 2.78e-06 ***
## Sexmale -2.513e+00 2.454e-01 -10.244 < 2e-16 ***
## Age -3.776e-02 8.794e-03 -4.294 1.75e-05 ***
## SibSp -3.706e-01 1.469e-01 -2.522 0.0117 *
## Parch -1.225e-01 1.340e-01 -0.914 0.3608
## Fare -5.504e-04 3.047e-03 -0.181 0.8566
## EmbarkedC -1.171e+01 5.354e+02 -0.022 0.9826
## EmbarkedQ -1.289e+01 5.354e+02 -0.024 0.9808
## EmbarkedS -1.217e+01 5.354e+02 -0.023 0.9819
## totalCabin 2.555e-01 2.903e-01 0.880 0.3788
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 771.40 on 570 degrees of freedom
## Residual deviance: 533.81 on 559 degrees of freedom
## AIC: 557.81
##
## Number of Fisher Scoring iterations: 12
We have successfully made a model using all the predictor, but as we
can see, based on the data there are still many predictor that are not
significant that cost us efficiency. So we decide on using
stepwise method to reduce the predictor unto a set of
variables that are important only.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
titanic_model_step <- stepAIC(titanic_model, direction = "backward", trace = FALSE)
summary(titanic_model_step)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6247 -0.6973 -0.4178 0.6843 2.3653
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.903644 0.471246 8.284 < 2e-16 ***
## Pclass2 -1.207460 0.298297 -4.048 5.17e-05 ***
## Pclass3 -2.414140 0.300302 -8.039 9.05e-16 ***
## Sexmale -2.443079 0.233547 -10.461 < 2e-16 ***
## Age -0.039576 0.008629 -4.586 4.51e-06 ***
## SibSp -0.412410 0.137856 -2.992 0.00278 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 771.40 on 570 degrees of freedom
## Residual deviance: 539.77 on 565 degrees of freedom
## AIC: 551.77
##
## Number of Fisher Scoring iterations: 4
After using the stepwise method, we can conclude that
the most significant predictors are Pclass,
Sex, Age and SibSp. In other
words, those parameters are the deciding factors wheter a passenger can
survive the disaster or not.
Using model that are developed using stepwise method, we
will try to predict the rate of survival of each passenger.
titanic_test$pred <- predict(titanic_model_step, type = "response", newdata = titanic_test)
titanic_test[,c("PassengerId","Survived","pred")]
## PassengerId Survived pred
## 3 3 Survived 0.61314086
## 4 4 Survived 0.89149291
## 11 11 Survived 0.71479577
## 12 12 Survived 0.83316813
## 17 17 Not Survived 0.06402196
## 19 19 Not Survived 0.46263115
## 25 25 Not Survived 0.48392270
## 36 36 Not Survived 0.35114094
## 39 39 Not Survived 0.48808242
## 41 41 Not Survived 0.37614655
## 45 45 Survived 0.67646305
## 57 57 Survived 0.86588942
## 60 60 Not Survived 0.03074031
## 62 62 Survived 0.91680934
## 68 68 Not Survived 0.15374636
## 69 69 Survived 0.30302733
## 85 85 Survived 0.88323167
## 104 104 Not Survived 0.09452667
## 109 109 Not Survived 0.07889511
## 111 111 Not Survived 0.40143322
## 112 112 Not Survived 0.62322177
## 113 113 Not Survived 0.13892562
## 114 114 Not Survived 0.57091343
## 121 121 Not Survived 0.19737115
## 133 133 Not Survived 0.31368076
## 145 145 Not Survived 0.38716151
## 158 158 Not Survived 0.10518950
## 165 165 Not Survived 0.06643478
## 171 171 Not Survived 0.27816988
## 173 173 Survived 0.73837095
## 174 174 Not Survived 0.14372788
## 175 175 Not Survived 0.31958515
## 176 176 Not Survived 0.11121907
## 190 190 Not Survived 0.08484223
## 195 195 Survived 0.89681349
## 196 196 Survived 0.83316813
## 201 201 Not Survived 0.11287580
## 209 209 Survived 0.70188533
## 217 217 Survived 0.60371253
## 229 229 Not Survived 0.38716151
## 232 232 Not Survived 0.10897319
## 240 240 Not Survived 0.25866935
## 243 243 Not Survived 0.29016289
## 253 253 Not Survived 0.27029358
## 262 262 Survived 0.06169098
## 263 263 Not Survived 0.26702181
## 267 267 Not Survived 0.03781770
## 268 268 Survived 0.08663914
## 270 270 Survived 0.92542804
## 282 282 Not Survived 0.11287580
## 294 294 Not Survived 0.63173928
## 295 295 Not Survived 0.12972485
## 297 297 Not Survived 0.13197524
## 303 303 Not Survived 0.15374636
## 310 310 Survived 0.93798576
## 312 312 Survived 0.91423373
## 314 314 Not Survived 0.11287580
## 315 315 Not Survived 0.13457911
## 319 319 Survived 0.93564340
## 320 320 Survived 0.87081714
## 322 322 Not Survived 0.11689984
## 323 323 Survived 0.81889969
## 329 329 Survived 0.46263115
## 342 342 Survived 0.84768985
## 345 345 Not Survived 0.23656166
## 351 351 Not Survived 0.13425865
## 354 354 Not Survived 0.08663914
## 361 361 Not Survived 0.04978304
## 362 362 Not Survived 0.21298881
## 371 371 Survived 0.51468587
## 373 373 Not Survived 0.15374636
## 387 387 Not Survived 0.04499351
## 388 388 Survived 0.78099187
## 406 406 Not Survived 0.18169856
## 408 408 Survived 0.43093617
## 415 415 Survived 0.06327441
## 425 425 Not Survived 0.11121907
## 434 434 Not Survived 0.16432916
## 436 436 Survived 0.94965375
## 462 462 Not Survived 0.09119325
## 472 472 Not Survived 0.07889511
## 479 479 Not Survived 0.13892562
## 490 490 Survived 0.15159226
## 497 497 Survived 0.79480600
## 531 531 Survived 0.90066385
## 536 536 Survived 0.91827624
## 540 540 Survived 0.95404188
## 544 544 Survived 0.19376484
## 551 551 Survived 0.68735369
## 557 557 Survived 0.83084065
## 559 559 Survived 0.87520425
## 562 562 Not Survived 0.07333146
## 566 566 Not Survived 0.06132893
## 577 577 Survived 0.79422892
## 589 589 Not Survived 0.13892562
## 593 593 Not Survived 0.05659171
## 596 596 Not Survived 0.05782810
## 601 601 Survived 0.71535693
## 604 604 Not Survived 0.06327441
## 617 617 Not Survived 0.06229458
## 620 620 Not Survived 0.31521160
## 625 625 Not Survived 0.14372788
## 636 636 Survived 0.83034328
## 661 661 Survived 0.20700994
## 666 666 Not Survived 0.13727142
## 669 669 Not Survived 0.06566104
## 676 676 Not Survived 0.15896635
## 683 683 Not Survived 0.14866748
## 685 685 Not Survived 0.07351884
## 686 686 Not Survived 0.24072666
## 688 688 Not Survived 0.15374636
## 695 695 Not Survived 0.28618568
## 700 700 Not Survived 0.06813114
## 704 704 Not Survived 0.12532198
## 705 705 Not Survived 0.08355820
## 708 708 Survived 0.44976555
## 709 709 Survived 0.95404188
## 720 720 Not Survived 0.09452667
## 721 721 Survived 0.92119746
## 734 734 Not Survived 0.34138273
## 738 738 Survived 0.51884418
## 764 764 Survived 0.88760488
## 770 770 Not Survived 0.09796880
## 780 780 Survived 0.90041871
## 781 781 Survived 0.72611879
## 785 785 Not Survived 0.12532198
## 787 787 Survived 0.68506345
## 797 797 Survived 0.87701168
## 805 805 Survived 0.11689984
## 806 806 Not Survived 0.10152223
## 813 813 Not Survived 0.24378340
## 824 824 Survived 0.60371253
## 825 825 Not Survived 0.06402196
## 832 832 Survived 0.45210897
## 835 835 Not Survived 0.15896635
## 842 842 Not Survived 0.40610087
## 848 848 Not Survived 0.08796596
## 859 859 Survived 0.63173928
## 876 876 Survived 0.71009956
## 880 880 Survived 0.84388199
## 881 881 Survived 0.84641940
## 882 882 Not Survived 0.09452667
## 884 884 Not Survived 0.29838143
ggplot(titanic_test, aes(x=pred)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()
We already produce the survivability rate of each passenger, now to be able to examine our model we will set a threshold of 0.75 to decide a passenger will survive or not.
titanic_test$pred_label <- factor(ifelse(titanic_test$pred > 0.75, "Survived","Not Survived"))
library(caret)
## Loading required package: lattice
conf_matrix <- confusionMatrix(titanic_test$pred_label, titanic_test$Survived, positive = "Survived")
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Survived Survived
## Not Survived 85 27
## Survived 0 31
##
## Accuracy : 0.8112
## 95% CI : (0.7373, 0.8717)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 2.541e-08
##
## Kappa : 0.5772
##
## Mcnemar's Test P-Value : 5.624e-07
##
## Sensitivity : 0.5345
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.7589
## Prevalence : 0.4056
## Detection Rate : 0.2168
## Detection Prevalence : 0.2168
## Balanced Accuracy : 0.7672
##
## 'Positive' Class : Survived
##
Based on the confusion matrix above we can conclude that our model can predict class Survived with the accuracy of 79,72 percent with 0.75 threshold, with many errors on Survived passengers are predicted not survived but actually do survive the accident.
In these research of understanding the dataset of Titanic passengers, we found a pattern in parameters of survival in the disaster is ticket class, sex, age, and number of siblings or spouses. The prediction model can only achieve accuracy of 79.7% with many errors on passengers came out survived despite all otherwise factors, and this passengers can be truly considered miracluous statistically.