The sinking of the Titanic will be one of the most interesting case study to be used in making predictions whether the passengers survived or not.
In this LBB Project, we visited back our previous same dataset of Titanic taken from Kaggle that we have explored in the beginning stage of LBB Project Programming for Data Science with R.
We will make prediction whether the passengers aboard the Titanic will Survived or not
There are three datasets provided by the original owner of the data, which we can explore and used to create the predictions:
Dataset train.csv contains the details of a subset
of passengers aboard Titanic with total of 891 passengers and
importantly, have information whether they survived or not, also known
as the “ground truth”.
Dataset test.csv contains similar information as
train.csv but does not disclosed the “ground truth” for
each passenger.
Dataset gender_submission.csv is a set of
predictions that assume all and only female passengers survive.
Before we dive to do further analysis, let us prepare our datasets
# Library setup and Installation necessary packages
library(readr)
library(dplyr)
library(gtools)
library(car)
library(caret)
library(CMplot)
library(class)
#library(lubridate) # working with datetime
#library(GGally) # correlation relationship
#library(MLmetrics) # for MAE calculations
#library(performance) # model performance comparison
#library(lmtest) # Testing Linear Regression ModelBefore we can create the dataset, we will
check the informations contain in each original kaggle datasets
train.csv and test.csv.
# Read the first dataset
titanic_train <- read.csv("data_input/train.csv")
# Check the structure of first dataset
glimpse(titanic_train)#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
#> $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
#> $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
#> $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
#> $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
#> $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
#> $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
This dataset contains information for total of 891 passengers.
# Read the second dataset
titanic_test <- read.csv("data_input/test.csv")
# Check the structure of second dataset
glimpse(titanic_test)#> Rows: 418
#> Columns: 11
#> $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
#> $ Pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
#> $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
#> $ Sex <chr> "male", "female", "male", "male", "female", "male", "femal…
#> $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0…
#> $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
#> $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
#> $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2…
#> $ Cabin <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
#> $ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…
This dataset contains information for total of 418 passengers.
The difference between dataframe train and
test is the extra column variable Survived on
dataframe train but all the other column variables have the
same meaning.
Therefore, our data description applied to both dataframes and can be explained as follows:
PassengerId: Row numberSurvived: Survival Status of the passenger. 1 for Yes, 0 for NoPclass: Ticket class and a proxy for socio-economic status (1 = 1st class (Upper), 2 = 2nd class (Middle), 3 = 3rd class (Lower))Name: Name of the passenger
Sex: Gender of the passenger (male / female)Age: Age of the passenger in years and it is fractional if less than 1. If the age is estimated, it is in the form of xx.5SibSp: Number of Siblings / Spouses aboard the Titanic with family relations as follows
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: Number of Parents / Children aboard the Titanic with family relations as follows
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch = 0 for them
Ticket: Ticket NumberFare: Passenger FareCabin: Cabin numberEmbarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
We will change the following column variables into categorical
factor datatype: Survived,
Pclass, Sex, SibSp,
Parch, and Embarked
# Change Data Type
titanic_train <- titanic_train %>%
mutate_at(vars(Survived, Pclass, Sex, SibSp,
Parch, Embarked),
as.factor)
# Confirm Data Type Change
str(titanic_train)#> 'data.frame': 891 obs. of 12 variables:
#> $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
#> $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
#> $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#> $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
#> $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
#> $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
#> $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
#> $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Cabin : chr "" "C85" "" "C123" ...
#> $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
#> [1] TRUE
#> PassengerId Survived Pclass Name Sex Age
#> 0 0 0 0 0 177
#> SibSp Parch Ticket Fare Cabin Embarked
#> 0 0 0 0 0 0
There are 177 missing values in column
Age
#> [1] 0.423445
Estimated about 42.34% of our dataframe titanic_train
have missing values in column Age as it is quite
significant, therefore we will keep the data as it is without doing any
removal process or handling of those missing values.
We will change the following column variables into categorical
factor datatype: Pclass, Sex,
SibSp, Parch, and Embarked
# Change Data Type
titanic_test <- titanic_test %>%
mutate_at(vars(Pclass, Sex, SibSp,
Parch, Embarked),
as.factor)
# Confirm Data Type Change
str(titanic_test)#> 'data.frame': 418 obs. of 11 variables:
#> $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
#> $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
#> $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
#> $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
#> $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
#> $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
#> $ Parch : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
#> $ Ticket : chr "330911" "363272" "240276" "315154" ...
#> $ Fare : num 7.83 7 9.69 8.66 12.29 ...
#> $ Cabin : chr "" "" "" "" ...
#> $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
#> [1] TRUE
#> PassengerId Pclass Name Sex Age SibSp
#> 0 0 0 0 86 0
#> Parch Ticket Fare Cabin Embarked
#> 0 0 1 0 0
There are 86 missing values in column Age and 1 missing
values in column Fare
#> [1] 0.2057416
Estimated about 20.57% of our dataframe titanic_test
have missing values in column Age as it is still
significant amount, therefore we will keep the data as it is without
doing any removal process or handling of those missing values.
For the purpose of building the model, we will focus our model using
dataset titanic_train which contains our target variable to
be analyzed Survived
#>
#> 0 1
#> 0.6161616 0.3838384
We observed that the proportion is balance with 61.6% (Not Survived) to 38.4% (Survived)
titanic_train_null <- glm(Survived ~ 1, data = titanic_train, family = "binomial")
summary(titanic_train_null)#>
#> Call:
#> glm(formula = Survived ~ 1, family = "binomial", data = titanic_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.47329 0.06889 -6.87 0.0000000000064 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1186.7 on 890 degrees of freedom
#> Residual deviance: 1186.7 on 890 degrees of freedom
#> AIC: 1188.7
#>
#> Number of Fisher Scoring iterations: 4
#> [1] 0.6229494
#> [1] 0.650889
💡 Insight Interpretation :
Number of Survived passengers (alive) is 0.62 times more than passengers not surviving.
Probability that passengers aboard Titanic survived (alive) is 65%, and the rest (35%) not survived
Let us take a look again on what our dataframe
titanic_train looks like
titanic_train_gender <- glm(Survived ~ Sex, data = titanic_train, family = "binomial")
summary(titanic_train_gender)#>
#> Call:
#> glm(formula = Survived ~ Sex, family = "binomial", data = titanic_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.0566 0.1290 8.191 0.000000000000000258 ***
#> Sexmale -2.5137 0.1672 -15.036 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1186.7 on 890 degrees of freedom
#> Residual deviance: 917.8 on 889 degrees of freedom
#> AIC: 921.8
#>
#> Number of Fisher Scoring iterations: 4
💡 Coefficients Information :
Intercept : 1.0566 is log of odds ratio passengers survived (alive) when its gender is “female”
Sexmale : -2.5137 is log of odds ratio passengers survived (alive) when its gender is “male”
Calculate Probability for Survival Passengers with “female” gender
#> [1] 2.876574
#> [1] 0.9466762
💡 Insight Interpretation :
Number of Survived passengers (alive) with gender as “female” is 2.876574 times more than “male” passengers
Probability that the survived passengers (alive) with gender as “female” is 94.67%, and the rest (5.33%) is the survival probability for “male” passengers
Let us do Evaluate our Model with prediction based on our Gender Model
titanic_train$pred_Survived <- predict(titanic_train_gender,
titanic_train,
type = "response")
head(titanic_train)Classified data titanic_train based on pred_Survived and
save into new column named pred_Label
titanic_train$pred_Label <- ifelse(titanic_train$pred_Survived > 0.5, yes = "1", no = "0")
head(titanic_train)confusionMatrix(data = as.factor(titanic_train$pred_Label), # Prediction results (pred_Label)
reference = titanic_train$Survived, # actual target column
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 468 109
#> 1 81 233
#>
#> Accuracy : 0.7868
#> 95% CI : (0.7584, 0.8132)
#> No Information Rate : 0.6162
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.5421
#>
#> Mcnemar's Test P-Value : 0.05014
#>
#> Sensitivity : 0.6813
#> Specificity : 0.8525
#> Pos Pred Value : 0.7420
#> Neg Pred Value : 0.8111
#> Prevalence : 0.3838
#> Detection Rate : 0.2615
#> Detection Prevalence : 0.3524
#> Balanced Accuracy : 0.7669
#>
#> 'Positive' Class : 1
#>
Based on our confusion matrix above,
Target variable = Survived
Positive Class = Survived
FP : Predict Surived (+) but actually Not Survived (-)
FN : Predic Not Survived (-) but actually Survived (+)
In this Business Case, we will need to focus our main Metrics in False Positive(FP) as it is the important focus.], hence we will used PRECISSION Metrics.
In this model, our Pos Pred Value to evaluate the PRECISSION is 74.20%