A is a planning officer in a rescuer organisation. Upon the news of Titanic sinking, she is required to determine how many rescuers her organisation should deploy to the site. The number of rescuers is decided from the number of passenger who would likely survived, according to their manifestation data.
A will use ’generalized linear model` to generate prediction model.
Before creating the model, A need to ensure that her data is ready
## [1] TRUE
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
A decided that Age is relevant to target variable, hence she need to imputate the blank rows in Age variable with mean/median. A need to explore the Age data first
From this histogram, the data is pretty much balanced (normally distributed). Then, she confirmed the exact number of the median and mean from the Age variable by summary() function
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
Both median and mean is within close range, so both values are acceptable to use. She will impute the missing value in Age variable with its mean value
A saved the imputed dataset into titanic_clean object
To ensure the imputation works, she checked the data summary again.
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:22.00
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :29.70
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:35.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
A checked the newer data structure with str function
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Variable explanation
survival : passenger survival status
pclass : passenger ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
sex : passenger sexuality
Age : passenger age in years
sibsp : the number of siblings / spouses of passengers aboarding the Titanic
parch : the number of parents / children of passengers aboarding the Titanic
ticket : Ticket number
fare : Passenger fare
cabin : Cabin number
embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
A identified that there are 4 variables that have zero information to decide whether a passenger should survive or not:
PassengerId
Ticket
Name
Cabin
So she removed them from the variable to speed up the modelling process.
##
## 0 1
## 0.6161616 0.3838384
A detected no class imbalance among the target variables, so it’s confirmed that she can use this dataset.
##
## 1 2 3
## 0 80 97 372
## 1 136 87 119
##
## female male
## 0 81 468
## 1 233 109
##
## 0.42 0.67 0.75 0.83 0.92 1 2 3 4 5 6 7 8 9 10 11 12
## 0 0 0 0 0 0 2 7 1 3 0 1 2 2 6 2 3 0
## 1 1 1 2 2 1 5 3 5 7 4 2 1 2 2 0 1 1
##
## 13 14 14.5 15 16 17 18 19 20 20.5 21 22 23 23.5 24 24.5 25 26
## 0 0 3 1 1 11 7 17 16 12 1 19 16 10 1 15 1 17 12
## 1 2 3 0 4 6 6 9 9 3 0 5 11 5 0 15 0 6 6
##
## 27 28 28.5 29 29.6991176470588 30 30.5 31 32 32.5 33 34 34.5 35
## 0 7 18 2 12 125 15 2 9 9 1 9 9 1 7
## 1 11 7 0 8 52 10 0 8 9 1 6 6 0 11
##
## 36 36.5 37 38 39 40 40.5 41 42 43 44 45 45.5 46 47 48 49 50
## 0 11 1 5 6 9 7 2 4 7 4 6 7 2 3 8 3 2 5
## 1 11 0 1 5 5 6 0 2 6 1 3 5 0 0 1 6 4 5
##
## 51 52 53 54 55 55.5 56 57 58 59 60 61 62 63 64 65 66 70
## 0 5 3 0 5 1 1 2 2 2 2 2 3 2 0 2 3 1 2
## 1 2 3 1 3 1 0 2 0 3 0 2 0 2 2 0 0 0 0
##
## 70.5 71 74 80
## 0 1 2 1 0
## 1 0 0 0 1
##
## 0 1 2 3 4 5 8
## 0 398 97 15 12 15 5 7
## 1 210 112 13 4 3 0 0
##
## 0 1 2 3 4 5 6
## 0 445 53 40 2 4 4 1
## 1 233 65 40 3 0 1 0
There are no variables that perfectly contribute to the survival prediction, so A reckons that she can use all variables.
A will split her dataset into train and test data, to ensure that her model would not be overfitted. She decided to use 80% of the randomly selected rows for generating the model, and the other 20% for testing her model.
set.seed(921)
splitted <- initial_split(data = titanic_clean, prop = 0.8, strata = "Survived")
titanic_train <- training(splitted)
titanic_test <- testing(splitted)She confirmed that her splitted data have similar target variable proportion by comparing the proportion of each target.
##
## 0 1
## 0.6161616 0.3838384
##
## 0 1
## 0.6162465 0.3837535
##
## 0 1
## 0.6158192 0.3841808
She confirmed that all of her splitted data has similar target proportion.
All variables have varying degree of correlation against the target variable. However, A reckons that all variables might be relevant to decide the survival rate of the passengers. Hence she remove no variable at this point.
A will use the glm() function to generate classification model from all (mostly numeric) variables available.
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5739 -0.6283 -0.4407 0.6738 2.3670
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.432864 535.411599 0.031 0.97552
## Pclass -0.950849 0.156136 -6.090 1.13e-09 ***
## Sexmale -2.658113 0.220486 -12.056 < 2e-16 ***
## Age -0.034028 0.008724 -3.901 9.60e-05 ***
## SibSp -0.376437 0.122142 -3.082 0.00206 **
## Parch -0.012432 0.133808 -0.093 0.92597
## Fare 0.002422 0.002569 0.943 0.34576
## EmbarkedC -11.615881 535.411299 -0.022 0.98269
## EmbarkedQ -11.803506 535.411373 -0.022 0.98241
## EmbarkedS -12.104306 535.411273 -0.023 0.98196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 651.82 on 704 degrees of freedom
## AIC: 671.82
##
## Number of Fisher Scoring iterations: 12
Her first model generate an AIC score of 671.82, and she wonders if she can improve her score by eliminating some variables. She use backward feature selection to generate possible higher AIC score.
## Start: AIC=671.82
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Parch 1 651.83 669.83
## - Embarked 3 656.04 670.04
## - Fare 1 652.79 670.79
## <none> 651.82 671.82
## - SibSp 1 662.97 680.97
## - Age 1 667.86 685.86
## - Pclass 1 688.89 706.89
## - Sex 1 833.87 851.87
##
## Step: AIC=669.83
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
##
## Df Deviance AIC
## - Embarked 3 656.10 668.10
## - Fare 1 652.80 668.80
## <none> 651.83 669.83
## - SibSp 1 664.42 680.42
## - Age 1 667.87 683.87
## - Pclass 1 689.72 705.72
## - Sex 1 841.28 857.28
##
## Step: AIC=668.1
## Survived ~ Pclass + Sex + Age + SibSp + Fare
##
## Df Deviance AIC
## - Fare 1 657.97 667.97
## <none> 656.10 668.10
## - SibSp 1 670.81 680.81
## - Age 1 672.55 682.55
## - Pclass 1 696.67 706.67
## - Sex 1 853.59 863.59
##
## Step: AIC=667.97
## Survived ~ Pclass + Sex + Age + SibSp
##
## Df Deviance AIC
## <none> 657.97 667.97
## - SibSp 1 671.12 679.12
## - Age 1 675.21 683.21
## - Pclass 1 731.96 739.96
## - Sex 1 860.28 868.28
##
## Call: glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic_train)
##
## Coefficients:
## (Intercept) Pclass Sexmale Age SibSp
## 4.79907 -1.05618 -2.69570 -0.03484 -0.37588
##
## Degrees of Freedom: 713 Total (i.e. Null); 709 Residual
## Null Deviance: 950.9
## Residual Deviance: 658 AIC: 668
The backward feature selection generate a model with AIC score of 667.97. Since this score is slightly higher than the initial model, she decide to use this model and assign it to titanic_md_fin object
After generating a model with better AIC score, A applied her model to her testing dataset. She use predict() function with response type input to generate the surviving probability. Later, she assign them to Pred_Survived column.
titanic_test$Pred_Survived <- predict(titanic_md_fin, newdata = titanic_test, type = "response")
titanic_testShe adds another variable of class_survived to assign which probabily class are the predictions run into.
A will evaluate her model by comparing how many correct predictions (represented in class_survived column) against the actual situation (represented in Survived column). Before continue with the comparison, A converted the columns into factor data type
titanic_test <- titanic_test %>%
mutate(Survived = as.factor(Survived),
class_survived = as.factor(class_survived))A compared her prediction with confusionMatrix
confusionMatrix(data = titanic_test$class_survived,
reference = titanic_test$Survived,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 14
## 1 17 54
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6329
##
## Mcnemar's Test P-Value : 0.7194
##
## Sensitivity : 0.7941
## Specificity : 0.8440
## Pos Pred Value : 0.7606
## Neg Pred Value : 0.8679
## Prevalence : 0.3842
## Detection Rate : 0.3051
## Detection Prevalence : 0.4011
## Balanced Accuracy : 0.8191
##
## 'Positive' Class : 1
##
As a rescuer planner, she need to deploy as little rescuer as possible while still optimizing the rescuing process. She focused more on her model precision, so she doesn’t want to misclassify those were surviving into not survived category.
In above summary, her model only 76.06% valid in classifying the surviving passengers. Hence, she need to tune her model by:
To not disturb the previous model, A assign her data test into a new object titanic_test1
She decided to add back all variables previously removed in feature selection. She use her previous model titanic_model which includes all variables (except the completely irrelevant data), and replace the pred_survived column with the probability generated from the titanic_mdl model.
titanic_test1$Pred_Survived <- predict(titanic_mdl, newdata = titanic_test1, type = "response")
titanic_test1The class_survived also replaced with the new classification generated from above probabilities.
A will evaluate her model by comparing how many correct predictions (represented in class_survived column) against the actual situation (represented in Survived column). Before continue with the comparison, A converted the columns into factor data type
titanic_test1 <- titanic_test1 %>%
mutate(Survived = as.factor(Survived),
class_survived = as.factor(class_survived))A compared her prediction with confusionMatrix
confusionMatrix(data = titanic_test1$class_survived,
reference = titanic_test1$Survived,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 93 15
## 1 16 53
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6309
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7794
## Specificity : 0.8532
## Pos Pred Value : 0.7681
## Neg Pred Value : 0.8611
## Prevalence : 0.3842
## Detection Rate : 0.2994
## Detection Prevalence : 0.3898
## Balanced Accuracy : 0.8163
##
## 'Positive' Class : 1
##
This model generate slightly better precision score (76.81% compared to 76.06%). However, the change is too small, so she decided to upsample her dataset titanic_train
To ensuring that the target variable (Survived column) is already in correct format - the mutated dataset will be assigned to a new object titanic_train2
upsampling the data train
set.seed(921)
titanic_train_up <- upSample(x = titanic_train2[,-1],
y = titanic_train2$Survived, yname = "Survived")##
## 0 1
## 0.6162465 0.3837535
##
## 0 1
## 0.5 0.5
A will use the glm() function to generate classification model from all (mostly numeric) variables available.
titanic_mdl_up <- glm(Survived~., data = titanic_train_up, family = "binomial")
summary(titanic_mdl_up)##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train_up)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7279 -0.6785 -0.0694 0.6893 2.1890
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.225169 535.411500 0.030 0.97582
## Pclass -0.878612 0.138607 -6.339 2.31e-10 ***
## Sexmale -2.679176 0.197648 -13.555 < 2e-16 ***
## Age -0.032568 0.007430 -4.383 1.17e-05 ***
## SibSp -0.389671 0.109863 -3.547 0.00039 ***
## Parch 0.009659 0.121404 0.080 0.93658
## Fare 0.002984 0.002707 1.102 0.27035
## EmbarkedC -11.097147 535.411274 -0.021 0.98346
## EmbarkedQ -11.413284 535.411327 -0.021 0.98299
## EmbarkedS -11.667092 535.411249 -0.022 0.98261
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1219.94 on 879 degrees of freedom
## Residual deviance: 829.28 on 870 degrees of freedom
## AIC: 849.28
##
## Number of Fisher Scoring iterations: 12
After generating a model with better AIC score, A applied her model to her testing dataset. She use predict() function with response type input to generate the surviving probability. Later, she assign them to Pred_Survived column.
titanic_test2$Pred_Survived <- predict(titanic_mdl_up, newdata = titanic_test2, type = "response")
titanic_test2$class_survived <- ifelse(titanic_test2$Pred_Survived > 0.5, "1","0")
titanic_test2A will evaluate her model by comparing how many correct predictions (represented in class_survived column) against the actual situation (represented in Survived column). Before continue with the comparison, A converted the columns into factor data type
A compared her prediction with confusionMatrix
confusionMatrix(data = titanic_test2$class_survived,
reference = titanic_test2$Survived,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 13
## 1 21 55
##
## Accuracy : 0.8079
## 95% CI : (0.7421, 0.8632)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 2.854e-08
##
## Kappa : 0.6028
##
## Mcnemar's Test P-Value : 0.2299
##
## Sensitivity : 0.8088
## Specificity : 0.8073
## Pos Pred Value : 0.7237
## Neg Pred Value : 0.8713
## Prevalence : 0.3842
## Detection Rate : 0.3107
## Detection Prevalence : 0.4294
## Balanced Accuracy : 0.8081
##
## 'Positive' Class : 1
##
As a rescuer planner, she need to deploy as little rescuer as possible while still optimizing the rescuing process. She focused more on her model precision, so she doesn’t want to misclassify those surviving into not survived category.
In above summary, her model only 72.37% valid in classifying the surviving passengers.
Before classifying data into two classes (Survived and Not Survived), A needed to scale her data, before assigning the variables into predictors and target, for both insample and testing data.
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Insample Data
# predictor variables in `train`
knn_train_x <- knn_train[,-1]
# target variable in `train`
knn_train_y <- knn_train[,1]Testing Data
To ensure all variables contribute in similar weight during classification, A needed to standardize the variables using z-score standardization with scale() function.
## [1] 26.72078
Based on above computation, the k-optimum value is 27, since the number of classes in target variable was even (2, Survived and Not Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 93 15
## 1 16 53
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6309
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7794
## Specificity : 0.8532
## Pos Pred Value : 0.7681
## Neg Pred Value : 0.8611
## Prevalence : 0.3842
## Detection Rate : 0.2994
## Detection Prevalence : 0.3898
## Balanced Accuracy : 0.8163
##
## 'Positive' Class : 1
##
A compared the performance of her models as follows:
Original model, with variables as follows:
Pclass
Sex
Age
SibSp
Precision: 76.06%
Tuning #1 : Adding back number of parent/children (Parch), embarking dock(Embarked), and fare paid (Fare) into the model Precision: 76.81%
Tuning #2 : Upsampling the data train / insample data Precision: 72.37%
Tuning #3 : K-NN classification Precision: 76.81%
All models do not provide satisfactory performance. Therefore, A concluded that generalized linear models and K-NN classification might not suitable for this dataset.