this LBB is used to learn about Classification in Machine Learning by predicting which of the passenger of titanic would survive the titanic incident using the datasets provided by kaggle. By Classification in Machine Learning through this dataset we can predict which passenger is most likely to survive.
First lets load the necessary library we will be using
library(dplyr) # for wrangling
library(inspectdf) # for EDA
library(gtools) # for ML model & assumption
library(caret) # for ML model & evaluation
library(readxl)
library(rsample)
library(zoo)
library(imbalance)
library(class)lets load in the data we will be using into an object. The datasets have provided two data, one is train data and the other is to test our prediction, the test data.
Based on this data frame, the target variable would be “survived”
where -0 = didn’t survived
-1 = survived
Lets check the structure of our data
#> 'data.frame': 891 obs. of 12 variables:
#> $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
#> $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
#> $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#> $ Sex : chr "male" "female" "female" "female" ...
#> $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
#> $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
#> $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
#> $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Cabin : chr "" "C85" "" "C123" ...
#> $ Embarked : chr "S" "C" "S" "S" ...
It seems that the type of data on multiple variables such as “Survived”, “sex”, “Pclass”, and also “embarked” is wrong. lets change it into factors. While also changing the type of data, lets also remove some unused data
survive <- survive %>% mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Embarked = as.factor(Embarked),
Sex = case_when(Sex == "male" ~ 1,
Sex == "female" ~ 2),
Sex = as.factor(Sex)) %>% select(-c("PassengerId", "Name", "Cabin", "Ticket"))
str(survive)#> 'data.frame': 891 obs. of 8 variables:
#> $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
#> $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
#> $ Sex : Factor w/ 2 levels "1","2": 1 2 2 2 1 1 1 1 2 2 ...
#> $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
#> $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
#> $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Then lets check for any missing values
#> Survived Pclass Sex Age SibSp Parch Fare Embarked
#> 0 0 0 177 0 0 0 0
It seems there is many missing values within our Age coloumns, lets try to fill those missing values using
#> Survived Pclass Sex Age SibSp Parch Fare Embarked
#> 0 0 0 0 0 0 0 0
In this step, we want to know how spread is the category data and the numeric data
Insight
-The age of Titanic passengers ranges
from 3 month year old baby to 80 year’s old.
-The average age of
the passengers are 29 years old.
-64% of Titanics passengers were
male
-only 38% of Titanic passengers survived.
Cross validation is a simple method used to split data into 2 parts which is train data and test data. However in this case Kaggle already splitted the data and have given us the splitted data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
splitter <- initial_split(data = survive_clean, prop = 0.8, strata = "Survived")
train_clean <- training(splitter)
test_clean <- testing(splitter)Balanced class proportions are important for the training data because we will be using this data to train the model.
Balanced proportions are essential to ensure that the classification model learns the characteristics of each class equally and is not dominated by just one class. This prevents the model from being biased towards the class with the larger proportion, which would result in the model being only good at predicting one class.
#>
#> 0 1
#> 0.616573 0.383427
#>
#> 0 1
#> 439 273
It seems the data provided by kaggle is not balanced yet so we need to balance this.
to balance the model we can use either upsampling or downsampling. in this case we will be using upsampling to balance our class
#>
#> 0 1
#> 0.5 0.5
After checking the proporsion of survived and non-surviver, we now have a balanced class
lets do a modelling using Logistic Regression. This modelling will be using the function called “glm”. The variable used will be all the variables provided by the data sets
#>
#> Call:
#> glm(formula = Survived ~ ., family = "binomial", data = train_Nclean)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 13.097655 617.012211 0.021 0.983064
#> Pclass2 -1.100741 0.294414 -3.739 0.000185 ***
#> Pclass3 -2.137240 0.280000 -7.633 0.0000000000000229 ***
#> Sex2 2.768607 0.207654 13.333 < 0.0000000000000002 ***
#> Age -0.028188 0.006665 -4.229 0.0000234722718158 ***
#> SibSp -0.382933 0.108183 -3.540 0.000401 ***
#> Parch -0.060876 0.119285 -0.510 0.609813
#> Fare 0.001777 0.002303 0.772 0.440196
#> EmbarkedC -11.704093 617.012131 -0.019 0.984866
#> EmbarkedQ -11.527751 617.012180 -0.019 0.985094
#> EmbarkedS -11.899387 617.012119 -0.019 0.984613
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1217.17 on 877 degrees of freedom
#> Residual deviance: 802.38 on 867 degrees of freedom
#> AIC: 824.38
#>
#> Number of Fisher Scoring iterations: 13
Insight
-it seems that Pclass2, Pclass3, Sex2,
Age, and SibSp played a significant role in determining wether the
passenger would survive or not
#> Pclass2
#> 0.3326246
#> Pclass3
#> 0.11798
#> Sex2
#> 15.93642
#> Age
#> 0.972206
#> SibSp
#> 0.6818588
Insight
What these number tells us is that:
-People in class 2 are 0.33 times more likely to survive than
people in class 1.
-People in class 3 are 0.11 more likely to
survive than people in class 1.
-Sex2 or woman are 15.9 times more
like to survive compared to men.
-For every increase in age, the
likely survivability of a person is increase by 0.97 times.
-for
every increase in number of siblings or spouses, the likely survivalhood
of a person is increase by 0.68 times.
In the first model, theres still many variable prediktor that isn’t significant towards the target variable. That is why we will be tuning our model using Stepwise method
#> Start: AIC=824.38
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
#>
#> Df Deviance AIC
#> - Embarked 3 804.28 820.28
#> - Parch 1 802.64 822.64
#> - Fare 1 803.02 823.02
#> <none> 802.38 824.38
#> - SibSp 1 817.15 837.15
#> - Age 1 821.13 841.13
#> - Pclass 2 864.83 882.83
#> - Sex 1 1034.79 1054.79
#>
#> Step: AIC=820.28
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
#>
#> Df Deviance AIC
#> - Parch 1 804.68 818.68
#> - Fare 1 805.28 819.28
#> <none> 804.28 820.28
#> - SibSp 1 820.47 834.47
#> - Age 1 822.49 836.49
#> - Pclass 2 867.22 879.22
#> - Sex 1 1053.13 1067.13
#>
#> Step: AIC=818.68
#> Survived ~ Pclass + Sex + Age + SibSp + Fare
#>
#> Df Deviance AIC
#> - Fare 1 805.47 817.47
#> <none> 804.68 818.68
#> - Age 1 822.62 834.62
#> - SibSp 1 824.03 836.03
#> - Pclass 2 869.47 879.47
#> - Sex 1 1060.15 1072.15
#>
#> Step: AIC=817.47
#> Survived ~ Pclass + Sex + Age + SibSp
#>
#> Df Deviance AIC
#> <none> 805.47 817.47
#> - Age 1 823.83 833.83
#> - SibSp 1 824.06 834.06
#> - Pclass 2 912.53 920.53
#> - Sex 1 1067.69 1077.69
#>
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
#> data = train_Nclean)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.365448 0.317030 4.307 0.0000165 ***
#> Pclass2 -1.256699 0.265939 -4.726 0.0000023 ***
#> Pclass3 -2.229983 0.233344 -9.557 < 0.0000000000000002 ***
#> Sex2 2.800483 0.198799 14.087 < 0.0000000000000002 ***
#> Age -0.027729 0.006626 -4.185 0.0000286 ***
#> SibSp -0.405622 0.103601 -3.915 0.0000903 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1217.17 on 877 degrees of freedom
#> Residual deviance: 805.47 on 872 degrees of freedom
#> AIC: 817.47
#>
#> Number of Fisher Scoring iterations: 5
Insight
-Some variable prediktor such as “Fare”
and “Parch” are thrown out of the model_step -The significance level of
Pclass2 has increased compared to our Model_all.
Lets see if the odds of Pclass 2 has also increase/decrease
#> Pclass2
#> 0.2845919
It seems that the odds of survival for people in Pclass2 have decreased from 0.33 times more likely to survive compared to Pclass1 to 0.28 more likely to survive # Prediction using our model_step that was tuned from model_all, lets try to predict using our current datasets
test_clean$survivability <- predict(object = model_step,
newdata = test_clean,
type = "response")
head(test_clean)Data classification to classify which survives and which didn’t surive in the test data
The next step after doing prediction with our model is to do evaluation. we will evaluate our model using “Confusion matrix”
confusionMatrix(data = as.factor(test_clean$Pred_survive),
reference = test_clean$Survived,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 82 16
#> 1 28 53
#>
#> Accuracy : 0.7542
#> 95% CI : (0.6844, 0.8154)
#> No Information Rate : 0.6145
#> P-Value [Acc > NIR] : 0.00005396
#>
#> Kappa : 0.4974
#>
#> Mcnemar's Test P-Value : 0.09725
#>
#> Sensitivity : 0.7681
#> Specificity : 0.7455
#> Pos Pred Value : 0.6543
#> Neg Pred Value : 0.8367
#> Prevalence : 0.3855
#> Detection Rate : 0.2961
#> Detection Prevalence : 0.4525
#> Balanced Accuracy : 0.7568
#>
#> 'Positive' Class : 1
#>
Insight
-our model have an accuracy of
75.42%
there are multiple type of accuracy such as :
-Recall : From all
of the number of actual positive in the data, how much did our model
predict correctly -Accuracy : How well did our model guessed the target
-Precision : From all prediction, how well our model guessed right about
class positive -Specificity : From all of the number of actual negative
in the data, how much did our model predict correctly
Recall <- round((53)/(53+16),2)
Accuracy <- round((82+53)/nrow(test_clean),2)
Precision <- round((53)/(53+28),2)
Specificity <- round((82)/(82+16),2)
performance_logistic <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance_logisticWe see that our model fails in identifying people who actually didn’t survive but was predicted to survive so lets tune our model to increase the Precision value. this means that we need to decrease the number of False Positive.
test_clean$Pred_survive_new <- ifelse(test_clean$survivability > 0.7, yes = 1, no =0)
confusionMatrix(data = as.factor(test_clean$Pred_survive_new),
reference = test_clean$Survived,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 98 26
#> 1 12 43
#>
#> Accuracy : 0.7877
#> 95% CI : (0.7205, 0.8452)
#> No Information Rate : 0.6145
#> P-Value [Acc > NIR] : 0.0000005459
#>
#> Kappa : 0.5343
#>
#> Mcnemar's Test P-Value : 0.03496
#>
#> Sensitivity : 0.6232
#> Specificity : 0.8909
#> Pos Pred Value : 0.7818
#> Neg Pred Value : 0.7903
#> Prevalence : 0.3855
#> Detection Rate : 0.2402
#> Detection Prevalence : 0.3073
#> Balanced Accuracy : 0.7570
#>
#> 'Positive' Class : 1
#>
Recall <- round((43)/(43+26),2)
Accuracy <- round((98+43)/nrow(test_clean),2)
Precision <- round((43)/(43+12),2)
Specificity <- round((98)/(98+26),2)
performance_tuned <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance_tunedInsight
-After tuning our model the Accuracy and
Precision of our model increased
#> 'data.frame': 891 obs. of 8 variables:
#> $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
#> $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
#> $ Sex : Factor w/ 2 levels "1","2": 1 2 2 2 1 1 1 1 2 2 ...
#> $ Age : num 22 38 26 35 35 35 54 2 27 14 ...
#> $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
#> $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
To prepare for KNN, we first need to take out categorical factor variables
#> 'data.frame': 891 obs. of 5 variables:
#> $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
#> $ Age : num 22 38 26 35 35 35 54 2 27 14 ...
#> $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
#> $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
Before we predict, we need to find the optimum K
#> [1] 29.42788
Insight
-Target binary -> K Optimum: 29
-Jumlah target -> 2 (1 dan 0)
saving the values to data test
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 89 21
#> 1 27 42
#>
#> Accuracy : 0.7318
#> 95% CI : (0.6606, 0.7952)
#> No Information Rate : 0.648
#> P-Value [Acc > NIR] : 0.0105
#>
#> Kappa : 0.4247
#>
#> Mcnemar's Test P-Value : 0.4705
#>
#> Sensitivity : 0.6667
#> Specificity : 0.7672
#> Pos Pred Value : 0.6087
#> Neg Pred Value : 0.8091
#> Prevalence : 0.3520
#> Detection Rate : 0.2346
#> Detection Prevalence : 0.3855
#> Balanced Accuracy : 0.7170
#>
#> 'Positive' Class : 1
#>
Performance of Logistic Regression
Performance of Tuned Logistic Regression
Performance of K Nearest Neighbor
If we compared just the logistic Regression and the K Nearest Neighbor model without the tuned model, it seems it is better to use Logistic Regression due to the higher Accuracy, Recall, Precision, and Specificity of the model.
In conclusion if i were one of the passenger on board in the Titanic, where then the Titanic crash and is sinking, I would prefer if i were to be predicted not survive but then survive, instead of being predicted to survive but didn’t