Intro

this LBB is used to learn about Classification in Machine Learning by predicting which of the passenger of titanic would survive the titanic incident using the datasets provided by kaggle. By Classification in Machine Learning through this dataset we can predict which passenger is most likely to survive.

Data Preparation

First lets load the necessary library we will be using

library(dplyr) # for wrangling
library(inspectdf) # for EDA
library(gtools) # for ML model & assumption 
library(caret) # for ML model & evaluation 
library(readxl)
library(rsample)
library(zoo)
library(imbalance)
library(class)

lets load in the data we will be using into an object. The datasets have provided two data, one is train data and the other is to test our prediction, the test data.

survive <- read.csv("train.csv")
survive

Based on this data frame, the target variable would be “survived” where -0 = didn’t survived
-1 = survived

Data Cleansing

Lets check the structure of our data

str(survive)

#> 'data.frame':    891 obs. of  12 variables:
#>  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
#>  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
#>  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#>  $ Sex        : chr  "male" "female" "female" "female" ...
#>  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
#>  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
#>  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
#>  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#>  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Cabin      : chr  "" "C85" "" "C123" ...
#>  $ Embarked   : chr  "S" "C" "S" "S" ...

It seems that the type of data on multiple variables such as “Survived”, “sex”, “Pclass”, and also “embarked” is wrong. lets change it into factors. While also changing the type of data, lets also remove some unused data

survive <- survive %>% mutate(Survived = as.factor(Survived),
                          Pclass = as.factor(Pclass),
                          Embarked = as.factor(Embarked),
                          Sex = case_when(Sex == "male" ~ 1,
                                          Sex == "female" ~ 2),
                          Sex = as.factor(Sex)) %>% select(-c("PassengerId", "Name", "Cabin", "Ticket"))
str(survive)

#> 'data.frame':    891 obs. of  8 variables:
#>  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
#>  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
#>  $ Sex     : Factor w/ 2 levels "1","2": 1 2 2 2 1 1 1 1 2 2 ...
#>  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
#>  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
#>  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
#>  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Then lets check for any missing values

survive %>% is.na() %>% colSums()

#> Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
#>        0        0        0      177        0        0        0        0

It seems there is many missing values within our Age coloumns, lets try to fill those missing values using

# train cleaning
survive_clean <- survive %>% na.locf()
survive_clean %>% is.na() %>% colSums()

#> Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
#>        0        0        0        0        0        0        0        0

Exploratory Data Analysis

In this step, we want to know how spread is the category data and the numeric data

survive_clean %>% inspect_cat()

survive_clean %>% inspect_num()

Insight
-The age of Titanic passengers ranges from 3 month year old baby to 80 year’s old.
-The average age of the passengers are 29 years old.
-64% of Titanics passengers were male
-only 38% of Titanic passengers survived.

Cross Validation

Cross validation is a simple method used to split data into 2 parts which is train data and test data. However in this case Kaggle already splitted the data and have given us the splitted data.

RNGkind(sample.kind = "Rounding")
set.seed(123)

splitter <- initial_split(data = survive_clean, prop = 0.8, strata = "Survived")
train_clean <- training(splitter)
test_clean <- testing(splitter)

Check Imbalance Class

Balanced class proportions are important for the training data because we will be using this data to train the model.

Balanced proportions are essential to ensure that the classification model learns the characteristics of each class equally and is not dominated by just one class. This prevents the model from being biased towards the class with the larger proportion, which would result in the model being only good at predicting one class.

table(train_clean$Survived) %>% prop.table()

#> 
#>        0        1 
#> 0.616573 0.383427

table(train_clean$Survived)

#> 
#>   0   1 
#> 439 273

It seems the data provided by kaggle is not balanced yet so we need to balance this.

Balancing model

to balance the model we can use either upsampling or downsampling. in this case we will be using upsampling to balance our class

train_Nclean <- upSample(x = train_clean[, -1],
                           y = train_clean$Survived,
                           yname = "Survived")

table(train_Nclean$Survived) %>% prop.table()

#> 
#>   0   1 
#> 0.5 0.5

After checking the proporsion of survived and non-surviver, we now have a balanced class

Modeling

lets do a modelling using Logistic Regression. This modelling will be using the function called “glm”. The variable used will be all the variables provided by the data sets

Step 1: building the model

model_all <- glm(Survived ~.,
                 family = "binomial",
                 data = train_Nclean)
summary(model_all)

#> 
#> Call:
#> glm(formula = Survived ~ ., family = "binomial", data = train_Nclean)
#> 
#> Coefficients:
#>               Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)  13.097655 617.012211   0.021             0.983064    
#> Pclass2      -1.100741   0.294414  -3.739             0.000185 ***
#> Pclass3      -2.137240   0.280000  -7.633   0.0000000000000229 ***
#> Sex2          2.768607   0.207654  13.333 < 0.0000000000000002 ***
#> Age          -0.028188   0.006665  -4.229   0.0000234722718158 ***
#> SibSp        -0.382933   0.108183  -3.540             0.000401 ***
#> Parch        -0.060876   0.119285  -0.510             0.609813    
#> Fare          0.001777   0.002303   0.772             0.440196    
#> EmbarkedC   -11.704093 617.012131  -0.019             0.984866    
#> EmbarkedQ   -11.527751 617.012180  -0.019             0.985094    
#> EmbarkedS   -11.899387 617.012119  -0.019             0.984613    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1217.17  on 877  degrees of freedom
#> Residual deviance:  802.38  on 867  degrees of freedom
#> AIC: 824.38
#> 
#> Number of Fisher Scoring iterations: 13

Insight
-it seems that Pclass2, Pclass3, Sex2, Age, and SibSp played a significant role in determining wether the passenger would survive or not

Step 2: Finding the odds prediktor

#odds of Pclass2  
exp(model_all$coefficients[2])

#>   Pclass2 
#> 0.3326246

#odds of Pclass3
exp(model_all$coefficients[3])

#> Pclass3 
#> 0.11798

#odds of Sex2
exp(model_all$coefficients[4])

#>     Sex2 
#> 15.93642

#odds of Age
exp(model_all$coefficients[5])

#>      Age 
#> 0.972206

#odds of SibSp
exp(model_all$coefficients[6])

#>     SibSp 
#> 0.6818588

Insight
What these number tells us is that:
-People in class 2 are 0.33 times more likely to survive than people in class 1.
-People in class 3 are 0.11 more likely to survive than people in class 1.
-Sex2 or woman are 15.9 times more like to survive compared to men.
-For every increase in age, the likely survivability of a person is increase by 0.97 times.
-for every increase in number of siblings or spouses, the likely survivalhood of a person is increase by 0.68 times.

Model Tuning

In the first model, theres still many variable prediktor that isn’t significant towards the target variable. That is why we will be tuning our model using Stepwise method

model_step <- step(object = model_all,
                   direction = "backward")

#> Start:  AIC=824.38
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
#> 
#>            Df Deviance     AIC
#> - Embarked  3   804.28  820.28
#> - Parch     1   802.64  822.64
#> - Fare      1   803.02  823.02
#> <none>          802.38  824.38
#> - SibSp     1   817.15  837.15
#> - Age       1   821.13  841.13
#> - Pclass    2   864.83  882.83
#> - Sex       1  1034.79 1054.79
#> 
#> Step:  AIC=820.28
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
#> 
#>          Df Deviance     AIC
#> - Parch   1   804.68  818.68
#> - Fare    1   805.28  819.28
#> <none>        804.28  820.28
#> - SibSp   1   820.47  834.47
#> - Age     1   822.49  836.49
#> - Pclass  2   867.22  879.22
#> - Sex     1  1053.13 1067.13
#> 
#> Step:  AIC=818.68
#> Survived ~ Pclass + Sex + Age + SibSp + Fare
#> 
#>          Df Deviance     AIC
#> - Fare    1   805.47  817.47
#> <none>        804.68  818.68
#> - Age     1   822.62  834.62
#> - SibSp   1   824.03  836.03
#> - Pclass  2   869.47  879.47
#> - Sex     1  1060.15 1072.15
#> 
#> Step:  AIC=817.47
#> Survived ~ Pclass + Sex + Age + SibSp
#> 
#>          Df Deviance     AIC
#> <none>        805.47  817.47
#> - Age     1   823.83  833.83
#> - SibSp   1   824.06  834.06
#> - Pclass  2   912.53  920.53
#> - Sex     1  1067.69 1077.69

summary(model_step)

#> 
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
#>     data = train_Nclean)
#> 
#> Coefficients:
#>              Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)  1.365448   0.317030   4.307            0.0000165 ***
#> Pclass2     -1.256699   0.265939  -4.726            0.0000023 ***
#> Pclass3     -2.229983   0.233344  -9.557 < 0.0000000000000002 ***
#> Sex2         2.800483   0.198799  14.087 < 0.0000000000000002 ***
#> Age         -0.027729   0.006626  -4.185            0.0000286 ***
#> SibSp       -0.405622   0.103601  -3.915            0.0000903 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1217.17  on 877  degrees of freedom
#> Residual deviance:  805.47  on 872  degrees of freedom
#> AIC: 817.47
#> 
#> Number of Fisher Scoring iterations: 5

Insight
-Some variable prediktor such as “Fare” and “Parch” are thrown out of the model_step -The significance level of Pclass2 has increased compared to our Model_all.

Lets see if the odds of Pclass 2 has also increase/decrease

# odds of Pclass2
exp(model_step$coefficients[2])

#>   Pclass2 
#> 0.2845919

It seems that the odds of survival for people in Pclass2 have decreased from 0.33 times more likely to survive compared to Pclass1 to 0.28 more likely to survive # Prediction using our model_step that was tuned from model_all, lets try to predict using our current datasets

test_clean$survivability <- predict(object = model_step,
                                 newdata = test_clean,
                                 type = "response")
head(test_clean)

Data classification to classify which survives and which didn’t surive in the test data

test_clean$Pred_survive <- ifelse(test_clean$survivability > 0.5, yes = 1, no = 0)
head(test_clean)

Model Evaluation

The next step after doing prediction with our model is to do evaluation. we will evaluate our model using “Confusion matrix”

confusionMatrix(data = as.factor(test_clean$Pred_survive),
                reference = test_clean$Survived,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 82 16
#>          1 28 53
#>                                           
#>                Accuracy : 0.7542          
#>                  95% CI : (0.6844, 0.8154)
#>     No Information Rate : 0.6145          
#>     P-Value [Acc > NIR] : 0.00005396      
#>                                           
#>                   Kappa : 0.4974          
#>                                           
#>  Mcnemar's Test P-Value : 0.09725         
#>                                           
#>             Sensitivity : 0.7681          
#>             Specificity : 0.7455          
#>          Pos Pred Value : 0.6543          
#>          Neg Pred Value : 0.8367          
#>              Prevalence : 0.3855          
#>          Detection Rate : 0.2961          
#>    Detection Prevalence : 0.4525          
#>       Balanced Accuracy : 0.7568          
#>                                           
#>        'Positive' Class : 1               
#>

Insight
-our model have an accuracy of 75.42%

Accuracy

there are multiple type of accuracy such as :
-Recall : From all of the number of actual positive in the data, how much did our model predict correctly -Accuracy : How well did our model guessed the target -Precision : From all prediction, how well our model guessed right about class positive -Specificity : From all of the number of actual negative in the data, how much did our model predict correctly

Recall <- round((53)/(53+16),2)
Accuracy <-  round((82+53)/nrow(test_clean),2)
Precision <- round((53)/(53+28),2)
Specificity <- round((82)/(82+16),2)
performance_logistic <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance_logistic

Model tuning

We see that our model fails in identifying people who actually didn’t survive but was predicted to survive so lets tune our model to increase the Precision value. this means that we need to decrease the number of False Positive.

test_clean$Pred_survive_new <- ifelse(test_clean$survivability > 0.7, yes = 1, no =0)

confusionMatrix(data = as.factor(test_clean$Pred_survive_new),
                reference = test_clean$Survived,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 98 26
#>          1 12 43
#>                                           
#>                Accuracy : 0.7877          
#>                  95% CI : (0.7205, 0.8452)
#>     No Information Rate : 0.6145          
#>     P-Value [Acc > NIR] : 0.0000005459    
#>                                           
#>                   Kappa : 0.5343          
#>                                           
#>  Mcnemar's Test P-Value : 0.03496         
#>                                           
#>             Sensitivity : 0.6232          
#>             Specificity : 0.8909          
#>          Pos Pred Value : 0.7818          
#>          Neg Pred Value : 0.7903          
#>              Prevalence : 0.3855          
#>          Detection Rate : 0.2402          
#>    Detection Prevalence : 0.3073          
#>       Balanced Accuracy : 0.7570          
#>                                           
#>        'Positive' Class : 1               
#>

Accuracy

Recall <- round((43)/(43+26),2)
Accuracy <-  round((98+43)/nrow(test_clean),2)
Precision <- round((43)/(43+12),2)
Specificity <- round((98)/(98+26),2)
performance_tuned <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance_tuned

Insight
-After tuning our model the Accuracy and Precision of our model increased

K Nearest Neighbor Classifier

Pre-processing Data

str(survive_clean)

#> 'data.frame':    891 obs. of  8 variables:
#>  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
#>  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
#>  $ Sex     : Factor w/ 2 levels "1","2": 1 2 2 2 1 1 1 1 2 2 ...
#>  $ Age     : num  22 38 26 35 35 35 54 2 27 14 ...
#>  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
#>  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
#>  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

To prepare for KNN, we first need to take out categorical factor variables

knn_data <- survive_clean %>% select(-c("Pclass", "Sex", "Embarked"))
str(knn_data)

#> 'data.frame':    891 obs. of  5 variables:
#>  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
#>  $ Age     : num  22 38 26 35 35 35 54 2 27 14 ...
#>  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
#>  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
#>  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(123)
splitter <- initial_split(data = knn_data, prop = 0.8)
train_knn <- training(splitter)
test_knn <- testing(splitter)

Check Imbalance Class

table(train_knn$Survived) %>% 
  prop.table()

#> 
#>         0         1 
#> 0.6081461 0.3918539

Balancing

train_knn_up <- upSample(x = train_knn[, -1],
                           y = train_knn$Survived,
                           yname = "Survived")

Check imbalance Class

table(train_knn_up$Survived) %>% 
  prop.table()

#> 
#>   0   1 
#> 0.5 0.5

Scaling

Preparing data for scaling

train_Knn_x <- train_knn_up %>% select(-Survived)
test_knn_x <- test_knn %>% select(-Survived)

train_y <- train_knn_up$Survived
test_y <- test_knn$Survived

Scaling

#train
train_xs <- scale(train_Knn_x)

#test
test_xs <- scale(test_knn_x,
                 center = attr(train_xs, "scaled:center"),
                 scale = attr(train_xs, "scaled:scale"))

predict

Before we predict, we need to find the optimum K

sqrt(nrow(train_knn_up))

#> [1] 29.42788

Insight
-Target binary -> K Optimum: 29
-Jumlah target -> 2 (1 dan 0)

knn <- knn(train = train_xs,
           test = test_xs,
           cl = train_y,
           k = 29)

saving the values to data test

test_knn$pred_label <- knn
test_knn

Model Evaluation

confusionMatrix(data = test_knn$pred_label,
                reference = test_knn$Survived,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 89 21
#>          1 27 42
#>                                           
#>                Accuracy : 0.7318          
#>                  95% CI : (0.6606, 0.7952)
#>     No Information Rate : 0.648           
#>     P-Value [Acc > NIR] : 0.0105          
#>                                           
#>                   Kappa : 0.4247          
#>                                           
#>  Mcnemar's Test P-Value : 0.4705          
#>                                           
#>             Sensitivity : 0.6667          
#>             Specificity : 0.7672          
#>          Pos Pred Value : 0.6087          
#>          Neg Pred Value : 0.8091          
#>              Prevalence : 0.3520          
#>          Detection Rate : 0.2346          
#>    Detection Prevalence : 0.3855          
#>       Balanced Accuracy : 0.7170          
#>                                           
#>        'Positive' Class : 1               
#>

Accuracy

Recall <- round((42)/(42+21),2)
Accuracy <-  round((89+42)/nrow(test_clean),2)
Precision <- round((42)/(42+27),2)
Specificity <- round((89)/(89+21),2)
performance_knn <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance_knn

Comparison

Performance of Logistic Regression

performance_logistic

Performance of Tuned Logistic Regression

performance_tuned

Performance of K Nearest Neighbor

performance_knn

If we compared just the logistic Regression and the K Nearest Neighbor model without the tuned model, it seems it is better to use Logistic Regression due to the higher Accuracy, Recall, Precision, and Specificity of the model.

Conclusion

In conclusion if i were one of the passenger on board in the Titanic, where then the Titanic crash and is sinking, I would prefer if i were to be predicted not survive but then survive, instead of being predicted to survive but didn’t

Reference

https://www.kaggle.com/c/titanic/data

LBB Classification in Machine Learning I

Jonathan Christopher Pandiana

2024-07-23

Intro

Data Preparation

Data Cleansing

Exploratory Data Analysis

Cross Validation

Check Imbalance Class

Balancing model

Modeling

Step 1: building the model

Step 2: Finding the odds prediktor

Model Tuning

Model Evaluation

Accuracy

Model tuning

Accuracy

K Nearest Neighbor Classifier

Pre-processing Data

Cross Validation

Check Imbalance Class

Balancing

Check imbalance Class

Scaling

predict

Model Evaluation

Accuracy

Comparison

Conclusion

Reference