Naive Bayes Method

Case:

Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence

Read Data

df_nb <- read.csv("data_input/default_sample.csv")

Data Preprocessing

df_nb <- df_nb %>% 
  select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history, 
         religious_province) %>% 
  mutate(age = as.integer(age),
         credit_rep = as.factor(credit_rep),
         delinquent = as.factor(delinquent),
         moral_all = as.factor(moral_all),
         muslim = as.factor(muslim),
         poor_credit_history = as.factor(poor_credit_history),
         religious_province = as.factor(religious_province))

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_nb <- sample(nrow(df_nb), nrow(df_nb)*0.8)
train_df_nb <- df_nb[index_df_nb,]
test_df_nb <- df_nb[-index_df_nb,]

Data Modelling

predvar_nb <- train_df_nb[ , -2]
respvar_nb <- train_df_nb$credit_rep

Model Making

model_naive <- train(predvar_nb,
                     respvar_nb,
                     method = "nb",
                     trControl=trainControl(method='cv',number=10))
model_naive
#> Naive Bayes 
#> 
#> 5583 samples
#>    6 predictor
#>    2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 5025, 5025, 5025, 5025, 5024, 5024, ... 
#> Resampling results across tuning parameters:
#> 
#>   usekernel  Accuracy   Kappa       
#>   FALSE      0.7126998  0.0000000000
#>    TRUE      0.7126995  0.0005246613
#> 
#> Tuning parameter 'fL' was held constant at a value of 0
#> Tuning
#>  parameter 'adjust' was held constant at a value of 1
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were fL = 0, usekernel = FALSE and adjust
#>  = 1.

Model Evaluation

predict_nb <- predict(model_naive,
                      newdata = test_df_nb)
confusionMatrix(predict_nb, test_df_nb$credit_rep)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1000  396
#>          1    0    0
#>                                              
#>                Accuracy : 0.7163             
#>                  95% CI : (0.6919, 0.7399)   
#>     No Information Rate : 0.7163             
#>     P-Value [Acc > NIR] : 0.5135             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.7163             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.7163             
#>          Detection Rate : 0.7163             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 0                  
#> 

Interpretation

The final output shows that we built a Naive Bayes classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%

Variable Performance

var_nb <- varImp(model_naive)
plot(var_nb)

click to enlarge figures

From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome

Support Vector Machine

Case:

Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence

Read Data

df_svm <- read.csv("data_input/default_sample.csv")

Data Preprocessing

df_svm <- df_svm %>% 
  select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history, 
         religious_province) %>% 
  mutate(age = as.integer(age),
         credit_rep = as.factor(credit_rep),
         delinquent = as.factor(delinquent),
         moral_all = as.factor(moral_all),
         muslim = as.factor(muslim),
         poor_credit_history = as.factor(poor_credit_history),
         religious_province = as.factor(religious_province))

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_svm <- sample(nrow(df_svm), nrow(df_svm)*0.8)
train_df_svm <- df_svm[index_df_svm,]
test_df_svm <- df_svm[-index_df_svm,]

Data Modelling

predvar_svm <- train_df_svm[ , -2]
respvar_svm <- train_df_svm$credit_rep

Model Making

model_svm <- train(credit_rep ~.,
                   data = train_df_svm,
                   method = "svmLinear",
                   trControl=trainControl(method='cv',number=10))
model_svm
#> Support Vector Machines with Linear Kernel 
#> 
#> 5583 samples
#>    6 predictor
#>    2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 5025, 5025, 5025, 5025, 5024, 5024, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa
#>   0.7126998  0    
#> 
#> Tuning parameter 'C' was held constant at a value of 1

Model Evaluation

predict_svm <- predict(model_svm,
                      newdata = test_df_svm)
confusionMatrix(predict_svm, test_df_svm$credit_rep)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1000  396
#>          1    0    0
#>                                              
#>                Accuracy : 0.7163             
#>                  95% CI : (0.6919, 0.7399)   
#>     No Information Rate : 0.7163             
#>     P-Value [Acc > NIR] : 0.5135             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.7163             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.7163             
#>          Detection Rate : 0.7163             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 0                  
#> 

Interpretation

The final output shows that we built a Support Vector Machine classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%

Variable Performance

var_svm <- varImp(model_svm)
plot(var_svm)

click to enlarge figures

From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome

Decision Tree

Case:

Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence

Read Data

df_dt <- read.csv("data_input/default_sample.csv")

Data Preprocessing

df_dt <- df_dt %>% 
  select(credit_rep, delinquent, moral_all, muslim, poor_credit_history, 
         religious_province) %>% 
  mutate(
         credit_rep = as.factor(credit_rep),
         delinquent = as.factor(delinquent),
         moral_all = as.factor(moral_all),
         muslim = as.factor(muslim),
         poor_credit_history = as.factor(poor_credit_history),
         religious_province = as.factor(religious_province))

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_dt <- sample(nrow(df_dt), nrow(df_dt)*0.8)
train_df_dt <- df_dt[index_df_dt,]
test_df_dt <- df_dt[-index_df_dt,]

Model Making

model_dt <- ctree(formula = credit_rep ~ .,
                  data = train_df_dt)
plot(model_dt, type = "simple")

click to enlarge figures

Model Evaluation

predict_dt <- predict(object = model_dt, 
                              newdata = train_df_dt, 
                              type = "response")
confusionMatrix(data = predict_dt,
                reference = train_df_dt$credit_rep,
                positive = "0")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 3979 1604
#>          1    0    0
#>                                              
#>                Accuracy : 0.7127             
#>                  95% CI : (0.7006, 0.7245)   
#>     No Information Rate : 0.7127             
#>     P-Value [Acc > NIR] : 0.5067             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.7127             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.7127             
#>          Detection Rate : 0.7127             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 0                  
#> 

Interpretation

The final output shows that we built a Decision Tree classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.27%

Random Forest

Case:

Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence

Read Data

df_rf <- read.csv("data_input/default_sample.csv")

Data Preprocessing

df_rf <- df_rf %>% 
  select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history, 
         religious_province) %>% 
  mutate(age = as.integer(age),
         credit_rep = as.factor(credit_rep),
         delinquent = as.factor(delinquent),
         moral_all = as.factor(moral_all),
         muslim = as.factor(muslim),
         poor_credit_history = as.factor(poor_credit_history),
         religious_province = as.factor(religious_province))

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_rf <- sample(nrow(df_rf), nrow(df_rf)*0.8)
train_df_rf <- df_rf[index_df_rf,]
test_df_rf <- df_rf[-index_df_rf,]

Model Making

model_forest <- train(credit_rep ~ .,
                      data = train_df_rf,
                      method = "rf",
                      trControl = trainControl(method='cv',number=10))
 
saveRDS(model_forest, "model_forest.RDS")

Model Evalauation

predict_rf <- predict(model_forest,
                      newdata = test_df_rf)
confusionMatrix(predict_rf, test_df_rf$credit_rep)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1000  396
#>          1    0    0
#>                                              
#>                Accuracy : 0.7163             
#>                  95% CI : (0.6919, 0.7399)   
#>     No Information Rate : 0.7163             
#>     P-Value [Acc > NIR] : 0.5135             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.7163             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.7163             
#>          Detection Rate : 0.7163             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 0                  
#> 

Interpretation

The final output shows that we built a Random Forest classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%

Variable Performance

var_rf <- varImp(model_forest)
plot(var_rf)

click to enlarge figures

From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome

Conclusion

Model Accuracy Comparison

acc_rank <- data.frame(Method = c("Naive Bayes", "Support Vector Machine", 
                                  "Decision Tree", "Random Forest"),
                       Accuracy = c(71.63, 71.63, 71.27, 71.63))
 
ggplot(acc_rank, aes(x=Method, y=Accuracy),) +
  geom_segment(aes(x=Method, xend=Method, y=71, yend=Accuracy,)) +
  geom_point(size=5, color="red", fill=alpha("orange", 0.3), alpha=0.7, 
              shape=21, stroke=2) +
  geom_text(label = acc_rank$Accuracy,
            hjust=1.5,
            vjust=0) +
  labs(y= "Accuracy (%)") +
  theme_minimal()

click to enlarge figures

As we can see Naive Bayes, Random Forest and Support Vector Machine have the same accuracy value which is 71.63% while Decision Tree has the lowest accuracy value of 71.27%

Model Pros & Cons

Naive Bayes

Pros:

  • Fast training time (because of the “naive” assumption)

  • Is often used as a base classifier (reference) to be compared with more complex models

  • Good for the case of text classification/text analysis which can have a lot of word predictors

Cons:

  • Skewness due to data scarcity: if one of the predictors has a value of 0 in one of the target classes, then the model will immediately predict probability = 0 (absolute) so that the model becomes biased

Support Vector Machine

Pros:

  • Works really well with a clear margin of separation

  • Effective in high dimensional spaces

  • Effective in cases where the number of dimensions is greater than the number of samples

Cons:

  • Doesn’t perform well when we have large data set because the required training time is higher

  • Doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping

Decision Tree

Pros:

  • Robust/powerful but we can still interpret it

  • Can be used for regression cases

Cons:

  • Tends to overfit

  • The lowest accuracy compared to the other method according to our classification

Random Forest

Pros:

  • Easy to interpret

  • Handles both categorical and continuous data well

  • Works well on a large dataset

Cons:

  • These are prone to overfitting

  • It can be quite large, thus making pruning necessary

  • The process can be very time-consuming