Naive Bayes Method

Case:

Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence

Read Data

df_nb <- read.csv("data_input/default_sample.csv")

Data Preprocessing

df_nb <- df_nb %>% 
  select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history, 
         religious_province) %>% 
  mutate(age = as.integer(age),
         credit_rep = as.factor(credit_rep),
         delinquent = as.factor(delinquent),
         moral_all = as.factor(moral_all),
         muslim = as.factor(muslim),
         poor_credit_history = as.factor(poor_credit_history),
         religious_province = as.factor(religious_province))

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_nb <- sample(nrow(df_nb), nrow(df_nb)*0.8)
train_df_nb <- df_nb[index_df_nb,]
test_df_nb <- df_nb[-index_df_nb,]

Data Modelling

predvar_nb <- train_df_nb[ , -2]
respvar_nb <- train_df_nb$credit_rep

Model Making

model_naive <- train(predvar_nb,
                     respvar_nb,
                     method = "nb",
                     trControl=trainControl(method='cv',number=10))
model_naive

#> Naive Bayes 
#> 
#> 5583 samples
#>    6 predictor
#>    2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 5025, 5025, 5025, 5025, 5024, 5024, ... 
#> Resampling results across tuning parameters:
#> 
#>   usekernel  Accuracy   Kappa       
#>   FALSE      0.7126998  0.0000000000
#>    TRUE      0.7126995  0.0005246613
#> 
#> Tuning parameter 'fL' was held constant at a value of 0
#> Tuning
#>  parameter 'adjust' was held constant at a value of 1
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were fL = 0, usekernel = FALSE and adjust
#>  = 1.

Model Evaluation

predict_nb <- predict(model_naive,
                      newdata = test_df_nb)
confusionMatrix(predict_nb, test_df_nb$credit_rep)

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1000  396
#>          1    0    0
#>                                              
#>                Accuracy : 0.7163             
#>                  95% CI : (0.6919, 0.7399)   
#>     No Information Rate : 0.7163             
#>     P-Value [Acc > NIR] : 0.5135             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.7163             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.7163             
#>          Detection Rate : 0.7163             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 0                  
#>

Interpretation

The final output shows that we built a Naive Bayes classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%

Variable Performance

var_nb <- varImp(model_naive)
plot(var_nb)

click to enlarge figures

From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome

Support Vector Machine

Case:

Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence

Read Data

df_svm <- read.csv("data_input/default_sample.csv")

Data Preprocessing

df_svm <- df_svm %>% 
  select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history, 
         religious_province) %>% 
  mutate(age = as.integer(age),
         credit_rep = as.factor(credit_rep),
         delinquent = as.factor(delinquent),
         moral_all = as.factor(moral_all),
         muslim = as.factor(muslim),
         poor_credit_history = as.factor(poor_credit_history),
         religious_province = as.factor(religious_province))

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_svm <- sample(nrow(df_svm), nrow(df_svm)*0.8)
train_df_svm <- df_svm[index_df_svm,]
test_df_svm <- df_svm[-index_df_svm,]

Data Modelling

predvar_svm <- train_df_svm[ , -2]
respvar_svm <- train_df_svm$credit_rep

Model Making

model_svm <- train(credit_rep ~.,
                   data = train_df_svm,
                   method = "svmLinear",
                   trControl=trainControl(method='cv',number=10))
model_svm

#> Support Vector Machines with Linear Kernel 
#> 
#> 5583 samples
#>    6 predictor
#>    2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 5025, 5025, 5025, 5025, 5024, 5024, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa
#>   0.7126998  0    
#> 
#> Tuning parameter 'C' was held constant at a value of 1

Model Evaluation

predict_svm <- predict(model_svm,
                      newdata = test_df_svm)
confusionMatrix(predict_svm, test_df_svm$credit_rep)

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1000  396
#>          1    0    0
#>                                              
#>                Accuracy : 0.7163             
#>                  95% CI : (0.6919, 0.7399)   
#>     No Information Rate : 0.7163             
#>     P-Value [Acc > NIR] : 0.5135             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.7163             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.7163             
#>          Detection Rate : 0.7163             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 0                  
#>

Interpretation

The final output shows that we built a Support Vector Machine classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%

Variable Performance

var_svm <- varImp(model_svm)
plot(var_svm)

click to enlarge figures

From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome

Decision Tree

Case:

Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence

Read Data

df_dt <- read.csv("data_input/default_sample.csv")

Data Preprocessing

df_dt <- df_dt %>% 
  select(credit_rep, delinquent, moral_all, muslim, poor_credit_history, 
         religious_province) %>% 
  mutate(
         credit_rep = as.factor(credit_rep),
         delinquent = as.factor(delinquent),
         moral_all = as.factor(moral_all),
         muslim = as.factor(muslim),
         poor_credit_history = as.factor(poor_credit_history),
         religious_province = as.factor(religious_province))

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_dt <- sample(nrow(df_dt), nrow(df_dt)*0.8)
train_df_dt <- df_dt[index_df_dt,]
test_df_dt <- df_dt[-index_df_dt,]

Model Making

model_dt <- ctree(formula = credit_rep ~ .,
                  data = train_df_dt)
plot(model_dt, type = "simple")

click to enlarge figures

Model Evaluation

predict_dt <- predict(object = model_dt, 
                              newdata = train_df_dt, 
                              type = "response")
confusionMatrix(data = predict_dt,
                reference = train_df_dt$credit_rep,
                positive = "0")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 3979 1604
#>          1    0    0
#>                                              
#>                Accuracy : 0.7127             
#>                  95% CI : (0.7006, 0.7245)   
#>     No Information Rate : 0.7127             
#>     P-Value [Acc > NIR] : 0.5067             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.7127             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.7127             
#>          Detection Rate : 0.7127             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 0                  
#>

Interpretation

The final output shows that we built a Decision Tree classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.27%

Random Forest

Case:

Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence

Read Data

df_rf <- read.csv("data_input/default_sample.csv")

Data Preprocessing

df_rf <- df_rf %>% 
  select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history, 
         religious_province) %>% 
  mutate(age = as.integer(age),
         credit_rep = as.factor(credit_rep),
         delinquent = as.factor(delinquent),
         moral_all = as.factor(moral_all),
         muslim = as.factor(muslim),
         poor_credit_history = as.factor(poor_credit_history),
         religious_province = as.factor(religious_province))

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_rf <- sample(nrow(df_rf), nrow(df_rf)*0.8)
train_df_rf <- df_rf[index_df_rf,]
test_df_rf <- df_rf[-index_df_rf,]

Model Making

model_forest <- train(credit_rep ~ .,
                      data = train_df_rf,
                      method = "rf",
                      trControl = trainControl(method='cv',number=10))
 
saveRDS(model_forest, "model_forest.RDS")

Model Evalauation

predict_rf <- predict(model_forest,
                      newdata = test_df_rf)
confusionMatrix(predict_rf, test_df_rf$credit_rep)

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1000  396
#>          1    0    0
#>                                              
#>                Accuracy : 0.7163             
#>                  95% CI : (0.6919, 0.7399)   
#>     No Information Rate : 0.7163             
#>     P-Value [Acc > NIR] : 0.5135             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.7163             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.7163             
#>          Detection Rate : 0.7163             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 0                  
#>

Interpretation

The final output shows that we built a Random Forest classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%

Variable Performance

var_rf <- varImp(model_forest)
plot(var_rf)

click to enlarge figures

From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome

Conclusion

Model Accuracy Comparison

acc_rank <- data.frame(Method = c("Naive Bayes", "Support Vector Machine", 
                                  "Decision Tree", "Random Forest"),
                       Accuracy = c(71.63, 71.63, 71.27, 71.63))
 
ggplot(acc_rank, aes(x=Method, y=Accuracy),) +
  geom_segment(aes(x=Method, xend=Method, y=71, yend=Accuracy,)) +
  geom_point(size=5, color="red", fill=alpha("orange", 0.3), alpha=0.7, 
              shape=21, stroke=2) +
  geom_text(label = acc_rank$Accuracy,
            hjust=1.5,
            vjust=0) +
  labs(y= "Accuracy (%)") +
  theme_minimal()

click to enlarge figures

As we can see Naive Bayes, Random Forest and Support Vector Machine have the same accuracy value which is 71.63% while Decision Tree has the lowest accuracy value of 71.27%

Model Pros & Cons

Naive Bayes

Pros:

Fast training time (because of the “naive” assumption)
Is often used as a base classifier (reference) to be compared with more complex models
Good for the case of text classification/text analysis which can have a lot of word predictors

Cons:

Skewness due to data scarcity: if one of the predictors has a value of 0 in one of the target classes, then the model will immediately predict probability = 0 (absolute) so that the model becomes biased

Support Vector Machine

Pros:

Works really well with a clear margin of separation
Effective in high dimensional spaces
Effective in cases where the number of dimensions is greater than the number of samples

Cons:

Doesn’t perform well when we have large data set because the required training time is higher
Doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping

Decision Tree

Pros:

Robust/powerful but we can still interpret it
Can be used for regression cases

Cons:

Tends to overfit
The lowest accuracy compared to the other method according to our classification

Random Forest

Pros:

Easy to interpret
Handles both categorical and continuous data well
Works well on a large dataset

Cons:

These are prone to overfitting
It can be quite large, thus making pruning necessary
The process can be very time-consuming

Group Homework Classification

Azya Magyvra Sambudhi

Gopga Aqsha Muhadi

Reyhan Bihaqqi Purnawan

November 28, 2022

Naive Bayes Method

Read Data

Data Preprocessing

Cross Validation

Data Modelling

Model Making

Model Evaluation

Interpretation

Variable Performance

Support Vector Machine

Read Data

Data Preprocessing

Cross Validation

Data Modelling

Model Making

Model Evaluation

Interpretation

Variable Performance

Decision Tree

Read Data

Data Preprocessing

Cross Validation

Model Making

Model Evaluation

Interpretation

Random Forest

Read Data

Data Preprocessing

Cross Validation

Model Making

Model Evalauation

Interpretation

Variable Performance

Conclusion

Model Accuracy Comparison

Model Pros & Cons

Naive Bayes

Support Vector Machine

Decision Tree

Random Forest