Case:
Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence
df_nb <- read.csv("data_input/default_sample.csv")df_nb <- df_nb %>%
select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history,
religious_province) %>%
mutate(age = as.integer(age),
credit_rep = as.factor(credit_rep),
delinquent = as.factor(delinquent),
moral_all = as.factor(moral_all),
muslim = as.factor(muslim),
poor_credit_history = as.factor(poor_credit_history),
religious_province = as.factor(religious_province))RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_nb <- sample(nrow(df_nb), nrow(df_nb)*0.8)
train_df_nb <- df_nb[index_df_nb,]
test_df_nb <- df_nb[-index_df_nb,]predvar_nb <- train_df_nb[ , -2]
respvar_nb <- train_df_nb$credit_repmodel_naive <- train(predvar_nb,
respvar_nb,
method = "nb",
trControl=trainControl(method='cv',number=10))
model_naive#> Naive Bayes
#>
#> 5583 samples
#> 6 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 5025, 5025, 5025, 5025, 5024, 5024, ...
#> Resampling results across tuning parameters:
#>
#> usekernel Accuracy Kappa
#> FALSE 0.7126998 0.0000000000
#> TRUE 0.7126995 0.0005246613
#>
#> Tuning parameter 'fL' was held constant at a value of 0
#> Tuning
#> parameter 'adjust' was held constant at a value of 1
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were fL = 0, usekernel = FALSE and adjust
#> = 1.
predict_nb <- predict(model_naive,
newdata = test_df_nb)
confusionMatrix(predict_nb, test_df_nb$credit_rep)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1000 396
#> 1 0 0
#>
#> Accuracy : 0.7163
#> 95% CI : (0.6919, 0.7399)
#> No Information Rate : 0.7163
#> P-Value [Acc > NIR] : 0.5135
#>
#> Kappa : 0
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 1.0000
#> Specificity : 0.0000
#> Pos Pred Value : 0.7163
#> Neg Pred Value : NaN
#> Prevalence : 0.7163
#> Detection Rate : 0.7163
#> Detection Prevalence : 1.0000
#> Balanced Accuracy : 0.5000
#>
#> 'Positive' Class : 0
#>
The final output shows that we built a Naive Bayes classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%
var_nb <- varImp(model_naive)
plot(var_nb)
click to enlarge figures
From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome
Case:
Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence
df_svm <- read.csv("data_input/default_sample.csv")df_svm <- df_svm %>%
select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history,
religious_province) %>%
mutate(age = as.integer(age),
credit_rep = as.factor(credit_rep),
delinquent = as.factor(delinquent),
moral_all = as.factor(moral_all),
muslim = as.factor(muslim),
poor_credit_history = as.factor(poor_credit_history),
religious_province = as.factor(religious_province))RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_svm <- sample(nrow(df_svm), nrow(df_svm)*0.8)
train_df_svm <- df_svm[index_df_svm,]
test_df_svm <- df_svm[-index_df_svm,]predvar_svm <- train_df_svm[ , -2]
respvar_svm <- train_df_svm$credit_repmodel_svm <- train(credit_rep ~.,
data = train_df_svm,
method = "svmLinear",
trControl=trainControl(method='cv',number=10))
model_svm#> Support Vector Machines with Linear Kernel
#>
#> 5583 samples
#> 6 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 5025, 5025, 5025, 5025, 5024, 5024, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.7126998 0
#>
#> Tuning parameter 'C' was held constant at a value of 1
predict_svm <- predict(model_svm,
newdata = test_df_svm)
confusionMatrix(predict_svm, test_df_svm$credit_rep)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1000 396
#> 1 0 0
#>
#> Accuracy : 0.7163
#> 95% CI : (0.6919, 0.7399)
#> No Information Rate : 0.7163
#> P-Value [Acc > NIR] : 0.5135
#>
#> Kappa : 0
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 1.0000
#> Specificity : 0.0000
#> Pos Pred Value : 0.7163
#> Neg Pred Value : NaN
#> Prevalence : 0.7163
#> Detection Rate : 0.7163
#> Detection Prevalence : 1.0000
#> Balanced Accuracy : 0.5000
#>
#> 'Positive' Class : 0
#>
The final output shows that we built a Support Vector Machine classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%
var_svm <- varImp(model_svm)
plot(var_svm)
click to enlarge figures
From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome
Case:
Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence
df_dt <- read.csv("data_input/default_sample.csv")df_dt <- df_dt %>%
select(credit_rep, delinquent, moral_all, muslim, poor_credit_history,
religious_province) %>%
mutate(
credit_rep = as.factor(credit_rep),
delinquent = as.factor(delinquent),
moral_all = as.factor(moral_all),
muslim = as.factor(muslim),
poor_credit_history = as.factor(poor_credit_history),
religious_province = as.factor(religious_province))RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_dt <- sample(nrow(df_dt), nrow(df_dt)*0.8)
train_df_dt <- df_dt[index_df_dt,]
test_df_dt <- df_dt[-index_df_dt,]model_dt <- ctree(formula = credit_rep ~ .,
data = train_df_dt)
plot(model_dt, type = "simple")
click to enlarge figures
predict_dt <- predict(object = model_dt,
newdata = train_df_dt,
type = "response")
confusionMatrix(data = predict_dt,
reference = train_df_dt$credit_rep,
positive = "0")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 3979 1604
#> 1 0 0
#>
#> Accuracy : 0.7127
#> 95% CI : (0.7006, 0.7245)
#> No Information Rate : 0.7127
#> P-Value [Acc > NIR] : 0.5067
#>
#> Kappa : 0
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 1.0000
#> Specificity : 0.0000
#> Pos Pred Value : 0.7127
#> Neg Pred Value : NaN
#> Prevalence : 0.7127
#> Detection Rate : 0.7127
#> Detection Prevalence : 1.0000
#> Balanced Accuracy : 0.5000
#>
#> 'Positive' Class : 0
#>
The final output shows that we built a Decision Tree classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.27%
Case:
Classify customers who have paid credit card payments or not by studying age, tendency to commit crime, moral, religion, credit history and domicile of residence
df_rf <- read.csv("data_input/default_sample.csv")df_rf <- df_rf %>%
select(age, credit_rep, delinquent, moral_all, muslim, poor_credit_history,
religious_province) %>%
mutate(age = as.integer(age),
credit_rep = as.factor(credit_rep),
delinquent = as.factor(delinquent),
moral_all = as.factor(moral_all),
muslim = as.factor(muslim),
poor_credit_history = as.factor(poor_credit_history),
religious_province = as.factor(religious_province))RNGkind(sample.kind = "Rounding")
set.seed(100)
index_df_rf <- sample(nrow(df_rf), nrow(df_rf)*0.8)
train_df_rf <- df_rf[index_df_rf,]
test_df_rf <- df_rf[-index_df_rf,]model_forest <- train(credit_rep ~ .,
data = train_df_rf,
method = "rf",
trControl = trainControl(method='cv',number=10))
saveRDS(model_forest, "model_forest.RDS")predict_rf <- predict(model_forest,
newdata = test_df_rf)
confusionMatrix(predict_rf, test_df_rf$credit_rep)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1000 396
#> 1 0 0
#>
#> Accuracy : 0.7163
#> 95% CI : (0.6919, 0.7399)
#> No Information Rate : 0.7163
#> P-Value [Acc > NIR] : 0.5135
#>
#> Kappa : 0
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 1.0000
#> Specificity : 0.0000
#> Pos Pred Value : 0.7163
#> Neg Pred Value : NaN
#> Prevalence : 0.7163
#> Detection Rate : 0.7163
#> Detection Prevalence : 1.0000
#> Balanced Accuracy : 0.5000
#>
#> 'Positive' Class : 0
#>
The final output shows that we built a Random Forest classifier that can predict whether a person is pay credit card repayment or not, with an accuracy of approximately 71.63%
var_rf <- varImp(model_forest)
plot(var_rf)
click to enlarge figures
From the above illustration, it is clear that ‘moral_all’ is the most significant variable for predicting the outcome
acc_rank <- data.frame(Method = c("Naive Bayes", "Support Vector Machine",
"Decision Tree", "Random Forest"),
Accuracy = c(71.63, 71.63, 71.27, 71.63))
ggplot(acc_rank, aes(x=Method, y=Accuracy),) +
geom_segment(aes(x=Method, xend=Method, y=71, yend=Accuracy,)) +
geom_point(size=5, color="red", fill=alpha("orange", 0.3), alpha=0.7,
shape=21, stroke=2) +
geom_text(label = acc_rank$Accuracy,
hjust=1.5,
vjust=0) +
labs(y= "Accuracy (%)") +
theme_minimal()
click to enlarge figures
As we can see Naive Bayes, Random Forest and Support Vector Machine have the same accuracy value which is 71.63% while Decision Tree has the lowest accuracy value of 71.27%
Pros:
Fast training time (because of the “naive” assumption)
Is often used as a base classifier (reference) to be compared with more complex models
Good for the case of text classification/text analysis which can have a lot of word predictors
Cons:
Pros:
Works really well with a clear margin of separation
Effective in high dimensional spaces
Effective in cases where the number of dimensions is greater than the number of samples
Cons:
Doesn’t perform well when we have large data set because the required training time is higher
Doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
Pros:
Robust/powerful but we can still interpret it
Can be used for regression cases
Cons:
Tends to overfit
The lowest accuracy compared to the other method according to our classification
Pros:
Easy to interpret
Handles both categorical and continuous data well
Works well on a large dataset
Cons:
These are prone to overfitting
It can be quite large, thus making pruning necessary
The process can be very time-consuming