High Risk Borrower/Debtor Detection with Machine Learning

Introduction

This machine learning project is centered on the crucial task of detecting bad credit customers for credit card approval. The project’s core focus lies in developing a robust predictive model that can effectively assess the creditworthiness of applicants. By successfully identifying potential high-risk customers, the model aims to enhance the credit approval process, minimize financial risks, and contribute to more informed decision-making in the realm of credit card applications.

These are some valuable business advantages that are offered by the result of this project:

Enhanced Risk Management: Accurate identification of high-risk customers minimizes the potential for financial losses and defaults.

Improved Approval Process: Streamlined credit assessment leads to quicker and more precise credit decisions, enhancing customer satisfaction.

Reduced Manual Workload: Automated detection of bad credit customers reduces the need for manual review and speeds up the approval process.

Enhanced Customer Relationships: Identifying and declining bad credit customers prevents overextension of credit and fosters trust among good credit customers.

Financial Stability: Minimizing exposure to bad credit risks contributes to the overall financial health and stability of the lending institution.

Regulatory Compliance: Accurate risk assessment ensures compliance with regulatory requirements and prevents unauthorized lending.

Optimal Resource Allocation: Precise identification of bad credit customers allows for targeted collection efforts and resource allocation.

Competitive Advantage: Effective risk management sets the organization apart from competitors and builds a reputation for responsible lending practices.

Improved Profitability: Reduced bad debt and increased repayment rates lead to improved financial outcomes and higher profitability.

Data-Driven Insights: The analysis of credit data provides valuable insights into customer behavior, aiding in refining credit policies and strategies.

In short: save the business more time, save the business more money, provide the business more insight, and make the business more money,

Now let’s jump into the coding process!

Data Pre-processing

Impot used libraries

library(Ardian)
library(dplyr)
library(caret)
library(inspectdf)
library(partykit)
library(e1071)

Read the dataset

risk <- read.csv("credit_card_approval.csv", stringsAsFactors = T)

Inspect the data

Top 6 rows

risk %>% head()

Bottom 6 rows

risk %>% tail()

Check duplicated rows

risk %>% duplicated() %>% any()

## [1] FALSE

Alhamdulillah there are no duplicated rows

Check missing values

risk %>% anyNA()

## [1] FALSE

Alhamdulillah there are no missing values

Check data structure

risk %>% glimpse()

## Rows: 537,667
## Columns: 19
## $ ID                  <int> 5065438, 5142753, 5111146, 5010310, 5010835, 50670…
## $ CODE_GENDER         <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR        <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY     <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN        <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL    <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS  <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE   <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH          <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED       <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE     <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE          <int> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB                 <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS        <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS              <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ TARGET              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Some categorical columns are count as numerical, including the target variable. We’ll handle that

Parse categorical columns

I’m also gonna change the name of the target column to bad_credit

risk <- risk %>%
  mutate_at(vars(starts_with("FLAG"), TARGET), as.factor) %>% 
  rename(bad_credit = TARGET)

risk %>% glimpse()

## Rows: 537,667
## Columns: 19
## $ ID                  <int> 5065438, 5142753, 5111146, 5010310, 5010835, 50670…
## $ CODE_GENDER         <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR        <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY     <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN        <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL    <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS  <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE   <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH          <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED       <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE     <fct> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE          <fct> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB                 <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS        <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS              <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ bad_credit          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Remove unneeded columns

The ID column is just the record identifier that shouldn’t be used for modelling so I’m gonna remove it

risk <- risk %>% select(-ID)

risk %>% glimpse()

## Rows: 537,667
## Columns: 18
## $ CODE_GENDER         <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR        <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY     <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN        <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL    <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS  <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE   <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH          <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED       <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE     <fct> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE          <fct> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB                 <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS        <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS              <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ bad_credit          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Cool! Now we’re ready to explore the target variable and the features

Exploratory Data Analysis

Check target variable proportion

risk$bad_credit %>% table() %>% barplot()

Ah hell nah. It’s imbalanced! I will upsample the data!

Upsampling

up_risk <- upSample(x = risk%>% select(-bad_credit),
                     y = risk$bad_credit,
                     yname = "bad_credit")

up_risk$bad_credit %>% table() %>% barplot()

Cool! It’s balanced now!

Inspect categorical columns distributions

up_risk %>% inspect_cat() %>% show_plot()

FLAG_MOBIL column is all 1, I’m gonna remove it because it’s useless, it won’t provide any information for the model

Remove uninformative categorical columns

risk <- risk %>% select(-nearZeroVar(.))

risk %>% inspect_cat() %>% show_plot()

Cool! We’re now left with informative categorical columns. Now, let’s explore the numerical columns!

Inspect numerical columns distributions

for (col in up_risk %>% select_if(is.numeric) %>% colnames()) {
  print(
    ggplot(up_risk, aes(x = !!sym(col))) +
    geom_histogram(aes(fill = after_stat(density)), col = "white", show.legend = F) +
    labs(x = NULL,
         y = "Density",
         title = paste(col)) +
    theme_minimal()
  )
}

Alhamdulillah, all numerical columns are distributed normally. We can now move onto cross validation!

Cross Validation

Set training indices

set.seed(1)

indices <- createDataPartition(y = up_risk$bad_credit,
                               p = 0.8,
                               list = F)

Split train & test

train_data <- up_risk[indices, ]
test_data <- up_risk[-indices, ]

X_train <- train_data %>% select(-bad_credit)
y_train <- train_data$bad_credit

X_test <- test_data %>% select(-bad_credit)
y_test <- test_data$bad_credit

Model Fitting, Evaluation, and Selection

Naive Bayes Algorithm

Because I have large data, I will start with Naive Bayes algorithm because it’s fast like my Bugatti Chiron🏎️⚡

model_nb <- naiveBayes(x = X_train,
                       y = y_train,
                       laplace = 1)

Naive Bayes Model Evaluation

pred_nb <- predict(model_nb, X_test)

confusionMatrix(pred_nb, y_test, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 107092      0
##          1     49 107141
##                                           
##                Accuracy : 0.9998          
##                  95% CI : (0.9997, 0.9998)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9995          
##                                           
##  Mcnemar's Test P-Value : 7.025e-12       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9995          
##          Pos Pred Value : 0.9995          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.5002          
##       Balanced Accuracy : 0.9998          
##                                           
##        'Positive' Class : 1               
##

The Naive Bayes model exhibited outstanding 99% Accuracy, 100% Sensitivity, and 99% Specificity in predicting bad credit customers. This is an amazing performace. We don’t necessary need further model fitting, but I’m gonna try using Decision Tree Classifier algorithm because I believe I could get to 100% Accuracy!

Decision Tree Classifier Algorithm

tree = ctree_control(mincriterion = 0.9,
                     minsplit = 5,
                     minbucket = 3)

model_dt <- ctree(formula = bad_credit ~ .,
                  data = train_data,
                  control = tree)

Decision Tree Model Evaluation

pred_dt <- predict(model_dt, X_test)

confusionMatrix(pred_dt, y_test, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 107141      0
##          1      0 107141
##                                    
##                Accuracy : 1        
##                  95% CI : (1, 1)   
##     No Information Rate : 0.5      
##     P-Value [Acc > NIR] : < 2.2e-16
##                                    
##                   Kappa : 1        
##                                    
##  Mcnemar's Test P-Value : NA       
##                                    
##             Sensitivity : 1.0      
##             Specificity : 1.0      
##          Pos Pred Value : 1.0      
##          Neg Pred Value : 1.0      
##              Prevalence : 0.5      
##          Detection Rate : 0.5      
##    Detection Prevalence : 0.5      
##       Balanced Accuracy : 1.0      
##                                    
##        'Positive' Class : 1        
##

I was right! 100% Accuracy!

The decision tree model achieved perfect Accuracy in detecting bad credit customers, making it the winner model of this project! Congratulations to model_dt!🥳🤩

Conclusion

In conclusion, the final model has exhibited outstanding performance in predicting bad credit customers. With a perfect Accuracy of 100%, the model demonstrated exceptional precision in classifying both bad and not bad credit cases. The Sensitivity value of 100% further reinforces the model’s efficacy in identifying all instances of bad credit customers, making it an ideal choice for capturing such cases. Moreover, the model’s Specificity of 100% indicates its ability to accurately identify non-bad credit customers. In brief, the decision tree model’s outstanding accuracy of 100% renders it suitable for real-world applications!