Introduction

This machine learning project is centered on the crucial task of detecting bad credit customers for credit card approval. The project’s core focus lies in developing a robust predictive model that can effectively assess the creditworthiness of applicants. By successfully identifying potential high-risk customers, the model aims to enhance the credit approval process, minimize financial risks, and contribute to more informed decision-making in the realm of credit card applications.

These are some valuable business advantages that are offered by the result of this project:

  • Enhanced Risk Management: Accurate identification of high-risk customers minimizes the potential for financial losses and defaults.
  • Improved Approval Process: Streamlined credit assessment leads to quicker and more precise credit decisions, enhancing customer satisfaction.
  • Reduced Manual Workload: Automated detection of bad credit customers reduces the need for manual review and speeds up the approval process.
  • Enhanced Customer Relationships: Identifying and declining bad credit customers prevents overextension of credit and fosters trust among good credit customers.
  • Financial Stability: Minimizing exposure to bad credit risks contributes to the overall financial health and stability of the lending institution.
  • Regulatory Compliance: Accurate risk assessment ensures compliance with regulatory requirements and prevents unauthorized lending.
  • Optimal Resource Allocation: Precise identification of bad credit customers allows for targeted collection efforts and resource allocation.
  • Competitive Advantage: Effective risk management sets the organization apart from competitors and builds a reputation for responsible lending practices.
  • Improved Profitability: Reduced bad debt and increased repayment rates lead to improved financial outcomes and higher profitability.
  • Data-Driven Insights: The analysis of credit data provides valuable insights into customer behavior, aiding in refining credit policies and strategies.

In short: save the business more time, save the business more money, provide the business more insight, and make the business more money,

Now let’s jump into the coding process!

Data Pre-processing

Impot used libraries

library(Ardian)
library(dplyr)
library(caret)
library(inspectdf)
library(partykit)
library(e1071)

Read the dataset

risk <- read.csv("credit_card_approval.csv", stringsAsFactors = T)

Inspect the data

Top 6 rows

risk %>% head()

Bottom 6 rows

risk %>% tail()

Check duplicated rows

risk %>% duplicated() %>% any()
## [1] FALSE

Alhamdulillah there are no duplicated rows

Check missing values

risk %>% anyNA()
## [1] FALSE

Alhamdulillah there are no missing values

Check data structure

risk %>% glimpse()
## Rows: 537,667
## Columns: 19
## $ ID                  <int> 5065438, 5142753, 5111146, 5010310, 5010835, 50670…
## $ CODE_GENDER         <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR        <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY     <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN        <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL    <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS  <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE   <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH          <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED       <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE     <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE          <int> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB                 <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS        <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS              <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ TARGET              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Some categorical columns are count as numerical, including the target variable. We’ll handle that

Parse categorical columns

I’m also gonna change the name of the target column to bad_credit

risk <- risk %>%
  mutate_at(vars(starts_with("FLAG"), TARGET), as.factor) %>% 
  rename(bad_credit = TARGET)

risk %>% glimpse()
## Rows: 537,667
## Columns: 19
## $ ID                  <int> 5065438, 5142753, 5111146, 5010310, 5010835, 50670…
## $ CODE_GENDER         <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR        <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY     <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN        <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL    <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS  <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE   <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH          <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED       <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE     <fct> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE          <fct> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB                 <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS        <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS              <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ bad_credit          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Remove unneeded columns

The ID column is just the record identifier that shouldn’t be used for modelling so I’m gonna remove it

risk <- risk %>% select(-ID)

risk %>% glimpse()
## Rows: 537,667
## Columns: 18
## $ CODE_GENDER         <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR        <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY     <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN        <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL    <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS  <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE   <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH          <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED       <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE     <fct> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE          <fct> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB                 <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS        <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS              <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ bad_credit          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Cool! Now we’re ready to explore the target variable and the features

Exploratory Data Analysis

Check target variable proportion

risk$bad_credit %>% table() %>% barplot()

Ah hell nah. It’s imbalanced! I will upsample the data!

Upsampling

up_risk <- upSample(x = risk%>% select(-bad_credit),
                     y = risk$bad_credit,
                     yname = "bad_credit")

up_risk$bad_credit %>% table() %>% barplot()

Cool! It’s balanced now!

Inspect categorical columns distributions

up_risk %>% inspect_cat() %>% show_plot()

FLAG_MOBIL column is all 1, I’m gonna remove it because it’s useless, it won’t provide any information for the model

Remove uninformative categorical columns

risk <- risk %>% select(-nearZeroVar(.))

risk %>% inspect_cat() %>% show_plot()

Cool! We’re now left with informative categorical columns. Now, let’s explore the numerical columns!

Inspect numerical columns distributions

for (col in up_risk %>% select_if(is.numeric) %>% colnames()) {
  print(
    ggplot(up_risk, aes(x = !!sym(col))) +
    geom_histogram(aes(fill = after_stat(density)), col = "white", show.legend = F) +
    labs(x = NULL,
         y = "Density",
         title = paste(col)) +
    theme_minimal()
  )
}

Alhamdulillah, all numerical columns are distributed normally. We can now move onto cross validation!

Cross Validation

Set training indices

set.seed(1)

indices <- createDataPartition(y = up_risk$bad_credit,
                               p = 0.8,
                               list = F)

Split train & test

train_data <- up_risk[indices, ]
test_data <- up_risk[-indices, ]

X_train <- train_data %>% select(-bad_credit)
y_train <- train_data$bad_credit

X_test <- test_data %>% select(-bad_credit)
y_test <- test_data$bad_credit

Model Fitting, Evaluation, and Selection

Naive Bayes Algorithm

Because I have large data, I will start with Naive Bayes algorithm because it’s fast like my Bugatti Chiron🏎️⚡

model_nb <- naiveBayes(x = X_train,
                       y = y_train,
                       laplace = 1)

Naive Bayes Model Evaluation

pred_nb <- predict(model_nb, X_test)

confusionMatrix(pred_nb, y_test, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 107092      0
##          1     49 107141
##                                           
##                Accuracy : 0.9998          
##                  95% CI : (0.9997, 0.9998)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9995          
##                                           
##  Mcnemar's Test P-Value : 7.025e-12       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9995          
##          Pos Pred Value : 0.9995          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.5002          
##       Balanced Accuracy : 0.9998          
##                                           
##        'Positive' Class : 1               
## 

The Naive Bayes model exhibited outstanding 99% Accuracy, 100% Sensitivity, and 99% Specificity in predicting bad credit customers. This is an amazing performace. We don’t necessary need further model fitting, but I’m gonna try using Decision Tree Classifier algorithm because I believe I could get to 100% Accuracy!

Decision Tree Classifier Algorithm

tree = ctree_control(mincriterion = 0.9,
                     minsplit = 5,
                     minbucket = 3)

model_dt <- ctree(formula = bad_credit ~ .,
                  data = train_data,
                  control = tree)

Decision Tree Model Evaluation

pred_dt <- predict(model_dt, X_test)

confusionMatrix(pred_dt, y_test, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 107141      0
##          1      0 107141
##                                    
##                Accuracy : 1        
##                  95% CI : (1, 1)   
##     No Information Rate : 0.5      
##     P-Value [Acc > NIR] : < 2.2e-16
##                                    
##                   Kappa : 1        
##                                    
##  Mcnemar's Test P-Value : NA       
##                                    
##             Sensitivity : 1.0      
##             Specificity : 1.0      
##          Pos Pred Value : 1.0      
##          Neg Pred Value : 1.0      
##              Prevalence : 0.5      
##          Detection Rate : 0.5      
##    Detection Prevalence : 0.5      
##       Balanced Accuracy : 1.0      
##                                    
##        'Positive' Class : 1        
## 

I was right! 100% Accuracy!

The decision tree model achieved perfect Accuracy in detecting bad credit customers, making it the winner model of this project! Congratulations to model_dt!🥳🤩

Conclusion

In conclusion, the final model has exhibited outstanding performance in predicting bad credit customers. With a perfect Accuracy of 100%, the model demonstrated exceptional precision in classifying both bad and not bad credit cases. The Sensitivity value of 100% further reinforces the model’s efficacy in identifying all instances of bad credit customers, making it an ideal choice for capturing such cases. Moreover, the model’s Specificity of 100% indicates its ability to accurately identify non-bad credit customers. In brief, the decision tree model’s outstanding accuracy of 100% renders it suitable for real-world applications!