This machine learning project is centered on the crucial task of detecting bad credit customers for credit card approval. The project’s core focus lies in developing a robust predictive model that can effectively assess the creditworthiness of applicants. By successfully identifying potential high-risk customers, the model aims to enhance the credit approval process, minimize financial risks, and contribute to more informed decision-making in the realm of credit card applications.
These are some valuable business advantages that are offered by the result of this project:
Enhanced Risk Management: Accurate identification of high-risk customers minimizes the potential for financial losses and defaults.Improved Approval Process: Streamlined credit assessment leads to quicker and more precise credit decisions, enhancing customer satisfaction.Reduced Manual Workload: Automated detection of bad credit customers reduces the need for manual review and speeds up the approval process.Enhanced Customer Relationships: Identifying and declining bad credit customers prevents overextension of credit and fosters trust among good credit customers.Financial Stability: Minimizing exposure to bad credit risks contributes to the overall financial health and stability of the lending institution.Regulatory Compliance: Accurate risk assessment ensures compliance with regulatory requirements and prevents unauthorized lending.Optimal Resource Allocation: Precise identification of bad credit customers allows for targeted collection efforts and resource allocation.Competitive Advantage: Effective risk management sets the organization apart from competitors and builds a reputation for responsible lending practices.Improved Profitability: Reduced bad debt and increased repayment rates lead to improved financial outcomes and higher profitability.Data-Driven Insights: The analysis of credit data provides valuable insights into customer behavior, aiding in refining credit policies and strategies.In short: save the business more time, save the business more money, provide the business more insight, and make the business more money,
Now let’s jump into the coding process!
## [1] FALSE
Alhamdulillah there are no duplicated rows
## Rows: 537,667
## Columns: 19
## $ ID <int> 5065438, 5142753, 5111146, 5010310, 5010835, 50670…
## $ CODE_GENDER <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE <int> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ TARGET <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Some categorical columns are count as numerical, including the target variable. We’ll handle that
I’m also gonna change the name of the target column to
bad_credit
risk <- risk %>%
mutate_at(vars(starts_with("FLAG"), TARGET), as.factor) %>%
rename(bad_credit = TARGET)
risk %>% glimpse()## Rows: 537,667
## Columns: 19
## $ ID <int> 5065438, 5142753, 5111146, 5010310, 5010835, 50670…
## $ CODE_GENDER <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE <fct> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE <fct> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ bad_credit <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
The ID column is just the record identifier that shouldn’t be used for modelling so I’m gonna remove it
## Rows: 537,667
## Columns: 18
## $ CODE_GENDER <fct> F, F, M, F, M, F, M, M, F, F, M, F, F, F, F, M, F,…
## $ FLAG_OWN_CAR <fct> Y, N, Y, Y, Y, Y, Y, Y, N, N, Y, N, N, N, N, Y, Y,…
## $ FLAG_OWN_REALTY <fct> N, N, Y, Y, Y, Y, N, N, Y, Y, Y, N, Y, Y, N, Y, Y,…
## $ CNT_CHILDREN <fct> 2+ children, No children, No children, 1 children,…
## $ AMT_INCOME_TOTAL <dbl> 270000, 81000, 270000, 112500, 139500, 144000, 180…
## $ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, Secondary / seconda…
## $ NAME_FAMILY_STATUS <fct> Married, Single / not married, Married, Married, M…
## $ NAME_HOUSING_TYPE <fct> With parents, House / apartment, House / apartment…
## $ DAYS_BIRTH <int> -13258, -17876, -19579, -15109, -17281, -15394, -1…
## $ DAYS_EMPLOYED <int> -2300, -377, -1028, -1956, -5578, -2959, -219, -32…
## $ FLAG_MOBIL <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_WORK_PHONE <fct> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_PHONE <fct> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ FLAG_EMAIL <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JOB <fct> Managers, Private service staff, Laborers, Core st…
## $ BEGIN_MONTHS <int> -6, -4, 0, -3, -29, -25, -19, -18, -43, -38, -15, …
## $ STATUS <fct> C, 0, C, 0, 0, 0, X, X, 0, 0, 0, X, 0, X, 0, C, X,…
## $ bad_credit <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Cool! Now we’re ready to explore the target variable and the features
Ah hell nah. It’s imbalanced! I will upsample the data!
up_risk <- upSample(x = risk%>% select(-bad_credit),
y = risk$bad_credit,
yname = "bad_credit")
up_risk$bad_credit %>% table() %>% barplot()Cool! It’s balanced now!
FLAG_MOBIL column is all 1, I’m gonna remove it because it’s useless, it won’t provide any information for the model
Cool! We’re now left with informative categorical columns. Now, let’s explore the numerical columns!
for (col in up_risk %>% select_if(is.numeric) %>% colnames()) {
print(
ggplot(up_risk, aes(x = !!sym(col))) +
geom_histogram(aes(fill = after_stat(density)), col = "white", show.legend = F) +
labs(x = NULL,
y = "Density",
title = paste(col)) +
theme_minimal()
)
}Alhamdulillah, all numerical columns are distributed normally. We can now move onto cross validation!
Because I have large data, I will start with Naive Bayes algorithm because it’s fast like my Bugatti Chiron🏎️⚡
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 107092 0
## 1 49 107141
##
## Accuracy : 0.9998
## 95% CI : (0.9997, 0.9998)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9995
##
## Mcnemar's Test P-Value : 7.025e-12
##
## Sensitivity : 1.0000
## Specificity : 0.9995
## Pos Pred Value : 0.9995
## Neg Pred Value : 1.0000
## Prevalence : 0.5000
## Detection Rate : 0.5000
## Detection Prevalence : 0.5002
## Balanced Accuracy : 0.9998
##
## 'Positive' Class : 1
##
The Naive Bayes model exhibited outstanding
99% Accuracy,100% Sensitivity, and99% Specificityin predicting bad credit customers. This is an amazing performace. We don’t necessary need further model fitting, but I’m gonna try using Decision Tree Classifier algorithm because I believe I could get to 100% Accuracy!
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 107141 0
## 1 0 107141
##
## Accuracy : 1
## 95% CI : (1, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : 1
##
I was right!
100% Accuracy!The decision tree model achieved perfect Accuracy in detecting bad credit customers, making it the winner model of this project! Congratulations to
model_dt!🥳🤩
In conclusion, the final model has exhibited outstanding performance in predicting bad credit customers. With a perfect Accuracy of 100%, the model demonstrated exceptional precision in classifying both bad and not bad credit cases. The Sensitivity value of 100% further reinforces the model’s efficacy in identifying all instances of bad credit customers, making it an ideal choice for capturing such cases. Moreover, the model’s Specificity of 100% indicates its ability to accurately identify non-bad credit customers. In brief, the decision tree model’s outstanding accuracy of 100% renders it suitable for real-world applications!