Name | Matrix Number |
---|---|
TAN YANG YI | 24061644 |
SHARON LEE JOO WEI | 24063813 |
LEE RONG PHEI | 24064031 |
NURUL SARAH IZZATI BINTI ZAHID | 24064189 |
YEE SEE MARN | 23102510 |
Credit scoring plays a critical role in financial decision-making, influencing loan approvals, assigning interest rates , and risk assessment. With the increasing reliance on data-driven methodologies, machine learning has become an indispensable tool for evaluating customers’ credit scoring and enhancing credit score classification. The goal of this project is to establish a robust machine learning framework using R to analyze financial data, including credit history, loan amounts, and income levels. Subsequently, the most effective model will be developed to accurately predict customers’ credit score bands (Good, Standard and Bad) based on their financial profiles.
The framework encompasses key stages of the data lifecycle— data cleaning, exploratory data analysis (EDA), modeling, and evaluation by leveraging advanced tools in R. Ultimately, this project seeks to uncover actionable insights from the financial data, improve prediction accuracy on credit score classification problem, and demonstrate how technology can empower financial institutions to make informed decisions.
How accurate are the selected machine learning models—Logistic Regression, Random Forest, and XGBoost—in classifying individuals into the credit score bands (Good, Standard, Poor) based on their financial data?
How effectively does the regression model predict an individual’s Bad Rate% based on financial attributes such as income, loan amount, and credit history, providing insights for risk assessment, scorecard thresholds, and data-driven decision-making?
To develop machine learning-based credit scoring classification systems that accurately classify individuals into appropriate credit score categories (Good, Standard, or Poor) using various classification techniques.
To further develop a regression model that estimates the customer’s Bad Rate%, enabling more detailed risk assessments. This approach allows for differentiating between various risk levels, such as a 20% Bad Rate versus an 80% Bad Rate.
Accurately classifying customers into Good, Standard, and Poor categories streamlines the customer monitoring process, allowing the bank to take prompt actions that mitigate credit risk.
The regression model plays a crucial role in developing a scorecard, which helps establish the loan approval/rejection threshold.
Interest rate optimization is achieved by segmenting customers based on varying Bad Rate percentages, enabling the bank to offer more tailored financial products and services.
A more granular assessment of the Bad Rate percentage allows the collections team to take proactive measures in recovering outstanding payments, thereby minimizing potential losses for the bank.
Identifying customers with lower Bad Rate percentages creates opportunities for cross-selling additional bank products, driving new business growth.
Dimension : 100,000 Rows & 28 Columns
Contents & Structure :
[Character] ID : Represents a unique identification of an entry
[Character] Name: Represents the name of a person
[Character] Age : Represents the age of the person
[Character] Occupation : Represents the occupation of the person
[Character] Annual_Income : Represents the annual income of the person
[Character] Outstanding_Debt : Represents the remaining debt to be paid (in USD)
[Numeric] Monthly_Balance : Represents the monthly balance amount of the customer (in USD)
[Numeric] Num_Bank_Accounts : Represents the number of bank accounts a person holds
[Numeric] Num_Credit_Card : Represents the number of other credit cards held by a person
[Character] Num_of_Loan : Represents the number of loans taken from the bank
[Character] Num_of_Loan : Represents the number of loans taken from the bank
[Numeric] Interest_Rate : Represents the interest rate on credit card
[Numeric] Amount_invested_monthly : Represents the monthly amount invested by the customer (in USD)
[Numeric] Num_of_Delayed_Payment : Represents the average number of payments delayed by a person
[Numeric] Num_Credit_Inquiries : Represents the number of credit card inquiries
[Character] Credit_History_Age : Represents the age of credit history of the person
[Character] Payment_Behaviour : Represents the payment behavior of the customer (in USD)
[Numeric] Credit_Utilization_Ratio : Represents the utilization ratio of credit card
[Character] Credit Score : Represents the bracket of credit score (Poor, Standard, Good)
## # A tibble: 28 × 8
## Variable Class Min `1st Qu.` Median Mean `3rd Qu.` Max
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ID numeric 5634 43132.75 80631… 8063… 118130.25 1556…
## 2 Customer_ID character <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Month character <NA> <NA> <NA> <NA> <NA> <NA>
## 4 Name character <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Age character <NA> <NA> <NA> <NA> <NA> <NA>
## 6 SSN character <NA> <NA> <NA> <NA> <NA> <NA>
## 7 Occupation character <NA> <NA> <NA> <NA> <NA> <NA>
## 8 Annual_Income character <NA> <NA> <NA> <NA> <NA> <NA>
## 9 Monthly_Inhand_Salary numeric 303.6… 1625.568… 3093.… 4194… 5957.448… 1520…
## 10 Num_Bank_Accounts numeric -1 3 6 17.0… 7 1798
## # ℹ 18 more rows
The summary table reveals that some numerical columns are incorrectly classified as "character" data type. Additionally, outliers and erroneous values, such as -1 and 1798 in the Num_Bank_Accounts column, have been identified. These issues necessitate a data type conversion and cleaning in the next phase to ensure the data is properly formatted, readable and accurate.
## ID Customer_ID Month Name Age SSN Occupation
## 1 5634 CUS_0xd40 January Aaron Maashoh 23 821-00-0265 Scientist
## 2 5635 CUS_0xd40 February Aaron Maashoh 23 821-00-0265 Scientist
## 3 5636 CUS_0xd40 March Aaron Maashoh -500 821-00-0265 Scientist
## 4 5637 CUS_0xd40 April Aaron Maashoh 23 821-00-0265 Scientist
## 5 5638 CUS_0xd40 May Aaron Maashoh 23 821-00-0265 Scientist
## 6 5639 CUS_0xd40 June Aaron Maashoh 23 821-00-0265 Scientist
## Annual_Income Monthly_Inhand_Salary Num_Bank_Accounts Num_Credit_Card
## 1 19114.12 1824.843 3 4
## 2 19114.12 NA 3 4
## 3 19114.12 NA 3 4
## 4 19114.12 NA 3 4
## 5 19114.12 1824.843 3 4
## 6 19114.12 NA 3 4
## Interest_Rate Num_of_Loan
## 1 3 4
## 2 3 4
## 3 3 4
## 4 3 4
## 5 3 4
## 6 3 4
## Type_of_Loan
## 1 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 2 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 3 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 4 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 5 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 6 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## Delay_from_due_date Num_of_Delayed_Payment Changed_Credit_Limit
## 1 3 7 11.27
## 2 -1 11.27
## 3 3 7 _
## 4 5 4 6.27
## 5 6 11.27
## 6 8 4 9.27
## Num_Credit_Inquiries Credit_Mix Outstanding_Debt Credit_Utilization_Ratio
## 1 4 _ 809.98 26.82262
## 2 4 Good 809.98 31.94496
## 3 4 Good 809.98 28.60935
## 4 4 Good 809.98 31.37786
## 5 4 Good 809.98 24.79735
## 6 4 Good 809.98 27.26226
## Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month
## 1 22 Years and 1 Months No 49.57495
## 2 <NA> No 49.57495
## 3 22 Years and 3 Months No 49.57495
## 4 22 Years and 4 Months No 49.57495
## 5 22 Years and 5 Months No 49.57495
## 6 22 Years and 6 Months No 49.57495
## Amount_invested_monthly Payment_Behaviour Monthly_Balance
## 1 80.41529543900253 High_spent_Small_value_payments 312.49408867943663
## 2 118.28022162236736 Low_spent_Large_value_payments 284.62916249607184
## 3 81.699521264648 Low_spent_Medium_value_payments 331.2098628537912
## 4 199.4580743910713 Low_spent_Small_value_payments 223.45130972736786
## 5 41.420153086217326 High_spent_Medium_value_payments 341.48923103222177
## 6 62.430172331195294 !@9#%8 340.4792117872438
## Credit_Score
## 1 Good
## 2 Good
## 3 Good
## 4 Good
## 5 Good
## 6 Good
From the data summarized above, the first layer of cleaning has been performed on the variables that were expected to be numeric but were recorded as character. The cleaning process involves removing underscores from these variables, converting them to numeric format, and rounding the numeric values to two decimal places.
## # A tibble: 28 × 8
## Variable Class Min `1st Qu.` Median Mean `3rd Qu.` Max
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ID numeric 5634 43132.75 80631… 8063… 118130.25 1556…
## 2 Customer_ID character <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Month character <NA> <NA> <NA> <NA> <NA> <NA>
## 4 Name character <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Age numeric -500 24 33 110.… 42 8698
## 6 SSN character <NA> <NA> <NA> <NA> <NA> <NA>
## 7 Occupation character <NA> <NA> <NA> <NA> <NA> <NA>
## 8 Annual_Income numeric 7005.… 19457.5 37578… 1764… 72790.92 2419…
## 9 Monthly_Inhand_Salary numeric 303.65 1625.57 3093.… 4194… 5957.45 1520…
## 10 Num_Bank_Accounts numeric -1 3 6 17.0… 7 1798
## # ℹ 18 more rows
After converting the data types, the numeric variables now appear in the "Numeric" format. However, the previously mentioned erroneous and outlier values still persist and need to be addressed.
Function to compute MODE
Function to handle missing or outlier values using backward or forward filling:
1. Missing or outlier were replaced with 0 for scanning purpose.
2. Backward Filling: Fills the current value by using the most recent available value from the previous months.
3. Forward Filling: Fills the current value by using the most recent available value from the upcoming months.
Function to handle missing or outlier values using backward or forward filling:
1. Missing or outlier were replaced with 99 for scanning purpose.
2. Backward Filling: Fills the current value by using the most recent available value from the previous months.
3. Forward Filling: Fills the current value by using the most recent available value from the upcoming months.
Function to handle missing or outlier values using backward or forward filling:
1. Missing or outlier were replaced with 0 for scanning purpose.
2. Backward Filling: Find the current value by using the most recent available value from the previous months.
3. Forward Filling: Find the current value by using the most recent available value from the upcoming months.
4. All values are then adjusted by adding or subtracting 1, depending on whether backward or forward filling is applied.
Function to detect outliers using IQR method:
This method is used when it's impossible to detect outlier from the frequency table.
Type_of_Loan & Num_of_Loan
## Type_of_Loan
## 1 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 2 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 3 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## Num_of_Loan
## 1 4
## 2 4
## 3 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.00 1.00 3.00 3.01 5.00 1496.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.533 5.000 9.000
Use mode to replace the outlier with the most frequently occurring value in the dataset
Occupation
Annual Income
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7006 19458 37579 176416 72791 24198062
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7006 19343 37000 50505 71683 179987
Interest Rate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 13.00 72.47 20.00 5797.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 13.00 14.53 20.00 34.00
Credit_Mix
Payment_Behavior
Total_EMI_per_month
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 30.31 69.25 1403.12 161.22 82331.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.23 566.07 1166.15 1426.22 1945.96 4998.07
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 28.42 65.01 98.12 141.51 1701.96
Amount_invested_monthly
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 74.53 135.93 637.41 265.73 10000.00 4479
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 74.6 131.2 195.8 239.5 1977.3
Monthly_Balance
## Min. 1st Qu.
## -333333333333333314868224222 270
## Median Mean
## 337 -30364372469635625254402
## 3rd Qu. Max.
## 470 1602
## NA's
## 1200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 270.19 337.12 403.12 471.57 1602.04
to fill the missing/outlier value with previous/next month available value
Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -500.0 24.0 33.0 110.7 42.0 8698.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 24.00 33.00 33.31 42.00 56.00
Monthly_Inhand_Salary
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 303.6 1625.6 3093.8 4194.2 5957.4 15204.6 15002
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 303.6 1626.8 3096.4 4198.8 5961.7 15204.6
Num_Bank_Accounts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00 3.00 6.00 17.09 7.00 1798.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 5.000 5.369 7.000 11.000
Num_Credit_Card
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 5.00 22.47 7.00 1499.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 5.000 5.534 7.000 11.000
Delay_from_due_date
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.00 10.00 18.00 21.07 28.00 67.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 18.00 21.08 28.00 67.00
Num_of_Delayed_Payment
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -3.00 9.00 14.00 30.92 18.00 4397.00 7002
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 14.00 13.21 18.00 28.00
Changed_Credit_Limit
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -6.49 5.32 9.40 10.39 14.87 36.97 2091
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.49 5.32 9.40 10.39 14.86 36.97
Num_Credit_Inquiries
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 27.75 9.00 2597.00 1965
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 5.000 5.773 8.000 17.000
Credit_History_Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 144.0 219.0 221.2 302.0 404.0
Payment_of_Min_Amount
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.5943 1.0000 1.0000
## # A tibble: 33 × 8
## Variable Class Min `1st Qu.` Median Mean `3rd Qu.` Max
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Customer_ID character <NA> <NA> <NA> <NA> <NA> <NA>
## 2 Month character <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Age numeric 14 24 33 33.3… 42 56
## 4 Occupation character <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Annual_Income numeric 7005.… 19342.97… 36999… 5050… 71683.47 1799…
## 6 Monthly_Inhand_Salary numeric 303.65 1626.76 3096.… 4198… 5961.74 1520…
## 7 Num_Bank_Accounts numeric 0 3 5 5.36… 7 11
## 8 Num_Credit_Card numeric 0 4 5 5.53… 7 11
## 9 Interest_Rate numeric 1 7 13 14.5… 20 34
## 10 Delay_from_due_date numeric 0 10 18 21.0… 28 67
## # ℹ 23 more rows
Exploratory Data Analysis (EDA) is a crucial step in examining and summarizing datasets to reveal patterns, trends, anomalies, and relationships. It uses both visual and quantitative methods to provide a thorough understanding of the data before moving on to more sophisticated modeling or analysis.
For our credit score classification project, three primary analysis were carried out to explore the financial data, study the distribution of variables, and investigate the underlying relationships among them.
- Univariate Analysis
- Bivariate Analysis
- Correlation Analysis
Annual_Income & Monthly_Inhand_Salary & Monthly_Balance & Amount_Invested_Monthly:
Right-skewed distribution observed with a higher density of data points at lower income levels, indicating that most individuals have lower incomes, with fewer individuals having very high incomes. There is evident income inequality, with most people earning less and only a small portion earning significantly higher incomes. this pattern is also observed in Monthly_Balance and Amount_Invested_Monthly, as individuals with lower incomes tend to have smaller savings or investments, while only a few individuals with high incomes have significantly larger balances and investments.
Occupation:
Relatively even across the various categories.
Credit_Mix and Credit_Score:
The majority of the population falls into the "Standard" credit mix category, followed by "Good" and "Bad," which is consistent with the distribution observed in the credit score variable, suggesting a potential positive correlation between these two variables.
Payment_Behavior:
The most common behaviour is "Low spent, Small value payments", indicating a cautious approach to spending.
The majority of customers have a "Standard" credit score, followed by "Poor" and "Good," indicating class imbalance, where the "Standard" category is overrepresented, while the "Poor" and "Good" categories are underrepresented.
Observation:
While most variables display clear differentiation in boxplots across the three classes, certain variables such as Changed_Credit_Limit, Credit_Utilization_Ratio, one-hot-encoded Loan types (e.g., CreditBuilderLoan, HomeEquityLoan, etc.), and Total_EMI_per_month struggle to show distinct differences. This indicates that these particular variables may have limited predictive power in distinguishing between the classes.
Observation:
Occupation x Credit_Score:
No clear relationship has been observed between occupation and credit_score, indicating that occupation may not be a significant predictor of credit score. Hence, 'Occupation’ will be dropped from further consideration.
Credit_Mix x Credit_Score:
As already noted in Section 4.1.2, these two variables again show a positive correlation, reinforcing their role as predictive variables in credit score classification.
Correlation heatmap shows the relationship between the features with the credit score. The correlation coefficient ranges between -1 and +1. The closer the value is to +1, the stronger the positive relationship. The closer the value is to -1, the stronger the negative relationship. Values near zero indicate a weak or no relationship.
Observation:
Independent Variables with High Inter-collinearity (> 0.70)
1. Monthly_Balance <-> Annual_Income <-> Monthly_Inhand_Salary:
These variables are highly correlated with each other, indicating that these variables provide overlapping information about the financial standing of the individuals and could lead to multicollinearity issues.
2. Credit_Mix <-> Num_Bank_Accounts <-> Interest_Rate <-> Num_of_Delayed_Payment <-> Min_Amount_Pymt_Ind:
These variables are highly correlated with Credit_Mix (correlation > 0.7) but exhibit lower correlations with each other (correlation < 0.7). This suggests that Credit_Mix serves as a comprehensive measure encompassing various aspects of financial behavior and creditworthiness.However, the high correlation between these variables indicates multicollinearity, which may complicate the interpretation of their individual contributions in a predictive model.
Observations:
1. Variables with the Strongest POSITIVE Correlation with Credit Score (the Higher the value, the Worse the Credit Score)
Credit_Mix (0.50): A diverse mix of credit types helps to mitigate risk by ensuring that an individual isn't overly reliant on a single type of credit. This diversification can indicate that the individual is managing multiple forms of credit responsibly, reducing the overall risk.
Interest_Rate (0.49): Higher interest rates are often charged to individuals who are considered higher risk. This reflects the increased likelihood of credit loss, as lenders compensate for the higher risk by charging more.
Payment_of_Min_Amount (0.44): Individuals who only pay the minimum amount due and allow the outstanding balance to roll over are often experiencing financial strain. This behavior can signal that they are struggling to manage their finances effectively.
2. Variables with the Strongest NEGATIVE Correlation with Credit Score (the Lower the value, the Worse the Credit Score)
Credit_History_Months (-0.39): Longer credit history suggests that the individual has experience in managing credit over time. It indicates maturity and an ability to maintain financial health over the long term without being charged off or blacklisted by financial institutions.
Monthly_Balance (-0.21): A lower remaining balance at the end of each month indicates financial difficulties. It suggests that the individual has little room for any unexpected expenses or financial setbacks, which could lead to further financial trouble.
Annual_Income (-0.21): Higher annual income provides individuals with greater purchasing power and the ability to sustain themselves through financial downturns. It often correlates with better credit scores as higher income individuals are more capable of meeting their financial obligations.
3. Variables with Insignificant Correlation with Credit Score (close to ZERO) This suggests that other factors may be more critical in determining an individual's credit score, and these variables may not be useful predictors in credit score modeling.
- Occupation (0.03)
- Credit_Utilization_Ratio (-0.05)
- Total_EMI_per_month (0.07)
For Classification Model to predict Credit Score = Poor/ Standard/ Good, 3 types of models will be explored, and their performance will be compared for adoption recommendation:
Model Type 1: Multinomial Logistic Regression
(MLR)
Model Type 2: Random Forest (RF)
Model Type 3: XGBoost (XG)
Data will be scaled prior to modelling due to:
As class imbalance observed, 2 class balancing techniques below has been attempted:
Data partition of 70% (Train sample) vs. 30% (Test sample) has been conducted to have a hold-out sample for model testing, and all 3 model types will be conducted on 3 separate data sets as follows:
## [1] "There is no character column, proceed."
## [1] "There is no NA Value in train set, proceed."
## [1] "There is no NA Value in test set, proceed."
Original
##
## Good Standard Poor
## 12480 37222 20299
Over-Sampling
##
## Good Standard Poor
## 37222 37222 37222
Under-Sampling
##
## Good Standard Poor
## 12480 12480 12480
Model Type 1: Multinomial Logistic Regression
## character(0)
Model Type 2: Random Forest
Model Type 3: XG Boost
Train data sets 1 to 3 are used for evaluation purpose across the 3
models:
1) Scaled Data (original train_new): MLR, RF,
XGBoost
2) Scaled Data + Oversampling (up_train): MLR, RF,
XGBoost
3) Scaled Data + Undersampling (down_train): MLR, RF,
XGBoost
However, Test sample, evaluation will be performed on only
original data “test_new” for unbiased
evaluation across all models attempted.
Altering the test data via over- or under-sampling could lead to biased
evaluation and overly optimistic performance metrics, as the test set
would no longer reflect the original class distribution.
Confusion Matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 7217 5091 172
## 1 4143 28230 4849
## 2 1247 8558 10494
##
## Overall Statistics
##
## Accuracy : 0.6563
## 95% CI : (0.6528, 0.6598)
## No Information Rate : 0.5983
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.413
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.5725 0.6741 0.6764
## Specificity 0.9083 0.6803 0.8200
## Pos Pred Value 0.5783 0.7584 0.5170
## Neg Pred Value 0.9063 0.5836 0.8990
## Prevalence 0.1801 0.5983 0.2216
## Detection Rate 0.1031 0.4033 0.1499
## Detection Prevalence 0.1783 0.5317 0.2900
## Balanced Accuracy 0.7404 0.6772 0.7482
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 3106 2182 60
## 1 1763 12120 2069
## 2 551 3697 4451
##
## Overall Statistics
##
## Accuracy : 0.6559
## 95% CI : (0.6505, 0.6613)
## No Information Rate : 0.6
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.412
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.5731 0.6734 0.6764
## Specificity 0.9088 0.6807 0.8186
## Pos Pred Value 0.5808 0.7598 0.5117
## Neg Pred Value 0.9061 0.5815 0.9000
## Prevalence 0.1807 0.6000 0.2193
## Detection Rate 0.1035 0.4040 0.1484
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.7409 0.6770 0.7475
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 31023 5168 1031
## 1 7802 21397 8023
## 2 6240 5588 25394
##
## Overall Statistics
##
## Accuracy : 0.6968
## 95% CI : (0.6941, 0.6995)
## No Information Rate : 0.4036
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5453
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.6884 0.6655 0.7372
## Specificity 0.9069 0.8010 0.8468
## Pos Pred Value 0.8335 0.5748 0.6822
## Neg Pred Value 0.8114 0.8555 0.8784
## Prevalence 0.4036 0.2879 0.3085
## Detection Rate 0.2778 0.1916 0.2274
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 0.7977 0.7332 0.7920
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 4520 701 127
## 1 3417 9111 3424
## 2 1432 1299 5968
##
## Overall Statistics
##
## Accuracy : 0.6533
## 95% CI : (0.6479, 0.6587)
## No Information Rate : 0.3704
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.471
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.4824 0.8200 0.6270
## Specificity 0.9599 0.6378 0.8667
## Pos Pred Value 0.8452 0.5712 0.6861
## Neg Pred Value 0.8033 0.8576 0.8333
## Prevalence 0.3123 0.3704 0.3173
## Detection Rate 0.1507 0.3037 0.1989
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.7212 0.7289 0.7468
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 10389 1760 331
## 1 2563 7224 2693
## 2 2076 1864 8540
##
## Overall Statistics
##
## Accuracy : 0.6985
## 95% CI : (0.6939, 0.7032)
## No Information Rate : 0.4014
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5478
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.6913 0.6659 0.7385
## Specificity 0.9067 0.8023 0.8477
## Pos Pred Value 0.8325 0.5788 0.6843
## Neg Pred Value 0.8141 0.8548 0.8788
## Prevalence 0.4014 0.2897 0.3089
## Detection Rate 0.2775 0.1929 0.2281
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 0.7990 0.7341 0.7931
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 4522 705 121
## 1 3411 9117 3424
## 2 1431 1310 5958
##
## Overall Statistics
##
## Accuracy : 0.6533
## 95% CI : (0.6478, 0.6586)
## No Information Rate : 0.3711
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4708
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.4829 0.8190 0.6270
## Specificity 0.9600 0.6377 0.8663
## Pos Pred Value 0.8455 0.5715 0.6849
## Neg Pred Value 0.8036 0.8566 0.8336
## Prevalence 0.3121 0.3711 0.3168
## Detection Rate 0.1507 0.3039 0.1986
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.7214 0.7284 0.7466
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 12448 32 0
## 1 45 37092 85
## 2 0 54 20245
##
## Overall Statistics
##
## Accuracy : 0.9969
## 95% CI : (0.9965, 0.9973)
## No Information Rate : 0.5311
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9949
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9964 0.9977 0.9958
## Specificity 0.9994 0.9960 0.9989
## Pos Pred Value 0.9974 0.9965 0.9973
## Neg Pred Value 0.9992 0.9974 0.9983
## Prevalence 0.1785 0.5311 0.2904
## Detection Rate 0.1778 0.5299 0.2892
## Detection Prevalence 0.1783 0.5317 0.2900
## Balanced Accuracy 0.9979 0.9969 0.9974
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 4066 1245 37
## 1 1160 12957 1835
## 2 69 1585 7045
##
## Overall Statistics
##
## Accuracy : 0.8023
## 95% CI : (0.7977, 0.8068)
## No Information Rate : 0.5263
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6719
##
## Mcnemar's Test P-Value : 0.0000008754
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.7679 0.8207 0.7901
## Specificity 0.9481 0.7893 0.9215
## Pos Pred Value 0.7603 0.8122 0.8099
## Neg Pred Value 0.9501 0.7985 0.9121
## Prevalence 0.1765 0.5263 0.2972
## Detection Rate 0.1355 0.4319 0.2348
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.8580 0.8050 0.8558
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 37217 5 0
## 1 155 36904 163
## 2 0 16 37206
##
## Overall Statistics
##
## Accuracy : 0.997
## 95% CI : (0.9966, 0.9973)
## No Information Rate : 0.3347
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9954
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9959 0.9994 0.9956
## Specificity 0.9999 0.9957 0.9998
## Pos Pred Value 0.9999 0.9915 0.9996
## Neg Pred Value 0.9979 0.9997 0.9978
## Prevalence 0.3347 0.3307 0.3346
## Detection Rate 0.3333 0.3305 0.3332
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 0.9979 0.9976 0.9977
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 4398 933 17
## 1 1474 12451 2027
## 2 75 1223 7401
##
## Overall Statistics
##
## Accuracy : 0.8084
## 95% CI : (0.8039, 0.8128)
## No Information Rate : 0.4869
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6881
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.7395 0.8524 0.7836
## Specificity 0.9605 0.7725 0.9368
## Pos Pred Value 0.8224 0.7805 0.8508
## Neg Pred Value 0.9372 0.8465 0.9040
## Prevalence 0.1982 0.4869 0.3148
## Detection Rate 0.1466 0.4150 0.2467
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.8500 0.8125 0.8602
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 12477 3 0
## 1 43 12408 29
## 2 0 13 12467
##
## Overall Statistics
##
## Accuracy : 0.9976
## 95% CI : (0.9971, 0.9981)
## No Information Rate : 0.3344
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9965
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9966 0.9987 0.9977
## Specificity 0.9999 0.9971 0.9995
## Pos Pred Value 0.9998 0.9942 0.9990
## Neg Pred Value 0.9983 0.9994 0.9988
## Prevalence 0.3344 0.3318 0.3338
## Detection Rate 0.3333 0.3314 0.3330
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 0.9982 0.9979 0.9986
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 4720 586 42
## 1 2674 10753 2525
## 2 384 961 7354
##
## Overall Statistics
##
## Accuracy : 0.7609
## 95% CI : (0.7561, 0.7657)
## No Information Rate : 0.41
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6264
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.6068 0.8742 0.7413
## Specificity 0.9717 0.7063 0.9330
## Pos Pred Value 0.8826 0.6741 0.8454
## Neg Pred Value 0.8759 0.8899 0.8795
## Prevalence 0.2593 0.4100 0.3307
## Detection Rate 0.1573 0.3584 0.2451
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.7893 0.7902 0.8371
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 9019 3225 236
## 1 3946 29124 4152
## 2 1283 4820 14196
##
## Overall Statistics
##
## Accuracy : 0.7477
## 95% CI : (0.7445, 0.7509)
## No Information Rate : 0.531
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5825
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.6330 0.7836 0.7639
## Specificity 0.9379 0.7534 0.8813
## Pos Pred Value 0.7227 0.7824 0.6993
## Neg Pred Value 0.9091 0.7546 0.9117
## Prevalence 0.2035 0.5310 0.2655
## Detection Rate 0.1288 0.4161 0.2028
## Detection Prevalence 0.1783 0.5317 0.2900
## Balanced Accuracy 0.7855 0.7685 0.8226
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 3748 1497 103
## 1 1956 12127 1869
## 2 602 2274 5823
##
## Overall Statistics
##
## Accuracy : 0.7233
## 95% CI : (0.7182, 0.7283)
## No Information Rate : 0.53
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5429
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.5944 0.7628 0.7470
## Specificity 0.9325 0.7287 0.8705
## Pos Pred Value 0.7008 0.7602 0.6694
## Neg Pred Value 0.8962 0.7315 0.9074
## Prevalence 0.2102 0.5300 0.2598
## Detection Rate 0.1249 0.4042 0.1941
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.7634 0.7458 0.8087
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 31962 4594 666
## 1 6965 23608 6649
## 2 3959 3425 29838
##
## Overall Statistics
##
## Accuracy : 0.7649
## 95% CI : (0.7624, 0.7673)
## No Information Rate : 0.3841
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6473
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.7453 0.7465 0.8031
## Specificity 0.9235 0.8299 0.9009
## Pos Pred Value 0.8587 0.6342 0.8016
## Neg Pred Value 0.8533 0.8923 0.9017
## Prevalence 0.3841 0.2832 0.3327
## Detection Rate 0.2862 0.2114 0.2672
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 0.8344 0.7882 0.8520
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 4511 724 113
## 1 3084 9864 3004
## 2 1006 914 6779
##
## Overall Statistics
##
## Accuracy : 0.7052
## 95% CI : (0.7, 0.7103)
## No Information Rate : 0.3834
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5459
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.5245 0.8576 0.6850
## Specificity 0.9609 0.6709 0.9045
## Pos Pred Value 0.8435 0.6184 0.7793
## Neg Pred Value 0.8341 0.8834 0.8537
## Prevalence 0.2867 0.3834 0.3299
## Detection Rate 0.1504 0.3288 0.2260
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.7427 0.7642 0.7948
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 10686 1576 218
## 1 2230 7954 2296
## 2 1365 1059 10056
##
## Overall Statistics
##
## Accuracy : 0.7665
## 95% CI : (0.7621, 0.7707)
## No Information Rate : 0.3814
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6497
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.7483 0.7512 0.8000
## Specificity 0.9225 0.8314 0.9025
## Pos Pred Value 0.8562 0.6373 0.8058
## Neg Pred Value 0.8560 0.8944 0.8993
## Prevalence 0.3814 0.2828 0.3357
## Detection Rate 0.2854 0.2124 0.2686
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 0.8354 0.7913 0.8513
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 4534 702 112
## 1 3172 9597 3183
## 2 1038 873 6788
##
## Overall Statistics
##
## Accuracy : 0.6973
## 95% CI : (0.6921, 0.7025)
## No Information Rate : 0.3724
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5362
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.5185 0.8590 0.6732
## Specificity 0.9617 0.6625 0.9040
## Pos Pred Value 0.8478 0.6016 0.7803
## Neg Pred Value 0.8292 0.8879 0.8453
## Prevalence 0.2915 0.3724 0.3361
## Detection Rate 0.1511 0.3199 0.2263
## Detection Prevalence 0.1783 0.5318 0.2900
## Balanced Accuracy 0.7401 0.7607 0.7886
Model Performance Interpretation:
Random Forest (RF)
XGBoost (XG)
Multinomial Logistic Regression (MLR)
Comparison between Class Balancing methods
Credit scoring requires stable and reliable predictions, and XGBoost’s consistent performance makes it a strong candidate for production deployment. It is less likely to experience unexpected performance drops when applied to new customer data. Therefore, the XGBoost model trained on the original dataset without class balancing is selected as the final model, as class balancing does not provide significant benefits in this case.
For Regression Model, development of Credit
Scorecard would be demonstrated.
As the main purpose of Credit Scorecard is for loan approval process,
where the decision is either “Approve” or “Reject”, binary target has
been created with Credit Score = Poor (=1) vs. Standard or Good
(=0).
Benefits:
Simplicity and Clarity: A binary classification model simplifies the process by categorizing individuals into two distinct groups i.e. high-risk (poor) and lower-risk (standard/good). This clarity facilitates easier interpretation and decision-making for stakeholders i.e. to approve or reject the loan.
Flexibility in Score Band Creation: With the binary model, more specific score bands can be created based on the probability of being classified as poor. This allows for further segmentation and targeted decision-making, providing a more granular understanding of credit risk within the broader category i.e. collection strategy, interest offering differentiation, cross-selling purpose.
## [1] "There is no character column, proceed."
Step 1: Binary target creation: Poor (=1) vs. Standard or Good (=0) & Data Scaling
## [1] "There is no NA Value in train set, proceed."
## [1] "There is no NA Value in test set, proceed."
Step 2: Data Processing & Variable Reduction
## ✔ Binning on 70001 rows and 30 columns in 00:00:24
## [1] "Selected Variables and their Information Values:"
##
##
## |var | iv|
## |:----------------------|---------:|
## |Outstanding_Debt | 1.2744588|
## |Interest_Rate | 1.0757602|
## |Num_Credit_Inquiries | 0.8840514|
## |Delay_from_due_date | 0.7414860|
## |Credit_Mix | 0.6800040|
## |Credit_History_Months | 0.6680570|
## |Num_Credit_Card | 0.6424321|
## |No_of_Loan | 0.6046970|
## |Num_Bank_Accounts | 0.4811811|
## |Min_Amount_Pymt_Ind | 0.4318173|
## |Num_of_Delayed_Payment | 0.3673243|
## |Annual_Income | 0.2671015|
## |Monthly_Balance | 0.2061845|
Step 1: Model Fitting
Fitting of Generalized Linear Model (GLM) model to predict the binary outcome of Credit Score, using all predictors post variable reduction in Step 2.
##
## Call:
## glm(formula = Credit_Score2 ~ ., family = binomial(), data = iv_var_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.11406 0.01012 -110.065 < 0.0000000000000002 ***
## Outstanding_Debt 0.09094 0.01445 6.292 0.00000000031348802 ***
## Interest_Rate 0.55220 0.01490 37.056 < 0.0000000000000002 ***
## Num_Credit_Inquiries 0.41867 0.01433 29.217 < 0.0000000000000002 ***
## Delay_from_due_date 0.36932 0.01285 28.746 < 0.0000000000000002 ***
## Credit_Mix -0.50941 0.02439 -20.886 < 0.0000000000000002 ***
## Credit_History_Months -0.11739 0.01472 -7.975 0.00000000000000153 ***
## Num_Credit_Card 0.32867 0.01182 27.815 < 0.0000000000000002 ***
## No_of_Loan 0.11079 0.01418 7.815 0.00000000000000551 ***
## Num_Bank_Accounts 0.03480 0.01444 2.410 0.01596 *
## Min_Amount_Pymt_Ind -0.05410 0.01745 -3.100 0.00194 **
## Num_of_Delayed_Payment -0.02309 0.01529 -1.511 0.13085
## Annual_Income -0.12447 0.01493 -8.339 < 0.0000000000000002 ***
## Monthly_Balance 0.08669 0.01607 5.393 0.00000006919048661 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 84300 on 70000 degrees of freedom
## Residual deviance: 67401 on 69987 degrees of freedom
## AIC: 67429
##
## Number of Fisher Scoring iterations: 4
## Outstanding_Debt Interest_Rate Num_Credit_Inquiries
## 2.600563 2.378558 2.118945
## Delay_from_due_date Credit_Mix Credit_History_Months
## 1.893779 6.616354 2.282167
## Num_Credit_Card No_of_Loan Num_Bank_Accounts
## 1.433335 2.323659 2.174574
## Min_Amount_Pymt_Ind Num_of_Delayed_Payment Annual_Income
## 2.870790 2.416286 2.039456
## Monthly_Balance
## 2.345378
Step 2: Scorecard Creation
Before creating the scorecard, the variables Credit_Mix and Num_of_Delayed_Payment were excluded due to multicollinearity and lack of statistical significance, respectively. Credit_Mix had a Variance Inflation Factor (VIF) of 6.6, indicating moderate multicollinearity, which can inflate regression coefficient variance and compromise model reliability. Num_of_Delayed_Payment had a p-value of 0.13085, above the standard threshold of 0.05, suggesting it is not a significant predictor of the target variable. Removing these variables enhances model interpretability, reduces the risk of overfitting, and improves scorecard stability.
The Weight of Evidence (WOE) bins for predictor variables were then created, forming the basis for the credit scorecard, which was generated using the WOE-transformed data and a logistic regression model, with a base score of 600.
##
## Call:
## glm(formula = Credit_Score2 ~ ., family = binomial(), data = iv_var_train2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.111447 0.010092 -110.136 < 0.0000000000000002 ***
## Outstanding_Debt 0.006852 0.013920 0.492 0.623
## Interest_Rate 0.475849 0.014347 33.167 < 0.0000000000000002 ***
## Num_Credit_Inquiries 0.414445 0.014225 29.135 < 0.0000000000000002 ***
## Delay_from_due_date 0.286338 0.012205 23.461 < 0.0000000000000002 ***
## Credit_History_Months -0.105541 0.014630 -7.214 0.0000000000005437 ***
## Num_Credit_Card 0.298691 0.011690 25.551 < 0.0000000000000002 ***
## No_of_Loan 0.060274 0.013930 4.327 0.0000151163866710 ***
## Num_Bank_Accounts -0.078943 0.013537 -5.832 0.0000000054903309 ***
## Min_Amount_Pymt_Ind -0.209276 0.015720 -13.313 < 0.0000000000000002 ***
## Annual_Income -0.112643 0.014898 -7.561 0.0000000000000401 ***
## Monthly_Balance 0.084244 0.016030 5.255 0.0000001476924288 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 84300 on 70000 degrees of freedom
## Residual deviance: 67972 on 69989 degrees of freedom
## AIC: 67996
##
## Number of Fisher Scoring iterations: 4
## ✔ Binning on 70001 rows and 12 columns in 00:00:18
## $basepoints
## variable bin woe points
## <char> <lgcl> <lgcl> <num>
## 1: basepoints NA NA 0
##
## $Outstanding_Debt
## variable bin count count_distr neg pos posprob
## <char> <char> <int> <num> <int> <int> <num>
## 1: Outstanding_Debt [-Inf,-0.2) 35951 0.5135784 32094 3857 0.1072849
## 2: Outstanding_Debt [-0.2,0.1) 12244 0.1749118 9003 3241 0.2647011
## 3: Outstanding_Debt [0.1,1.1) 12803 0.1828974 3898 8905 0.6955401
## 4: Outstanding_Debt [1.1, Inf) 9003 0.1286124 4707 4296 0.4771743
## woe bin_iv total_iv breaks is_special_values points
## <num> <num> <num> <char> <lgcl> <num>
## 1: -1.2233059 0.557483949 1.274459 -0.2 FALSE 43
## 2: -0.1262024 0.002710392 1.274459 0.1 FALSE 43
## 3: 1.7216229 0.620238970 1.274459 1.1 FALSE 42
## 4: 0.8041071 0.094025537 1.274459 Inf FALSE 42
##
## $Interest_Rate
## variable bin count count_distr neg pos posprob
## <char> <char> <int> <num> <int> <int> <num>
## 1: Interest_Rate [-Inf,-0.4) 30776 0.43965086 26659 4117 0.13377307
## 2: Interest_Rate [-0.4,0) 6554 0.09362723 6020 534 0.08147696
## 3: Interest_Rate [0,0.7) 16637 0.23766803 11395 5242 0.31508084
## 4: Interest_Rate [0.7, Inf) 16034 0.22905387 5628 10406 0.64899588
## woe bin_iv total_iv breaks is_special_values points
## <num> <num> <num> <char> <lgcl> <num>
## 1: -0.9725285 0.324395573 1.07576 -0.4 FALSE 76
## 2: -1.5269731 0.144780222 1.07576 0 FALSE 95
## 3: 0.1190020 0.003447832 1.07576 0.7 FALSE 38
## 4: 1.5101020 0.603136556 1.07576 Inf FALSE -9
##
## $Num_Credit_Inquiries
## variable bin count count_distr neg pos posprob
## <char> <char> <int> <num> <int> <int> <num>
## 1: Num_Credit_Inquiries [-Inf,-0.2) 35167 0.5023785 30875 4292 0.1220462
## 2: Num_Credit_Inquiries [-0.2,0.6) 17431 0.2490107 11476 5955 0.3416327
## 3: Num_Credit_Inquiries [0.6,1.2) 7492 0.1070270 3493 3999 0.5337694
## 4: Num_Credit_Inquiries [1.2, Inf) 9911 0.1415837 3858 6053 0.6107355
## woe bin_iv total_iv breaks is_special_values points
## <num> <num> <num> <char> <lgcl> <num>
## 1: -1.0777204 0.44161036 0.8840514 -0.2 FALSE 75
## 2: 0.2394469 0.01495778 0.8840514 0.6 FALSE 35
## 3: 1.0307569 0.13062361 0.8840514 1.2 FALSE 12
## 4: 1.3458787 0.29685964 0.8840514 Inf FALSE 2
##
## $Delay_from_due_date
## variable bin count count_distr neg pos posprob
## <char> <char> <int> <num> <int> <int> <num>
## 1: Delay_from_due_date [-Inf,-1.1) 5463 0.07804174 5037 426 0.07797913
## 2: Delay_from_due_date [-1.1,-0.4) 25137 0.35909487 21374 3763 0.14969965
## 3: Delay_from_due_date [-0.4,0.6) 26017 0.37166612 18116 7901 0.30368605
## 4: Delay_from_due_date [0.6, Inf) 13384 0.19119727 5175 8209 0.61334429
## woe bin_iv total_iv breaks is_special_values points
## <num> <num> <num> <char> <lgcl> <num>
## 1: -1.57465305 0.12653558 0.741486 -1.1 FALSE 75
## 2: -0.84148517 0.20588152 0.741486 -0.4 FALSE 60
## 3: 0.06566736 0.00162452 0.741486 0.6 FALSE 41
## 4: 1.35686532 0.40744440 0.741486 Inf FALSE 14
##
## $Credit_History_Months
## variable bin count count_distr neg pos posprob
## <char> <char> <int> <num> <int> <int> <num>
## 1: Credit_History_Months [-Inf,-0.4) 23968 0.34239511 12129 11839 0.4939503
## 2: Credit_History_Months [-0.4,0.2) 17489 0.24983929 12473 5016 0.2868089
## 3: Credit_History_Months [0.2,0.4) 3823 0.05461351 3207 616 0.1611300
## 4: Credit_History_Months [0.4, Inf) 24721 0.35315210 21893 2828 0.1143967
## woe bin_iv total_iv breaks is_special_values points
## <num> <num> <num> <char> <lgcl> <num>
## 1: 0.87127344 0.2955326876 0.668057 -0.4 FALSE 49
## 2: -0.01545995 0.0000595197 0.668057 0.2 FALSE 42
## 3: -0.75437069 0.0257830653 0.668057 0.4 FALSE 37
## 4: -1.15112365 0.3466817057 0.668057 Inf FALSE 34
##
## $Num_Credit_Card
## variable bin count count_distr neg pos posprob
## <char> <char> <int> <num> <int> <int> <num>
## 1: Num_Credit_Card [-Inf,-1) 12514 0.1787689 11200 1314 0.1050024
## 2: Num_Credit_Card [-1,-0.5) 10110 0.1444265 8692 1418 0.1402572
## 3: Num_Credit_Card [-0.5,1) 36980 0.5282782 25936 11044 0.2986479
## 4: Num_Credit_Card [1, Inf) 10397 0.1485264 3874 6523 0.6273925
## woe bin_iv total_iv breaks is_special_values points
## <num> <num> <num> <char> <lgcl> <num>
## 1: -1.24736431 0.2003401691 0.6424321 -1 FALSE 69
## 2: -0.91768208 0.0963810664 0.6424321 -0.5 FALSE 62
## 3: 0.04172888 0.0009278876 0.6424321 1 FALSE 42
## 4: 1.41652038 0.3447829434 0.6424321 Inf FALSE 12
##
## $No_of_Loan
## variable bin count count_distr neg pos posprob woe
## <char> <char> <int> <num> <int> <int> <num> <num>
## 1: No_of_Loan [-Inf,-1) 15807 0.2258111 13974 1833 0.1159613 -1.1357709
## 2: No_of_Loan [-1,-0.5) 11119 0.1588406 8453 2666 0.2397698 -0.2584686
## 3: No_of_Loan [-0.5,0.5) 21720 0.3102813 17034 4686 0.2157459 -0.3951585
## 4: No_of_Loan [0.5, Inf) 21355 0.3050671 10241 11114 0.5204402 0.9772799
## bin_iv total_iv breaks is_special_values points
## <num> <num> <char> <lgcl> <num>
## 1: 0.21676833 0.604697 -1 FALSE 47
## 2: 0.01001233 0.604697 -0.5 FALSE 44
## 3: 0.04420788 0.604697 0.5 FALSE 44
## 4: 0.33370844 0.604697 Inf FALSE 38
##
## $Num_Bank_Accounts
## variable bin count count_distr neg pos posprob
## <char> <char> <int> <num> <int> <int> <num>
## 1: Num_Bank_Accounts [-Inf,0) 34941 0.49915001 29370 5571 0.1594402
## 2: Num_Bank_Accounts [0,1) 18299 0.26141055 11656 6643 0.3630253
## 3: Num_Bank_Accounts [1,1.5) 13040 0.18628305 7234 5806 0.4452454
## 4: Num_Bank_Accounts [1.5, Inf) 3721 0.05315638 1442 2279 0.6124698
## woe bin_iv total_iv breaks is_special_values points
## <num> <num> <num> <char> <lgcl> <num>
## 1: -0.7669256 0.24271269 0.4811811 0 FALSE 38
## 2: 0.3332161 0.03090239 0.4811811 1 FALSE 44
## 3: 0.6755733 0.09490216 0.4811811 1.5 FALSE 46
## 4: 1.3531793 0.11266384 0.4811811 Inf FALSE 50
##
## $Min_Amount_Pymt_Ind
## variable bin count count_distr neg pos
## <char> <char> <int> <num> <int> <int>
## 1: Min_Amount_Pymt_Ind [-Inf,0.823375414) 28283 0.4040371 24449 3834
## 2: Min_Amount_Pymt_Ind [0.823375414, Inf) 41718 0.5959629 25253 16465
## posprob woe bin_iv total_iv breaks is_special_values
## <num> <num> <num> <num> <char> <lgcl>
## 1: 0.1355585 -0.9572071 0.2900677 0.4318173 0.823375414 FALSE
## 2: 0.3946738 0.4677655 0.1417496 0.4318173 Inf FALSE
## points
## <num>
## 1: 28
## 2: 50
##
## $Annual_Income
## variable bin count count_distr neg pos posprob woe
## <char> <char> <int> <num> <int> <int> <num> <num>
## 1: Annual_Income [-Inf,-1) 5691 0.08129884 3026 2665 0.4682833 0.7684360
## 2: Annual_Income [-1,-0.8) 13076 0.18679733 8021 5055 0.3865861 0.4337883
## 3: Annual_Income [-0.8,-0.4) 14242 0.20345424 10967 3275 0.2299537 -0.3130993
## 4: Annual_Income [-0.4,0.9) 24316 0.34736647 16607 7709 0.3170341 0.1280377
## 5: Annual_Income [0.9, Inf) 12676 0.18108313 11081 1595 0.1258283 -1.0428846
## bin_iv total_iv breaks is_special_values points
## <num> <num> <char> <lgcl> <num>
## 1: 0.054101270 0.2671015 -1 FALSE 49
## 2: 0.038019467 0.2671015 -0.8 FALSE 46
## 3: 0.018572147 0.2671015 -0.4 FALSE 40
## 4: 0.005843768 0.2671015 0.9 FALSE 44
## 5: 0.150564880 0.2671015 Inf FALSE 34
##
## $Monthly_Balance
## variable bin count count_distr neg pos posprob
## <char> <char> <int> <num> <int> <int> <num>
## 1: Monthly_Balance [-Inf,-0.5) 24996 0.3570806 15180 9816 0.3927028
## 2: Monthly_Balance [-0.5,-0.3) 10361 0.1480122 7068 3293 0.3178265
## 3: Monthly_Balance [-0.3,0.7) 21972 0.3138812 16689 5283 0.2404424
## 4: Monthly_Balance [0.7, Inf) 12672 0.1810260 10765 1907 0.1504893
## woe bin_iv total_iv breaks is_special_values points
## <num> <num> <num> <char> <lgcl> <num>
## 1: 0.4595085 0.081861581 0.2061845 -0.5 FALSE 40
## 2: 0.1316950 0.002636163 0.2061845 -0.3 FALSE 42
## 3: -0.2547822 0.019241691 0.2061845 0.7 FALSE 44
## 4: -0.8352953 0.102445093 0.2061845 Inf FALSE 48
The evaluation of the Credit Scorecard model was performed using several key metrics to assess its performance and generalization ability. The metrics calculated include AUC (Area Under the Curve), Gini coefficient, and KS (Kolmogorov-Smirnov) statistic. These metrics were calculated for both the training and testing datasets.
AUC (Area Under the Curve)
Purpose: Measures model’s ability to distinguish between classes.
Result: Training AUC = 0.77, Testing AUC = 0.77.
Interpretation: Values > 0.7 indicate good performance and reliable predictions.
Gini Coefficient
Purpose: Indicates model’s discriminatory power.
Result: Training Gini = 0.55, Testing Gini = 0.54.
Interpretation: Indicates good model performance which aligns well with the AUC values and further confirms the model’s ability to distinguish between the classes.
KS (Kolmogorov-Smirnov) Statistic
Purpose: Evaluates separation between positive and negative classes.
Result: Training KS = 0.41, Testing KS = 0.40.
Interpretation: Good at distinguishing between good vs. bad and generalizing well to unseen data.
Population Stability Index (PSI)
Purpose: Measures the stability of model scores over time by comparing distributions.
Result: PSI = 0.0725.
Interpretation: PSI values < 0.1 indicate a very stable population. While PSI is ideally performed on out-of-time data to detect changes over different periods, in this case, it was tested on out-of-sample data to check for immediate stability. The result suggests that the model’s score distribution between the training and testing sets is stable, indicating good model performance.
Score vs. Poor Rate
The scatter plot shows that there is a clear trend where higher scores are associated with lower predicted Poor rate %. The plot shows a high density of data points at higher predicted probabilities for lower scores. This suggests that the model is effectively capturing the risk associated with lower scores.
The box plot reveals a clear trend where category 1 (poor credit scores) has a lower median score and wider IQR compared to category 0 (better credit scores). This visual distinction aligns well with the model’s strong performance metrics, indicating effective risk differentiation.
XG Boost Integration:
A combination of XGBoost and Logistic Regression is explored for further model refinement. Initially, an XGBoost model is trained to predict the probability of the target variable, effectively capturing complex patterns within the data. The predictions from the XGBoost model are then used as input for a logistic regression model. This approach leverages the predictive power of XGBoost along with the interpretability of Logistic Regression.
Performance: The integration of XGBoost has been observed to significantly enhance model performance, particularly in terms of AUC (88%), Gini (76%), and KS statistics (63%), for both training and testing datasets.
In the context of creating a credit scorecard for various credit strategies, the original Logistic Regression model would be more suitable as it produces a more spread-out distribution of predicted scores, allowing more flexible strategy implementation based on different score buckets.
XGBoost intergrated model tends to generate scores that cluster more tightly, showing higher clustering at high scores, despite lack flexibility, it is highly predictive and can be a great solution for loan approval assessment, depending on the bank risk appetite.
## [1] train-auc:0.826324 test-auc:0.823245
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
##
## [2] train-auc:0.846951 test-auc:0.843589
## [3] train-auc:0.844459 test-auc:0.842036
## [4] train-auc:0.849353 test-auc:0.847870
## [5] train-auc:0.850215 test-auc:0.849140
## [6] train-auc:0.851124 test-auc:0.850050
## [7] train-auc:0.853734 test-auc:0.852037
## [8] train-auc:0.854966 test-auc:0.852981
## [9] train-auc:0.854992 test-auc:0.852579
## [10] train-auc:0.855786 test-auc:0.853195
## [11] train-auc:0.857319 test-auc:0.854595
## [12] train-auc:0.858093 test-auc:0.855209
## [13] train-auc:0.860979 test-auc:0.857076
## [14] train-auc:0.861954 test-auc:0.857998
## [15] train-auc:0.862942 test-auc:0.858280
## [16] train-auc:0.864723 test-auc:0.859675
## [17] train-auc:0.867083 test-auc:0.861602
## [18] train-auc:0.867633 test-auc:0.861877
## [19] train-auc:0.868723 test-auc:0.862319
## [20] train-auc:0.869359 test-auc:0.862614
## [21] train-auc:0.870649 test-auc:0.863096
## [22] train-auc:0.870843 test-auc:0.863090
## [23] train-auc:0.871917 test-auc:0.864180
## [24] train-auc:0.872689 test-auc:0.864622
## [25] train-auc:0.873753 test-auc:0.865416
## [26] train-auc:0.874480 test-auc:0.865953
## [27] train-auc:0.875185 test-auc:0.866507
## [28] train-auc:0.875637 test-auc:0.866634
## [29] train-auc:0.876062 test-auc:0.866869
## [30] train-auc:0.876747 test-auc:0.867242
## [31] train-auc:0.877028 test-auc:0.867539
## [32] train-auc:0.877531 test-auc:0.867816
## [33] train-auc:0.878295 test-auc:0.868334
## [34] train-auc:0.879271 test-auc:0.869032
## [35] train-auc:0.879655 test-auc:0.869175
## [36] train-auc:0.880007 test-auc:0.869433
## [37] train-auc:0.880355 test-auc:0.869704
## [38] train-auc:0.880658 test-auc:0.869838
## [39] train-auc:0.881000 test-auc:0.870132
## [40] train-auc:0.881428 test-auc:0.870334
## [41] train-auc:0.881907 test-auc:0.870667
## [42] train-auc:0.882200 test-auc:0.870814
## [43] train-auc:0.882636 test-auc:0.871176
## [44] train-auc:0.882954 test-auc:0.871311
## [45] train-auc:0.883059 test-auc:0.871166
## [46] train-auc:0.883301 test-auc:0.871234
## [47] train-auc:0.884098 test-auc:0.871741
## [48] train-auc:0.884338 test-auc:0.871931
## [49] train-auc:0.884803 test-auc:0.872190
## [50] train-auc:0.885485 test-auc:0.872467
## [51] train-auc:0.885773 test-auc:0.872629
## [52] train-auc:0.885985 test-auc:0.872784
## [53] train-auc:0.886490 test-auc:0.872982
## [54] train-auc:0.886761 test-auc:0.873067
## [55] train-auc:0.887473 test-auc:0.873386
## [56] train-auc:0.887783 test-auc:0.873531
## [57] train-auc:0.887985 test-auc:0.873722
## [58] train-auc:0.888125 test-auc:0.873835
## [59] train-auc:0.888689 test-auc:0.874100
## [60] train-auc:0.888995 test-auc:0.874326
## [61] train-auc:0.889294 test-auc:0.874491
## [62] train-auc:0.889463 test-auc:0.874627
## [63] train-auc:0.890343 test-auc:0.875128
## [64] train-auc:0.891154 test-auc:0.875499
## [65] train-auc:0.891396 test-auc:0.875623
## [66] train-auc:0.891849 test-auc:0.875668
## [67] train-auc:0.892395 test-auc:0.876022
## [68] train-auc:0.892721 test-auc:0.876162
## [69] train-auc:0.892849 test-auc:0.876251
## [70] train-auc:0.893146 test-auc:0.876275
## [71] train-auc:0.893314 test-auc:0.876287
## [72] train-auc:0.893575 test-auc:0.876350
## [73] train-auc:0.893780 test-auc:0.876426
## [74] train-auc:0.893955 test-auc:0.876570
## [75] train-auc:0.894310 test-auc:0.876700
## [76] train-auc:0.894851 test-auc:0.877130
## [77] train-auc:0.894968 test-auc:0.877193
## [78] train-auc:0.895287 test-auc:0.877185
## [79] train-auc:0.895386 test-auc:0.877175
## [80] train-auc:0.896052 test-auc:0.877651
## [81] train-auc:0.896237 test-auc:0.877643
## [82] train-auc:0.896665 test-auc:0.877772
## [83] train-auc:0.897000 test-auc:0.877993
## [84] train-auc:0.897348 test-auc:0.878158
## [85] train-auc:0.897516 test-auc:0.878255
## [86] train-auc:0.897955 test-auc:0.878480
## [87] train-auc:0.898052 test-auc:0.878518
## [88] train-auc:0.898434 test-auc:0.878778
## [89] train-auc:0.898657 test-auc:0.878882
## [90] train-auc:0.898803 test-auc:0.879008
## [91] train-auc:0.899105 test-auc:0.879207
## [92] train-auc:0.899615 test-auc:0.879477
## [93] train-auc:0.899841 test-auc:0.879579
## [94] train-auc:0.899992 test-auc:0.879628
## [95] train-auc:0.900416 test-auc:0.879832
## [96] train-auc:0.900559 test-auc:0.879850
## [97] train-auc:0.900676 test-auc:0.879821
## [98] train-auc:0.900997 test-auc:0.879983
## [99] train-auc:0.901348 test-auc:0.880127
## [100] train-auc:0.901820 test-auc:0.880399
## ✔ Binning on 70001 rows and 12 columns in 00:00:17
Future Improvement:
Automated Binning with Manual Review: To further refine the risk model and ensure that risk rankings are intuitive and sensible, it would be wise to attempt hybrid binning process that combines automated binning with manual review and tweaking by domain experts.
This process will create initial bins based on statistical criteria, such as maximizing information value or minimizing within-bin variance. Following this automated binning, a manual review will be conducted to ensure that the bins make intuitive sense and accurately reflect the underlying risk patterns. This combination of automation and domain expert judgment is believed to create a more robust and interpretable scorecard.
Exploration of Alternative Models: Other potential refinement would be by exploring other types of models for scorecard development. This could include machine learning techniques such as random forests, or SVR. By comparing the performance of these alternative models with the current model, we can identify the most effective approach for accurately predicting risk and improving the overall performance of the scorecard.
This project successfully demonstrated the application of machine learning to enhance credit scoring and risk assessment in financial decision-making. By leveraging advanced tools in R, the framework covered essential stages of the data lifecycle, including data cleaning, exploratory data analysis (EDA), modeling, and evaluation. Among the 3 models tested, XGBoost emerged as the strongest for classifying customers into credit score bands (Good, Standard, and Poor), offering superior prediction performance and reliability.
The regression models built for scorecard development has delivered satisfactory results. Logistic regression model allows a more granular assessment of credit risk and suitable for various credit strategies such as collection strategy, interest rate assignment and cross-selling initiatives. XGBoost integrated model which is highly predictive, can be a great solution for loan approval assessment.
Overall, the project demonstrates the power of machine learning in empowering financial institutions with actionable insights, streamlining risk management, and enhancing customer segmentation. By achieving accurate classification and reliable risk predictions, the developed models offer a strong foundation for informed financial decisions and uncovering new business opportunities.