WQD7004 OCC1 - Machine Learning-Based Prediction in Credit Score Classification

Team Member Introduction - Group 12

Name	Matrix Number
TAN YANG YI	24061644
SHARON LEE JOO WEI	24063813
LEE RONG PHEI	24064031
NURUL SARAH IZZATI BINTI ZAHID	24064189
YEE SEE MARN	23102510

1. Introduction

Credit scoring plays a critical role in financial decision-making, influencing loan approvals, assigning interest rates , and risk assessment. With the increasing reliance on data-driven methodologies, machine learning has become an indispensable tool for evaluating customers’ credit scoring and enhancing credit score classification. The goal of this project is to establish a robust machine learning framework using R to analyze financial data, including credit history, loan amounts, and income levels. Subsequently, the most effective model will be developed to accurately predict customers’ credit score bands (Good, Standard and Bad) based on their financial profiles.

The framework encompasses key stages of the data lifecycle— data cleaning, exploratory data analysis (EDA), modeling, and evaluation by leveraging advanced tools in R. Ultimately, this project seeks to uncover actionable insights from the financial data, improve prediction accuracy on credit score classification problem, and demonstrate how technology can empower financial institutions to make informed decisions.

1.1 Questions

How accurate are the selected machine learning models—Logistic Regression, Random Forest, and XGBoost—in classifying individuals into the credit score bands (Good, Standard, Poor) based on their financial data?
How effectively does the regression model predict an individual’s Bad Rate% based on financial attributes such as income, loan amount, and credit history, providing insights for risk assessment, scorecard thresholds, and data-driven decision-making?

1.2 Objectives

To develop machine learning-based credit scoring classification systems that accurately classify individuals into appropriate credit score categories (Good, Standard, or Poor) using various classification techniques.
To further develop a regression model that estimates the customer’s Bad Rate%, enabling more detailed risk assessments. This approach allows for differentiating between various risk levels, such as a 20% Bad Rate versus an 80% Bad Rate.

1.3 Project Contributions

Accurately classifying customers into Good, Standard, and Poor categories streamlines the customer monitoring process, allowing the bank to take prompt actions that mitigate credit risk.
The regression model plays a crucial role in developing a scorecard, which helps establish the loan approval/rejection threshold.
Interest rate optimization is achieved by segmenting customers based on varying Bad Rate percentages, enabling the bank to offer more tailored financial products and services.
A more granular assessment of the Bad Rate percentage allows the collections team to take proactive measures in recovering outstanding payments, thereby minimizing potential losses for the bank.
Identifying customers with lower Bad Rate percentages creates opportunities for cross-selling additional bank products, driving new business growth.

2. Data Understanding

2.1 Data Introduction

Title : Credit Score Classification
Year : 2022
Source : Kaggle Dataset
Purpose : To enable the development and evaluation of machine learning models for predicting and classifying individuals’ credit scores based on financial attributes.

2.2 Data Overview

Dimension : 100,000 Rows & 28 Columns

Contents & Structure :

Demographic

[Character] ID  : Represents a unique identification of an entry
[Character] Name: Represents the name of a person
[Character] Age : Represents the age of the person
[Character] Occupation : Represents the occupation of the person

Financial Profile

[Character] Annual_Income : Represents the annual income of the person 
[Character] Outstanding_Debt : Represents the remaining debt to be paid (in USD)
[Numeric] Monthly_Balance : Represents the monthly balance amount of the customer (in USD)
[Numeric] Num_Bank_Accounts : Represents the number of bank accounts a person holds
[Numeric] Num_Credit_Card : Represents the number of other credit cards held by a person
[Character] Num_of_Loan : Represents the number of loans taken from the bank

Loan Information

[Character] Num_of_Loan : Represents the number of loans taken from the bank
[Numeric] Interest_Rate : Represents the interest rate on credit card
[Numeric] Amount_invested_monthly : Represents the monthly amount invested by the customer (in USD)

Credit Behavior

[Numeric] Num_of_Delayed_Payment : Represents the average number of payments delayed by a person
[Numeric] Num_Credit_Inquiries : Represents the number of credit card inquiries
[Character] Credit_History_Age : Represents the age of credit history of the person
[Character] Payment_Behaviour : Represents the payment behavior of the customer (in USD)
[Numeric] Credit_Utilization_Ratio : Represents the utilization ratio of credit card

Credit Score

[Character] Credit Score : Represents the bracket of credit score (Poor, Standard, Good)

2.3 Data Summary

## # A tibble: 28 × 8
##    Variable              Class     Min    `1st Qu.` Median Mean  `3rd Qu.` Max  
##    <chr>                 <chr>     <chr>  <chr>     <chr>  <chr> <chr>     <chr>
##  1 ID                    numeric   5634   43132.75  80631… 8063… 118130.25 1556…
##  2 Customer_ID           character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  3 Month                 character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  4 Name                  character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  5 Age                   character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  6 SSN                   character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  7 Occupation            character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  8 Annual_Income         character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  9 Monthly_Inhand_Salary numeric   303.6… 1625.568… 3093.… 4194… 5957.448… 1520…
## 10 Num_Bank_Accounts     numeric   -1     3         6      17.0… 7         1798 
## # ℹ 18 more rows

The summary table reveals that some numerical columns are incorrectly classified as "character" data type. Additionally, outliers and erroneous values, such as -1 and 1798 in the Num_Bank_Accounts column, have been identified. These issues necessitate a data type conversion and cleaning in the next phase to ensure the data is properly formatted, readable and accurate.

##     ID Customer_ID    Month          Name  Age         SSN Occupation
## 1 5634   CUS_0xd40  January Aaron Maashoh   23 821-00-0265  Scientist
## 2 5635   CUS_0xd40 February Aaron Maashoh   23 821-00-0265  Scientist
## 3 5636   CUS_0xd40    March Aaron Maashoh -500 821-00-0265  Scientist
## 4 5637   CUS_0xd40    April Aaron Maashoh   23 821-00-0265  Scientist
## 5 5638   CUS_0xd40      May Aaron Maashoh   23 821-00-0265  Scientist
## 6 5639   CUS_0xd40     June Aaron Maashoh   23 821-00-0265  Scientist
##   Annual_Income Monthly_Inhand_Salary Num_Bank_Accounts Num_Credit_Card
## 1      19114.12              1824.843                 3               4
## 2      19114.12                    NA                 3               4
## 3      19114.12                    NA                 3               4
## 4      19114.12                    NA                 3               4
## 5      19114.12              1824.843                 3               4
## 6      19114.12                    NA                 3               4
##   Interest_Rate Num_of_Loan
## 1             3           4
## 2             3           4
## 3             3           4
## 4             3           4
## 5             3           4
## 6             3           4
##                                                          Type_of_Loan
## 1 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 2 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 3 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 4 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 5 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 6 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
##   Delay_from_due_date Num_of_Delayed_Payment Changed_Credit_Limit
## 1                   3                      7                11.27
## 2                  -1                                       11.27
## 3                   3                      7                    _
## 4                   5                      4                 6.27
## 5                   6                                       11.27
## 6                   8                      4                 9.27
##   Num_Credit_Inquiries Credit_Mix Outstanding_Debt Credit_Utilization_Ratio
## 1                    4          _           809.98                 26.82262
## 2                    4       Good           809.98                 31.94496
## 3                    4       Good           809.98                 28.60935
## 4                    4       Good           809.98                 31.37786
## 5                    4       Good           809.98                 24.79735
## 6                    4       Good           809.98                 27.26226
##      Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month
## 1 22 Years and 1 Months                    No            49.57495
## 2                  <NA>                    No            49.57495
## 3 22 Years and 3 Months                    No            49.57495
## 4 22 Years and 4 Months                    No            49.57495
## 5 22 Years and 5 Months                    No            49.57495
## 6 22 Years and 6 Months                    No            49.57495
##   Amount_invested_monthly                Payment_Behaviour    Monthly_Balance
## 1       80.41529543900253  High_spent_Small_value_payments 312.49408867943663
## 2      118.28022162236736   Low_spent_Large_value_payments 284.62916249607184
## 3         81.699521264648  Low_spent_Medium_value_payments  331.2098628537912
## 4       199.4580743910713   Low_spent_Small_value_payments 223.45130972736786
## 5      41.420153086217326 High_spent_Medium_value_payments 341.48923103222177
## 6      62.430172331195294                           !@9#%8  340.4792117872438
##   Credit_Score
## 1         Good
## 2         Good
## 3         Good
## 4         Good
## 5         Good
## 6         Good

3. Data Cleaning

3.1 Remove “_” & Data Type Conversion

From the data summarized above, the first layer of cleaning has been performed on the variables that were expected to be numeric but were recorded as character. The cleaning process involves removing underscores from these variables, converting them to numeric format, and rounding the numeric values to two decimal places.

## # A tibble: 28 × 8
##    Variable              Class     Min    `1st Qu.` Median Mean  `3rd Qu.` Max  
##    <chr>                 <chr>     <chr>  <chr>     <chr>  <chr> <chr>     <chr>
##  1 ID                    numeric   5634   43132.75  80631… 8063… 118130.25 1556…
##  2 Customer_ID           character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  3 Month                 character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  4 Name                  character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  5 Age                   numeric   -500   24        33     110.… 42        8698 
##  6 SSN                   character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  7 Occupation            character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  8 Annual_Income         numeric   7005.… 19457.5   37578… 1764… 72790.92  2419…
##  9 Monthly_Inhand_Salary numeric   303.65 1625.57   3093.… 4194… 5957.45   1520…
## 10 Num_Bank_Accounts     numeric   -1     3         6      17.0… 7         1798 
## # ℹ 18 more rows

After converting the data types, the numeric variables now appear in the "Numeric" format. However, the previously mentioned erroneous and outlier values still persist and need to be addressed.

3.2 Function used for the following cleaning process

Function to compute MODE

Function to handle missing or outlier values using backward or forward filling:

1. Missing or outlier were replaced with 0 for scanning purpose.
2. Backward Filling: Fills the current value by using the most recent available value from the previous months.
3. Forward Filling: Fills the current value by using the most recent available value from the upcoming months.

Function to handle missing or outlier values using backward or forward filling:

1. Missing or outlier were replaced with 99 for scanning purpose.
2. Backward Filling: Fills the current value by using the most recent available value from the previous months.
3. Forward Filling: Fills the current value by using the most recent available value from the upcoming months.

Function to handle missing or outlier values using backward or forward filling:

1. Missing or outlier were replaced with 0 for scanning purpose.
2. Backward Filling: Find the current value by using the most recent available value from the previous months.
3. Forward Filling: Find the current value by using the most recent available value from the upcoming months.
4. All values are then adjusted by adding or subtracting 1, depending on whether backward or forward filling is applied.

Function to detect outliers using IQR method: 
This method is used when it's impossible to detect outlier from the frequency table.

3.3 Frequency-based One-Hot Encoding

Type_of_Loan & Num_of_Loan

##                                                          Type_of_Loan
## 1 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 2 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 3 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
##   Num_of_Loan
## 1           4
## 2           4
## 3           4

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -100.00    1.00    3.00    3.01    5.00 1496.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.533   5.000   9.000

3.4 Replacing Missing Values/Erroneous Values/Outlier with Mode

Use mode to replace the outlier with the most frequently occurring value in the dataset

Occupation

Annual Income

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7006    19458    37579   176416    72791 24198062

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7006   19343   37000   50505   71683  179987

Interest Rate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   13.00   72.47   20.00 5797.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   13.00   14.53   20.00   34.00

Credit_Mix

Payment_Behavior

Total_EMI_per_month

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    30.31    69.25  1403.12   161.22 82331.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.23  566.07 1166.15 1426.22 1945.96 4998.07

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   28.42   65.01   98.12  141.51 1701.96

3.5 Replacing Missing Values/Erroneous Values/Outlier with Mean

Amount_invested_monthly

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     0.00    74.53   135.93   637.41   265.73 10000.00     4479

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    74.6   131.2   195.8   239.5  1977.3

Monthly_Balance

##                         Min.                      1st Qu. 
## -333333333333333314868224222                          270 
##                       Median                         Mean 
##                          337     -30364372469635625254402 
##                      3rd Qu.                         Max. 
##                          470                         1602 
##                         NA's 
##                         1200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01  270.19  337.12  403.12  471.57 1602.04

3.6 Backward/Forward Filling

to fill the missing/outlier value with previous/next month available value

Age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -500.0    24.0    33.0   110.7    42.0  8698.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   24.00   33.00   33.31   42.00   56.00

Monthly_Inhand_Salary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   303.6  1625.6  3093.8  4194.2  5957.4 15204.6   15002

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   303.6  1626.8  3096.4  4198.8  5961.7 15204.6

Num_Bank_Accounts

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00    3.00    6.00   17.09    7.00 1798.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   5.000   5.369   7.000  11.000

Num_Credit_Card

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    5.00   22.47    7.00 1499.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.000   5.000   5.534   7.000  11.000

Delay_from_due_date

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5.00   10.00   18.00   21.07   28.00   67.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   18.00   21.08   28.00   67.00

Num_of_Delayed_Payment

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -3.00    9.00   14.00   30.92   18.00 4397.00    7002

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   14.00   13.21   18.00   28.00

Changed_Credit_Limit

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -6.49    5.32    9.40   10.39   14.87   36.97    2091

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.49    5.32    9.40   10.39   14.86   36.97

Num_Credit_Inquiries

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.00    6.00   27.75    9.00 2597.00    1965

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   5.000   5.773   8.000  17.000

Credit_History_Age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   144.0   219.0   221.2   302.0   404.0

3.7 Binary Encoding

Payment_of_Min_Amount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5943  1.0000  1.0000

3.8 Drop irrelevant variables & Rename Variable

## # A tibble: 33 × 8
##    Variable              Class     Min    `1st Qu.` Median Mean  `3rd Qu.` Max  
##    <chr>                 <chr>     <chr>  <chr>     <chr>  <chr> <chr>     <chr>
##  1 Customer_ID           character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  2 Month                 character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  3 Age                   numeric   14     24        33     33.3… 42        56   
##  4 Occupation            character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  5 Annual_Income         numeric   7005.… 19342.97… 36999… 5050… 71683.47  1799…
##  6 Monthly_Inhand_Salary numeric   303.65 1626.76   3096.… 4198… 5961.74   1520…
##  7 Num_Bank_Accounts     numeric   0      3         5      5.36… 7         11   
##  8 Num_Credit_Card       numeric   0      4         5      5.53… 7         11   
##  9 Interest_Rate         numeric   1      7         13     14.5… 20        34   
## 10 Delay_from_due_date   numeric   0      10        18     21.0… 28        67   
## # ℹ 23 more rows

4. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial step in examining and summarizing datasets to reveal patterns, trends, anomalies, and relationships. It uses both visual and quantitative methods to provide a thorough understanding of the data before moving on to more sophisticated modeling or analysis.

For our credit score classification project, three primary analysis were carried out to explore the financial data, study the distribution of variables, and investigate the underlying relationships among them.
  - Univariate Analysis
  - Bivariate Analysis
  - Correlation Analysis

4.1 Univariate Analysis

4.1.1 Distribution of Numerical Variables

Annual_Income & Monthly_Inhand_Salary & Monthly_Balance & Amount_Invested_Monthly:

Right-skewed distribution observed with a higher density of data points at lower income levels, indicating that most individuals have lower incomes, with fewer individuals having very high incomes. There is evident income inequality, with most people earning less and only a small portion earning significantly higher incomes. this pattern is also observed in Monthly_Balance and Amount_Invested_Monthly, as individuals with lower incomes tend to have smaller savings or investments, while only a few individuals with high incomes have significantly larger balances and investments.

4.1.2 Distribution of Categorical Variables

Occupation:

Relatively even across the various categories.

Credit_Mix and Credit_Score:

The majority of the population falls into the "Standard" credit mix category, followed by "Good" and "Bad," which is consistent with the distribution observed in the credit score variable, suggesting a potential positive correlation between these two variables.

Payment_Behavior:

The most common behaviour is "Low spent, Small value payments", indicating a cautious approach to spending.

4.1.3 Distribution of Credit Score (Target variable in Credit Classification)

The majority of customers have a "Standard" credit score, followed by "Poor" and "Good," indicating class imbalance, where the "Standard" category is overrepresented, while the "Poor" and "Good" categories are underrepresented.

4.2 Bivariate Analysis

4.2.1 Distribution of Numerical Columns with Credit Score

Observation:

While most variables display clear differentiation in boxplots across the three classes, certain variables such as Changed_Credit_Limit, Credit_Utilization_Ratio, one-hot-encoded Loan types (e.g., CreditBuilderLoan, HomeEquityLoan, etc.), and Total_EMI_per_month struggle to show distinct differences. This indicates that these particular variables may have limited predictive power in distinguishing between the classes.

4.2.2 Distribution of categorical columns with Credit Score

Observation:

Occupation x Credit_Score:
No clear relationship has been observed between occupation and credit_score, indicating that occupation may not be a significant predictor of credit score. Hence, 'Occupation’ will be dropped from further consideration.

Credit_Mix x Credit_Score:
As already noted in Section 4.1.2, these two variables again show a positive correlation, reinforcing their role as predictive variables in credit score classification.

4.3 Correlation Analysis

4.3.1 Correlation HeatMap of numerical columns

Correlation heatmap shows the relationship between the features with the credit score. The correlation coefficient ranges between -1 and +1. The closer the value is to +1, the stronger the positive relationship. The closer the value is to -1, the stronger the negative relationship. Values near zero indicate a weak or no relationship.

Observation:

Independent Variables with High Inter-collinearity (> 0.70)

1. Monthly_Balance <-> Annual_Income <-> Monthly_Inhand_Salary: 
These variables are highly correlated with each other, indicating that these variables provide overlapping information about the financial standing of the individuals and could lead to multicollinearity issues.

2. Credit_Mix <-> Num_Bank_Accounts <-> Interest_Rate <-> Num_of_Delayed_Payment <-> Min_Amount_Pymt_Ind: 
These variables are highly correlated with Credit_Mix (correlation > 0.7) but exhibit lower correlations with each other (correlation < 0.7). This suggests that Credit_Mix serves as a comprehensive measure encompassing various aspects of financial behavior and creditworthiness.However, the high correlation between these variables indicates multicollinearity, which may complicate the interpretation of their individual contributions in a predictive model.

4.3.2 Correlation of all features with Credit Score

Observations:

1.  Variables with the Strongest POSITIVE Correlation with Credit Score (the Higher the value, the Worse the Credit Score)

Credit_Mix (0.50): A diverse mix of credit types helps to mitigate risk by ensuring that an individual isn't overly reliant on a single type of credit. This diversification can indicate that the individual is managing multiple forms of credit responsibly, reducing the overall risk.

Interest_Rate (0.49): Higher interest rates are often charged to individuals who are considered higher risk. This reflects the increased likelihood of credit loss, as lenders compensate for the higher risk by charging more.

Payment_of_Min_Amount (0.44): Individuals who only pay the minimum amount due and allow the outstanding balance to roll over are often experiencing financial strain. This behavior can signal that they are struggling to manage their finances effectively.

2.  Variables with the Strongest NEGATIVE Correlation with Credit Score (the Lower the value, the Worse the Credit Score)

Credit_History_Months (-0.39): Longer credit history suggests that the individual has experience in managing credit over time. It indicates maturity and an ability to maintain financial health over the long term without being charged off or blacklisted by financial institutions.

Monthly_Balance (-0.21): A lower remaining balance at the end of each month indicates financial difficulties. It suggests that the individual has little room for any unexpected expenses or financial setbacks, which could lead to further financial trouble.

Annual_Income (-0.21): Higher annual income provides individuals with greater purchasing power and the ability to sustain themselves through financial downturns. It often correlates with better credit scores as higher income individuals are more capable of meeting their financial obligations.

3.  Variables with Insignificant Correlation with Credit Score (close to ZERO) This suggests that other factors may be more critical in determining an individual's credit score, and these variables may not be useful predictors in credit score modeling.

-   Occupation (0.03)
-   Credit_Utilization_Ratio (-0.05)
-   Total_EMI_per_month (0.07)

5. Model Modeling & Evaluation - Credit Score Classification Model

For Classification Model to predict Credit Score = Poor/ Standard/ Good, 3 types of models will be explored, and their performance will be compared for adoption recommendation:

Model Type 1: Multinomial Logistic Regression (MLR)
Model Type 2: Random Forest (RF)
Model Type 3: XGBoost (XG)

5.1 Data Preparation for Modelling

Data will be scaled prior to modelling due to:

Most machine learning algorithms benefit from scaling.
Scaled data has seen to require less computational power.

As class imbalance observed, 2 class balancing techniques below has been attempted:

Over-sampling
Under-sampling

Data partition of 70% (Train sample) vs. 30% (Test sample) has been conducted to have a hold-out sample for model testing, and all 3 model types will be conducted on 3 separate data sets as follows:

Scaled Data (original)
Scaled Data + Over-sampling
Scaled Data + Under-sampling

## [1] "There is no character column, proceed."

## [1] "There is no NA Value in train set, proceed."

## [1] "There is no NA Value in test set, proceed."

Original

## 
##     Good Standard     Poor 
##    12480    37222    20299

Over-Sampling

## 
##     Good Standard     Poor 
##    37222    37222    37222

Under-Sampling

## 
##     Good Standard     Poor 
##    12480    12480    12480

5.2 Data Modelling

Model Type 1: Multinomial Logistic Regression

## character(0)

Model Type 2: Random Forest

Model Type 3: XG Boost

5.3 Classification Model Evaluation

Train data sets 1 to 3 are used for evaluation purpose across the 3 models:
1) Scaled Data (original train_new): MLR, RF, XGBoost
2) Scaled Data + Oversampling (up_train): MLR, RF, XGBoost
3) Scaled Data + Undersampling (down_train): MLR, RF, XGBoost

However, Test sample, evaluation will be performed on only original data “test_new” for unbiased evaluation across all models attempted.
Altering the test data via over- or under-sampling could lead to biased evaluation and overly optimistic performance metrics, as the test set would no longer reflect the original class distribution.

Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  7217  5091   172
##          1  4143 28230  4849
##          2  1247  8558 10494
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6563               
##                  95% CI : (0.6528, 0.6598)     
##     No Information Rate : 0.5983               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.413                
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5725   0.6741   0.6764
## Specificity            0.9083   0.6803   0.8200
## Pos Pred Value         0.5783   0.7584   0.5170
## Neg Pred Value         0.9063   0.5836   0.8990
## Prevalence             0.1801   0.5983   0.2216
## Detection Rate         0.1031   0.4033   0.1499
## Detection Prevalence   0.1783   0.5317   0.2900
## Balanced Accuracy      0.7404   0.6772   0.7482

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  3106  2182    60
##          1  1763 12120  2069
##          2   551  3697  4451
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6559               
##                  95% CI : (0.6505, 0.6613)     
##     No Information Rate : 0.6                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.412                
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5731   0.6734   0.6764
## Specificity            0.9088   0.6807   0.8186
## Pos Pred Value         0.5808   0.7598   0.5117
## Neg Pred Value         0.9061   0.5815   0.9000
## Prevalence             0.1807   0.6000   0.2193
## Detection Rate         0.1035   0.4040   0.1484
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7409   0.6770   0.7475

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 31023  5168  1031
##          1  7802 21397  8023
##          2  6240  5588 25394
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6968               
##                  95% CI : (0.6941, 0.6995)     
##     No Information Rate : 0.4036               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5453               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6884   0.6655   0.7372
## Specificity            0.9069   0.8010   0.8468
## Pos Pred Value         0.8335   0.5748   0.6822
## Neg Pred Value         0.8114   0.8555   0.8784
## Prevalence             0.4036   0.2879   0.3085
## Detection Rate         0.2778   0.1916   0.2274
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.7977   0.7332   0.7920

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0 4520  701  127
##          1 3417 9111 3424
##          2 1432 1299 5968
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6533               
##                  95% CI : (0.6479, 0.6587)     
##     No Information Rate : 0.3704               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.471                
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.4824   0.8200   0.6270
## Specificity            0.9599   0.6378   0.8667
## Pos Pred Value         0.8452   0.5712   0.6861
## Neg Pred Value         0.8033   0.8576   0.8333
## Prevalence             0.3123   0.3704   0.3173
## Detection Rate         0.1507   0.3037   0.1989
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7212   0.7289   0.7468

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 10389  1760   331
##          1  2563  7224  2693
##          2  2076  1864  8540
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6985               
##                  95% CI : (0.6939, 0.7032)     
##     No Information Rate : 0.4014               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5478               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6913   0.6659   0.7385
## Specificity            0.9067   0.8023   0.8477
## Pos Pred Value         0.8325   0.5788   0.6843
## Neg Pred Value         0.8141   0.8548   0.8788
## Prevalence             0.4014   0.2897   0.3089
## Detection Rate         0.2775   0.1929   0.2281
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.7990   0.7341   0.7931

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0 4522  705  121
##          1 3411 9117 3424
##          2 1431 1310 5958
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6533               
##                  95% CI : (0.6478, 0.6586)     
##     No Information Rate : 0.3711               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4708               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.4829   0.8190   0.6270
## Specificity            0.9600   0.6377   0.8663
## Pos Pred Value         0.8455   0.5715   0.6849
## Neg Pred Value         0.8036   0.8566   0.8336
## Prevalence             0.3121   0.3711   0.3168
## Detection Rate         0.1507   0.3039   0.1986
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7214   0.7284   0.7466

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 12448    32     0
##          1    45 37092    85
##          2     0    54 20245
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9969               
##                  95% CI : (0.9965, 0.9973)     
##     No Information Rate : 0.5311               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9949               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9964   0.9977   0.9958
## Specificity            0.9994   0.9960   0.9989
## Pos Pred Value         0.9974   0.9965   0.9973
## Neg Pred Value         0.9992   0.9974   0.9983
## Prevalence             0.1785   0.5311   0.2904
## Detection Rate         0.1778   0.5299   0.2892
## Detection Prevalence   0.1783   0.5317   0.2900
## Balanced Accuracy      0.9979   0.9969   0.9974

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  4066  1245    37
##          1  1160 12957  1835
##          2    69  1585  7045
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8023               
##                  95% CI : (0.7977, 0.8068)     
##     No Information Rate : 0.5263               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6719               
##                                                
##  Mcnemar's Test P-Value : 0.0000008754         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.7679   0.8207   0.7901
## Specificity            0.9481   0.7893   0.9215
## Pos Pred Value         0.7603   0.8122   0.8099
## Neg Pred Value         0.9501   0.7985   0.9121
## Prevalence             0.1765   0.5263   0.2972
## Detection Rate         0.1355   0.4319   0.2348
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.8580   0.8050   0.8558

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 37217     5     0
##          1   155 36904   163
##          2     0    16 37206
## 
## Overall Statistics
##                                                
##                Accuracy : 0.997                
##                  95% CI : (0.9966, 0.9973)     
##     No Information Rate : 0.3347               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9954               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9959   0.9994   0.9956
## Specificity            0.9999   0.9957   0.9998
## Pos Pred Value         0.9999   0.9915   0.9996
## Neg Pred Value         0.9979   0.9997   0.9978
## Prevalence             0.3347   0.3307   0.3346
## Detection Rate         0.3333   0.3305   0.3332
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.9979   0.9976   0.9977

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  4398   933    17
##          1  1474 12451  2027
##          2    75  1223  7401
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8084               
##                  95% CI : (0.8039, 0.8128)     
##     No Information Rate : 0.4869               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6881               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.7395   0.8524   0.7836
## Specificity            0.9605   0.7725   0.9368
## Pos Pred Value         0.8224   0.7805   0.8508
## Neg Pred Value         0.9372   0.8465   0.9040
## Prevalence             0.1982   0.4869   0.3148
## Detection Rate         0.1466   0.4150   0.2467
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.8500   0.8125   0.8602

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 12477     3     0
##          1    43 12408    29
##          2     0    13 12467
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9976               
##                  95% CI : (0.9971, 0.9981)     
##     No Information Rate : 0.3344               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9965               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9966   0.9987   0.9977
## Specificity            0.9999   0.9971   0.9995
## Pos Pred Value         0.9998   0.9942   0.9990
## Neg Pred Value         0.9983   0.9994   0.9988
## Prevalence             0.3344   0.3318   0.3338
## Detection Rate         0.3333   0.3314   0.3330
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.9982   0.9979   0.9986

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  4720   586    42
##          1  2674 10753  2525
##          2   384   961  7354
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7609               
##                  95% CI : (0.7561, 0.7657)     
##     No Information Rate : 0.41                 
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6264               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6068   0.8742   0.7413
## Specificity            0.9717   0.7063   0.9330
## Pos Pred Value         0.8826   0.6741   0.8454
## Neg Pred Value         0.8759   0.8899   0.8795
## Prevalence             0.2593   0.4100   0.3307
## Detection Rate         0.1573   0.3584   0.2451
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7893   0.7902   0.8371

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  9019  3225   236
##          1  3946 29124  4152
##          2  1283  4820 14196
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7477               
##                  95% CI : (0.7445, 0.7509)     
##     No Information Rate : 0.531                
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5825               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6330   0.7836   0.7639
## Specificity            0.9379   0.7534   0.8813
## Pos Pred Value         0.7227   0.7824   0.6993
## Neg Pred Value         0.9091   0.7546   0.9117
## Prevalence             0.2035   0.5310   0.2655
## Detection Rate         0.1288   0.4161   0.2028
## Detection Prevalence   0.1783   0.5317   0.2900
## Balanced Accuracy      0.7855   0.7685   0.8226

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  3748  1497   103
##          1  1956 12127  1869
##          2   602  2274  5823
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7233               
##                  95% CI : (0.7182, 0.7283)     
##     No Information Rate : 0.53                 
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5429               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5944   0.7628   0.7470
## Specificity            0.9325   0.7287   0.8705
## Pos Pred Value         0.7008   0.7602   0.6694
## Neg Pred Value         0.8962   0.7315   0.9074
## Prevalence             0.2102   0.5300   0.2598
## Detection Rate         0.1249   0.4042   0.1941
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7634   0.7458   0.8087

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 31962  4594   666
##          1  6965 23608  6649
##          2  3959  3425 29838
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7649               
##                  95% CI : (0.7624, 0.7673)     
##     No Information Rate : 0.3841               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6473               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.7453   0.7465   0.8031
## Specificity            0.9235   0.8299   0.9009
## Pos Pred Value         0.8587   0.6342   0.8016
## Neg Pred Value         0.8533   0.8923   0.9017
## Prevalence             0.3841   0.2832   0.3327
## Detection Rate         0.2862   0.2114   0.2672
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.8344   0.7882   0.8520

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0 4511  724  113
##          1 3084 9864 3004
##          2 1006  914 6779
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7052               
##                  95% CI : (0.7, 0.7103)        
##     No Information Rate : 0.3834               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5459               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5245   0.8576   0.6850
## Specificity            0.9609   0.6709   0.9045
## Pos Pred Value         0.8435   0.6184   0.7793
## Neg Pred Value         0.8341   0.8834   0.8537
## Prevalence             0.2867   0.3834   0.3299
## Detection Rate         0.1504   0.3288   0.2260
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7427   0.7642   0.7948

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 10686  1576   218
##          1  2230  7954  2296
##          2  1365  1059 10056
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7665               
##                  95% CI : (0.7621, 0.7707)     
##     No Information Rate : 0.3814               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6497               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.7483   0.7512   0.8000
## Specificity            0.9225   0.8314   0.9025
## Pos Pred Value         0.8562   0.6373   0.8058
## Neg Pred Value         0.8560   0.8944   0.8993
## Prevalence             0.3814   0.2828   0.3357
## Detection Rate         0.2854   0.2124   0.2686
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.8354   0.7913   0.8513

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0 4534  702  112
##          1 3172 9597 3183
##          2 1038  873 6788
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6973               
##                  95% CI : (0.6921, 0.7025)     
##     No Information Rate : 0.3724               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5362               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5185   0.8590   0.6732
## Specificity            0.9617   0.6625   0.9040
## Pos Pred Value         0.8478   0.6016   0.7803
## Neg Pred Value         0.8292   0.8879   0.8453
## Prevalence             0.2915   0.3724   0.3361
## Detection Rate         0.1511   0.3199   0.2263
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7401   0.7607   0.7886

Model Performance Interpretation:

Random Forest (RF)

Nearly 100% accuracy & F1 score observed on training sample but only 75-80% on testing sample
This ~20-25% gap between training and testing performance is a classic indicator of overfitting.
The model has essentially “memorized” the training data rather than learning generalizable patterns.

XGBoost (XG)

Testing Accuracy & F1 score around 70-73% vs Training around 75%, indicates very good performance.
Minimal gap between Train vs. Test sample performance shows that XG model has better generalization capability as compared to RF model, and will be more reliable performance on new/ unseen data.

Multinomial Logistic Regression (MLR)

MLR models show the worst performance across all metrics vs. the other 2 models.
Such observation could suggest that the relationship between variables might be non-linear.

Comparison between Class Balancing methods

Class Balancing methods have insignificant impact on overall model performance.
For RF, slightly improved performance observed with over-sampling method on the Test sample.
For XG, despite Train sample shows slightly better performance for over/under-sampling, Test sample shows worse Accuracy, F1 score and Precision.
For MLR, both Train & Test samples shows slightly higher accuracy for over/under-sampling, however, overall performance still the worst compared to other models.

5.4 Conclusion

Credit scoring requires stable and reliable predictions, and XGBoost’s consistent performance makes it a strong candidate for production deployment. It is less likely to experience unexpected performance drops when applied to new customer data. Therefore, the XGBoost model trained on the original dataset without class balancing is selected as the final model, as class balancing does not provide significant benefits in this case.

6. Model Modeling & Evaluation - Credit Score Regression Model

For Regression Model, development of Credit Scorecard would be demonstrated.
As the main purpose of Credit Scorecard is for loan approval process, where the decision is either “Approve” or “Reject”, binary target has been created with Credit Score = Poor (=1) vs. Standard or Good (=0).

Benefits:

Simplicity and Clarity: A binary classification model simplifies the process by categorizing individuals into two distinct groups i.e. high-risk (poor) and lower-risk (standard/good). This clarity facilitates easier interpretation and decision-making for stakeholders i.e. to approve or reject the loan.

Flexibility in Score Band Creation: With the binary model, more specific score bands can be created based on the probability of being classified as poor. This allows for further segmentation and targeted decision-making, providing a more granular understanding of credit risk within the broader category i.e. collection strategy, interest offering differentiation, cross-selling purpose.

## [1] "There is no character column, proceed."

6.1 Data Preparation for Modelling

Step 1: Binary target creation: Poor (=1) vs. Standard or Good (=0) & Data Scaling

## [1] "There is no NA Value in train set, proceed."

## [1] "There is no NA Value in test set, proceed."

Step 2: Data Processing & Variable Reduction

Feature selection based on IV: Threshold of IV > 0.2 being applied to help in filtering out irrelevant or less important variables that do not contribute meaningfully to the model. This reduces noise and potential overfitting, leading to a more effective model.
Next, independent variables with high correlation with each other will also be dropped, preserving the other one with higher IV to prevent multicollinearity:

Drop Monthly_Inhand_Salary due to lower IV and high correlation observed vs. Annual_Income.

## ✔ Binning on 70001 rows and 30 columns in 00:00:24

## [1] "Selected Variables and their Information Values:"

## 
## 
## |var                    |        iv|
## |:----------------------|---------:|
## |Outstanding_Debt       | 1.2744588|
## |Interest_Rate          | 1.0757602|
## |Num_Credit_Inquiries   | 0.8840514|
## |Delay_from_due_date    | 0.7414860|
## |Credit_Mix             | 0.6800040|
## |Credit_History_Months  | 0.6680570|
## |Num_Credit_Card        | 0.6424321|
## |No_of_Loan             | 0.6046970|
## |Num_Bank_Accounts      | 0.4811811|
## |Min_Amount_Pymt_Ind    | 0.4318173|
## |Num_of_Delayed_Payment | 0.3673243|
## |Annual_Income          | 0.2671015|
## |Monthly_Balance        | 0.2061845|

6.2 Data Modelling

Step 1: Model Fitting

Fitting of Generalized Linear Model (GLM) model to predict the binary outcome of Credit Score, using all predictors post variable reduction in Step 2.

## 
## Call:
## glm(formula = Credit_Score2 ~ ., family = binomial(), data = iv_var_train)
## 
## Coefficients:
##                        Estimate Std. Error  z value             Pr(>|z|)    
## (Intercept)            -1.11406    0.01012 -110.065 < 0.0000000000000002 ***
## Outstanding_Debt        0.09094    0.01445    6.292  0.00000000031348802 ***
## Interest_Rate           0.55220    0.01490   37.056 < 0.0000000000000002 ***
## Num_Credit_Inquiries    0.41867    0.01433   29.217 < 0.0000000000000002 ***
## Delay_from_due_date     0.36932    0.01285   28.746 < 0.0000000000000002 ***
## Credit_Mix             -0.50941    0.02439  -20.886 < 0.0000000000000002 ***
## Credit_History_Months  -0.11739    0.01472   -7.975  0.00000000000000153 ***
## Num_Credit_Card         0.32867    0.01182   27.815 < 0.0000000000000002 ***
## No_of_Loan              0.11079    0.01418    7.815  0.00000000000000551 ***
## Num_Bank_Accounts       0.03480    0.01444    2.410              0.01596 *  
## Min_Amount_Pymt_Ind    -0.05410    0.01745   -3.100              0.00194 ** 
## Num_of_Delayed_Payment -0.02309    0.01529   -1.511              0.13085    
## Annual_Income          -0.12447    0.01493   -8.339 < 0.0000000000000002 ***
## Monthly_Balance         0.08669    0.01607    5.393  0.00000006919048661 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 84300  on 70000  degrees of freedom
## Residual deviance: 67401  on 69987  degrees of freedom
## AIC: 67429
## 
## Number of Fisher Scoring iterations: 4

##       Outstanding_Debt          Interest_Rate   Num_Credit_Inquiries 
##               2.600563               2.378558               2.118945 
##    Delay_from_due_date             Credit_Mix  Credit_History_Months 
##               1.893779               6.616354               2.282167 
##        Num_Credit_Card             No_of_Loan      Num_Bank_Accounts 
##               1.433335               2.323659               2.174574 
##    Min_Amount_Pymt_Ind Num_of_Delayed_Payment          Annual_Income 
##               2.870790               2.416286               2.039456 
##        Monthly_Balance 
##               2.345378

Step 2: Scorecard Creation

Before creating the scorecard, the variables Credit_Mix and Num_of_Delayed_Payment were excluded due to multicollinearity and lack of statistical significance, respectively. Credit_Mix had a Variance Inflation Factor (VIF) of 6.6, indicating moderate multicollinearity, which can inflate regression coefficient variance and compromise model reliability. Num_of_Delayed_Payment had a p-value of 0.13085, above the standard threshold of 0.05, suggesting it is not a significant predictor of the target variable. Removing these variables enhances model interpretability, reduces the risk of overfitting, and improves scorecard stability.

The Weight of Evidence (WOE) bins for predictor variables were then created, forming the basis for the credit scorecard, which was generated using the WOE-transformed data and a logistic regression model, with a base score of 600.

## 
## Call:
## glm(formula = Credit_Score2 ~ ., family = binomial(), data = iv_var_train2)
## 
## Coefficients:
##                        Estimate Std. Error  z value             Pr(>|z|)    
## (Intercept)           -1.111447   0.010092 -110.136 < 0.0000000000000002 ***
## Outstanding_Debt       0.006852   0.013920    0.492                0.623    
## Interest_Rate          0.475849   0.014347   33.167 < 0.0000000000000002 ***
## Num_Credit_Inquiries   0.414445   0.014225   29.135 < 0.0000000000000002 ***
## Delay_from_due_date    0.286338   0.012205   23.461 < 0.0000000000000002 ***
## Credit_History_Months -0.105541   0.014630   -7.214   0.0000000000005437 ***
## Num_Credit_Card        0.298691   0.011690   25.551 < 0.0000000000000002 ***
## No_of_Loan             0.060274   0.013930    4.327   0.0000151163866710 ***
## Num_Bank_Accounts     -0.078943   0.013537   -5.832   0.0000000054903309 ***
## Min_Amount_Pymt_Ind   -0.209276   0.015720  -13.313 < 0.0000000000000002 ***
## Annual_Income         -0.112643   0.014898   -7.561   0.0000000000000401 ***
## Monthly_Balance        0.084244   0.016030    5.255   0.0000001476924288 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 84300  on 70000  degrees of freedom
## Residual deviance: 67972  on 69989  degrees of freedom
## AIC: 67996
## 
## Number of Fisher Scoring iterations: 4

## ✔ Binning on 70001 rows and 12 columns in 00:00:18

## $basepoints
##      variable    bin    woe points
##        <char> <lgcl> <lgcl>  <num>
## 1: basepoints     NA     NA      0
## 
## $Outstanding_Debt
##            variable         bin count count_distr   neg   pos   posprob
##              <char>      <char> <int>       <num> <int> <int>     <num>
## 1: Outstanding_Debt [-Inf,-0.2) 35951   0.5135784 32094  3857 0.1072849
## 2: Outstanding_Debt  [-0.2,0.1) 12244   0.1749118  9003  3241 0.2647011
## 3: Outstanding_Debt   [0.1,1.1) 12803   0.1828974  3898  8905 0.6955401
## 4: Outstanding_Debt  [1.1, Inf)  9003   0.1286124  4707  4296 0.4771743
##           woe      bin_iv total_iv breaks is_special_values points
##         <num>       <num>    <num> <char>            <lgcl>  <num>
## 1: -1.2233059 0.557483949 1.274459   -0.2             FALSE     43
## 2: -0.1262024 0.002710392 1.274459    0.1             FALSE     43
## 3:  1.7216229 0.620238970 1.274459    1.1             FALSE     42
## 4:  0.8041071 0.094025537 1.274459    Inf             FALSE     42
## 
## $Interest_Rate
##         variable         bin count count_distr   neg   pos    posprob
##           <char>      <char> <int>       <num> <int> <int>      <num>
## 1: Interest_Rate [-Inf,-0.4) 30776  0.43965086 26659  4117 0.13377307
## 2: Interest_Rate    [-0.4,0)  6554  0.09362723  6020   534 0.08147696
## 3: Interest_Rate     [0,0.7) 16637  0.23766803 11395  5242 0.31508084
## 4: Interest_Rate  [0.7, Inf) 16034  0.22905387  5628 10406 0.64899588
##           woe      bin_iv total_iv breaks is_special_values points
##         <num>       <num>    <num> <char>            <lgcl>  <num>
## 1: -0.9725285 0.324395573  1.07576   -0.4             FALSE     76
## 2: -1.5269731 0.144780222  1.07576      0             FALSE     95
## 3:  0.1190020 0.003447832  1.07576    0.7             FALSE     38
## 4:  1.5101020 0.603136556  1.07576    Inf             FALSE     -9
## 
## $Num_Credit_Inquiries
##                variable         bin count count_distr   neg   pos   posprob
##                  <char>      <char> <int>       <num> <int> <int>     <num>
## 1: Num_Credit_Inquiries [-Inf,-0.2) 35167   0.5023785 30875  4292 0.1220462
## 2: Num_Credit_Inquiries  [-0.2,0.6) 17431   0.2490107 11476  5955 0.3416327
## 3: Num_Credit_Inquiries   [0.6,1.2)  7492   0.1070270  3493  3999 0.5337694
## 4: Num_Credit_Inquiries  [1.2, Inf)  9911   0.1415837  3858  6053 0.6107355
##           woe     bin_iv  total_iv breaks is_special_values points
##         <num>      <num>     <num> <char>            <lgcl>  <num>
## 1: -1.0777204 0.44161036 0.8840514   -0.2             FALSE     75
## 2:  0.2394469 0.01495778 0.8840514    0.6             FALSE     35
## 3:  1.0307569 0.13062361 0.8840514    1.2             FALSE     12
## 4:  1.3458787 0.29685964 0.8840514    Inf             FALSE      2
## 
## $Delay_from_due_date
##               variable         bin count count_distr   neg   pos    posprob
##                 <char>      <char> <int>       <num> <int> <int>      <num>
## 1: Delay_from_due_date [-Inf,-1.1)  5463  0.07804174  5037   426 0.07797913
## 2: Delay_from_due_date [-1.1,-0.4) 25137  0.35909487 21374  3763 0.14969965
## 3: Delay_from_due_date  [-0.4,0.6) 26017  0.37166612 18116  7901 0.30368605
## 4: Delay_from_due_date  [0.6, Inf) 13384  0.19119727  5175  8209 0.61334429
##            woe     bin_iv total_iv breaks is_special_values points
##          <num>      <num>    <num> <char>            <lgcl>  <num>
## 1: -1.57465305 0.12653558 0.741486   -1.1             FALSE     75
## 2: -0.84148517 0.20588152 0.741486   -0.4             FALSE     60
## 3:  0.06566736 0.00162452 0.741486    0.6             FALSE     41
## 4:  1.35686532 0.40744440 0.741486    Inf             FALSE     14
## 
## $Credit_History_Months
##                 variable         bin count count_distr   neg   pos   posprob
##                   <char>      <char> <int>       <num> <int> <int>     <num>
## 1: Credit_History_Months [-Inf,-0.4) 23968  0.34239511 12129 11839 0.4939503
## 2: Credit_History_Months  [-0.4,0.2) 17489  0.24983929 12473  5016 0.2868089
## 3: Credit_History_Months   [0.2,0.4)  3823  0.05461351  3207   616 0.1611300
## 4: Credit_History_Months  [0.4, Inf) 24721  0.35315210 21893  2828 0.1143967
##            woe       bin_iv total_iv breaks is_special_values points
##          <num>        <num>    <num> <char>            <lgcl>  <num>
## 1:  0.87127344 0.2955326876 0.668057   -0.4             FALSE     49
## 2: -0.01545995 0.0000595197 0.668057    0.2             FALSE     42
## 3: -0.75437069 0.0257830653 0.668057    0.4             FALSE     37
## 4: -1.15112365 0.3466817057 0.668057    Inf             FALSE     34
## 
## $Num_Credit_Card
##           variable       bin count count_distr   neg   pos   posprob
##             <char>    <char> <int>       <num> <int> <int>     <num>
## 1: Num_Credit_Card [-Inf,-1) 12514   0.1787689 11200  1314 0.1050024
## 2: Num_Credit_Card [-1,-0.5) 10110   0.1444265  8692  1418 0.1402572
## 3: Num_Credit_Card  [-0.5,1) 36980   0.5282782 25936 11044 0.2986479
## 4: Num_Credit_Card  [1, Inf) 10397   0.1485264  3874  6523 0.6273925
##            woe       bin_iv  total_iv breaks is_special_values points
##          <num>        <num>     <num> <char>            <lgcl>  <num>
## 1: -1.24736431 0.2003401691 0.6424321     -1             FALSE     69
## 2: -0.91768208 0.0963810664 0.6424321   -0.5             FALSE     62
## 3:  0.04172888 0.0009278876 0.6424321      1             FALSE     42
## 4:  1.41652038 0.3447829434 0.6424321    Inf             FALSE     12
## 
## $No_of_Loan
##      variable        bin count count_distr   neg   pos   posprob        woe
##        <char>     <char> <int>       <num> <int> <int>     <num>      <num>
## 1: No_of_Loan  [-Inf,-1) 15807   0.2258111 13974  1833 0.1159613 -1.1357709
## 2: No_of_Loan  [-1,-0.5) 11119   0.1588406  8453  2666 0.2397698 -0.2584686
## 3: No_of_Loan [-0.5,0.5) 21720   0.3102813 17034  4686 0.2157459 -0.3951585
## 4: No_of_Loan [0.5, Inf) 21355   0.3050671 10241 11114 0.5204402  0.9772799
##        bin_iv total_iv breaks is_special_values points
##         <num>    <num> <char>            <lgcl>  <num>
## 1: 0.21676833 0.604697     -1             FALSE     47
## 2: 0.01001233 0.604697   -0.5             FALSE     44
## 3: 0.04420788 0.604697    0.5             FALSE     44
## 4: 0.33370844 0.604697    Inf             FALSE     38
## 
## $Num_Bank_Accounts
##             variable        bin count count_distr   neg   pos   posprob
##               <char>     <char> <int>       <num> <int> <int>     <num>
## 1: Num_Bank_Accounts   [-Inf,0) 34941  0.49915001 29370  5571 0.1594402
## 2: Num_Bank_Accounts      [0,1) 18299  0.26141055 11656  6643 0.3630253
## 3: Num_Bank_Accounts    [1,1.5) 13040  0.18628305  7234  5806 0.4452454
## 4: Num_Bank_Accounts [1.5, Inf)  3721  0.05315638  1442  2279 0.6124698
##           woe     bin_iv  total_iv breaks is_special_values points
##         <num>      <num>     <num> <char>            <lgcl>  <num>
## 1: -0.7669256 0.24271269 0.4811811      0             FALSE     38
## 2:  0.3332161 0.03090239 0.4811811      1             FALSE     44
## 3:  0.6755733 0.09490216 0.4811811    1.5             FALSE     46
## 4:  1.3531793 0.11266384 0.4811811    Inf             FALSE     50
## 
## $Min_Amount_Pymt_Ind
##               variable                bin count count_distr   neg   pos
##                 <char>             <char> <int>       <num> <int> <int>
## 1: Min_Amount_Pymt_Ind [-Inf,0.823375414) 28283   0.4040371 24449  3834
## 2: Min_Amount_Pymt_Ind [0.823375414, Inf) 41718   0.5959629 25253 16465
##      posprob        woe    bin_iv  total_iv      breaks is_special_values
##        <num>      <num>     <num>     <num>      <char>            <lgcl>
## 1: 0.1355585 -0.9572071 0.2900677 0.4318173 0.823375414             FALSE
## 2: 0.3946738  0.4677655 0.1417496 0.4318173         Inf             FALSE
##    points
##     <num>
## 1:     28
## 2:     50
## 
## $Annual_Income
##         variable         bin count count_distr   neg   pos   posprob        woe
##           <char>      <char> <int>       <num> <int> <int>     <num>      <num>
## 1: Annual_Income   [-Inf,-1)  5691  0.08129884  3026  2665 0.4682833  0.7684360
## 2: Annual_Income   [-1,-0.8) 13076  0.18679733  8021  5055 0.3865861  0.4337883
## 3: Annual_Income [-0.8,-0.4) 14242  0.20345424 10967  3275 0.2299537 -0.3130993
## 4: Annual_Income  [-0.4,0.9) 24316  0.34736647 16607  7709 0.3170341  0.1280377
## 5: Annual_Income  [0.9, Inf) 12676  0.18108313 11081  1595 0.1258283 -1.0428846
##         bin_iv  total_iv breaks is_special_values points
##          <num>     <num> <char>            <lgcl>  <num>
## 1: 0.054101270 0.2671015     -1             FALSE     49
## 2: 0.038019467 0.2671015   -0.8             FALSE     46
## 3: 0.018572147 0.2671015   -0.4             FALSE     40
## 4: 0.005843768 0.2671015    0.9             FALSE     44
## 5: 0.150564880 0.2671015    Inf             FALSE     34
## 
## $Monthly_Balance
##           variable         bin count count_distr   neg   pos   posprob
##             <char>      <char> <int>       <num> <int> <int>     <num>
## 1: Monthly_Balance [-Inf,-0.5) 24996   0.3570806 15180  9816 0.3927028
## 2: Monthly_Balance [-0.5,-0.3) 10361   0.1480122  7068  3293 0.3178265
## 3: Monthly_Balance  [-0.3,0.7) 21972   0.3138812 16689  5283 0.2404424
## 4: Monthly_Balance  [0.7, Inf) 12672   0.1810260 10765  1907 0.1504893
##           woe      bin_iv  total_iv breaks is_special_values points
##         <num>       <num>     <num> <char>            <lgcl>  <num>
## 1:  0.4595085 0.081861581 0.2061845   -0.5             FALSE     40
## 2:  0.1316950 0.002636163 0.2061845   -0.3             FALSE     42
## 3: -0.2547822 0.019241691 0.2061845    0.7             FALSE     44
## 4: -0.8352953 0.102445093 0.2061845    Inf             FALSE     48

6.3 Regression Model Evaluation

The evaluation of the Credit Scorecard model was performed using several key metrics to assess its performance and generalization ability. The metrics calculated include AUC (Area Under the Curve), Gini coefficient, and KS (Kolmogorov-Smirnov) statistic. These metrics were calculated for both the training and testing datasets.

AUC (Area Under the Curve)
- Purpose: Measures model’s ability to distinguish between classes.
- Result: Training AUC = 0.77, Testing AUC = 0.77.
- Interpretation: Values > 0.7 indicate good performance and reliable predictions.
Gini Coefficient
- Purpose: Indicates model’s discriminatory power.
- Result: Training Gini = 0.55, Testing Gini = 0.54.
- Interpretation: Indicates good model performance which aligns well with the AUC values and further confirms the model’s ability to distinguish between the classes.
KS (Kolmogorov-Smirnov) Statistic
- Purpose: Evaluates separation between positive and negative classes.
- Result: Training KS = 0.41, Testing KS = 0.40.
- Interpretation: Good at distinguishing between good vs. bad and generalizing well to unseen data.
Population Stability Index (PSI)
- Purpose: Measures the stability of model scores over time by comparing distributions.
- Result: PSI = 0.0725.
- Interpretation: PSI values < 0.1 indicate a very stable population. While PSI is ideally performed on out-of-time data to detect changes over different periods, in this case, it was tested on out-of-sample data to check for immediate stability. The result suggests that the model’s score distribution between the training and testing sets is stable, indicating good model performance.

Score vs. Poor Rate

The scatter plot shows that there is a clear trend where higher scores are associated with lower predicted Poor rate %. The plot shows a high density of data points at higher predicted probabilities for lower scores. This suggests that the model is effectively capturing the risk associated with lower scores.

The box plot reveals a clear trend where category 1 (poor credit scores) has a lower median score and wider IQR compared to category 0 (better credit scores). This visual distinction aligns well with the model’s strong performance metrics, indicating effective risk differentiation.

XG Boost Integration:

A combination of XGBoost and Logistic Regression is explored for further model refinement. Initially, an XGBoost model is trained to predict the probability of the target variable, effectively capturing complex patterns within the data. The predictions from the XGBoost model are then used as input for a logistic regression model. This approach leverages the predictive power of XGBoost along with the interpretability of Logistic Regression.

Performance: The integration of XGBoost has been observed to significantly enhance model performance, particularly in terms of AUC (88%), Gini (76%), and KS statistics (63%), for both training and testing datasets.

In the context of creating a credit scorecard for various credit strategies, the original Logistic Regression model would be more suitable as it produces a more spread-out distribution of predicted scores, allowing more flexible strategy implementation based on different score buckets.

XGBoost intergrated model tends to generate scores that cluster more tightly, showing higher clustering at high scores, despite lack flexibility, it is highly predictive and can be a great solution for loan approval assessment, depending on the bank risk appetite.

## [1]  train-auc:0.826324  test-auc:0.823245 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
## 
## [2]  train-auc:0.846951  test-auc:0.843589 
## [3]  train-auc:0.844459  test-auc:0.842036 
## [4]  train-auc:0.849353  test-auc:0.847870 
## [5]  train-auc:0.850215  test-auc:0.849140 
## [6]  train-auc:0.851124  test-auc:0.850050 
## [7]  train-auc:0.853734  test-auc:0.852037 
## [8]  train-auc:0.854966  test-auc:0.852981 
## [9]  train-auc:0.854992  test-auc:0.852579 
## [10] train-auc:0.855786  test-auc:0.853195 
## [11] train-auc:0.857319  test-auc:0.854595 
## [12] train-auc:0.858093  test-auc:0.855209 
## [13] train-auc:0.860979  test-auc:0.857076 
## [14] train-auc:0.861954  test-auc:0.857998 
## [15] train-auc:0.862942  test-auc:0.858280 
## [16] train-auc:0.864723  test-auc:0.859675 
## [17] train-auc:0.867083  test-auc:0.861602 
## [18] train-auc:0.867633  test-auc:0.861877 
## [19] train-auc:0.868723  test-auc:0.862319 
## [20] train-auc:0.869359  test-auc:0.862614 
## [21] train-auc:0.870649  test-auc:0.863096 
## [22] train-auc:0.870843  test-auc:0.863090 
## [23] train-auc:0.871917  test-auc:0.864180 
## [24] train-auc:0.872689  test-auc:0.864622 
## [25] train-auc:0.873753  test-auc:0.865416 
## [26] train-auc:0.874480  test-auc:0.865953 
## [27] train-auc:0.875185  test-auc:0.866507 
## [28] train-auc:0.875637  test-auc:0.866634 
## [29] train-auc:0.876062  test-auc:0.866869 
## [30] train-auc:0.876747  test-auc:0.867242 
## [31] train-auc:0.877028  test-auc:0.867539 
## [32] train-auc:0.877531  test-auc:0.867816 
## [33] train-auc:0.878295  test-auc:0.868334 
## [34] train-auc:0.879271  test-auc:0.869032 
## [35] train-auc:0.879655  test-auc:0.869175 
## [36] train-auc:0.880007  test-auc:0.869433 
## [37] train-auc:0.880355  test-auc:0.869704 
## [38] train-auc:0.880658  test-auc:0.869838 
## [39] train-auc:0.881000  test-auc:0.870132 
## [40] train-auc:0.881428  test-auc:0.870334 
## [41] train-auc:0.881907  test-auc:0.870667 
## [42] train-auc:0.882200  test-auc:0.870814 
## [43] train-auc:0.882636  test-auc:0.871176 
## [44] train-auc:0.882954  test-auc:0.871311 
## [45] train-auc:0.883059  test-auc:0.871166 
## [46] train-auc:0.883301  test-auc:0.871234 
## [47] train-auc:0.884098  test-auc:0.871741 
## [48] train-auc:0.884338  test-auc:0.871931 
## [49] train-auc:0.884803  test-auc:0.872190 
## [50] train-auc:0.885485  test-auc:0.872467 
## [51] train-auc:0.885773  test-auc:0.872629 
## [52] train-auc:0.885985  test-auc:0.872784 
## [53] train-auc:0.886490  test-auc:0.872982 
## [54] train-auc:0.886761  test-auc:0.873067 
## [55] train-auc:0.887473  test-auc:0.873386 
## [56] train-auc:0.887783  test-auc:0.873531 
## [57] train-auc:0.887985  test-auc:0.873722 
## [58] train-auc:0.888125  test-auc:0.873835 
## [59] train-auc:0.888689  test-auc:0.874100 
## [60] train-auc:0.888995  test-auc:0.874326 
## [61] train-auc:0.889294  test-auc:0.874491 
## [62] train-auc:0.889463  test-auc:0.874627 
## [63] train-auc:0.890343  test-auc:0.875128 
## [64] train-auc:0.891154  test-auc:0.875499 
## [65] train-auc:0.891396  test-auc:0.875623 
## [66] train-auc:0.891849  test-auc:0.875668 
## [67] train-auc:0.892395  test-auc:0.876022 
## [68] train-auc:0.892721  test-auc:0.876162 
## [69] train-auc:0.892849  test-auc:0.876251 
## [70] train-auc:0.893146  test-auc:0.876275 
## [71] train-auc:0.893314  test-auc:0.876287 
## [72] train-auc:0.893575  test-auc:0.876350 
## [73] train-auc:0.893780  test-auc:0.876426 
## [74] train-auc:0.893955  test-auc:0.876570 
## [75] train-auc:0.894310  test-auc:0.876700 
## [76] train-auc:0.894851  test-auc:0.877130 
## [77] train-auc:0.894968  test-auc:0.877193 
## [78] train-auc:0.895287  test-auc:0.877185 
## [79] train-auc:0.895386  test-auc:0.877175 
## [80] train-auc:0.896052  test-auc:0.877651 
## [81] train-auc:0.896237  test-auc:0.877643 
## [82] train-auc:0.896665  test-auc:0.877772 
## [83] train-auc:0.897000  test-auc:0.877993 
## [84] train-auc:0.897348  test-auc:0.878158 
## [85] train-auc:0.897516  test-auc:0.878255 
## [86] train-auc:0.897955  test-auc:0.878480 
## [87] train-auc:0.898052  test-auc:0.878518 
## [88] train-auc:0.898434  test-auc:0.878778 
## [89] train-auc:0.898657  test-auc:0.878882 
## [90] train-auc:0.898803  test-auc:0.879008 
## [91] train-auc:0.899105  test-auc:0.879207 
## [92] train-auc:0.899615  test-auc:0.879477 
## [93] train-auc:0.899841  test-auc:0.879579 
## [94] train-auc:0.899992  test-auc:0.879628 
## [95] train-auc:0.900416  test-auc:0.879832 
## [96] train-auc:0.900559  test-auc:0.879850 
## [97] train-auc:0.900676  test-auc:0.879821 
## [98] train-auc:0.900997  test-auc:0.879983 
## [99] train-auc:0.901348  test-auc:0.880127 
## [100]    train-auc:0.901820  test-auc:0.880399

## ✔ Binning on 70001 rows and 12 columns in 00:00:17

Future Improvement:

Automated Binning with Manual Review: To further refine the risk model and ensure that risk rankings are intuitive and sensible, it would be wise to attempt hybrid binning process that combines automated binning with manual review and tweaking by domain experts.

This process will create initial bins based on statistical criteria, such as maximizing information value or minimizing within-bin variance. Following this automated binning, a manual review will be conducted to ensure that the bins make intuitive sense and accurately reflect the underlying risk patterns. This combination of automation and domain expert judgment is believed to create a more robust and interpretable scorecard.
Exploration of Alternative Models: Other potential refinement would be by exploring other types of models for scorecard development. This could include machine learning techniques such as random forests, or SVR. By comparing the performance of these alternative models with the current model, we can identify the most effective approach for accurately predicting risk and improving the overall performance of the scorecard.

7.Project Conclusion

This project successfully demonstrated the application of machine learning to enhance credit scoring and risk assessment in financial decision-making. By leveraging advanced tools in R, the framework covered essential stages of the data lifecycle, including data cleaning, exploratory data analysis (EDA), modeling, and evaluation. Among the 3 models tested, XGBoost emerged as the strongest for classifying customers into credit score bands (Good, Standard, and Poor), offering superior prediction performance and reliability.

The regression models built for scorecard development has delivered satisfactory results. Logistic regression model allows a more granular assessment of credit risk and suitable for various credit strategies such as collection strategy, interest rate assignment and cross-selling initiatives. XGBoost integrated model which is highly predictive, can be a great solution for loan approval assessment.

Overall, the project demonstrates the power of machine learning in empowering financial institutions with actionable insights, streamlining risk management, and enhancing customer segmentation. By achieving accurate classification and reliable risk predictions, the developed models offer a strong foundation for informed financial decisions and uncovering new business opportunities.