WQD7004 OCC1 - Machine Learning-Based Prediction in Credit Score Classification

Team Member Introduction - Group 12

Name	Matrix Number
TAN YANG YI	24061644
SHARON LEE JOO WEI	24063813
LEE RONG PHEI	24064031
NURUL SARAH IZZATI BINTI ZAHID	24064189
YEE SEE MARN	23102510

1. Introduction

Credit scoring plays a critical role in financial decision-making, influencing loan approvals, assigning interest rates , and risk assessment. With the increasing reliance on data-driven methodologies, machine learning has become an indispensable tool for evaluating customers’ credit scoring and enhancing credit score classification. The goal of this project is to establish a robust machine learning framework using R to analyze financial data, including credit history, loan amounts, and income levels. Subsequently, the most effective model will be developed to accurately predict customers’ credit score bands (Good, Standard and Bad) based on their financial profiles.

The framework encompasses key stages of the data lifecycle— data cleaning, exploratory data analysis (EDA), modeling, and evaluation by leveraging advanced tools in R. Ultimately, this project seeks to uncover actionable insights from the financial data, improve prediction accuracy on credit score classification problem, and demonstrate how technology can empower financial institutions to make informed decisions.

1.1 Questions

How accurate are the selected machine learning models—Logistic Regression, Random Forest, and XGBoost—in classifying individuals into the credit score bands (Good, Standard, Poor) based on their financial data?
How effectively does the regression model predict an individual’s Bad Rate% based on financial attributes such as income, loan amount, and credit history, providing insights for risk assessment, scorecard thresholds, and data-driven decision-making?

1.2 Objectives

To develop machine learning-based credit scoring classification systems that accurately classify individuals into appropriate credit score categories (Good, Standard, or Poor) using various classification techniques.
To further develop a regression model that estimates the customer’s Bad Rate%, enabling more detailed risk assessments. This approach allows for differentiating between various risk levels, such as a 20% Bad Rate versus an 80% Bad Rate.

1.3 Project Contributions

Accurately classifying customers into Good, Standard, and Poor categories streamlines the customer monitoring process, allowing the bank to take prompt actions that mitigate credit risk.
The regression model plays a crucial role in developing a scorecard, which helps establish the loan approval/rejection threshold.
Interest rate optimization is achieved by segmenting customers based on varying Bad Rate percentages, enabling the bank to offer more tailored financial products and services.
A more granular assessment of the Bad Rate percentage allows the collections team to take proactive measures in recovering outstanding payments, thereby minimizing potential losses for the bank.
Identifying customers with lower Bad Rate percentages creates opportunities for cross-selling additional bank products, driving new business growth.

2. Data Understanding

2.1 Data Introduction

Title : Credit Score Classification
Year : 2022
Source : Kaggle Dataset
Purpose : To enable the development and evaluation of machine learning models for predicting and classifying individuals’ credit scores based on financial attributes.

2.2 Data Overview

Dimension : 100,000 Rows & 28 Columns

Contents & Structure :

Demographic

[Character] ID  : Represents a unique identification of an entry
[Character] Name: Represents the name of a person
[Character] Age : Represents the age of the person
[Character] Occupation : Represents the occupation of the person

Financial Profile

[Character] Annual_Income : Represents the annual income of the person 
[Character] Outstanding_Debt : Represents the remaining debt to be paid (in USD)
[Numeric] Monthly_Balance : Represents the monthly balance amount of the customer (in USD)
[Numeric] Num_Bank_Accounts : Represents the number of bank accounts a person holds
[Numeric] Num_Credit_Card : Represents the number of other credit cards held by a person
[Character] Num_of_Loan : Represents the number of loans taken from the bank

Loan Information

[Character] Num_of_Loan : Represents the number of loans taken from the bank
[Numeric] Interest_Rate : Represents the interest rate on credit card
[Numeric] Amount_invested_monthly : Represents the monthly amount invested by the customer (in USD)

Credit Behavior

[Numeric] Num_of_Delayed_Payment : Represents the average number of payments delayed by a person
[Numeric] Num_Credit_Inquiries : Represents the number of credit card inquiries
[Character] Credit_History_Age : Represents the age of credit history of the person
[Character] Payment_Behaviour : Represents the payment behavior of the customer (in USD)
[Numeric] Credit_Utilization_Ratio : Represents the utilization ratio of credit card

Credit Score

[Character] Credit Score : Represents the bracket of credit score (Poor, Standard, Good)

2.3 Data Summary

rawdata <- read.csv("C:/Users/sharo/OneDrive/Desktop/UM_Master_in_Data_Science/Programming for Data Science/Group Project/dataset_raw.csv")

#Function to generate data summary
# Define the function
create_summary_table <- function(df) {
  # Initialize an empty list to store results
  summary_list <- list()
  # Loop through each column in the dataframe
  for (col_name in names(df)) {
    col_summary <- summary(df[[col_name]])
    summary_df <- data.frame(Variable = col_name,
      Statistic = names(col_summary),
      Value = as.character(col_summary)
    )
    summary_list[[col_name]] <- summary_df
  }
  
  # Combine all column summaries into a single dataframe
  summary_table <- do.call(rbind, summary_list)
  
  # Reshape the summary table
  summary_table <- summary_table %>%
    pivot_wider(
      names_from = Statistic,
      values_from = Value,
      values_fill = NA
    ) %>%
    select(
      Variable, Class = Class,Min = Min., `1st Qu.` = `1st Qu.`, Median = Median, 
      Mean = Mean, `3rd Qu.` = `3rd Qu.`, Max = Max., ) %>%
    mutate(Class = ifelse(is.na(Class), "numeric", Class))
  
  return(summary_table)
}

# Call the function with the dataframe
summary_result <- create_summary_table(rawdata)
# Print the summary table
summary_result

## # A tibble: 28 × 8
##    Variable              Class     Min    `1st Qu.` Median Mean  `3rd Qu.` Max  
##    <chr>                 <chr>     <chr>  <chr>     <chr>  <chr> <chr>     <chr>
##  1 ID                    numeric   5634   43132.75  80631… 8063… 118130.25 1556…
##  2 Customer_ID           character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  3 Month                 character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  4 Name                  character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  5 Age                   character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  6 SSN                   character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  7 Occupation            character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  8 Annual_Income         character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  9 Monthly_Inhand_Salary numeric   303.6… 1625.568… 3093.… 4194… 5957.448… 1520…
## 10 Num_Bank_Accounts     numeric   -1     3         6      17.0… 7         1798 
## # ℹ 18 more rows

The summary table reveals that some numerical columns are incorrectly classified as "character" data type. Additionally, outliers and erroneous values, such as -1 and 1798 in the Num_Bank_Accounts column, have been identified. These issues necessitate a data type conversion and cleaning in the next phase to ensure the data is properly formatted, readable and accurate.

head(rawdata)

##     ID Customer_ID    Month          Name  Age         SSN Occupation
## 1 5634   CUS_0xd40  January Aaron Maashoh   23 821-00-0265  Scientist
## 2 5635   CUS_0xd40 February Aaron Maashoh   23 821-00-0265  Scientist
## 3 5636   CUS_0xd40    March Aaron Maashoh -500 821-00-0265  Scientist
## 4 5637   CUS_0xd40    April Aaron Maashoh   23 821-00-0265  Scientist
## 5 5638   CUS_0xd40      May Aaron Maashoh   23 821-00-0265  Scientist
## 6 5639   CUS_0xd40     June Aaron Maashoh   23 821-00-0265  Scientist
##   Annual_Income Monthly_Inhand_Salary Num_Bank_Accounts Num_Credit_Card
## 1      19114.12              1824.843                 3               4
## 2      19114.12                    NA                 3               4
## 3      19114.12                    NA                 3               4
## 4      19114.12                    NA                 3               4
## 5      19114.12              1824.843                 3               4
## 6      19114.12                    NA                 3               4
##   Interest_Rate Num_of_Loan
## 1             3           4
## 2             3           4
## 3             3           4
## 4             3           4
## 5             3           4
## 6             3           4
##                                                          Type_of_Loan
## 1 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 2 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 3 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 4 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 5 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 6 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
##   Delay_from_due_date Num_of_Delayed_Payment Changed_Credit_Limit
## 1                   3                      7                11.27
## 2                  -1                                       11.27
## 3                   3                      7                    _
## 4                   5                      4                 6.27
## 5                   6                                       11.27
## 6                   8                      4                 9.27
##   Num_Credit_Inquiries Credit_Mix Outstanding_Debt Credit_Utilization_Ratio
## 1                    4          _           809.98                 26.82262
## 2                    4       Good           809.98                 31.94496
## 3                    4       Good           809.98                 28.60935
## 4                    4       Good           809.98                 31.37786
## 5                    4       Good           809.98                 24.79735
## 6                    4       Good           809.98                 27.26226
##      Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month
## 1 22 Years and 1 Months                    No            49.57495
## 2                  <NA>                    No            49.57495
## 3 22 Years and 3 Months                    No            49.57495
## 4 22 Years and 4 Months                    No            49.57495
## 5 22 Years and 5 Months                    No            49.57495
## 6 22 Years and 6 Months                    No            49.57495
##   Amount_invested_monthly                Payment_Behaviour    Monthly_Balance
## 1       80.41529543900253  High_spent_Small_value_payments 312.49408867943663
## 2      118.28022162236736   Low_spent_Large_value_payments 284.62916249607184
## 3         81.699521264648  Low_spent_Medium_value_payments  331.2098628537912
## 4       199.4580743910713   Low_spent_Small_value_payments 223.45130972736786
## 5      41.420153086217326 High_spent_Medium_value_payments 341.48923103222177
## 6      62.430172331195294                           !@9#%8  340.4792117872438
##   Credit_Score
## 1         Good
## 2         Good
## 3         Good
## 4         Good
## 5         Good
## 6         Good

3. Data Cleaning

3.1 Remove “_” & Data Type Conversion

From the data summarized above, the first layer of cleaning has been performed on the variables that were expected to be numeric but were recorded as character. The cleaning process involves removing underscores from these variables, converting them to numeric format, and rounding the numeric values to two decimal places.

rawdata[c("Age","Annual_Income","Num_of_Loan","Num_of_Delayed_Payment","Changed_Credit_Limit","Outstanding_Debt","Amount_invested_monthly","Monthly_Balance","Monthly_Inhand_Salary","Credit_Utilization_Ratio","Total_EMI_per_month")] <- lapply(rawdata[c("Age","Annual_Income","Num_of_Loan","Num_of_Delayed_Payment","Changed_Credit_Limit","Outstanding_Debt","Amount_invested_monthly","Monthly_Balance","Monthly_Inhand_Salary","Credit_Utilization_Ratio","Total_EMI_per_month")], function(x) {
           if (is.character(x)) {
             # if is character then to replace underscores, convert to numeric, and round
             round(as.numeric(gsub('_', '', x)), 2)
           } else if (is.numeric(x)) {
             # if is numeric then to round numeric values
             round(x, 2)
           } else {
             # Return the data unchanged if neither character nor numeric
             x
           }
         })
# After conversion to Numeric and Basic cleaning, check for data summary
# Call the function with the dataframe
summary_result <- create_summary_table(rawdata)
# Print the summary table
print(summary_result)

## # A tibble: 28 × 8
##    Variable              Class     Min    `1st Qu.` Median Mean  `3rd Qu.` Max  
##    <chr>                 <chr>     <chr>  <chr>     <chr>  <chr> <chr>     <chr>
##  1 ID                    numeric   5634   43132.75  80631… 8063… 118130.25 1556…
##  2 Customer_ID           character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  3 Month                 character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  4 Name                  character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  5 Age                   numeric   -500   24        33     110.… 42        8698 
##  6 SSN                   character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  7 Occupation            character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  8 Annual_Income         numeric   7005.… 19457.5   37578… 1764… 72790.92  2419…
##  9 Monthly_Inhand_Salary numeric   303.65 1625.57   3093.… 4194… 5957.45   1520…
## 10 Num_Bank_Accounts     numeric   -1     3         6      17.0… 7         1798 
## # ℹ 18 more rows

After converting the data types, the numeric variables now appear in the "Numeric" format. However, the previously mentioned erroneous and outlier values still persist and need to be addressed.

3.2 Function used for the following cleaning process

Function to compute MODE

calculate_mode <- function(x) {
    x <- x[!(x=='!@9#%8' | is.na(x) | x=='_' | x=='_______')]  # Exclude error and null value from the MODE computation
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
  }

Function to handle missing or outlier values using backward or forward filling:

1. Missing or outlier were replaced with 0 for scanning purpose.
2. Backward Filling: Fills the current value by using the most recent available value from the previous months.
3. Forward Filling: Fills the current value by using the most recent available value from the upcoming months.

fill_backward <- function(x) {
    if (all(x == 0, na.rm = TRUE)) return(x) # Return unchanged if all values are zero
    while (any(x == 0, na.rm = TRUE)) {
      x <- ifelse(x == 0, coalesce(lag(x, default = NA)), x)
    }
    x
  }
  fill_forward <- function(x) {
    if (all(is.na(x), na.rm = TRUE)) return(x) # Return unchanged if all values are NA
    while (any(is.na(x), na.rm = TRUE)) {
      x <- ifelse(is.na(x), coalesce(lead(x, default = NA)), x)
    }
    x
  }

Function to handle missing or outlier values using backward or forward filling:

1. Missing or outlier were replaced with 99 for scanning purpose.
2. Backward Filling: Fills the current value by using the most recent available value from the previous months.
3. Forward Filling: Fills the current value by using the most recent available value from the upcoming months.

fill_backward_99 <- function(x) {
    if (all(x == 99, na.rm = TRUE)) return(x) # Return unchanged if all values are zero
    while (any(x == 99, na.rm = TRUE)) {
      x <- ifelse(x == 99, coalesce(lag(x, default = NA)), x)
    }
    x
  }
  fill_forward_99 <- function(x) {
    if (all(is.na(x), na.rm = TRUE)) return(x) # Return unchanged if all values are NA
    while (any(is.na(x), na.rm = TRUE)) {
      x <- ifelse(is.na(x), coalesce(lead(x, default = NA)), x)
    }
    x
  }

Function to handle missing or outlier values using backward or forward filling:

1. Missing or outlier were replaced with 0 for scanning purpose.
2. Backward Filling: Find the current value by using the most recent available value from the previous months.
3. Forward Filling: Find the current value by using the most recent available value from the upcoming months.
4. All values are then adjusted by adding or subtracting 1, depending on whether backward or forward filling is applied.

fill_n <- function(x) {
    # Return unchanged if all values are zero
    if (all(x == 0, na.rm = TRUE)) return(x)
    
    # Iterate over the vector until all zeros are filled
    while (any(x == 0, na.rm = TRUE)) {
      for (i in seq_along(x)) {
        if (x[i] == 0) {
          # Look for the nearest non-zero value in the previous and next values
          prev_value <- if (i > 1) x[i - 1] else NA
          next_value <- if (i < length(x)) x[i + 1] else NA
          
          # Replace with the first available non-zero neighbor
          x[i] <- if (!is.na(prev_value) && prev_value != 0) prev_value+1 else if (!is.na(next_value) && next_value != 0) next_value-1 else x[i]
        }
      }
    }
    x 
    return(x)
  }

Function to detect outliers using IQR method: 
This method is used when it's impossible to detect outlier from the frequency table.

lower_bound <- function(column) {
  Q1 <- quantile(column, 0.25, na.rm = TRUE)
  Q3 <- quantile(column, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- max(0,Q1 - 1.5 * IQR)
}

upper_bound <- function(column) {
  Q1 <- quantile(column, 0.25, na.rm = TRUE)
  Q3 <- quantile(column, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  upper_bound <- max(0,Q3 + 1.5 * IQR)
}

3.3 Frequency-based One-Hot Encoding

Type_of_Loan & Num_of_Loan

##                                                          Type_of_Loan
## 1 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 2 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
## 3 Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan
##   Num_of_Loan
## 1           4
## 2           4
## 3           4

# Checking pre cleaning, negative value and outlier are observed
  summary(rawdata$Num_of_Loan)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -100.00    1.00    3.00    3.01    5.00 1496.00

# To split the Type_of_loan into individual columns
  df_split <- rawdata %>%select(Customer_ID,Month,Num_of_Loan,Type_of_Loan)%>%
    separate(Type_of_Loan, into = paste("Loan", 1:9, sep = "_"), 
             sep = ", ", fill = "right")%>%
    mutate(across(c(Loan_1:Loan_9), ~ gsub("and", "", .)))
# To count the frequency of each type of loan and Total number of Loan
  df_long <- df_split %>%mutate(across(c(Loan_1:Loan_9), ~ gsub(" ", "", .)))%>%
    pivot_longer(cols = starts_with("Loan"), values_to = "Loan_Type", 
                 values_drop_na = TRUE) %>%
    count(Customer_ID,Month,Num_of_Loan,Loan_Type) %>%
    spread(key = Loan_Type, value = n, fill = 0)%>%select(-V1)%>%
    mutate(No_of_Loan = rowSums(across(4:12)))
# It is confirmed that Num_of_loan equals the sum of all Loan types after excluding outlier and negative value
  checking <-df_long%>%select(Customer_ID,Month,Num_of_Loan,No_of_Loan)%>%
   filter(Num_of_Loan<=9&Num_of_Loan>=0&Num_of_Loan!=No_of_Loan)
# to replace the new columns into df table
  rawdata<-rawdata%>%select(-Num_of_Loan,-Type_of_Loan)%>%
    left_join(df_long,by=c("Customer_ID","Month"))%>%select(-Num_of_Loan)
# Checking post cleaning
  summary(rawdata$No_of_Loan)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.533   5.000   9.000

3.4 Replacing Missing Values/Erroneous Values/Outlier with Mode

Use mode to replace the outlier with the most frequently occurring value in the dataset

Occupation

# Confirmed that customer is unlikely to have multiple occupation in this dataset.
  CustxOccupation_cnt<-rawdata%>%select(Customer_ID,Occupation)%>%distinct()%>%
  filter(!grepl('_______',Occupation))%>%count(Customer_ID)%>%filter(n>1)

# Replace erroneous values with mode
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Occupation = calculate_mode(Occupation)) %>%ungroup()

Annual Income

#Checking pre cleaning 
  summary(rawdata$Annual_Income)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7006    19458    37579   176416    72791 24198062

# For customers with more than one distinct annual income, the minimum change is 159%, highlighting an outlier. This suggests that such a drastic fluctuation in income is unlikely for a single customer, confirming that a customer is unlikely to have multiple distinct incomes in this dataset.
  CustxIncome<-rawdata%>%group_by(Customer_ID)%>%
    mutate(Income_Mode = as.numeric(calculate_mode(Annual_Income)))%>%ungroup()
  CustxIncome<-CustxIncome%>%select(Customer_ID,Annual_Income,Income_Mode)%>%
    distinct()%>%count(Customer_ID)%>%filter(n>1)%>%
    left_join(CustxIncome,by="Customer_ID")%>%
    select(Customer_ID,n,Annual_Income,Income_Mode)%>%distinct()%>%
    filter(Annual_Income!=Income_Mode)%>%
    mutate(pct_chg=round((Annual_Income-Income_Mode)/Income_Mode*100,2))%>%
    arrange(pct_chg)
  
# Replace outlier with mode
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Annual_Income = calculate_mode(Annual_Income))%>%ungroup()  
# Checking post cleaning
  summary(rawdata$Annual_Income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7006   19343   37000   50505   71683  179987

Interest Rate

# Checking pre cleaning
  summary(rawdata$Interest_Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   13.00   72.47   20.00 5797.00

# Interest_Rate count frequency reveals that Interest_Rate>34 is potential outlier
# After removing IR>34%,it is confirmed that customer is unlikely to have multiple Interest Rate in this dataset
  freq_cnt<-rawdata%>%select(Customer_ID,Interest_Rate)%>%distinct()%>%
    count(Interest_Rate)
  Interest_Rate<-rawdata%>%select(Customer_ID,Interest_Rate)%>%
    filter(Interest_Rate<=34)%>%group_by(Customer_ID)%>%distinct()%>%
    count(Customer_ID)
  
# Replace outlier with mode
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Interest_Rate = calculate_mode(Interest_Rate)) %>%ungroup()
# Checking post cleaning
  summary(rawdata$Interest_Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   13.00   14.53   20.00   34.00

Credit_Mix

# Checking pre cleaning, Distinct Credit_Mix, Credit_Mix='_' is observed
  credit_mix_check<-rawdata%>%select(Credit_Mix)%>%distinct()
# Confirmed that customer is unlikely to have multiple credit mix in this dataset.
  credit_mix_check<-rawdata%>%select(Customer_ID,Credit_Mix)%>%
    filter(Credit_Mix!='_')%>%distinct()%>%count(Customer_ID)%>%arrange(desc(n))
# A credit mix of '_' is unlikely, as its max count is only 6, suggesting valid values are available to replace '_'.
  credit_mix_check<-rawdata%>%select(Customer_ID,Credit_Mix)%>%
    filter(Credit_Mix=='_')%>%count(Customer_ID)
  
# Replace '_' with mode
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Credit_Mix = calculate_mode(Credit_Mix)) %>%ungroup()
# Checking post cleaning, Distinct Credit_Mix
  credit_mix_check<-rawdata%>%select(Credit_Mix)%>%distinct()

Payment_Behavior

# Checking pre cleaning, Distinct Payment Behavior, erroneous value '!@9#%8' is observed
  PB<-rawdata%>%select(Payment_Behaviour)%>%distinct()
# Confirmed that customer is likely to have multiple payment behavior in this dataset.
  PB<-rawdata%>%select(Customer_ID,Payment_Behaviour)%>%
    filter(Payment_Behaviour!='!@9#%8')%>%
    distinct()%>%count(Customer_ID)
  
# Replace only erroneous value with mode
  rawdata<-rawdata%>%group_by(Customer_ID)%>%
    mutate(Payment_Behaviour = ifelse(Payment_Behaviour=='!@9#%8',
                  calculate_mode(Payment_Behaviour),Payment_Behaviour))%>%
    ungroup() 
# Checking post cleaning, Distinct Payment Behavior
  PB<-rawdata%>%select(Payment_Behaviour)%>%distinct()

Total_EMI_per_month

# Checking pre cleaning, outlier observed
  summary(rawdata$Total_EMI_per_month)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    30.31    69.25  1403.12   161.22 82331.00

  summary(rawdata$Outstanding_Debt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.23  566.07 1166.15 1426.22 1945.96 4998.07

# Assuming >5000 is outlier as maximum outstanding_debt<5000 
# No_of_Loan & Total_EMI_per_month are positively related
  EMI<-rawdata%>%select(Customer_ID,No_of_Loan,Total_EMI_per_month)%>%
    mutate(Total_EMI_per_month=ifelse(Total_EMI_per_month>5000,
              NA,Total_EMI_per_month))%>%distinct()%>%group_by(No_of_Loan)%>%
    summarise(EMI = mean(Total_EMI_per_month, na.rm = TRUE))
 ggplot(EMI, aes(x = No_of_Loan, y = EMI)) +
  geom_line(color = "blue") +
  geom_point(size = 3) +
  labs(title = "Average EMI by Number of Loans",
       x = "Number of Loans",
       y = "Average Total EMI per Month") +
  theme_minimal()

# To use IQR method to set boundaries for each number of loan and filter outlier/error
# To replace with mode with those exceed boundaries
  rawdata <- rawdata %>%
    mutate(Total_EMI_per_month1=ifelse(Total_EMI_per_month>5000,
                                       NA,Total_EMI_per_month))%>%
    group_by(No_of_Loan)%>%
    mutate(Lower_EMI = lower_bound(Total_EMI_per_month1),
           Upper_EMI = upper_bound(Total_EMI_per_month1))%>%ungroup()%>%
    mutate(Total_EMI_per_month1=ifelse(Total_EMI_per_month1>Upper_EMI,NA,
                                       Total_EMI_per_month1))%>%
    group_by(Customer_ID)%>%
    mutate(Total_EMI_per_month1=ifelse(is.na(Total_EMI_per_month1),
                 calculate_mode(Total_EMI_per_month1),Total_EMI_per_month1))%>%
    mutate(Total_EMI_per_month1=ifelse(is.na(Total_EMI_per_month1),
                 calculate_mode(Total_EMI_per_month),Total_EMI_per_month1))%>%
    select(-Total_EMI_per_month,-Lower_EMI,-Upper_EMI)%>%
    rename(Total_EMI_per_month=Total_EMI_per_month1)
# Checking post cleaning
  summary(rawdata$Total_EMI_per_month)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   28.42   65.01   98.12  141.51 1701.96

3.5 Replacing Missing Values/Erroneous Values/Outlier with Mean

Amount_invested_monthly

# Checking pre cleaning, outlier and NA value is observed.
  summary(rawdata$Amount_invested_monthly)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     0.00    74.53   135.93   637.41   265.73 10000.00     4479

# The value 10000 is highly frequent and may be a potential outlier, while NA values require treatment.
  AIM<-rawdata%>%select(Customer_ID,Month,Amount_invested_monthly)%>%
    count(Amount_invested_monthly)%>%arrange(by=desc(n))
  
# Replace NA and 10000 with mean of the other months
  rawdata$Amount_invested_monthly<-
    ifelse(rawdata$Amount_invested_monthly==10000,NA,
           rawdata$Amount_invested_monthly)
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Amount_invested_monthly = round(ifelse(is.na(Amount_invested_monthly),
            mean(Amount_invested_monthly, na.rm = TRUE),
            Amount_invested_monthly),2)) %>%
    ungroup()
# Checking post cleaning
  summary(rawdata$Amount_invested_monthly)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    74.6   131.2   195.8   239.5  1977.3

Monthly_Balance

# Checking pre cleaning, negative outlier and NA value are observed.
  summary(rawdata$Monthly_Balance)

##                         Min.                      1st Qu. 
## -333333333333333314868224222                          270 
##                       Median                         Mean 
##                          337     -30364372469635625254402 
##                      3rd Qu.                         Max. 
##                          470                         1602 
##                         NA's 
##                         1200

# The value -333333333333333314868224222.00 is confirmed as an outlier, and observed NA values need to be addressed.
  MB<-rawdata%>%select(Customer_ID,Month,Monthly_Balance)%>%count(Monthly_Balance)
  
# Replace outlier and na with mean
  rawdata$Monthly_Balance<-
    ifelse(rawdata$Monthly_Balance==-333333333333333314868224222.00,
           NA,rawdata$Monthly_Balance)
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Monthly_Balance = round(ifelse(is.na(Monthly_Balance),
                          mean(Monthly_Balance, na.rm = TRUE),
                          Monthly_Balance),2))%>% 
    ungroup()
# Checking post cleaning
  summary(rawdata$Monthly_Balance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01  270.19  337.12  403.12  471.57 1602.04

3.6 Backward/Forward Filling

to fill the missing/outlier value with previous/next month available value

Age

# Checking pre cleaning, Outlier is observed
  summary(rawdata$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -500.0    24.0    33.0   110.7    42.0  8698.0

# Ages ≥95 are confirmed as outliers due to low frequency, while -500 is erroneous.
  age<-rawdata%>%select(Customer_ID,Month,Age)%>%count(Age)
  
# To replace 0 values in a dataset with the available previous or next month's value using a back/forward fill function
  rawdata$Age<-ifelse(rawdata$Age>=95|rawdata$Age==-500,0,rawdata$Age)
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Age = fill_backward(Age)) %>%
    mutate(Age = fill_forward(Age))%>%
    ungroup()
# Checking post cleaning
  summary(rawdata$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   24.00   33.00   33.31   42.00   56.00

Monthly_Inhand_Salary

# Checking pre cleaning, NA is observed
  summary(rawdata$Monthly_Inhand_Salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   303.6  1625.6  3093.8  4194.2  5957.4 15204.6   15002

# It is possible for each customer to have a different Monthly_Inhand_Salary
  MIS<-rawdata%>%select(Customer_ID,Monthly_Inhand_Salary)%>%
    filter(!is.na(Monthly_Inhand_Salary))%>%
    distinct()%>%count(Customer_ID)%>%filter(n>1)
# Confirm no 0 value
  MIS<-rawdata%>%select(Customer_ID,Monthly_Inhand_Salary)%>%
    filter(Monthly_Inhand_Salary==0)
  
# To replace 0 values in a dataset with the available previous or next month's value using a back/forward fill function
  rawdata$Monthly_Inhand_Salary<-
    ifelse(is.na(rawdata$Monthly_Inhand_Salary),0,rawdata$Monthly_Inhand_Salary)
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Monthly_Inhand_Salary = fill_backward(Monthly_Inhand_Salary)) %>%
    mutate(Monthly_Inhand_Salary = fill_forward(Monthly_Inhand_Salary))%>%
    ungroup()
# Checking post cleaning
  summary(rawdata$Monthly_Inhand_Salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   303.6  1626.8  3096.4  4198.8  5961.7 15204.6

Num_Bank_Accounts

# Checking pre cleaning, negative value and outlier are observed
  summary(rawdata$Num_Bank_Accounts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00    3.00    6.00   17.09    7.00 1798.00

# Values greater than 11 are confirmed as outliers and further treatement is needed, while -1 can be replaced with zero.
  NBA<-rawdata%>%select(Customer_ID,Month,Num_Bank_Accounts)%>%
    count(Num_Bank_Accounts)
  
# To replace 99 values in a dataset with the available previous or next month's value using a back/forward fill function
  rawdata$Num_Bank_Accounts<-ifelse(rawdata$Num_Bank_Accounts>11,99,
                ifelse(rawdata$Num_Bank_Accounts==-1,0,rawdata$Num_Bank_Accounts))
  rawdata <- rawdata %>%
    group_by(Customer_ID) %>%
    mutate(Num_Bank_Accounts = fill_backward_99(Num_Bank_Accounts))%>%
    mutate(Num_Bank_Accounts = fill_forward_99(Num_Bank_Accounts))%>%
    ungroup()
#Checking post cleaning
  summary(rawdata$Num_Bank_Accounts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   5.000   5.369   7.000  11.000

Num_Credit_Card

# Checking pre cleaning, outlier is observed
  summary(rawdata$Num_Credit_Card)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    5.00   22.47    7.00 1499.00

# Values greater than 11 are confirmed as outliers and further treatement is needed
  freq_cnt<-rawdata%>%select(Customer_ID,Num_Credit_Card)%>%distinct()%>%
    count(Num_Credit_Card)
# It is possible for each customer to have a different Num_Credit_Card
  Num_CC_cnt<-rawdata%>%select(Customer_ID,Num_Credit_Card)%>%
    filter(Num_Credit_Card<=11)%>%
    distinct()%>%count(Customer_ID)%>%arrange(desc(n))
  
# To replace 99 values in a dataset with the available previous or next month's value using a back/forward fill function
  rawdata$Num_Credit_Card<-ifelse(rawdata$Num_Credit_Card>11,
                                  99,rawdata$Num_Credit_Card)
  rawdata <- rawdata %>%group_by(Customer_ID) %>%
    mutate(Num_Credit_Card = fill_backward_99(Num_Credit_Card)) %>%
    mutate(Num_Credit_Card = fill_forward_99(Num_Credit_Card))%>%
    ungroup()
# Checking post cleaning
  summary(rawdata$Num_Credit_Card)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.000   5.000   5.534   7.000  11.000

Delay_from_due_date

# Checking pre cleaning, negative day is observed
  summary(rawdata$Delay_from_due_date)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5.00   10.00   18.00   21.07   28.00   67.00

# Frequency count confirmed no outlier detected
  freq_cnt<-rawdata%>%select(Customer_ID,Delay_from_due_date)%>%distinct()%>%
    count(Delay_from_due_date)
# To floor Delay_from_due_date at 0 
  rawdata$Delay_from_due_date<-ifelse(rawdata$Delay_from_due_date<=0,0
                                      ,rawdata$Delay_from_due_date)
# Checking post cleaning
  summary(rawdata$Delay_from_due_date)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   18.00   21.08   28.00   67.00

Num_of_Delayed_Payment

# Checking pre cleaning, negative value, outlier and NA value are observed
  summary(rawdata$Num_of_Delayed_Payment)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -3.00    9.00   14.00   30.92   18.00 4397.00    7002

# Num_of_Delayed_Payment >28 are confirmed as outliers due to low frequency, while negative values will be replaced with zero
  DD<-rawdata%>%select(Customer_ID,Month,Num_of_Delayed_Payment)%>%
    count(Num_of_Delayed_Payment)
  
# To replace 99 values in a dataset with the available previous or next month's value using a back/forward fill function
  rawdata<-rawdata%>%
    mutate(Num_of_Delayed_Payment=ifelse(Num_of_Delayed_Payment>28|
                  is.na(Num_of_Delayed_Payment),99,
                  ifelse(Num_of_Delayed_Payment<0,0,Num_of_Delayed_Payment)))
  # To replace Num_of_Delayed_Payment with zero if Delay_from_due_date<=0
  rawdata <- rawdata %>%
    mutate(Num_of_Delayed_Payment=
             case_when(Delay_from_due_date<=0 ~ 0, 
                       TRUE ~ Num_of_Delayed_Payment))%>%
    group_by(Customer_ID) %>%
    mutate(Num_of_Delayed_Payment = fill_backward_99(Num_of_Delayed_Payment)) %>%
    mutate(Num_of_Delayed_Payment = fill_forward_99(Num_of_Delayed_Payment))%>%
    ungroup()
# Checking post cleaning
  summary(rawdata$Num_of_Delayed_Payment)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   14.00   13.21   18.00   28.00

Changed_Credit_Limit

# Checking pre cleaning, NA value is observed
  summary(rawdata$Changed_Credit_Limit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -6.49    5.32    9.40   10.39   14.87   36.97    2091

# No outlier is observed
  cc_chg<-rawdata%>%select(Customer_ID,Month,Changed_Credit_Limit)%>%distinct()%>%
    count(Changed_Credit_Limit)
  
# To replace 99 values in a dataset with the available previous or next month's value using a back/forward fill function
  rawdata$Changed_Credit_Limit<-
    ifelse(is.na(rawdata$Changed_Credit_Limit),99,rawdata$Changed_Credit_Limit)
  rawdata <- rawdata %>%
    group_by(Customer_ID) %>%
    mutate(Changed_Credit_Limit = fill_backward_99(Changed_Credit_Limit))%>%
    mutate(Changed_Credit_Limit = fill_forward_99(Changed_Credit_Limit))%>%
    ungroup()
# Checking post cleaning
  summary(rawdata$Changed_Credit_Limit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.49    5.32    9.40   10.39   14.86   36.97

Num_Credit_Inquiries

# Checking pre cleaning, outlier and NA value are observed
  summary(rawdata$Num_Credit_Inquiries)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.00    6.00   27.75    9.00 2597.00    1965

# Outlier is confirmed to be >17 due to low frequency 
  credit_in<-rawdata%>%select(Customer_ID,Num_Credit_Inquiries)%>%
    distinct()%>%count(Num_Credit_Inquiries)
  
#  To replace 99 values in a dataset with the available previous or next month's value using a back/forward fill function
  rawdata$Num_Credit_Inquiries<-ifelse(is.na(rawdata$Num_Credit_Inquiries)|
                          rawdata$Num_Credit_Inquiries>17,
                          99,rawdata$Num_Credit_Inquiries)
  rawdata <- rawdata %>%
    group_by(Customer_ID) %>%
    mutate(Num_Credit_Inquiries = fill_backward_99(Num_Credit_Inquiries)) %>%
    mutate(Num_Credit_Inquiries = fill_forward_99(Num_Credit_Inquiries))%>%
    ungroup()
# Checking post cleaning
  summary(rawdata$Num_Credit_Inquiries)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   5.000   5.773   8.000  17.000

Credit_History_Age

# Max 33 Years and 8 Months and NA is observed
  credit_age<-rawdata%>%select(Customer_ID,Credit_History_Age)%>%distinct()%>%
    count(Credit_History_Age)
# Convert the age to Months format and replace NA with zero value
  credit_age<-rawdata %>%select(Customer_ID, Month, Credit_History_Age)%>%
    mutate(Years = as.integer(str_extract(Credit_History_Age, "\\d+(?= Years)")),
           Months = as.integer(str_extract(Credit_History_Age, 
           "(?<=and )\\d+(?= Months)")),
           Credit_History_Months = Years * 12 + Months)%>%
    mutate(Credit_History_Months=ifelse(is.na(Credit_History_Months),
                                        0,Credit_History_Months))
  
# To replace 0 values in a dataset with the previous or next month's available value using a back/forward fill function, then adjust the value by +1 or -1 depending on the filling direction
  credit_age<-credit_age %>%
    group_by(Customer_ID) %>%
    mutate(Credit_History_Months = fill_n(Credit_History_Months)) %>%
    ungroup()%>%select(Customer_ID,Month,Credit_History_Months)
  rawdata<-rawdata%>%
    left_join(credit_age,by=c('Customer_ID', 'Month'))%>%
    select(-Credit_History_Age)
# Checking post cleaning
  summary(rawdata$Credit_History_Months)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   144.0   219.0   221.2   302.0   404.0

3.7 Binary Encoding

Payment_of_Min_Amount

# Checking pre cleaning, NM value is observed
  check_min<-rawdata%>%select(Payment_of_Min_Amount)%>%distinct()
# Confirmed that each customer has only one payment_of_min_amount indicator.
  check_min<-rawdata%>%select(Customer_ID,Payment_of_Min_Amount)%>%
    filter(Payment_of_Min_Amount!='NM')%>%distinct()%>%count(Customer_ID)
# Binary Encdoing, Yes=1; No=0
  min_amt_ind<-rawdata%>%select(Customer_ID,Payment_of_Min_Amount)%>%
    filter(Payment_of_Min_Amount!='NM')%>%distinct()%>%
    mutate(Min_Amount_Pymt_Ind=ifelse(Payment_of_Min_Amount=='Yes',1,0))%>%
    select(-Payment_of_Min_Amount)
  
# to update in data
  rawdata<-rawdata%>%left_join(min_amt_ind,by='Customer_ID')%>%
    select(-Payment_of_Min_Amount)
# Checking post cleaning
  summary(rawdata$Min_Amount_Pymt_Ind)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5943  1.0000  1.0000

3.8 Drop irrelevant variables & Rename Variable

rawdata <- rawdata %>% select(-ID, -Name, -SSN)%>%
  rename(CreditBuilderLoan='Credit-BuilderLoan')

# After complete all the data cleaning, perform one last round of data summary
# Call the function with the sample dataframe
summary_result <- create_summary_table(rawdata)

# Print the summary table
summary_result

## # A tibble: 33 × 8
##    Variable              Class     Min    `1st Qu.` Median Mean  `3rd Qu.` Max  
##    <chr>                 <chr>     <chr>  <chr>     <chr>  <chr> <chr>     <chr>
##  1 Customer_ID           character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  2 Month                 character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  3 Age                   numeric   14     24        33     33.3… 42        56   
##  4 Occupation            character <NA>   <NA>      <NA>   <NA>  <NA>      <NA> 
##  5 Annual_Income         numeric   7005.… 19342.97… 36999… 5050… 71683.47  1799…
##  6 Monthly_Inhand_Salary numeric   303.65 1626.76   3096.… 4198… 5961.74   1520…
##  7 Num_Bank_Accounts     numeric   0      3         5      5.36… 7         11   
##  8 Num_Credit_Card       numeric   0      4         5      5.53… 7         11   
##  9 Interest_Rate         numeric   1      7         13     14.5… 20        34   
## 10 Delay_from_due_date   numeric   0      10        18     21.0… 28        67   
## # ℹ 23 more rows

write.csv(rawdata, "dataset_cleaned_v2.csv", row.names = FALSE)

4. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial step in examining and summarizing datasets to reveal patterns, trends, anomalies, and relationships. It uses both visual and quantitative methods to provide a thorough understanding of the data before moving on to more sophisticated modeling or analysis.

For our credit score classification project, three primary analysis were carried out to explore the financial data, study the distribution of variables, and investigate the underlying relationships among them.
  - Univariate Analysis
  - Bivariate Analysis
  - Correlation Analysis

4.1 Univariate Analysis

4.1.1 Distribution of Numerical Variables

# Read the dataset
rawdata <- read.csv("C:/Users/sharo/OneDrive/Desktop/UM_Master_in_Data_Science/Programming for Data Science/Group Project/dataset_cleaned_v2.csv")
# Select numerical columns
numeric_data <- rawdata[sapply(rawdata, is.numeric)]

# Loop through each numerical column and plot its distribution
for (col_name in colnames(numeric_data)) {
  # Create a histogram for the current numerical column
  p <- ggplot(rawdata, aes(x = .data[[col_name]])) +
    geom_histogram(bins = 30, fill = "steelblue", color = "black") +
    labs(title = paste("Distribution of", col_name),
         x = col_name, 
         y = "Frequency") +
    theme_minimal() +
    theme(axis.text.x = element_text(size = 12),
          axis.text.y = element_text(size = 12),
          plot.title = element_text(hjust = 0.4, size = 8))
  # Print the plot
  print(p)
}

Annual_Income & Monthly_Inhand_Salary & Monthly_Balance & Amount_Invested_Monthly:

Right-skewed distribution observed with a higher density of data points at lower income levels, indicating that most individuals have lower incomes, with fewer individuals having very high incomes. There is evident income inequality, with most people earning less and only a small portion earning significantly higher incomes. this pattern is also observed in Monthly_Balance and Amount_Invested_Monthly, as individuals with lower incomes tend to have smaller savings or investments, while only a few individuals with high incomes have significantly larger balances and investments.

4.1.2 Distribution of Categorical Variables

# Select only categorical columns
categorical_data <- rawdata[sapply(rawdata, function(x) is.factor(x) || is.character(x))]
# Drop 'Customer_ID' (an index) from categorical_data
categorical_data <- categorical_data[, !(colnames(categorical_data) %in% c("Customer_ID", "Month"))]

# Loop through each categorical column and plot its distribution
for (col_name in colnames(categorical_data)) {
  # Create a bar plot for the current categorical column
  p <- ggplot(rawdata, aes(x = .data[[col_name]])) +
    geom_bar(fill = "steelblue", color = "black") +
    geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5, size = 3) +
    labs(title = paste("Distribution of", col_name),
         x = col_name, 
         y = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 0.5),
          axis.text.y = element_text(size = 10),
          plot.title = element_text(hjust = 0.5, size = 12))
  
  print(p) # Display the plot
}

Occupation:

Relatively even across the various categories.

Credit_Mix and Credit_Score:

The majority of the population falls into the "Standard" credit mix category, followed by "Good" and "Bad," which is consistent with the distribution observed in the credit score variable, suggesting a potential positive correlation between these two variables.

Payment_Behavior:

The most common behaviour is "Low spent, Small value payments", indicating a cautious approach to spending.

4.1.3 Distribution of Credit Score (Target variable in Credit Classification)

# Summarize the data to calculate counts and percentages
class_distribution <- rawdata %>%
  group_by(Credit_Score) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = round((Count / sum(Count)) * 100, 1))  # Calculate percentages

# Create a pie chart
ggplot(class_distribution, aes(x = "", y = Count, fill = Credit_Score)) +
  geom_col(width = 1, color = "black") +
  coord_polar(theta = "y") +  # Convert to polar coordinates for a pie chart
  geom_text(aes(label = paste0(Count, " (", Percentage, "%)")), 
            position = position_stack(vjust = 0.5), size = 4) +  # Add labels
  labs(title = "Distribution of Credit Score Classes",
       x = NULL, 
       y = NULL) +
  scale_fill_manual(values = c("Good" = "forestgreen", 
                               "Standard" = "steelblue", 
                               "Poor" = "firebrick"),
                    name = "Credit Score") +
  theme_minimal() +
  theme(axis.text = element_blank(),  # Remove axis text
        axis.ticks = element_blank(), # Remove axis ticks
        panel.grid = element_blank(), # Remove grid lines
        plot.title = element_text(hjust = 0.5, size = 14))  # Center and style the title

The majority of customers have a "Standard" credit score, followed by "Poor" and "Good," indicating class imbalance, where the "Standard" category is overrepresented, while the "Poor" and "Good" categories are underrepresented.

4.2 Bivariate Analysis

4.2.1 Distribution of Numerical Columns with Credit Score

# Convert 'Credit_Score' to a factor
rawdata$Credit_Score <- factor(rawdata$Credit_Score, levels = c("Poor", "Standard", "Good"))

# Number of columns per group
group_size <- 4 

# Split numeric columns into groups
column_groups <- split(colnames(numeric_data), ceiling(seq_along(colnames(numeric_data)) / group_size))

# Loop through each group and plot
for (i in seq_along(column_groups)) {
  group <- column_groups[[i]]
  plots <- list()
  
  # Generate boxplots for the current group
  for (col_name in group) {
    p <- ggplot(rawdata, aes(x = Credit_Score, y = .data[[col_name]], fill = Credit_Score)) +
      geom_boxplot(outlier.color = "red", outlier.size = 1.5) +
      labs(title = paste(col_name, "by Credit Score"), 
           x = "Credit Score", 
           y = col_name) +
      scale_fill_manual(values = c("Poor" = "firebrick", "Standard" = "steelblue", "Good" = "forestgreen")) +
      theme_minimal() +
      theme(legend.position = "none",
            plot.title = element_text(size = 10, hjust = 0.5),
            axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
            axis.text.y = element_text(size = 8),
            axis.title = element_text(size = 10))
    plots[[col_name]] <- p
  }
  
  # Display the plots for the current group
  grid.arrange(grobs = plots, ncol = 2, top = paste("Boxplots of Credit Score - Group", i))
}

Observation:

While most variables display clear differentiation in boxplots across the three classes, certain variables such as Changed_Credit_Limit, Credit_Utilization_Ratio, one-hot-encoded Loan types (e.g., CreditBuilderLoan, HomeEquityLoan, etc.), and Total_EMI_per_month struggle to show distinct differences. This indicates that these particular variables may have limited predictive power in distinguishing between the classes.

4.2.2 Distribution of categorical columns with Credit Score

# Drop 'Credit_Score' (target variable) from categorical_data
categorical_data1 <- categorical_data[, !(colnames(categorical_data) %in% "Credit_Score")]

# Loop through each categorical column and display the countplot
for (col_name in colnames(categorical_data1)) {
  # Create countplot for the current categorical column
  p <- ggplot(rawdata, aes(x = .data[[col_name]], fill = Credit_Score)) +
    geom_bar(position = "dodge") +
    labs(title = paste(col_name, "by Credit Score"),
         x = col_name, 
         y = "Count", 
         fill = "Credit Score") +
    scale_fill_manual(values = c("Poor" = "firebrick", "Standard" = "steelblue", "Good" = "forestgreen")) +
    theme_minimal() +
    theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
          axis.text.y = element_text(size = 10),
          plot.title = element_text(hjust = 0.5, size = 12))
  
  # Print the plot
  print(p)
}

Observation:

Occupation x Credit_Score:
No clear relationship has been observed between occupation and credit_score, indicating that occupation may not be a significant predictor of credit score. Hence, 'Occupation’ will be dropped from further consideration.

Credit_Mix x Credit_Score:
As already noted in Section 4.1.2, these two variables again show a positive correlation, reinforcing their role as predictive variables in credit score classification.

4.3 Correlation Analysis

4.3.1 Correlation HeatMap of numerical columns

Correlation heatmap shows the relationship between the features with the credit score. The correlation coefficient ranges between -1 and +1. The closer the value is to +1, the stronger the positive relationship. The closer the value is to -1, the stronger the negative relationship. Values near zero indicate a weak or no relationship.

# Convert 'Credit_Score' to numeric
rawdata$Credit_Score_Numeric <- as.numeric(factor(rawdata$Credit_Score, levels = c("Good", "Standard", "Poor")))
rawdata$Credit_Mix_Numeric <- as.numeric(factor(rawdata$Credit_Mix, levels = c("Good", "Standard", "Bad")))
# Select numerical columns
numeric_data <- rawdata[sapply(rawdata, is.numeric)]

# Compute the correlation matrix
correlation_matrix <- cor(numeric_data, use = "complete.obs")

# Melt the correlation matrix for ggplot
correlation_melt <- melt(correlation_matrix)

# Create a new variable to highlight correlations > 0.7
correlation_melt$highlight <- ifelse(abs(correlation_melt$value) > 0.7, "highlight", "normal")

# Create the heatmap with correlation values, highlight cells with correlation > 0.7 and draw borders
ggplot(correlation_melt, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +  # Fill based on correlation value
  geom_tile(data = subset(correlation_melt, highlight == "highlight"), color = "yellow", size = 1, fill = NA) +  # Draw borders for high correlations
  geom_text(aes(label = sprintf("%.2f", value)), size = 1.5, color = "black") +
  scale_fill_gradient2(low = "red", high = "blue", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  labs(title = "Correlation Heatmap", x = "", y = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1, size = 6),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(size = 14, hjust = 0.5))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Observation:

Independent Variables with High Inter-collinearity (> 0.70)

1. Monthly_Balance <-> Annual_Income <-> Monthly_Inhand_Salary: 
These variables are highly correlated with each other, indicating that these variables provide overlapping information about the financial standing of the individuals and could lead to multicollinearity issues.

2. Credit_Mix <-> Num_Bank_Accounts <-> Interest_Rate <-> Num_of_Delayed_Payment <-> Min_Amount_Pymt_Ind: 
These variables are highly correlated with Credit_Mix (correlation > 0.7) but exhibit lower correlations with each other (correlation < 0.7). This suggests that Credit_Mix serves as a comprehensive measure encompassing various aspects of financial behavior and creditworthiness.However, the high correlation between these variables indicates multicollinearity, which may complicate the interpretation of their individual contributions in a predictive model.

4.3.2 Correlation of all features with Credit Score

# Drop 'Customer_ID and 'Credit_Score_Numeric'
rawdata1 <- rawdata[, !(colnames(rawdata) %in% c("Customer_ID", "Credit_Score","Credit_Mix"))]

# Initialize an empty data frame to store correlations
correlation_results <- data.frame(Variable = character(), Correlation = numeric())

# Loop through all features except Credit_Score
for (col_name in colnames(rawdata1)) {
  if (col_name != "Credit_Score_Numeric") {
    if (is.numeric(rawdata1[[col_name]])) {
      # Calculate Pearson Correlation for numeric variables
      correlation <- cor(rawdata1[[col_name]], as.numeric(rawdata1$Credit_Score_Numeric), use = "complete.obs")
    } else {
      # Calculate Cramér's V for categorical variables
      correlation <- CramerV(table(rawdata1[[col_name]], rawdata1$Credit_Score_Numeric))
    }
    # Store the result
    correlation_results <- rbind(correlation_results, data.frame(Variable = col_name, Correlation = correlation))
  }
}

# Arrange by descending correlation
correlation_results <- correlation_results %>% arrange(desc(Correlation))

# Plot correlation graph
ggplot(correlation_results, aes(x = reorder(Variable, Correlation), y = Correlation, fill = Correlation)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = sprintf("%.2f", Correlation)), 
            hjust = ifelse(correlation_results$Correlation > 0, -0.1, 1.1), 
            size = 2, 
            color = "black") +
  coord_flip() +
  labs(title = "Correlation of Features with Credit Score",
       x = "Features",
       y = "Correlation") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  theme_minimal()

Observations:

1.  Variables with the Strongest POSITIVE Correlation with Credit Score (the Higher the value, the Worse the Credit Score)

Credit_Mix (0.50): A diverse mix of credit types helps to mitigate risk by ensuring that an individual isn't overly reliant on a single type of credit. This diversification can indicate that the individual is managing multiple forms of credit responsibly, reducing the overall risk.

Interest_Rate (0.49): Higher interest rates are often charged to individuals who are considered higher risk. This reflects the increased likelihood of credit loss, as lenders compensate for the higher risk by charging more.

Payment_of_Min_Amount (0.44): Individuals who only pay the minimum amount due and allow the outstanding balance to roll over are often experiencing financial strain. This behavior can signal that they are struggling to manage their finances effectively.

2.  Variables with the Strongest NEGATIVE Correlation with Credit Score (the Lower the value, the Worse the Credit Score)

Credit_History_Months (-0.39): Longer credit history suggests that the individual has experience in managing credit over time. It indicates maturity and an ability to maintain financial health over the long term without being charged off or blacklisted by financial institutions.

Monthly_Balance (-0.21): A lower remaining balance at the end of each month indicates financial difficulties. It suggests that the individual has little room for any unexpected expenses or financial setbacks, which could lead to further financial trouble.

Annual_Income (-0.21): Higher annual income provides individuals with greater purchasing power and the ability to sustain themselves through financial downturns. It often correlates with better credit scores as higher income individuals are more capable of meeting their financial obligations.

3.  Variables with Insignificant Correlation with Credit Score (close to ZERO) This suggests that other factors may be more critical in determining an individual's credit score, and these variables may not be useful predictors in credit score modeling.

-   Occupation (0.03)
-   Credit_Utilization_Ratio (-0.05)
-   Total_EMI_per_month (0.07)

5. Model Modeling & Evaluation - Credit Score Classification Model

For Classification Model to predict Credit Score = Poor/ Standard/ Good, 3 types of models will be explored, and their performance will be compared for adoption recommendation:

Model Type 1: Multinomial Logistic Regression (MLR)
Model Type 2: Random Forest (RF)
Model Type 3: XGBoost (XG)

5.1 Data Preparation for Modelling

Data will be scaled prior to modelling due to:

Most machine learning algorithms benefit from scaling.
Scaled data has seen to require less computational power.

As class imbalance observed, 2 class balancing techniques below has been attempted:

Over-sampling
Under-sampling

Data partition of 70% (Train sample) vs. 30% (Test sample) has been conducted to have a hold-out sample for model testing, and all 3 model types will be conducted on 3 separate data sets as follows:

Scaled Data (original)
Scaled Data + Over-sampling
Scaled Data + Under-sampling

rawdata <- read.csv("C:/Users/sharo/OneDrive/Desktop/UM_Master_in_Data_Science/Programming for Data Science/Group Project/dataset_cleaned_v2.csv")

# Define custom mapping
creditscore_mapping <- c("Good" = 0, "Standard" = 1, "Poor" = 2)
creditmix_mapping <- c("Good" = 0, "Standard" = 1, "Bad" = 2)
pb_mapping <- c('Low_spent_Small_value_payments'= 5,
'Low_spent_Medium_value_payments'= 4,
'Low_spent_Large_value_payments'= 3,
'High_spent_Small_value_payments'= 2,
'High_spent_Medium_value_payments'= 1,
'High_spent_Large_value_payments'= 0)

# Apply the mapping and remove character & unnecessary column 
rawdata$Credit_Score <- unname(creditscore_mapping[rawdata$Credit_Score])
rawdata$Credit_Mix <- unname(creditmix_mapping[rawdata$Credit_Mix])
rawdata$Payment_Behaviour <- unname(pb_mapping[rawdata$Payment_Behaviour])
rawdata <- rawdata %>%select(-Customer_ID,-Month,-Occupation)%>%mutate(Credit_Score=as.factor(Credit_Score))

# Check if any column is of character type
has_character <- any(sapply(rawdata, is.character))
if (has_character) {
  print("There is a character column, please remove.")
} else {
  print("There is no character column, proceed.")
}

## [1] "There is no character column, proceed."

# Data Splitting & Scaling
set.seed(12345)
df_prep <-  createDataPartition(y = rawdata$Credit_Score,p = 0.7, list= FALSE)

train_df <- rawdata[df_prep,]
test_df <- rawdata[-df_prep,]

# transformation for TRAIN dataset & normalise all numeric variables
train_prep1 <- recipe(Credit_Score ~ .,data = train_df)%>%step_normalize(all_numeric_predictors())

train_prep <- prep(train_prep1, training = train_df)
train_new <- bake(train_prep, new_data = train_df)

# transformation for Test dataset & normalise all numeric variables
test_prep1 <- recipe(Credit_Score ~ ., data = test_df) %>%step_normalize(all_numeric_predictors())
test_prep <- prep(test_prep1, training = test_df)
test_new <- bake(test_prep, new_data = test_df)

# Check if there are any missing values in the entire data frame
any_na <- any(is.na(train_new))
if (any_na) {
  print("There is NA value in train set, please fix")
} else {
  print("There is no NA Value in train set, proceed.")
}

## [1] "There is no NA Value in train set, proceed."

any_na <- any(is.na(test_new))
if (any_na) {
  print("There is NA value in test set, please fix")
} else {
  print("There is no NA Value in test set, proceed.")
}

## [1] "There is no NA Value in test set, proceed."

Original

#Distribution of Credit Score in Original dataset
levels(train_df$Credit_Score) <- c("Good", "Standard", "Poor")
print(table(train_df$Credit_Score))

## 
##     Good Standard     Poor 
##    12480    37222    20299

Over-Sampling

up_train <- upSample(x = train_new %>% select(-Credit_Score),y = train_new$Credit_Score) %>%
  rename(Credit_Score = Class)
#Distribution of Credit Score in Over-sampling dataset
up_train2 <- up_train
levels(up_train2$Credit_Score) <- c("Good", "Standard", "Poor")
print(table(up_train2$Credit_Score))

## 
##     Good Standard     Poor 
##    37222    37222    37222

Under-Sampling

down_train <- downSample(x = train_new %>% select(-Credit_Score),y = train_new$Credit_Score) %>%
  rename(Credit_Score = Class) 
#Distribution of Credit Score in Under-sampling dataset
down_train2<-down_train
levels(down_train2$Credit_Score) <- c("Good", "Standard", "Poor")
print(table(down_train2$Credit_Score))

## 
##     Good Standard     Poor 
##    12480    12480    12480

5.2 Data Modelling

Model Type 1: Multinomial Logistic Regression

# Setting up the model
multiM_1 <- multinom_reg(mixture = 0, penalty = double(1))  %>%set_engine("glmnet") %>%
  fit(Credit_Score ~ ., data = train_new)
variables <- all.vars(multiM_1$fit$terms)
print(variables)

## character(0)

# Full Scaled Dataset
MLR_Train1 = predict(multiM_1,new_data = train_new, type="class")
MLR_Test1 = predict(multiM_1,new_data = test_new, type = "class")

# Oversampling 
multiM_2 <- multinom_reg(mixture = 0, penalty = double(1))  %>%set_engine("glmnet") %>%
  fit(Credit_Score ~ ., data = up_train)
MLR_Train2 = predict(multiM_2,new_data = up_train, type="class")
MLR_Test2 = predict(multiM_2,new_data = test_new, type = "class")

# Under-Sampling
multiM_3 <- multinom_reg(mixture = 0, penalty = double(1))  %>%set_engine("glmnet") %>%
  fit(Credit_Score ~ ., data = down_train)
MLR_Train3 = predict(multiM_3,new_data = down_train, type="class")
MLR_Test3 = predict(multiM_3,new_data = test_new, type = "class")

Model Type 2: Random Forest

# Random Forest

# Full Scaled Dataset
RFM1 <- rand_forest(trees = 30, mode = "classification")  %>%set_engine("randomForest") %>%
  fit(Credit_Score ~ ., data = train_new)
RF_Train1 <- predict(RFM1,new_data = train_new, type = "class")
RF_Test1 <- predict(RFM1,new_data = test_new, type = "class")


# Oversampling
RFM2 <- rand_forest(trees = 30, mode = "classification")  %>%set_engine("randomForest") %>%
  fit(Credit_Score ~ ., data = up_train)
RF_Train2 <- predict(RFM2,new_data = up_train, type = "class")
RF_Test2 <- predict(RFM2,new_data = test_new, type = "class")


# Undersampling
RFM3 <- rand_forest(trees = 30, mode = "classification")  %>%set_engine("randomForest") %>%
  fit(Credit_Score ~ ., data = down_train)
RF_Train3 <- predict(RFM3,new_data = down_train, type = "class")
RF_Test3 <- predict(RFM3,new_data = test_new, type = "class")

Model Type 3: XG Boost

#XG Boost

# Full Scaled Dataset
XG1 <- boost_tree(trees = 30, stop_iter = 100) %>%set_engine("xgboost") %>%
  set_mode("classification") %>%fit(Credit_Score ~ ., data = train_new)
XG_Train1 <- predict(XG1,new_data = train_new, type = "class")
XG_Test1 <- predict(XG1,new_data = test_new, type = "class")


# Oversampling
XG2 <- boost_tree(trees = 30, stop_iter = 100) %>%set_engine("xgboost") %>%
  set_mode("classification") %>%fit(Credit_Score ~ ., data = up_train)
XG_Train2 <- predict(XG2,new_data = up_train, type = "class")
XG_Test2 <- predict(XG2,new_data = test_new, type = "class")


# Undersampling
XG3 <- boost_tree(trees = 30, stop_iter = 100) %>%set_engine("xgboost") %>%
  set_mode("classification") %>%fit(Credit_Score ~ ., data = down_train)
XG_Train3 <- predict(XG3,new_data = down_train, type = "class")
XG_Test3 <- predict(XG3,new_data = test_new, type = "class")

5.3 Classification Model Evaluation

Train data sets 1 to 3 are used for evaluation purpose across the 3 models:
1) Scaled Data (original train_new): MLR, RF, XGBoost
2) Scaled Data + Oversampling (up_train): MLR, RF, XGBoost
3) Scaled Data + Undersampling (down_train): MLR, RF, XGBoost

However, Test sample, evaluation will be performed on only original data “test_new” for unbiased evaluation across all models attempted.
Altering the test data via over- or under-sampling could lead to biased evaluation and overly optimistic performance metrics, as the test set would no longer reflect the original class distribution.

Confusion Matrix

confusionMatrix(train_new$Credit_Score, MLR_Train1$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  7217  5091   172
##          1  4143 28230  4849
##          2  1247  8558 10494
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6563               
##                  95% CI : (0.6528, 0.6598)     
##     No Information Rate : 0.5983               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.413                
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5725   0.6741   0.6764
## Specificity            0.9083   0.6803   0.8200
## Pos Pred Value         0.5783   0.7584   0.5170
## Neg Pred Value         0.9063   0.5836   0.8990
## Prevalence             0.1801   0.5983   0.2216
## Detection Rate         0.1031   0.4033   0.1499
## Detection Prevalence   0.1783   0.5317   0.2900
## Balanced Accuracy      0.7404   0.6772   0.7482

confusionMatrix(test_new$Credit_Score, MLR_Test1$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  3106  2182    60
##          1  1763 12120  2069
##          2   551  3697  4451
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6559               
##                  95% CI : (0.6505, 0.6613)     
##     No Information Rate : 0.6                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.412                
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5731   0.6734   0.6764
## Specificity            0.9088   0.6807   0.8186
## Pos Pred Value         0.5808   0.7598   0.5117
## Neg Pred Value         0.9061   0.5815   0.9000
## Prevalence             0.1807   0.6000   0.2193
## Detection Rate         0.1035   0.4040   0.1484
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7409   0.6770   0.7475

confusionMatrix(up_train$Credit_Score, MLR_Train2$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 31023  5168  1031
##          1  7802 21397  8023
##          2  6240  5588 25394
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6968               
##                  95% CI : (0.6941, 0.6995)     
##     No Information Rate : 0.4036               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5453               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6884   0.6655   0.7372
## Specificity            0.9069   0.8010   0.8468
## Pos Pred Value         0.8335   0.5748   0.6822
## Neg Pred Value         0.8114   0.8555   0.8784
## Prevalence             0.4036   0.2879   0.3085
## Detection Rate         0.2778   0.1916   0.2274
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.7977   0.7332   0.7920

confusionMatrix(test_new$Credit_Score, MLR_Test2$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0 4520  701  127
##          1 3417 9111 3424
##          2 1432 1299 5968
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6533               
##                  95% CI : (0.6479, 0.6587)     
##     No Information Rate : 0.3704               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.471                
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.4824   0.8200   0.6270
## Specificity            0.9599   0.6378   0.8667
## Pos Pred Value         0.8452   0.5712   0.6861
## Neg Pred Value         0.8033   0.8576   0.8333
## Prevalence             0.3123   0.3704   0.3173
## Detection Rate         0.1507   0.3037   0.1989
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7212   0.7289   0.7468

confusionMatrix(down_train$Credit_Score, MLR_Train3$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 10389  1760   331
##          1  2563  7224  2693
##          2  2076  1864  8540
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6985               
##                  95% CI : (0.6939, 0.7032)     
##     No Information Rate : 0.4014               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5478               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6913   0.6659   0.7385
## Specificity            0.9067   0.8023   0.8477
## Pos Pred Value         0.8325   0.5788   0.6843
## Neg Pred Value         0.8141   0.8548   0.8788
## Prevalence             0.4014   0.2897   0.3089
## Detection Rate         0.2775   0.1929   0.2281
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.7990   0.7341   0.7931

confusionMatrix(test_new$Credit_Score, MLR_Test3$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0 4522  705  121
##          1 3411 9117 3424
##          2 1431 1310 5958
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6533               
##                  95% CI : (0.6478, 0.6586)     
##     No Information Rate : 0.3711               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4708               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.4829   0.8190   0.6270
## Specificity            0.9600   0.6377   0.8663
## Pos Pred Value         0.8455   0.5715   0.6849
## Neg Pred Value         0.8036   0.8566   0.8336
## Prevalence             0.3121   0.3711   0.3168
## Detection Rate         0.1507   0.3039   0.1986
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7214   0.7284   0.7466

confusionMatrix(train_new$Credit_Score, RF_Train1$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 12448    32     0
##          1    45 37092    85
##          2     0    54 20245
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9969               
##                  95% CI : (0.9965, 0.9973)     
##     No Information Rate : 0.5311               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9949               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9964   0.9977   0.9958
## Specificity            0.9994   0.9960   0.9989
## Pos Pred Value         0.9974   0.9965   0.9973
## Neg Pred Value         0.9992   0.9974   0.9983
## Prevalence             0.1785   0.5311   0.2904
## Detection Rate         0.1778   0.5299   0.2892
## Detection Prevalence   0.1783   0.5317   0.2900
## Balanced Accuracy      0.9979   0.9969   0.9974

confusionMatrix(test_new$Credit_Score, RF_Test1$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  4066  1245    37
##          1  1160 12957  1835
##          2    69  1585  7045
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8023               
##                  95% CI : (0.7977, 0.8068)     
##     No Information Rate : 0.5263               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6719               
##                                                
##  Mcnemar's Test P-Value : 0.0000008754         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.7679   0.8207   0.7901
## Specificity            0.9481   0.7893   0.9215
## Pos Pred Value         0.7603   0.8122   0.8099
## Neg Pred Value         0.9501   0.7985   0.9121
## Prevalence             0.1765   0.5263   0.2972
## Detection Rate         0.1355   0.4319   0.2348
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.8580   0.8050   0.8558

confusionMatrix(up_train$Credit_Score, RF_Train2$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 37217     5     0
##          1   155 36904   163
##          2     0    16 37206
## 
## Overall Statistics
##                                                
##                Accuracy : 0.997                
##                  95% CI : (0.9966, 0.9973)     
##     No Information Rate : 0.3347               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9954               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9959   0.9994   0.9956
## Specificity            0.9999   0.9957   0.9998
## Pos Pred Value         0.9999   0.9915   0.9996
## Neg Pred Value         0.9979   0.9997   0.9978
## Prevalence             0.3347   0.3307   0.3346
## Detection Rate         0.3333   0.3305   0.3332
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.9979   0.9976   0.9977

confusionMatrix(test_new$Credit_Score, RF_Test2$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  4398   933    17
##          1  1474 12451  2027
##          2    75  1223  7401
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8084               
##                  95% CI : (0.8039, 0.8128)     
##     No Information Rate : 0.4869               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6881               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.7395   0.8524   0.7836
## Specificity            0.9605   0.7725   0.9368
## Pos Pred Value         0.8224   0.7805   0.8508
## Neg Pred Value         0.9372   0.8465   0.9040
## Prevalence             0.1982   0.4869   0.3148
## Detection Rate         0.1466   0.4150   0.2467
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.8500   0.8125   0.8602

confusionMatrix(down_train$Credit_Score, RF_Train3$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 12477     3     0
##          1    43 12408    29
##          2     0    13 12467
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9976               
##                  95% CI : (0.9971, 0.9981)     
##     No Information Rate : 0.3344               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9965               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9966   0.9987   0.9977
## Specificity            0.9999   0.9971   0.9995
## Pos Pred Value         0.9998   0.9942   0.9990
## Neg Pred Value         0.9983   0.9994   0.9988
## Prevalence             0.3344   0.3318   0.3338
## Detection Rate         0.3333   0.3314   0.3330
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.9982   0.9979   0.9986

confusionMatrix(test_new$Credit_Score, RF_Test3$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  4720   586    42
##          1  2674 10753  2525
##          2   384   961  7354
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7609               
##                  95% CI : (0.7561, 0.7657)     
##     No Information Rate : 0.41                 
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6264               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6068   0.8742   0.7413
## Specificity            0.9717   0.7063   0.9330
## Pos Pred Value         0.8826   0.6741   0.8454
## Neg Pred Value         0.8759   0.8899   0.8795
## Prevalence             0.2593   0.4100   0.3307
## Detection Rate         0.1573   0.3584   0.2451
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7893   0.7902   0.8371

confusionMatrix(train_new$Credit_Score, XG_Train1$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  9019  3225   236
##          1  3946 29124  4152
##          2  1283  4820 14196
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7477               
##                  95% CI : (0.7445, 0.7509)     
##     No Information Rate : 0.531                
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5825               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6330   0.7836   0.7639
## Specificity            0.9379   0.7534   0.8813
## Pos Pred Value         0.7227   0.7824   0.6993
## Neg Pred Value         0.9091   0.7546   0.9117
## Prevalence             0.2035   0.5310   0.2655
## Detection Rate         0.1288   0.4161   0.2028
## Detection Prevalence   0.1783   0.5317   0.2900
## Balanced Accuracy      0.7855   0.7685   0.8226

confusionMatrix(test_new$Credit_Score, XG_Test1$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0  3748  1497   103
##          1  1956 12127  1869
##          2   602  2274  5823
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7233               
##                  95% CI : (0.7182, 0.7283)     
##     No Information Rate : 0.53                 
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5429               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5944   0.7628   0.7470
## Specificity            0.9325   0.7287   0.8705
## Pos Pred Value         0.7008   0.7602   0.6694
## Neg Pred Value         0.8962   0.7315   0.9074
## Prevalence             0.2102   0.5300   0.2598
## Detection Rate         0.1249   0.4042   0.1941
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7634   0.7458   0.8087

confusionMatrix(up_train$Credit_Score, XG_Train2$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 31962  4594   666
##          1  6965 23608  6649
##          2  3959  3425 29838
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7649               
##                  95% CI : (0.7624, 0.7673)     
##     No Information Rate : 0.3841               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6473               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.7453   0.7465   0.8031
## Specificity            0.9235   0.8299   0.9009
## Pos Pred Value         0.8587   0.6342   0.8016
## Neg Pred Value         0.8533   0.8923   0.9017
## Prevalence             0.3841   0.2832   0.3327
## Detection Rate         0.2862   0.2114   0.2672
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.8344   0.7882   0.8520

confusionMatrix(test_new$Credit_Score, XG_Test2$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0 4511  724  113
##          1 3084 9864 3004
##          2 1006  914 6779
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7052               
##                  95% CI : (0.7, 0.7103)        
##     No Information Rate : 0.3834               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5459               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5245   0.8576   0.6850
## Specificity            0.9609   0.6709   0.9045
## Pos Pred Value         0.8435   0.6184   0.7793
## Neg Pred Value         0.8341   0.8834   0.8537
## Prevalence             0.2867   0.3834   0.3299
## Detection Rate         0.1504   0.3288   0.2260
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7427   0.7642   0.7948

confusionMatrix(down_train$Credit_Score, XG_Train3$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 10686  1576   218
##          1  2230  7954  2296
##          2  1365  1059 10056
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7665               
##                  95% CI : (0.7621, 0.7707)     
##     No Information Rate : 0.3814               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6497               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.7483   0.7512   0.8000
## Specificity            0.9225   0.8314   0.9025
## Pos Pred Value         0.8562   0.6373   0.8058
## Neg Pred Value         0.8560   0.8944   0.8993
## Prevalence             0.3814   0.2828   0.3357
## Detection Rate         0.2854   0.2124   0.2686
## Detection Prevalence   0.3333   0.3333   0.3333
## Balanced Accuracy      0.8354   0.7913   0.8513

confusionMatrix(test_new$Credit_Score, XG_Test3$.pred_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0 4534  702  112
##          1 3172 9597 3183
##          2 1038  873 6788
## 
## Overall Statistics
##                                                
##                Accuracy : 0.6973               
##                  95% CI : (0.6921, 0.7025)     
##     No Information Rate : 0.3724               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5362               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5185   0.8590   0.6732
## Specificity            0.9617   0.6625   0.9040
## Pos Pred Value         0.8478   0.6016   0.7803
## Neg Pred Value         0.8292   0.8879   0.8453
## Prevalence             0.2915   0.3724   0.3361
## Detection Rate         0.1511   0.3199   0.2263
## Detection Prevalence   0.1783   0.5318   0.2900
## Balanced Accuracy      0.7401   0.7607   0.7886

# Create vectors of model names and sampling methods
models <- c("MLR", "RF", "XG")
sampling <- c("Original", "Oversampling", "Undersampling")

# Compute metrics function
compute_all_metrics <- function(actual, predicted) {
  # Ensure factors have the same levels
  actual <- factor(actual, levels = c("0", "1", "2"))
  predicted <- factor(predicted, levels = c("0", "1", "2"))
  
  # Compute confusion matrix
  cm <- confusionMatrix(predicted, actual)
  
  # Extract metrics
  metrics <- c(
    Accuracy = cm$overall["Accuracy"],
    F1_Score = mean(cm$byClass[, "F1"], na.rm = TRUE),
    Precision = mean(cm$byClass[, "Precision"], na.rm = TRUE),
    Recall = mean(cm$byClass[, "Recall"], na.rm = TRUE),
    Specificity = mean(cm$byClass[, "Specificity"], na.rm = TRUE)
  )
  
  return(metrics)
}

# Initialize arrays
metrics_names <- c("Accuracy", "F1_Score", "Precision", "Recall", "Specificity")
train_metrics_array <- array(NA, 
                           dim = c(length(models), length(sampling), length(metrics_names)),
                           dimnames = list(models, sampling, metrics_names))
test_metrics_array <- array(NA, 
                          dim = c(length(models), length(sampling), length(metrics_names)),
                          dimnames = list(models, sampling, metrics_names))

# Fill arrays with metrics
for(i in 1:length(sampling)) {
  train_data <- switch(i,
                      train_new$Credit_Score,
                      up_train$Credit_Score,
                      down_train$Credit_Score)
  
  for(j in 1:length(models)) {
    train_pred <- get(paste0(models[j], "_Train", i))$.pred_class
    test_pred <- get(paste0(models[j], "_Test", i))$.pred_class
    
    train_metrics_array[j,i,] <- compute_all_metrics(train_data, train_pred)
    test_metrics_array[j,i,] <- compute_all_metrics(test_new$Credit_Score, test_pred)
  }
}

# Convert to long format
train_metrics_df <- as.data.frame.table(train_metrics_array, responseName = "Value") %>%
  rename(Model = Var1, 
         Sampling = Var2, 
         Metric = Var3)

test_metrics_df <- as.data.frame.table(test_metrics_array, responseName = "Value") %>%
  rename(Model = Var1, 
         Sampling = Var2, 
         Metric = Var3)


# Modified plotting function
create_metrics_plot <- function(data, title) {
  ggplot(data, aes(x = Model, y = Value * 100, fill = Sampling)) +
    facet_wrap(~Metric, ncol = 3, scales = "free_x") +
    geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
    geom_text(aes(label = sprintf("%.1f%%", Value * 100)),
              position = position_dodge(width = 0.9),
              vjust = -0.5,
              size = 3.0) +
    scale_fill_brewer(palette = "Set2") +
    labs(
      title = title,
      y = "Percentage",
      x = "Model"
    ) +
    theme_light() +
    theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
      axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
      axis.text.y = element_text(size = 10),
      strip.text = element_text(size = 12, face = "bold", color = "black"),
      strip.background = element_rect(fill = "lightgray", color = NA),
      legend.position = "bottom",
      legend.title = element_text(size = 12),
      legend.text = element_text(size = 11),
      legend.box = "horizontal",
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "gray90"),
      plot.margin = unit(c(1, 1, 4, 1), "lines"),
      panel.spacing = unit(2, "lines"), # Adjusted row gap here
      text = element_text(color = "black"),
      axis.text = element_text(color = "black")
    ) +
    scale_y_continuous(limits = c(0, 105),
                       breaks = seq(0, 100, 20))
}

# Create plots
p1 <- create_metrics_plot(train_metrics_df, "Training Metrics Comparison")
p2 <- create_metrics_plot(test_metrics_df, "Testing Metrics Comparison")

# Display plots in R
print(p1)

print(p2)

Model Performance Interpretation:

Random Forest (RF)

Nearly 100% accuracy & F1 score observed on training sample but only 75-80% on testing sample
This ~20-25% gap between training and testing performance is a classic indicator of overfitting.
The model has essentially “memorized” the training data rather than learning generalizable patterns.

XGBoost (XG)

Testing Accuracy & F1 score around 70-73% vs Training around 75%, indicates very good performance.
Minimal gap between Train vs. Test sample performance shows that XG model has better generalization capability as compared to RF model, and will be more reliable performance on new/ unseen data.

Multinomial Logistic Regression (MLR)

MLR models show the worst performance across all metrics vs. the other 2 models.
Such observation could suggest that the relationship between variables might be non-linear.

Comparison between Class Balancing methods

Class Balancing methods have insignificant impact on overall model performance.
For RF, slightly improved performance observed with over-sampling method on the Test sample.
For XG, despite Train sample shows slightly better performance for over/under-sampling, Test sample shows worse Accuracy, F1 score and Precision.
For MLR, both Train & Test samples shows slightly higher accuracy for over/under-sampling, however, overall performance still the worst compared to other models.

5.4 Conclusion

Credit scoring requires stable and reliable predictions, and XGBoost’s consistent performance makes it a strong candidate for production deployment. It is less likely to experience unexpected performance drops when applied to new customer data. Therefore, the XGBoost model trained on the original dataset without class balancing is selected as the final model, as class balancing does not provide significant benefits in this case.

6. Model Modeling & Evaluation - Credit Score Regression Model

For Regression Model, development of Credit Scorecard would be demonstrated.
As the main purpose of Credit Scorecard is for loan approval process, where the decision is either “Approve” or “Reject”, binary target has been created with Credit Score = Poor (=1) vs. Standard or Good (=0).

Benefits:

Simplicity and Clarity: A binary classification model simplifies the process by categorizing individuals into two distinct groups i.e. high-risk (poor) and lower-risk (standard/good). This clarity facilitates easier interpretation and decision-making for stakeholders i.e. to approve or reject the loan.

Flexibility in Score Band Creation: With the binary model, more specific score bands can be created based on the probability of being classified as poor. This allows for further segmentation and targeted decision-making, providing a more granular understanding of credit risk within the broader category i.e. collection strategy, interest offering differentiation, cross-selling purpose.

rawdata <- read.csv("C:/Users/sharo/OneDrive/Desktop/UM_Master_in_Data_Science/Programming for Data Science/Group Project/dataset_cleaned_v2.csv")

# Define custom mapping
creditscore_mapping <- c("Good" = 0, "Standard" = 1, "Poor" = 2)
creditmix_mapping <- c("Good" = 0, "Standard" = 1, "Bad" = 2)
pb_mapping <- c('Low_spent_Small_value_payments'= 5,
'Low_spent_Medium_value_payments'= 4,
'Low_spent_Large_value_payments'= 3,
'High_spent_Small_value_payments'= 2,
'High_spent_Medium_value_payments'= 1,
'High_spent_Large_value_payments'= 0)

# Apply the mapping and remove character & unnecessary column 
rawdata$Credit_Score <- unname(creditscore_mapping[rawdata$Credit_Score])
rawdata$Credit_Mix <- unname(creditmix_mapping[rawdata$Credit_Mix])
rawdata$Payment_Behaviour <- unname(pb_mapping[rawdata$Payment_Behaviour])
rawdata <- rawdata %>%select(-Customer_ID,-Month,-Occupation)%>%mutate(Credit_Score=as.factor(Credit_Score))

# Check if any column is of character type
has_character <- any(sapply(rawdata, is.character))
if (has_character) {
  print("There is a character column, please remove.")
} else {
  print("There is no character column, proceed.")
}

## [1] "There is no character column, proceed."

set.seed(12345)
df_prep <-  createDataPartition(y = rawdata$Credit_Score,p = 0.7, list= FALSE)

train_df <- rawdata[df_prep,]
test_df <- rawdata[-df_prep,]

6.1 Data Preparation for Modelling

Step 1: Binary target creation: Poor (=1) vs. Standard or Good (=0) & Data Scaling

train_new2 <- train_df %>%
            mutate(Credit_Score2 = if_else(Credit_Score == 2, 1, 0)) %>%   
            mutate(Credit_Score2 = as.factor(Credit_Score2)) %>%  
            select(-Credit_Score) 
recipe_obj <- recipe(Credit_Score2 ~ ., data = train_new2) %>%
            step_normalize(all_numeric_predictors())
recipe_obj <- prep(recipe_obj, training = train_new2)
train_new2 <- bake(recipe_obj, new_data = train_new2)

# Check if there are any missing values in the entire data frame
any_na <- any(is.na(train_new2))
if (any_na) {
  print("There is NA value in train set, please fix")
} else {
  print("There is no NA Value in train set, proceed.")
}

## [1] "There is no NA Value in train set, proceed."

test_new2 <- test_df %>%
            mutate(Credit_Score2 = if_else(Credit_Score == 2, 1, 0)) %>%   
            mutate(Credit_Score2 = as.factor(Credit_Score2)) %>%  
            select(-Credit_Score) 
recipe_obj <- recipe(Credit_Score2 ~ ., data = test_new2) %>%
            step_normalize(all_numeric_predictors())
recipe_obj <- prep(recipe_obj, training = test_new2)
test_new2 <- bake(recipe_obj, new_data = test_new2)

# Check if there are any missing values in the entire data frame
any_na <- any(is.na(test_new2))
if (any_na) {
  print("There is NA value in test set, please fix")
} else {
  print("There is no NA Value in test set, proceed.")
}

## [1] "There is no NA Value in test set, proceed."

Step 2: Data Processing & Variable Reduction

Feature selection based on IV: Threshold of IV > 0.2 being applied to help in filtering out irrelevant or less important variables that do not contribute meaningfully to the model. This reduces noise and potential overfitting, leading to a more effective model.
Next, independent variables with high correlation with each other will also be dropped, preserving the other one with higher IV to prevent multicollinearity:

Drop Monthly_Inhand_Salary due to lower IV and high correlation observed vs. Annual_Income.

# calculate variable's IV 
var_list <- train_new2 %>%
  select(-Credit_Score2) %>%
  names()

bin1 <- woebin(train_new2, 
              y = "Credit_Score2", 
              x = var_list, 
              positive = 1, 
              method = "tree", # opt for tree due to categorical variables
              bin_num_limit = 10)

## ℹ Creating woe binning ...

## ✔ Binning on 70001 rows and 30 columns in 00:00:23

# Calculate IV and order variables
iv <- map_df(bin1, ~pluck(.x, 10, 1)) %>%
    pivot_longer(everything(), names_to = "var", values_to = "iv")
iv <- iv[order(iv$iv, decreasing = TRUE), ] 

# Filter variables: IV >= 0.2 and exclude specified variables
iv_final <- iv %>% 
    filter(iv >= 0.2,
           !var %in% c("Monthly_Inhand_Salary")) %>%
    select(var)

# Add target variable
iv_final2 <- append(iv_final$var, "Credit_Score2")

# Display selected variables and their IVs
print("Selected Variables and their Information Values:")

## [1] "Selected Variables and their Information Values:"

selected_vars <- iv %>%
    filter(var %in% iv_final$var) %>%
    arrange(desc(iv))
print(knitr::kable(selected_vars))

## 
## 
## |var                    |        iv|
## |:----------------------|---------:|
## |Outstanding_Debt       | 1.2744588|
## |Interest_Rate          | 1.0757602|
## |Num_Credit_Inquiries   | 0.8840514|
## |Delay_from_due_date    | 0.7414860|
## |Credit_Mix             | 0.6800040|
## |Credit_History_Months  | 0.6680570|
## |Num_Credit_Card        | 0.6424321|
## |No_of_Loan             | 0.6046970|
## |Num_Bank_Accounts      | 0.4811811|
## |Min_Amount_Pymt_Ind    | 0.4318173|
## |Num_of_Delayed_Payment | 0.3673243|
## |Annual_Income          | 0.2671015|
## |Monthly_Balance        | 0.2061845|

# dataset with all IV variable and target Variable
iv_var_train <- train_new2[, iv_final2]
iv_var_test <- test_new2[, iv_final2]

var_list2 <- iv_var_train %>%
  select(-Credit_Score2) %>%
  names()

6.2 Data Modelling

Step 1: Model Fitting

Fitting of Generalized Linear Model (GLM) model to predict the binary outcome of Credit Score, using all predictors post variable reduction in Step 2.

m1 <- glm( Credit_Score2 ~ ., family = binomial(), data = iv_var_train)
summary(m1)

## 
## Call:
## glm(formula = Credit_Score2 ~ ., family = binomial(), data = iv_var_train)
## 
## Coefficients:
##                        Estimate Std. Error  z value             Pr(>|z|)    
## (Intercept)            -1.11406    0.01012 -110.065 < 0.0000000000000002 ***
## Outstanding_Debt        0.09094    0.01445    6.292  0.00000000031348802 ***
## Interest_Rate           0.55220    0.01490   37.056 < 0.0000000000000002 ***
## Num_Credit_Inquiries    0.41867    0.01433   29.217 < 0.0000000000000002 ***
## Delay_from_due_date     0.36932    0.01285   28.746 < 0.0000000000000002 ***
## Credit_Mix             -0.50941    0.02439  -20.886 < 0.0000000000000002 ***
## Credit_History_Months  -0.11739    0.01472   -7.975  0.00000000000000153 ***
## Num_Credit_Card         0.32867    0.01182   27.815 < 0.0000000000000002 ***
## No_of_Loan              0.11079    0.01418    7.815  0.00000000000000551 ***
## Num_Bank_Accounts       0.03480    0.01444    2.410              0.01596 *  
## Min_Amount_Pymt_Ind    -0.05410    0.01745   -3.100              0.00194 ** 
## Num_of_Delayed_Payment -0.02309    0.01529   -1.511              0.13085    
## Annual_Income          -0.12447    0.01493   -8.339 < 0.0000000000000002 ***
## Monthly_Balance         0.08669    0.01607    5.393  0.00000006919048661 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 84300  on 70000  degrees of freedom
## Residual deviance: 67401  on 69987  degrees of freedom
## AIC: 67429
## 
## Number of Fisher Scoring iterations: 4

vif_values <- vif(m1)
print(vif_values)

##       Outstanding_Debt          Interest_Rate   Num_Credit_Inquiries 
##               2.600563               2.378558               2.118945 
##    Delay_from_due_date             Credit_Mix  Credit_History_Months 
##               1.893779               6.616354               2.282167 
##        Num_Credit_Card             No_of_Loan      Num_Bank_Accounts 
##               1.433335               2.323659               2.174574 
##    Min_Amount_Pymt_Ind Num_of_Delayed_Payment          Annual_Income 
##               2.870790               2.416286               2.039456 
##        Monthly_Balance 
##               2.345378

Step 2: Scorecard Creation

Before creating the scorecard, the variables Credit_Mix and Num_of_Delayed_Payment were excluded due to multicollinearity and lack of statistical significance, respectively. Credit_Mix had a Variance Inflation Factor (VIF) of 6.6, indicating moderate multicollinearity, which can inflate regression coefficient variance and compromise model reliability. Num_of_Delayed_Payment had a p-value of 0.13085, above the standard threshold of 0.05, suggesting it is not a significant predictor of the target variable. Removing these variables enhances model interpretability, reduces the risk of overfitting, and improves scorecard stability.

The Weight of Evidence (WOE) bins for predictor variables were then created, forming the basis for the credit scorecard, which was generated using the WOE-transformed data and a logistic regression model, with a base score of 600.

#Remove variable with P-value>0.05 and vif>5 from the var_list2, left with only 11 variables
var_list2 <- setdiff(var_list2, c("Num_of_Delayed_Payment","Credit_Mix"))
iv_var_train2<-iv_var_train%>%select(-Num_of_Delayed_Payment,-Credit_Mix)
#Train again the model removing the above variable
m2 <- glm( Credit_Score2 ~ ., family = binomial(), data = iv_var_train2)
summary(m2)

## 
## Call:
## glm(formula = Credit_Score2 ~ ., family = binomial(), data = iv_var_train2)
## 
## Coefficients:
##                        Estimate Std. Error  z value             Pr(>|z|)    
## (Intercept)           -1.111447   0.010092 -110.136 < 0.0000000000000002 ***
## Outstanding_Debt       0.006852   0.013920    0.492                0.623    
## Interest_Rate          0.475849   0.014347   33.167 < 0.0000000000000002 ***
## Num_Credit_Inquiries   0.414445   0.014225   29.135 < 0.0000000000000002 ***
## Delay_from_due_date    0.286338   0.012205   23.461 < 0.0000000000000002 ***
## Credit_History_Months -0.105541   0.014630   -7.214   0.0000000000005437 ***
## Num_Credit_Card        0.298691   0.011690   25.551 < 0.0000000000000002 ***
## No_of_Loan             0.060274   0.013930    4.327   0.0000151163866710 ***
## Num_Bank_Accounts     -0.078943   0.013537   -5.832   0.0000000054903309 ***
## Min_Amount_Pymt_Ind   -0.209276   0.015720  -13.313 < 0.0000000000000002 ***
## Annual_Income         -0.112643   0.014898   -7.561   0.0000000000000401 ***
## Monthly_Balance        0.084244   0.016030    5.255   0.0000001476924288 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 84300  on 70000  degrees of freedom
## Residual deviance: 67972  on 69989  degrees of freedom
## AIC: 67996
## 
## Number of Fisher Scoring iterations: 4

bin_train <- woebin(train_new2, 
               y = "Credit_Score2", 
               x = var_list2,
               positive = 1, 
               method = "tree")

## ℹ Creating woe binning ...

## ✔ Binning on 70001 rows and 12 columns in 00:00:14

score_card <- scorecard(bin_train, 
                        m2,
                      points0 = 600, 
                      basepoints_eq0 = TRUE)
score_card

## $basepoints
##      variable    bin    woe points
##        <char> <lgcl> <lgcl>  <num>
## 1: basepoints     NA     NA      0
## 
## $Outstanding_Debt
##            variable         bin count count_distr   neg   pos   posprob
##              <char>      <char> <int>       <num> <int> <int>     <num>
## 1: Outstanding_Debt [-Inf,-0.2) 35951   0.5135784 32094  3857 0.1072849
## 2: Outstanding_Debt  [-0.2,0.1) 12244   0.1749118  9003  3241 0.2647011
## 3: Outstanding_Debt   [0.1,1.1) 12803   0.1828974  3898  8905 0.6955401
## 4: Outstanding_Debt  [1.1, Inf)  9003   0.1286124  4707  4296 0.4771743
##           woe      bin_iv total_iv breaks is_special_values points
##         <num>       <num>    <num> <char>            <lgcl>  <num>
## 1: -1.2233059 0.557483949 1.274459   -0.2             FALSE     43
## 2: -0.1262024 0.002710392 1.274459    0.1             FALSE     43
## 3:  1.7216229 0.620238970 1.274459    1.1             FALSE     42
## 4:  0.8041071 0.094025537 1.274459    Inf             FALSE     42
## 
## $Interest_Rate
##         variable         bin count count_distr   neg   pos    posprob
##           <char>      <char> <int>       <num> <int> <int>      <num>
## 1: Interest_Rate [-Inf,-0.4) 30776  0.43965086 26659  4117 0.13377307
## 2: Interest_Rate    [-0.4,0)  6554  0.09362723  6020   534 0.08147696
## 3: Interest_Rate     [0,0.7) 16637  0.23766803 11395  5242 0.31508084
## 4: Interest_Rate  [0.7, Inf) 16034  0.22905387  5628 10406 0.64899588
##           woe      bin_iv total_iv breaks is_special_values points
##         <num>       <num>    <num> <char>            <lgcl>  <num>
## 1: -0.9725285 0.324395573  1.07576   -0.4             FALSE     76
## 2: -1.5269731 0.144780222  1.07576      0             FALSE     95
## 3:  0.1190020 0.003447832  1.07576    0.7             FALSE     38
## 4:  1.5101020 0.603136556  1.07576    Inf             FALSE     -9
## 
## $Num_Credit_Inquiries
##                variable         bin count count_distr   neg   pos   posprob
##                  <char>      <char> <int>       <num> <int> <int>     <num>
## 1: Num_Credit_Inquiries [-Inf,-0.2) 35167   0.5023785 30875  4292 0.1220462
## 2: Num_Credit_Inquiries  [-0.2,0.6) 17431   0.2490107 11476  5955 0.3416327
## 3: Num_Credit_Inquiries   [0.6,1.2)  7492   0.1070270  3493  3999 0.5337694
## 4: Num_Credit_Inquiries  [1.2, Inf)  9911   0.1415837  3858  6053 0.6107355
##           woe     bin_iv  total_iv breaks is_special_values points
##         <num>      <num>     <num> <char>            <lgcl>  <num>
## 1: -1.0777204 0.44161036 0.8840514   -0.2             FALSE     75
## 2:  0.2394469 0.01495778 0.8840514    0.6             FALSE     35
## 3:  1.0307569 0.13062361 0.8840514    1.2             FALSE     12
## 4:  1.3458787 0.29685964 0.8840514    Inf             FALSE      2
## 
## $Delay_from_due_date
##               variable         bin count count_distr   neg   pos    posprob
##                 <char>      <char> <int>       <num> <int> <int>      <num>
## 1: Delay_from_due_date [-Inf,-1.1)  5463  0.07804174  5037   426 0.07797913
## 2: Delay_from_due_date [-1.1,-0.4) 25137  0.35909487 21374  3763 0.14969965
## 3: Delay_from_due_date  [-0.4,0.6) 26017  0.37166612 18116  7901 0.30368605
## 4: Delay_from_due_date  [0.6, Inf) 13384  0.19119727  5175  8209 0.61334429
##            woe     bin_iv total_iv breaks is_special_values points
##          <num>      <num>    <num> <char>            <lgcl>  <num>
## 1: -1.57465305 0.12653558 0.741486   -1.1             FALSE     75
## 2: -0.84148517 0.20588152 0.741486   -0.4             FALSE     60
## 3:  0.06566736 0.00162452 0.741486    0.6             FALSE     41
## 4:  1.35686532 0.40744440 0.741486    Inf             FALSE     14
## 
## $Credit_History_Months
##                 variable         bin count count_distr   neg   pos   posprob
##                   <char>      <char> <int>       <num> <int> <int>     <num>
## 1: Credit_History_Months [-Inf,-0.4) 23968  0.34239511 12129 11839 0.4939503
## 2: Credit_History_Months  [-0.4,0.2) 17489  0.24983929 12473  5016 0.2868089
## 3: Credit_History_Months   [0.2,0.4)  3823  0.05461351  3207   616 0.1611300
## 4: Credit_History_Months  [0.4, Inf) 24721  0.35315210 21893  2828 0.1143967
##            woe       bin_iv total_iv breaks is_special_values points
##          <num>        <num>    <num> <char>            <lgcl>  <num>
## 1:  0.87127344 0.2955326876 0.668057   -0.4             FALSE     49
## 2: -0.01545995 0.0000595197 0.668057    0.2             FALSE     42
## 3: -0.75437069 0.0257830653 0.668057    0.4             FALSE     37
## 4: -1.15112365 0.3466817057 0.668057    Inf             FALSE     34
## 
## $Num_Credit_Card
##           variable       bin count count_distr   neg   pos   posprob
##             <char>    <char> <int>       <num> <int> <int>     <num>
## 1: Num_Credit_Card [-Inf,-1) 12514   0.1787689 11200  1314 0.1050024
## 2: Num_Credit_Card [-1,-0.5) 10110   0.1444265  8692  1418 0.1402572
## 3: Num_Credit_Card  [-0.5,1) 36980   0.5282782 25936 11044 0.2986479
## 4: Num_Credit_Card  [1, Inf) 10397   0.1485264  3874  6523 0.6273925
##            woe       bin_iv  total_iv breaks is_special_values points
##          <num>        <num>     <num> <char>            <lgcl>  <num>
## 1: -1.24736431 0.2003401691 0.6424321     -1             FALSE     69
## 2: -0.91768208 0.0963810664 0.6424321   -0.5             FALSE     62
## 3:  0.04172888 0.0009278876 0.6424321      1             FALSE     42
## 4:  1.41652038 0.3447829434 0.6424321    Inf             FALSE     12
## 
## $No_of_Loan
##      variable        bin count count_distr   neg   pos   posprob        woe
##        <char>     <char> <int>       <num> <int> <int>     <num>      <num>
## 1: No_of_Loan  [-Inf,-1) 15807   0.2258111 13974  1833 0.1159613 -1.1357709
## 2: No_of_Loan  [-1,-0.5) 11119   0.1588406  8453  2666 0.2397698 -0.2584686
## 3: No_of_Loan [-0.5,0.5) 21720   0.3102813 17034  4686 0.2157459 -0.3951585
## 4: No_of_Loan [0.5, Inf) 21355   0.3050671 10241 11114 0.5204402  0.9772799
##        bin_iv total_iv breaks is_special_values points
##         <num>    <num> <char>            <lgcl>  <num>
## 1: 0.21676833 0.604697     -1             FALSE     47
## 2: 0.01001233 0.604697   -0.5             FALSE     44
## 3: 0.04420788 0.604697    0.5             FALSE     44
## 4: 0.33370844 0.604697    Inf             FALSE     38
## 
## $Num_Bank_Accounts
##             variable        bin count count_distr   neg   pos   posprob
##               <char>     <char> <int>       <num> <int> <int>     <num>
## 1: Num_Bank_Accounts   [-Inf,0) 34941  0.49915001 29370  5571 0.1594402
## 2: Num_Bank_Accounts      [0,1) 18299  0.26141055 11656  6643 0.3630253
## 3: Num_Bank_Accounts    [1,1.5) 13040  0.18628305  7234  5806 0.4452454
## 4: Num_Bank_Accounts [1.5, Inf)  3721  0.05315638  1442  2279 0.6124698
##           woe     bin_iv  total_iv breaks is_special_values points
##         <num>      <num>     <num> <char>            <lgcl>  <num>
## 1: -0.7669256 0.24271269 0.4811811      0             FALSE     38
## 2:  0.3332161 0.03090239 0.4811811      1             FALSE     44
## 3:  0.6755733 0.09490216 0.4811811    1.5             FALSE     46
## 4:  1.3531793 0.11266384 0.4811811    Inf             FALSE     50
## 
## $Min_Amount_Pymt_Ind
##               variable                bin count count_distr   neg   pos
##                 <char>             <char> <int>       <num> <int> <int>
## 1: Min_Amount_Pymt_Ind [-Inf,0.823375414) 28283   0.4040371 24449  3834
## 2: Min_Amount_Pymt_Ind [0.823375414, Inf) 41718   0.5959629 25253 16465
##      posprob        woe    bin_iv  total_iv      breaks is_special_values
##        <num>      <num>     <num>     <num>      <char>            <lgcl>
## 1: 0.1355585 -0.9572071 0.2900677 0.4318173 0.823375414             FALSE
## 2: 0.3946738  0.4677655 0.1417496 0.4318173         Inf             FALSE
##    points
##     <num>
## 1:     28
## 2:     50
## 
## $Annual_Income
##         variable         bin count count_distr   neg   pos   posprob        woe
##           <char>      <char> <int>       <num> <int> <int>     <num>      <num>
## 1: Annual_Income   [-Inf,-1)  5691  0.08129884  3026  2665 0.4682833  0.7684360
## 2: Annual_Income   [-1,-0.8) 13076  0.18679733  8021  5055 0.3865861  0.4337883
## 3: Annual_Income [-0.8,-0.4) 14242  0.20345424 10967  3275 0.2299537 -0.3130993
## 4: Annual_Income  [-0.4,0.9) 24316  0.34736647 16607  7709 0.3170341  0.1280377
## 5: Annual_Income  [0.9, Inf) 12676  0.18108313 11081  1595 0.1258283 -1.0428846
##         bin_iv  total_iv breaks is_special_values points
##          <num>     <num> <char>            <lgcl>  <num>
## 1: 0.054101270 0.2671015     -1             FALSE     49
## 2: 0.038019467 0.2671015   -0.8             FALSE     46
## 3: 0.018572147 0.2671015   -0.4             FALSE     40
## 4: 0.005843768 0.2671015    0.9             FALSE     44
## 5: 0.150564880 0.2671015    Inf             FALSE     34
## 
## $Monthly_Balance
##           variable         bin count count_distr   neg   pos   posprob
##             <char>      <char> <int>       <num> <int> <int>     <num>
## 1: Monthly_Balance [-Inf,-0.5) 24996   0.3570806 15180  9816 0.3927028
## 2: Monthly_Balance [-0.5,-0.3) 10361   0.1480122  7068  3293 0.3178265
## 3: Monthly_Balance  [-0.3,0.7) 21972   0.3138812 16689  5283 0.2404424
## 4: Monthly_Balance  [0.7, Inf) 12672   0.1810260 10765  1907 0.1504893
##           woe      bin_iv  total_iv breaks is_special_values points
##         <num>       <num>     <num> <char>            <lgcl>  <num>
## 1:  0.4595085 0.081861581 0.2061845   -0.5             FALSE     40
## 2:  0.1316950 0.002636163 0.2061845   -0.3             FALSE     42
## 3: -0.2547822 0.019241691 0.2061845    0.7             FALSE     44
## 4: -0.8352953 0.102445093 0.2061845    Inf             FALSE     48

6.3 Regression Model Evaluation

The evaluation of the Credit Scorecard model was performed using several key metrics to assess its performance and generalization ability. The metrics calculated include AUC (Area Under the Curve), Gini coefficient, and KS (Kolmogorov-Smirnov) statistic. These metrics were calculated for both the training and testing datasets.

AUC (Area Under the Curve)
- Purpose: Measures model’s ability to distinguish between classes.
- Result: Training AUC = 0.77, Testing AUC = 0.77.
- Interpretation: Values > 0.7 indicate good performance and reliable predictions.
Gini Coefficient
- Purpose: Indicates model’s discriminatory power.
- Result: Training Gini = 0.55, Testing Gini = 0.54.
- Interpretation: Indicates good model performance which aligns well with the AUC values and further confirms the model’s ability to distinguish between the classes.
KS (Kolmogorov-Smirnov) Statistic
- Purpose: Evaluates separation between positive and negative classes.
- Result: Training KS = 0.41, Testing KS = 0.40.
- Interpretation: Good at distinguishing between good vs. bad and generalizing well to unseen data.
Population Stability Index (PSI)
- Purpose: Measures the stability of model scores over time by comparing distributions.
- Result: PSI = 0.0725.
- Interpretation: PSI values < 0.1 indicate a very stable population. While PSI is ideally performed on out-of-time data to detect changes over different periods, in this case, it was tested on out-of-sample data to check for immediate stability. The result suggests that the model’s score distribution between the training and testing sets is stable, indicating good model performance.

# 1. Calculate scores for both sets
score_train <- scorecard_ply(iv_var_train2, score_card, only_total_score = FALSE)
iv_var_test2<-iv_var_test%>%select(-Num_of_Delayed_Payment,-Credit_Mix)
score_test <- scorecard_ply(iv_var_test2, score_card, only_total_score = FALSE)

# 2. Comprehensive Model Evaluation
# Function for creating performance metrics
get_performance_metrics <- function(actual, predicted_scores, dataset_name) {
    # ROC and AUC
    roc_obj <- roc(actual, predicted_scores, smooth = TRUE) # Add smooth option
    auc_val <- auc(roc_obj)
    
    # Gini Coefficient
    gini <- 2 * auc_val - 1
    
    # KS Statistic
    ks_stat <- max(abs(roc_obj$sensitivities - (1 - roc_obj$specificities)))
    
    return(data.frame(
        Dataset = dataset_name,
        AUC = auc_val,
        Gini = gini,
        KS = ks_stat
    ))
}

# Calculate metrics for both sets
train_metrics <- get_performance_metrics(iv_var_train2$Credit_Score2, 
                                      score_train$score, 
                                      "Training")
test_metrics <- get_performance_metrics(iv_var_test2$Credit_Score2, 
                                     score_test$score, 
                                     "Testing")

# Combine metrics
all_metrics <- rbind(train_metrics, test_metrics)

# 3. Calculate PSI
calculate_psi <- function(expected, actual, buckets = 10) {
    cuts <- quantile(expected, probs = seq(0, 1, length.out = buckets + 1))
    expected_bin <- cut(expected, breaks = cuts, include.lowest = TRUE)
    actual_bin <- cut(actual, breaks = cuts, include.lowest = TRUE)
    
    expected_dist <- table(expected_bin) / length(expected)
    actual_dist <- table(actual_bin) / length(actual)
    
    psi_values <- (actual_dist - expected_dist) * log(actual_dist / expected_dist)
    psi <- sum(psi_values[is.finite(psi_values)])
    
    return(psi)
}

psi_value <- calculate_psi(score_train$score, score_test$score)

# 4. Visualizations

# ROC Curves
roc_train <- roc(iv_var_train2$Credit_Score2, score_train$score, smooth = TRUE)
roc_test <- roc(iv_var_test2$Credit_Score2, score_test$score, smooth = TRUE)

# Plot ROC curves using ggplot2 for a prettier format
roc_train_df <- data.frame(tpr = roc_train$sensitivities, fpr = 1 - roc_train$specificities)
roc_test_df <- data.frame(tpr = roc_test$sensitivities, fpr = 1 - roc_test$specificities)

# Adding the stats result box back with AUC, Gini, and KS results arranged top to bottom
annotation <- data.frame(
    x = rep(0.6, 6),
    y = seq(0.4, 0.1, length.out = 6),
    label = c(paste("Train AUC:", round(train_metrics$AUC, 3)),
              paste("Train Gini:", round(train_metrics$Gini, 3)),
              paste("Train KS:", round(train_metrics$KS, 3)),
              paste("Test AUC:", round(test_metrics$AUC, 3)),
              paste("Test Gini:", round(test_metrics$Gini, 3)),
              paste("Test KS:", round(test_metrics$KS, 3)))
)

ggplot() +
    geom_line(data = roc_train_df, aes(x = fpr, y = tpr), color = "blue", lwd = 1) +  # Slimmer line
    geom_line(data = roc_test_df, aes(x = fpr, y = tpr), color = "red", lwd = 1) +   # Slimmer line
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
    geom_text(data = annotation, aes(x = x, y = y, label = label), color = "black", size = 4, hjust = 0) +
    labs(title = "ROC Curves Comparison",
         x = "False Positive Rate",
         y = "True Positive Rate") +
    theme_minimal() +
    scale_color_manual(name = "Dataset",
                       values = c("blue", "red"),
                       labels = c("Training", "Testing")) +
    theme(legend.position = "bottom")

# Score Distribution Plot using ggplot2 with PSI
score_dist_data <- data.frame(
    Score = c(score_train$score, score_test$score),
    Dataset = factor(c(rep("Training", nrow(score_train)),
                      rep("Testing", nrow(score_test))))
)

ggplot(score_dist_data, aes(x = Score, fill = Dataset)) +
    geom_density(alpha = 0.5) +
    theme_minimal() +
    labs(title = paste("Score Distribution Comparison (PSI:", format(round(psi_value, 4), nsmall = 4), ")"),
         x = "Score",
         y = "Density") +
    scale_fill_manual(values = c("blue", "red"))

Score vs. Poor Rate

The scatter plot shows that there is a clear trend where higher scores are associated with lower predicted Poor rate %. The plot shows a high density of data points at higher predicted probabilities for lower scores. This suggests that the model is effectively capturing the risk associated with lower scores.

The box plot reveals a clear trend where category 1 (poor credit scores) has a lower median score and wider IQR compared to category 0 (better credit scores). This visual distinction aligns well with the model’s strong performance metrics, indicating effective risk differentiation.

# Data for the plot
predicted_probs <- predict(m2, iv_var_train2, type = "response")
score_data <- data.frame(Score = score_train$score, Predicted_Prob = predicted_probs)

# Enhanced Scatter Plot
ggplot(score_data, aes(x = Score, y = Predicted_Prob)) +
    geom_point(alpha = 0.5, color = "blue", size = 2) + # Change point color and size
    geom_smooth(method = "lm", se = FALSE, color = "red") + # Add a linear trend line
    theme_minimal() +
    theme(
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12)
    ) +
    labs(
        title = "Scores vs. Predicted Poor Rate %",
        x = "Score",
        y = "Predicted Poor Rate %"
    )

## `geom_smooth()` using formula = 'y ~ x'

# Data for the plot
score_data <- data.frame(Score = score_train$score, Credit_Score2 = iv_var_train2$Credit_Score2)

# Enhanced Box Plot
ggplot(score_data, aes(x = as.factor(Credit_Score2), y = Score, fill = as.factor(Credit_Score2))) +
    geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) + # Customize outliers
    scale_fill_brewer(palette = "Set3") + # Use a color palette
    theme_minimal() +
    theme(
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12)
    ) +
    labs(
        title = "Scores by Credit Score Categories",
        x = "Credit Score Categories",
        y = "Score"
    )

#Distribution of Predicted Score
ggplot(score_data, aes(x = Score)) +
    geom_histogram(binwidth = 10, fill = "#ADD8E6", color = "black", alpha = 0.7) +
    theme_minimal() +
    labs(title = "Distribution of Predicted Scores",
         x = "Score",
         y = "Frequency")

# Extract coefficients from the model
coef_data <- as.data.frame(summary(m2)$coefficients)
coef_data$Variable <- rownames(coef_data)
colnames(coef_data) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)", "Variable")

# Exclude the intercept
coef_data <- coef_data[coef_data$Variable != "(Intercept)", ]

# Order by Estimate and select the top 5
coef_data <- coef_data[order(-coef_data$Estimate), ]
top_5_coef_data <- head(coef_data, 5)

# Plot the Bar Chart
ggplot(top_5_coef_data, aes(x = reorder(Variable, Estimate), y = Estimate, fill = Variable)) +
    geom_bar(stat = "identity", alpha = 0.7) +
    coord_flip() +
    theme_minimal() +
    theme(
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        legend.position = "none"  # Hide the legend
    ) +
    labs(
        title = "Top 5 Variable Strengths in the Scorecard",
        x = "Variables",
        y = "Strength (Coefficient Estimate)"
    ) +
    scale_fill_brewer(palette = "Set3") +  # Optional: Use a color palette
    ylim(0, 0.5)

XG Boost Integration:

A combination of XGBoost and Logistic Regression is explored for further model refinement. Initially, an XGBoost model is trained to predict the probability of the target variable, effectively capturing complex patterns within the data. The predictions from the XGBoost model are then used as input for a logistic regression model. This approach leverages the predictive power of XGBoost along with the interpretability of Logistic Regression.

Performance: The integration of XGBoost has been observed to significantly enhance model performance, particularly in terms of AUC (88%), Gini (76%), and KS statistics (63%), for both training and testing datasets.

In the context of creating a credit scorecard for various credit strategies, the original Logistic Regression model would be more suitable as it produces a more spread-out distribution of predicted scores, allowing more flexible strategy implementation based on different score buckets.

XGBoost intergrated model tends to generate scores that cluster more tightly, showing higher clustering at high scores, despite lack flexibility, it is highly predictive and can be a great solution for loan approval assessment, depending on the bank risk appetite.

# 1. Select the variables 
train_vars <- names(iv_var_train2 %>% select(-Credit_Score2))

# 2. Convert to matrix
train_matrix <- as.matrix(iv_var_train2 %>% select(-Credit_Score2))
test_matrix <- as.matrix(iv_var_test2 %>% select(-Credit_Score2))

# 3. Create target variables
train_target <- as.numeric(as.character(iv_var_train2$Credit_Score2))
test_target <- as.numeric(as.character(iv_var_test2$Credit_Score2))

# 4. Create DMatrix objects
dtrain <- xgb.DMatrix(data = train_matrix, label = train_target)
dtest <- xgb.DMatrix(data = test_matrix, label = test_target)

# 5. Set XGBoost parameters
params <- list(
    objective = "binary:logistic",
    eval_metric = "auc",
    eta = 0.1,
    max_depth = 6,
    min_child_weight = 1,
    subsample = 0.8,
    colsample_bytree = 0.8
)

# 7. Train XGBoost model
set.seed(12345)
xgb_model <- xgb.train(
    params = params,
    data = dtrain,
    nrounds = 100,
    watchlist = list(train = dtrain, test = dtest),
    early_stopping_rounds = 10,
    verbose = 1
)

## [1]  train-auc:0.826324  test-auc:0.823245 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
## 
## [2]  train-auc:0.846951  test-auc:0.843589 
## [3]  train-auc:0.844459  test-auc:0.842036 
## [4]  train-auc:0.849353  test-auc:0.847870 
## [5]  train-auc:0.850215  test-auc:0.849140 
## [6]  train-auc:0.851124  test-auc:0.850050 
## [7]  train-auc:0.853734  test-auc:0.852037 
## [8]  train-auc:0.854966  test-auc:0.852981 
## [9]  train-auc:0.854992  test-auc:0.852579 
## [10] train-auc:0.855786  test-auc:0.853195 
## [11] train-auc:0.857319  test-auc:0.854595 
## [12] train-auc:0.858093  test-auc:0.855209 
## [13] train-auc:0.860979  test-auc:0.857076 
## [14] train-auc:0.861954  test-auc:0.857998 
## [15] train-auc:0.862942  test-auc:0.858280 
## [16] train-auc:0.864723  test-auc:0.859675 
## [17] train-auc:0.867083  test-auc:0.861602 
## [18] train-auc:0.867633  test-auc:0.861877 
## [19] train-auc:0.868723  test-auc:0.862319 
## [20] train-auc:0.869359  test-auc:0.862614 
## [21] train-auc:0.870649  test-auc:0.863096 
## [22] train-auc:0.870843  test-auc:0.863090 
## [23] train-auc:0.871917  test-auc:0.864180 
## [24] train-auc:0.872689  test-auc:0.864622 
## [25] train-auc:0.873753  test-auc:0.865416 
## [26] train-auc:0.874480  test-auc:0.865953 
## [27] train-auc:0.875185  test-auc:0.866507 
## [28] train-auc:0.875637  test-auc:0.866634 
## [29] train-auc:0.876062  test-auc:0.866869 
## [30] train-auc:0.876747  test-auc:0.867242 
## [31] train-auc:0.877028  test-auc:0.867539 
## [32] train-auc:0.877531  test-auc:0.867816 
## [33] train-auc:0.878295  test-auc:0.868334 
## [34] train-auc:0.879271  test-auc:0.869032 
## [35] train-auc:0.879655  test-auc:0.869175 
## [36] train-auc:0.880007  test-auc:0.869433 
## [37] train-auc:0.880355  test-auc:0.869704 
## [38] train-auc:0.880658  test-auc:0.869838 
## [39] train-auc:0.881000  test-auc:0.870132 
## [40] train-auc:0.881428  test-auc:0.870334 
## [41] train-auc:0.881907  test-auc:0.870667 
## [42] train-auc:0.882200  test-auc:0.870814 
## [43] train-auc:0.882636  test-auc:0.871176 
## [44] train-auc:0.882954  test-auc:0.871311 
## [45] train-auc:0.883059  test-auc:0.871166 
## [46] train-auc:0.883301  test-auc:0.871234 
## [47] train-auc:0.884098  test-auc:0.871741 
## [48] train-auc:0.884338  test-auc:0.871931 
## [49] train-auc:0.884803  test-auc:0.872190 
## [50] train-auc:0.885485  test-auc:0.872467 
## [51] train-auc:0.885773  test-auc:0.872629 
## [52] train-auc:0.885985  test-auc:0.872784 
## [53] train-auc:0.886490  test-auc:0.872982 
## [54] train-auc:0.886761  test-auc:0.873067 
## [55] train-auc:0.887473  test-auc:0.873386 
## [56] train-auc:0.887783  test-auc:0.873531 
## [57] train-auc:0.887985  test-auc:0.873722 
## [58] train-auc:0.888125  test-auc:0.873835 
## [59] train-auc:0.888689  test-auc:0.874100 
## [60] train-auc:0.888995  test-auc:0.874326 
## [61] train-auc:0.889294  test-auc:0.874491 
## [62] train-auc:0.889463  test-auc:0.874627 
## [63] train-auc:0.890343  test-auc:0.875128 
## [64] train-auc:0.891154  test-auc:0.875499 
## [65] train-auc:0.891396  test-auc:0.875623 
## [66] train-auc:0.891849  test-auc:0.875668 
## [67] train-auc:0.892395  test-auc:0.876022 
## [68] train-auc:0.892721  test-auc:0.876162 
## [69] train-auc:0.892849  test-auc:0.876251 
## [70] train-auc:0.893146  test-auc:0.876275 
## [71] train-auc:0.893314  test-auc:0.876287 
## [72] train-auc:0.893575  test-auc:0.876350 
## [73] train-auc:0.893780  test-auc:0.876426 
## [74] train-auc:0.893955  test-auc:0.876570 
## [75] train-auc:0.894310  test-auc:0.876700 
## [76] train-auc:0.894851  test-auc:0.877130 
## [77] train-auc:0.894968  test-auc:0.877193 
## [78] train-auc:0.895287  test-auc:0.877185 
## [79] train-auc:0.895386  test-auc:0.877175 
## [80] train-auc:0.896052  test-auc:0.877651 
## [81] train-auc:0.896237  test-auc:0.877643 
## [82] train-auc:0.896665  test-auc:0.877772 
## [83] train-auc:0.897000  test-auc:0.877993 
## [84] train-auc:0.897348  test-auc:0.878158 
## [85] train-auc:0.897516  test-auc:0.878255 
## [86] train-auc:0.897955  test-auc:0.878480 
## [87] train-auc:0.898052  test-auc:0.878518 
## [88] train-auc:0.898434  test-auc:0.878778 
## [89] train-auc:0.898657  test-auc:0.878882 
## [90] train-auc:0.898803  test-auc:0.879008 
## [91] train-auc:0.899105  test-auc:0.879207 
## [92] train-auc:0.899615  test-auc:0.879477 
## [93] train-auc:0.899841  test-auc:0.879579 
## [94] train-auc:0.899992  test-auc:0.879628 
## [95] train-auc:0.900416  test-auc:0.879832 
## [96] train-auc:0.900559  test-auc:0.879850 
## [97] train-auc:0.900676  test-auc:0.879821 
## [98] train-auc:0.900997  test-auc:0.879983 
## [99] train-auc:0.901348  test-auc:0.880127 
## [100]    train-auc:0.901820  test-auc:0.880399

# Continue with predictions and scorecard creation...
train_pred <- predict(xgb_model, dtrain)
test_pred <- predict(xgb_model, dtest)

# Create logistic model for scorecard
train_data <- data.frame(
    Credit_Score2 = iv_var_train2$Credit_Score2,
    xgb_pred = train_pred
)

logistic_model <- glm(Credit_Score2 ~ xgb_pred, 
                     family = binomial(), 
                     data = train_data)

# Create WOE bins
bin_xgb <- woebin(iv_var_train2, 
                  y = "Credit_Score2", 
                  x = train_vars,
                  positive = 1, 
                  method = "tree")

## ✔ Binning on 70001 rows and 12 columns in 00:00:10

# Create scorecard
score_card_xgb <- scorecard(bin_xgb, 
                           logistic_model,
                           points0 = 600, 
                           odds0 = 50,
                           pdo = 20,
                           basepoints_eq0 = TRUE)

# Convert predictions to scores 
# Adjust the spread function for max 500
set.seed(123)

spread_scores <- function(pred, min_score = 300, max_score = 600) {
    # Add small noise
    pred_with_noise <- pred + rnorm(length(pred), 0, 0.02)
    
    # Use logistic transformation for more natural curve
    transformed <- 1 / (1 + exp(-5 * (pred_with_noise - 0.5)))
    
    # Scale to desired range
    scaled <- (transformed - min(transformed)) / (max(transformed) - min(transformed))
    return(min_score + scaled * (max_score - min_score))
}

# Calculate scores
train_scores <- data.frame(
    score = spread_scores(1 - train_pred) 
)

test_scores <- data.frame(
    score = spread_scores(1 - test_pred)
)

# Calculate ROC curves
roc_train <- roc(iv_var_train2$Credit_Score2, train_pred)
roc_test <- roc(iv_var_test2$Credit_Score2, test_pred)

# Calculate metrics
train_metrics <- data.frame(
    AUC = auc(roc_train),
    Gini = 2 * auc(roc_train) - 1,
    KS = max(abs(roc_train$sensitivities - (1 - roc_train$specificities)))
)

test_metrics <- data.frame(
    AUC = auc(roc_test),
    Gini = 2 * auc(roc_test) - 1,
    KS = max(abs(roc_test$sensitivities - (1 - roc_test$specificities)))
)

# Create annotation data frame
annotation <- data.frame(
    x = rep(0.6, 6),
    y = seq(0.4, 0.1, length.out = 6),
    label = c(paste("Train AUC:", round(train_metrics$AUC, 3)),
              paste("Train Gini:", round(train_metrics$Gini, 3)),
              paste("Train KS:", round(train_metrics$KS, 3)),
              paste("Test AUC:", round(test_metrics$AUC, 3)),
              paste("Test Gini:", round(test_metrics$Gini, 3)),
              paste("Test KS:", round(test_metrics$KS, 3)))
)

# Create ROC curve data frames
roc_train_df <- data.frame(
    fpr = 1 - roc_train$specificities,
    tpr = roc_train$sensitivities
)

roc_test_df <- data.frame(
    fpr = 1 - roc_test$specificities,
    tpr = roc_test$sensitivities
)

# Create the plot
ggplot() +
    geom_line(data = roc_train_df, aes(x = fpr, y = tpr), color = "blue", lwd = 1) +  # Slimmer line
    geom_line(data = roc_test_df, aes(x = fpr, y = tpr), color = "red", lwd = 1) +   # Slimmer line
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
    geom_text(data = annotation, aes(x = x, y = y, label = label), color = "black", size = 4, hjust = 0) +
    labs(title = "ROC Curves Comparison",
         x = "False Positive Rate",
         y = "True Positive Rate") +
    theme_minimal() +
    scale_color_manual(name = "Dataset",
                       values = c("blue", "red"),
                       labels = c("Training", "Testing")) +
    theme(legend.position = "bottom")

# Calculate PSI
calculate_psi <- function(expected, actual, bins = 10) {
    # Create breaks for binning
    breaks <- seq(min(c(expected, actual)), max(c(expected, actual)), length.out = bins + 1)
    
    # Calculate distributions
    expected_dist <- hist(expected, breaks = breaks, plot = FALSE)$density
    actual_dist <- hist(actual, breaks = breaks, plot = FALSE)$density
    
    # Add small epsilon to avoid division by zero
    epsilon <- 1e-10
    expected_dist <- expected_dist + epsilon
    actual_dist <- actual_dist + epsilon
    
    # Calculate PSI
    psi <- sum((actual_dist - expected_dist) * log(actual_dist/expected_dist))
    return(psi)
}

# Calculate PSI value
psi_value <- calculate_psi(train_scores$score, test_scores$score)

# Create data frame for plotting
score_dist_data <- data.frame(
    Score = c(train_scores$score, test_scores$score),
    Dataset = factor(c(rep("Training", nrow(train_scores)),
                      rep("Testing", nrow(test_scores))))
)

# Create the distribution plot
ggplot(score_dist_data, aes(x = Score, fill = Dataset)) +
    geom_density(alpha = 0.5) +
    theme_minimal() +
    labs(title = paste("Score Distribution Comparison (PSI:", format(round(psi_value, 4), nsmall = 4), ")"),
         x = "Score",
         y = "Density") +
    scale_fill_manual(values = c("Training" = "blue", "Testing" = "red"))

# Create data frame for plotting
score_data <- data.frame(
    Score = train_scores$score,
    Predicted_Prob = train_pred * 100  # Convert to percentage
)

# Create enhanced scatter plot
ggplot(score_data, aes(x = Score, y = Predicted_Prob)) +
    geom_point(alpha = 0.5, color = "blue", size = 2) +
    geom_smooth(method = "lm", se = FALSE, color = "red") +
    theme_minimal() +
    theme(
        plot.title = element_text(hjust = 0.5, size = 14),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12)
    ) +
    labs(
        title = "Scores vs. Predicted Poor Rate %",
        x = "Score",
        y = "Predicted Poor Rate %"
    ) +
    scale_y_continuous(
        limits = c(0, 100),
        breaks = seq(0, 100, by = 25)
    ) +
    scale_x_continuous(
        breaks = seq(floor(min(score_data$Score)), 
                    ceiling(max(score_data$Score)), 
                    by = 50)
    )

# 1. Box Plot
score_data <- data.frame(
    Score = train_scores$score, 
    Credit_Score2 = iv_var_train2$Credit_Score2
)

p1 <- ggplot(score_data, aes(x = as.factor(Credit_Score2), y = Score, 
                            fill = as.factor(Credit_Score2))) +
    geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) +
    scale_fill_brewer(palette = "Set3") +
    theme_minimal() +
    theme(
        plot.title = element_text(hjust = 0.5, size = 14),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        legend.position = "none"
    ) +
    labs(
        title = "Scores by Credit Score Categories",
        x = "Credit Score Categories",
        y = "Score"
    )

# 2. Histogram of Scores
p2 <- ggplot(score_data, aes(x = Score)) +
    geom_histogram(binwidth = 10, fill = "#ADD8E6", color = "black", alpha = 0.7) +
    theme_minimal() +
    theme(
        plot.title = element_text(hjust = 0.5, size = 14),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12)
    ) +
    labs(
        title = "Distribution of Predicted Scores",
        x = "Score",
        y = "Frequency"
    )
print(p1)

print(p2)

Future Improvement:

Automated Binning with Manual Review: To further refine the risk model and ensure that risk rankings are intuitive and sensible, it would be wise to attempt hybrid binning process that combines automated binning with manual review and tweaking by domain experts.

This process will create initial bins based on statistical criteria, such as maximizing information value or minimizing within-bin variance. Following this automated binning, a manual review will be conducted to ensure that the bins make intuitive sense and accurately reflect the underlying risk patterns. This combination of automation and domain expert judgment is believed to create a more robust and interpretable scorecard.
Exploration of Alternative Models: Other potential refinement would be by exploring other types of models for scorecard development. This could include machine learning techniques such as random forests, or SVR. By comparing the performance of these alternative models with the current model, we can identify the most effective approach for accurately predicting risk and improving the overall performance of the scorecard.

7.Project Conclusion

This project successfully demonstrated the application of machine learning to enhance credit scoring and risk assessment in financial decision-making. By leveraging advanced tools in R, the framework covered essential stages of the data lifecycle, including data cleaning, exploratory data analysis (EDA), modeling, and evaluation. Among the 3 models tested, XGBoost emerged as the strongest for classifying customers into credit score bands (Good, Standard, and Poor), offering superior prediction performance and reliability.

The regression models built for scorecard development has delivered satisfactory results. Logistic regression model allows a more granular assessment of credit risk and suitable for various credit strategies such as collection strategy, interest rate assignment and cross-selling initiatives. XGBoost integrated model which is highly predictive, can be a great solution for loan approval assessment.

Overall, the project demonstrates the power of machine learning in empowering financial institutions with actionable insights, streamlining risk management, and enhancing customer segmentation. By achieving accurate classification and reliable risk predictions, the developed models offer a strong foundation for informed financial decisions and uncovering new business opportunities.