1. Introduction

In modern finance, credit scoring systems are important tools to assess personal credit risk. However, traditional methods rely on fixed metrics and models, which may not meet the needs of all users. This project uses machine learning and data analysis to improve credit scoring based on existing features. Our model combines user behavior data and key indicators to provide more accurate credit scores, helping financial institutions reduce risks and make better decisions.

2. Objectives

  1. To predict users’ credit scores and evaluate its performance.
  2. To predict users’ loan interest rates and compare model results.

3. Data Collection

The dataset used in this project was obtained from https://www.kaggle.com/datasets/parisrohan/credit-score-classification/data

It contains various user-related information, such as age, income, loan intent, and credit scores. The primary objective of this dataset is to analyze and predict credit risk. The dataset serves as a foundation for understanding user financial behavior and building predictive models.

## 'data.frame':    100000 obs. of  28 variables:
##  $ ID                      : num  5634 5635 5636 5637 5638 ...
##  $ Customer_ID             : chr  "CUS_0xd40" "CUS_0xd40" "CUS_0xd40" "CUS_0xd40" ...
##  $ Month                   : chr  "January" "February" "March" "April" ...
##  $ Name                    : chr  "Aaron Maashoh" "Aaron Maashoh" "Aaron Maashoh" "Aaron Maashoh" ...
##  $ Age                     : chr  "23" "23" "-500" "23" ...
##  $ SSN                     : chr  "821-00-0265" "821-00-0265" "821-00-0265" "821-00-0265" ...
##  $ Occupation              : chr  "Scientist" "Scientist" "Scientist" "Scientist" ...
##  $ Annual_Income           : chr  "19114.12" "19114.12" "19114.12" "19114.12" ...
##  $ Monthly_Inhand_Salary   : num  1825 NA NA NA 1825 ...
##  $ Num_Bank_Accounts       : int  3 3 3 3 3 3 3 3 2 2 ...
##  $ Num_Credit_Card         : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Interest_Rate           : int  3 3 3 3 3 3 3 3 6 6 ...
##  $ Num_of_Loan             : chr  "4" "4" "4" "4" ...
##  $ Type_of_Loan            : chr  "Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan" "Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan" "Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan" "Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan" ...
##  $ Delay_from_due_date     : int  3 -1 3 5 6 8 3 3 3 7 ...
##  $ Num_of_Delayed_Payment  : chr  "7" "" "7" "4" ...
##  $ Changed_Credit_Limit    : chr  "11.27" "11.27" "_" "6.27" ...
##  $ Num_Credit_Inquiries    : num  4 4 4 4 4 4 4 4 2 2 ...
##  $ Credit_Mix              : chr  "_" "Good" "Good" "Good" ...
##  $ Outstanding_Debt        : chr  "809.98" "809.98" "809.98" "809.98" ...
##  $ Credit_Utilization_Ratio: num  26.8 31.9 28.6 31.4 24.8 ...
##  $ Credit_History_Age      : chr  "22 Years and 1 Months" NA "22 Years and 3 Months" "22 Years and 4 Months" ...
##  $ Payment_of_Min_Amount   : chr  "No" "No" "No" "No" ...
##  $ Total_EMI_per_month     : num  49.6 49.6 49.6 49.6 49.6 ...
##  $ Amount_invested_monthly : chr  "80.41529543900253" "118.28022162236736" "81.699521264648" "199.4580743910713" ...
##  $ Payment_Behaviour       : chr  "High_spent_Small_value_payments" "Low_spent_Large_value_payments" "Low_spent_Medium_value_payments" "Low_spent_Small_value_payments" ...
##  $ Monthly_Balance         : chr  "312.49408867943663" "284.62916249607184" "331.2098628537912" "223.45130972736786" ...
##  $ Credit_Score            : chr  "Good" "Good" "Good" "Good" ...

This dataset contains 100,000 observations across 28 variables. Below is a overview of the variables:

  1. ID: Unique identifier for each record.

  2. Customer_ID: Unique customer identifier. 。

  3. Month: Month of the data record.

  4. Name: Customer’s name.

  5. Age: Customer’s age.

  6. SSN: Customer’s Social Security Number.

  7. Occupation: Profession of the customer.

  8. Annual_Income: Customer’s yearly income.

  9. Monthly_Inhand_Salary: Monthly net salary.

  10. Num_Bank_Accounts: Number of bank accounts held by the customer.

  11. Num_Credit_Card: Number of credit cards owned.

  12. Interest_Rate: Loan interest rate.

  13. Num_of_Loan: Total number of loans.

  14. Type_of_Loan: Categories of loans taken.

  15. Delay_from_due_date: Number of days payment was delayed.

  16. Num_of_Delayed_Payment: Count of delayed payments.

  17. Changed_Credit_Limit: Amount by which credit limit has changed.

  18. Num_Credit_Inquiries: Number of credit inquiries made.

  19. Credit_Mix: Quality of the credit mix (e.g., Good, Standard, Bad).

  20. Outstanding_Debt: Total outstanding debt.

  21. Credit_Utilization_Ratio: Ratio of credit utilization.

  22. Credit_History_Age: Duration of credit history in years and months.

  23. Payment_of_Min_Amount: Whether the minimum amount was paid (Yes/No).

  24. Total_EMI_per_month: Total Equated Monthly Installment (EMI) paid per month.

  25. Amount_invested_monthly: Monthly investment amount.

  26. Payment_Behaviour: Observed payment behavior.

  27. Monthly_Balance: Remaining balance at the end of the month.

  28. Credit_Score: (Target feature) Customer’s credit score category (“Good”, “Poor”, “Standard”).

##        ID         Customer_ID           Month               Name          
##  Min.   :  5634   Length:100000      Length:100000      Length:100000     
##  1st Qu.: 43133   Class :character   Class :character   Class :character  
##  Median : 80632   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 80632                                                           
##  3rd Qu.:118130                                                           
##  Max.   :155629                                                           
##                                                                           
##      Age                SSN             Occupation        Annual_Income     
##  Length:100000      Length:100000      Length:100000      Length:100000     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##  Monthly_Inhand_Salary Num_Bank_Accounts Num_Credit_Card   Interest_Rate    
##  Min.   :  303.6       Min.   :  -1.00   Min.   :   0.00   Min.   :   1.00  
##  1st Qu.: 1625.6       1st Qu.:   3.00   1st Qu.:   4.00   1st Qu.:   8.00  
##  Median : 3093.7       Median :   6.00   Median :   5.00   Median :  13.00  
##  Mean   : 4194.2       Mean   :  17.09   Mean   :  22.47   Mean   :  72.47  
##  3rd Qu.: 5957.4       3rd Qu.:   7.00   3rd Qu.:   7.00   3rd Qu.:  20.00  
##  Max.   :15204.6       Max.   :1798.00   Max.   :1499.00   Max.   :5797.00  
##  NA's   :15002                                                         
##  
##  Num_of_Loan        Type_of_Loan       Delay_from_due_date
##  Length:100000      Length:100000      Min.   :-5.00      
##  Class :character   Class :character   1st Qu.:10.00      
##  Mode  :character   Mode  :character   Median :18.00      
##                                        Mean   :21.07      
##                                        3rd Qu.:28.00      
##                                        Max.   :67.00      
##                                                           
##  Num_of_Delayed_Payment Changed_Credit_Limit Num_Credit_Inquiries
##  Length:100000          Length:100000        Min.   :   0.00     
##  Class :character       Class :character     1st Qu.:   3.00     
##  Mode  :character       Mode  :character     Median :   6.00     
##                                              Mean   :  27.75     
##                                              3rd Qu.:   9.00     
##                                              Max.   :2597.00     
##                                              NA's   :1965   
## 
##   Credit_Mix        Outstanding_Debt   Credit_Utilization_Ratio
##  Length:100000      Length:100000      Min.   :20.00           
##  Class :character   Class :character   1st Qu.:28.05           
##  Mode  :character   Mode  :character   Median :32.31           
##                                        Mean   :32.29           
##                                        3rd Qu.:36.50           
##                                        Max.   :50.00           
##                                                                
##  Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month
##  Length:100000      Length:100000         Min.   :    0.00   
##  Class :character   Class :character      1st Qu.:   30.31   
##  Mode  :character   Mode  :character      Median :   69.25   
##                                           Mean   : 1403.12   
##                                           3rd Qu.:  161.22   
##                                           Max.   :82331.00   
##                                                              
##  Amount_invested_monthly Payment_Behaviour  Monthly_Balance   
##  Length:100000           Length:100000      Length:100000     
##  Class :character        Class :character   Class :character  
##  Mode  :character        Mode  :character   Mode  :character  
##                                                               
##  Credit_Score      
##  Length:100000     
##  Class :character  
##  Mode  :character  

4. Data Preprocessing

Data cleaning is a crucial phase to ensure the dataset is ready for analysis. For this dataset, cleaning involved several systematic steps tailored to its structure: (Note: The following data processing section includes similar operations for multiple columns. Therefore, only the code for one column is displayed as an example.)

4.1 Randomly sampled 10,000 records from the original dataset

Given the original dataset’s size of 100,000 observations, working with the entire dataset would be computationally expensive and time-consuming. Therefore, we employed random sampling to create a smaller, yet representative, subset of the data. This approach ensures that the analysis remains efficient while preserving the dataset’s diversity and patterns.

train <- read.csv("train.csv", stringsAsFactors = FALSE)
set.seed(123)
data <- train[sample(nrow(train), 10000), ]
write.csv(data, "Credit_Score.csv", row.names = FALSE)

4.2 Remove Unnecessary Columns

We removed columns (ID, Customer_ID, Name, and SSN) as they do not provide meaningful information for the modeling process. These columns are either unique identifiers or personal information that do not contribute to predicting outcomes.

data <- data[, !(names(data) %in% c("ID", "Customer_ID", "Name", "SSN"))]

4.3 Remove Redundancy

sum(duplicated(data))

The result of the sum(duplicated(data)) function returned 0, indicating that there are no duplicate rows in the dataset. This confirms that the data is unique and does not require further deduplication.

4.4 Data Type Conversion

  1. The columns Age, Annual_Income, Monthly_Inhand_Salary, Num_of_Loan, Num_of_Delayed_Payment, Changed_Credit_Limit, Outstanding_Debt, Amount_invested_monthly, and Monthly_Balance were originally of character type. These columns were converted to numeric type by removing non-numeric characters such as to ensure consistency and proper formatting for analysis and modeling.

Below is an example of the operation performed on the column. The same process was applied to the other columns:Changed_Credit_Limit

# Convert Character Data to Numeric
data$Changed_Credit_Limit <- as.numeric(gsub("[^0-9.-]", "", data$Changed_Credit_Limit))
  1. For the Num_Bank_Accounts column, I replaced the value -1 with 0, assuming that -1 originally indicates the absence of any bank accounts.
data$Num_Bank_Accounts[data$Num_Bank_Accounts == -1] <- 0
  1. For the remaining columns of type numeric and int, applied an absolute value transformation.
data$Monthly_Balance <- abs(data$Monthly_Balance)

4.5 Formatting Credit_History_Age

For the Credit_History_Age column, I converted the original format of “XX Years and XX Months” into “XX Months”. This was done to standardize the data and simplify further analysis.

data$Credit_History_Age <- with(data, {
  years <- as.numeric(str_extract(Credit_History_Age, "\\d+(?= Years)"))
  months <- as.numeric(str_extract(Credit_History_Age, "\\d+(?= Months)"))
  years * 12 + months
})

4.6 Replacing Placeholders with NA

  1. For some character-type columns, values such as “__” and ” ” were replaced with NA.
data$Occupation[is.na(data$Occupation) | data$Occupation == "" | data$Occupation == "_______"] <- NA
})
  1. For the Payment_Behaviour column, the value “!@9#%8” was replaced with NA.
data$Payment_Behaviour[data$Payment_Behaviour == "!@9#%8"] <- NA
})
  1. For the Payment_of_Min_Amount column, the value “NM” was replaced with NA.
data$Payment_of_Min_Amount[data$Payment_of_Min_Amount == "NM"] <- NA

4.7 Age Filtering

For the Age column, I removed records where the age was less than 18 or greater than 70, as the maximum loan term is capped at 70 years.

data <- data[!(data$Age < 18 | data$Age > 70), ]
})

4.8 Handle Missing Values

For missing values in the dataset, I applied different strategies depending on the nature of the column:

  1. For the Monthly_Inhand_Salary column, missing values were replaced using the formula Annual_Income / 2.
data <- data %>%
  mutate(Monthly_Inhand_Salary = ifelse(is.na(Monthly_Inhand_Salary), Annual_Income / 12, Monthly_Inhand_Salary))
  1. Direct Deletion: For the Type_of_Loan column, missing values were directly removed to ensure data integrity.
data <- data[!(is.na(data$Type_of_Loan) | data$Type_of_Loan == ""), ]
  1. KNN Imputation: Missing values in some columns, such as Amount_invested_monthly, Credit_History_Age, and Num_of_Delayed_Payment, were filled using the K-Nearest Neighbors (KNN) algorithm.
data <- kNN(data, variable = "Credit_History_Age", imp_var = FALSE)
  1. Mode Imputation: For the Payment_of_Min_Amount and Payment_Behaviourcolumn, missing values were replaced with the most frequent value (mode), ensuring consistency in this categorical variable.
fill_na_with_mode_or_unknown <- function(x) {
  if (all(is.na(x))) {
    x[is.na(x)] <- "Unknown"
    return(x)
  }
  mode_value <- names(sort(table(x), decreasing = TRUE))[1]
  x[is.na(x)] <- mode_value
  return(x)
}
data <- data %>%
  group_by(Credit_Score, Outstanding_Debt, Num_of_Loan, Annual_Income) %>%
  mutate(Payment_Behaviour = fill_na_with_mode_or_unknown(Payment_Behaviour)) %>%
  ungroup()
  1. Marking Remaining Missing Values and outliers: For columns with missing values or outliers that might significantly impact future modeling, no further imputation was performed. Instead, these values were flagged for subsequent handling during feature engineering and model preparation.

4.9 Cleaned Dataset Overview:

'data.frame':   7942 obs. of  24 variables:
 $ Month                   : chr  "July" "June" "May" "June" ...
 $ Age                     : int  42 23 52 26 49 18 32 19 48 42 ...
 $ Occupation              : chr  "Architect" "Developer" "Doctor" "Accountant" ...
 $ Annual_Income           : num  35758 14274 60284 77233 34982 ...
 $ Monthly_Inhand_Salary   : num  3213 908 5154 6436 2463 ...
 $ Num_Bank_Accounts       : int  3 4 0 9 6 3 4 2 3 0 ...
 $ Num_Credit_Card         : int  3 4 4 9 6 5 4 4 5 4 ...
 $ Interest_Rate           : int  15 12 2 17 15 1 18 8 18 11 ...
 $ Num_of_Loan             : int  3 3 2 5 3 1 1 3 3 3 ...
 $ Type_of_Loan            : chr  "Payday Loan, Home Equity Loan, and Debt Consolidation Loan" "Debt Consolidation Loan, Debt Consolidation Loan, and Mortgage Loan" "Student Loan, and Home Equity Loan" "Mortgage Loan, Home Equity Loan, Payday Loan, Credit-Builder Loan, and Payday Loan" ...
 $ Delay_from_due_date     : int  26 14 2 55 27 5 13 19 9 0 ...
 $ Num_of_Delayed_Payment  : int  19 9 4 25 16 3 17 9 19 4 ...
 $ Changed_Credit_Limit    : num  8.82 8.48 4.4 8.23 7.2 ...
 $ Num_Credit_Inquiries    : int  9 4 9 10 5 5 2423 3 1 3 ...
 $ Credit_Mix              : chr  "Standard" "Standard" "Good" "Bad" ...
 $ Outstanding_Debt        : num  785 1243 1307 2260 350 ...
 $ Credit_Utilization_Ratio: num  29.8 38.3 35.6 33.5 37.6 ...
 $ Credit_History_Age      : int  176 144 366 148 234 390 77 273 341 186 ...
 $ Payment_of_Min_Amount   : chr  "Yes" "Yes" "No" "Yes" ...
 $ Total_EMI_per_month     : num  49.7 53926 67.9 65928 291.6 ...
 $ Amount_invested_monthly : num  36.1 67.2 63.7 98.5 49.7 ...
 $ Payment_Behaviour       : chr  "High_spent_Medium_value_payments" "Low_spent_Large_value_payments" "High_spent_Large_value_payments" "High_spent_Large_value_payments" ...
 $ Monthly_Balance         : num  486 275 624 549 400 ...
 $ Credit_Score            : chr  "Standard" "Good" "Poor" "Standard" ...

5. Exploratory Data Analysis (EDA)

5.1 Objectives

  1. Understand data structure and distribution: Use univariate analysis to describe key features of each variable, such as distribution, range, mean, and median. Identify potential outliers and outliers to ensure the reliability of data analysis.

  2. Explore categorical and numerical features: Use visualization tools such as pie charts, histograms, and density plots to analyze the distribution of categorical and numerical variables. For categorical variables, merge low-frequency categories into broader labels to simplify interpretation while retaining meaningful insights.

  3. Study the relationship between variables (bivariate analysis): Use tools such as scatter plots to analyze the relationship between categorical and numerical variables. Explore trends and differences between different categories on numerical variables to provide potential predictors for credit risk assessment.

  4. Evaluate relationships between multiple variables: Analyze dependencies and relationships between multiple numerical variables through correlation heat maps. Identify strong positive, weak, or negative correlations and determine variables that have a key impact on model development.

  5. Support feature engineering and predictive modeling: Extract actionable insights from univariate and multivariate analysis to provide a solid foundation for feature selection and feature engineering. Provide data support for the construction of credit scoring models, such as identifying important predictive variables.

  6. Guide strategic decision-making: Use analytical results to optimize customer segmentation, credit risk assessment, and the design of customized financial products. Focus on high-risk groups and propose targeted intervention measures to reduce default risks and improve financial results.

5.2 Exploratory Data Analysis (EDA)

5.2.1 Univariate Analysis

# Load required libraries
library(ggplot2)
library(gridExtra) 

# Load the dataset
data <- read.csv("data/after_cleaning.csv")

# Apply log transformation to annual income
data$Log_Annual_Income <- log(data$Annual_Income + 1)

# Plot the distribution of Age (detailed)
p1 <- ggplot(data, aes(x = Age)) +
    geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) + # Set binwidth to 1
    geom_density(aes(y = after_stat(density) * 1000), color = "darkblue") +
    labs(title = "Distribution of Age (Detailed)", x = "Age", y = "Frequency") +
    theme_minimal()

# Plot the distribution of Log Annual Income
p2 <- ggplot(data, aes(x = Log_Annual_Income)) +
    geom_histogram(binwidth = 0.2, fill = "blue", color = "black", alpha = 0.7) +
    geom_density(aes(y = after_stat(density) * 1000), color = "darkblue") +
    labs(title = "Distribution of Log Annual Income", x = "Log Annual Income", y = "Frequency") +
    theme_minimal()

# Plot the distribution of Outstanding Debt
p3 <- ggplot(data, aes(x = Outstanding_Debt)) +
    geom_histogram(binwidth = 100, fill = "blue", color = "black", alpha = 0.7) +
    geom_density(aes(y = after_stat(density) * 1000), color = "darkblue") +
    labs(title = "Distribution of Outstanding Debt", x = "Outstanding Debt", y = "Frequency") +
    theme_minimal()

# Plot the distribution of Credit History Age
p4 <- ggplot(data, aes(x = Credit_History_Age)) +
    geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
    geom_density(aes(y = after_stat(density) * 1000), color = "darkblue") +
    labs(title = "Distribution of Credit History Age", x = "Credit History Age", y = "Frequency") +
    theme_minimal()

# Combine the four plots into one layout
grid.arrange(p1, p2, p3, p4, ncol = 2)

Result

1. Age Distribution (Age)

The chart shows that the age distribution is relatively uniform, concentrated between 20 and 50 years old. The number of people in the 20-30 age group is slightly higher, especially around 25 years old. The density curve shows that the overall distribution is relatively flat, indicating that the age distribution in the sample is relatively even, without obvious extreme values.

2. Annual Income Distribution After Logarithmic Transformation (Log Annual Income)

The annual income variable was logarithmically transformed to reduce the impact of extremely high-income samples on the analysis. From the distribution chart, most samples are concentrated between 9 and 12 of the logarithmic annual income (the corresponding annual income range is about 7,000 to 160,000). The data distribution after transformation is relatively normal, which meets the needs of model analysis.

3. Outstanding Debt Distribution (Outstanding Debt)

The distribution of outstanding debt is relatively scattered, mainly concentrated in the range of 0 to 3,000, and the peak of the distribution appears at around 1,000. The density curve shows that the tail is slightly to the right, and there are a small number of samples with high outstanding debt. This shows that the outstanding debt of most borrowers is within a controllable range, but we need to pay attention to the impact of extreme values ​​on the analysis.

4. Credit History Age Distribution

The credit history age is concentrated between 100 and 300, and the distribution shows a relatively obvious normalization trend, with a peak at around 200. A small number of samples have a credit history age of more than 350, which may represent a small number of users with long-term credit records and are key observation points for credit behavior.

# Load required libraries
library(dplyr)
library(ggplot2)
library(gridExtra) 
# 1. Read the data
data <- read.csv("data/after_cleaning.csv")
# 2. Credit Mix Pie Chart
# 2.1 Count each category in Credit_Mix and calculate the percentage
df_mix <- data %>%
  count(Credit_Mix) %>%
  mutate(percentage = n / sum(n))
# 2.2 Create a pie chart for Credit_Mix
p1 <- ggplot(df_mix, aes(x = "", y = n, fill = Credit_Mix)) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar(theta = "y") +
  geom_text(
    aes(label = scales::percent(percentage)),
    position = position_stack(vjust = 0.5),
    color = "white",
    size = 4
  ) +
  labs(
    title = "Distribution of Credit Mix",
    x = NULL,
    y = NULL,
    fill = "Credit Mix"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )
# 3. Occupation (Top 7 + Others) Pie Chart
# 3.1 Determine the top 7 occupations
top7_occ <- data %>%
  count(Occupation, sort = TRUE) %>%
  slice_max(n, n = 7) %>%
  pull(Occupation)
# 3.2 Merge other occupations into "Other"
data <- data %>%
  mutate(Occupation_merged = ifelse(Occupation %in% top7_occ, Occupation, "Other"))
# 3.3 Calculate frequency and percentage after merging
df_occ <- data %>%
  count(Occupation_merged) %>%
  mutate(percentage = n / sum(n))
# 3.4 Create a pie chart for Occupation (Top 7 + Others)
p2 <- ggplot(df_occ, aes(x = "", y = n, fill = Occupation_merged)) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar(theta = "y") +
  geom_text(
    aes(label = scales::percent(percentage), group = Occupation_merged),
    position = position_stack(vjust = 0.5),
    color = "white",
    size = 4
  ) +
  labs(
    title = "Top 7 Occupations + Others",
    x = NULL,
    y = NULL,
    fill = "Occupation"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )
# 4. Arrange both pie charts side by side (2 columns)
grid.arrange(p1, p2, ncol = 2)

Result

1. Distribution of Credit Mix:

The pie chart illustrates the distribution of the “Credit Mix” categories among the data:Standard dominates with 37.38% of the total, indicating it is the most common credit mix category among borrowers.Good follows with 22.60%, suggesting a significant proportion of borrowers maintain a good credit mix.Bad and Unknown credit mixes account for 20.50% and 19.52%, respectively. This highlights that a noticeable portion of borrowers either have a bad credit mix or their credit mix is not clearly defined. This distribution shows that while a majority maintain a standard or good credit mix, there is a significant percentage of borrowers with less optimal credit categories, which could influence creditworthiness assessments.

2. Top 7 Occupations + Others:

The pie chart for occupations shows that:The largest proportion of respondents (54.38%) fall under the “Other” category, indicating a diverse range of less common occupations.Among the named occupations:Accountants and Doctors are the most represented, comprising 6.67% and 6.51%, respectively.Other notable occupations include Entrepreneurs (6.45%), Lawyers (6.44%), and Architects (6.33%).The remaining categories like Writers and NA each account for smaller proportions of around 6.33% and 6.34%, respectively. This chart highlights that while certain professional categories are prevalent, the dataset contains a wide variety of occupations grouped under “Other,” emphasizing the diversity of borrowers’ professional backgrounds.

5.2.2 Bivariate Analysis

# Example 1: Credit_Score vs Interest_Rate
p1 <- ggplot(data, aes(x = factor(Credit_Score), y = Interest_Rate)) +
  geom_boxplot(fill = "lightblue", color = "darkblue", outlier.colour = "red") +
  labs(
    title = "Credit Score vs Interest Rate",
    x = "Credit Score (Good, Standard, Poor)",
    y = "Interest Rate"
  ) +
  theme_minimal()

# Example 2: Credit_Score vs Annual_Income
p2 <- ggplot(data, aes(x = factor(Credit_Score), y = Annual_Income)) +
  geom_boxplot(fill = "lightgreen", color = "darkgreen", outlier.colour = "red") +
  labs(
    title = "Credit Score vs Annual Income",
    x = "Credit Score (Good, Standard, Poor)",
    y = "Annual Income"
  ) +
  theme_minimal()

# Example 3: Credit_Score vs Num_of_Loan
p3 <- ggplot(data, aes(x = factor(Credit_Score), y = Num_of_Loan)) +
  geom_boxplot(fill = "lightpink", color = "darkred", outlier.colour = "red") +
  labs(
    title = "Credit Score vs Number of Loans",
    x = "Credit Score (Good, Standard, Poor)",
    y = "Number of Loans"
  ) +
  theme_minimal()

# Example 4: Credit_Score vs Monthly_Balance
p4 <- ggplot(data, aes(x = factor(Credit_Score), y = Monthly_Balance)) +
  geom_boxplot(fill = "lightcoral", color = "darkblue", outlier.colour = "red") +
  labs(
    title = "Credit Score vs Monthly Balance",
    x = "Credit Score (Good, Standard, Poor)",
    y = "Monthly Balance"
  ) +
  theme_minimal()

# Arrange all four plots on a single page
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)

Result

1. Credit Score vs Interest Rate

Observation: Different credit scores (Good, Standard, Poor) have a significant impact on interest rates. Poor users have a higher median interest rate, indicating that they need to pay a higher borrowing cost. Good users have the lowest median interest rate, and the interest rate distribution is more concentrated, indicating that users with good credit enjoy lower borrowing rates. Standard users have an interest rate distribution between the two. There are some outliers (red dots), indicating that some users have much higher interest rates than others.

2. Credit Score vs Annual Income

Observation: There is also a certain correlation between annual income and credit score. Good users have a higher median annual income, indicating that users with higher incomes are more likely to have good credit scores. Poor users have the lowest median annual income, and the distribution is more discrete, which may reflect that the credit scores of low-income people are more susceptible to other factors. There are still some outliers, which may be users with extremely high annual incomes, but these users exist in all credit scores.

3. Credit Score vs Number of Loans

Observation: The relationship between the number of loans and credit score is not as obvious as the previous two, but there is still a certain trend. Good users with credit scores have fewer loans and are more concentrated, which indicates that users with good credit scores may be more inclined to borrow conservatively. Poor users with credit scores have a more dispersed distribution of loan numbers, and some users have very high loan numbers, which may be related to higher credit risk.

4. Credit Score vs Monthly Balance

Observation: Monthly balance is also significantly correlated with credit score. Good users with credit scores have a higher median monthly balance, indicating that users who maintain good credit may pay more attention to savings and management of funds. Poor users with credit scores have the lowest median monthly balance and are very dispersed, and some users have extremely low monthly balances, which may increase their credit risk. Standard users with credit scores have a monthly balance distribution between the two.

5.2.3 Multivariate Analysis

# Draw heatmap
# Load necessary libraries
library(ggplot2)
library(reshape2)

# Read data
data <- read.csv("data/after_cleaning.csv")

# Select numeric columns and calculate correlation matrix
numeric_columns <- data[sapply(data, is.numeric)]
cor_matrix <- cor(numeric_columns, use = "complete.obs")

# Convert correlation matrix to long format for use with ggplot2
melted_cor_matrix <- melt(cor_matrix)

# Draw heatmap
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(
low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab", name = "Correlation" ) + theme_minimal() + theme( axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), axis.text.y = element_text(size = 10) ) + labs( title = "Correlation Heatmap of Numeric Variables", x = NULL, y = NULL )

Result

Multivariate Analysis: Correlation Heatmap of Numeric Variables

The correlation heatmap provides insights into the relationships between numeric variables in the dataset. Below are the key observations:

  1. Strong Positive Correlations: A high correlation is observed between Annual_Income and Monthly_Inhand_Salary, which is expected since monthly salary directly contributes to annual income.Total_EMI_per_month shows a strong positive correlation with Outstanding_Debt, indicating that higher debt is associated with higher monthly EMI payments.

  2. Weak or No Correlations: Several variables, such as Credit_Utilization_Ratio and Num_Credit_Card, show weak correlations with most other variables, suggesting these may not have a linear relationship with the rest.Interest_Rate appears to have minimal correlation with most variables, indicating it might be independent of other factors.

  3. Negative Correlations: Age has a slight negative correlation with variables like Num_Credit_Card and Credit_Utilization_Ratio, suggesting older individuals might use fewer credit cards or utilize less credit.

5.3 EDA Summary

Through this exploratory data analysis, we have a comprehensive understanding of the structure and characteristics of the data:

  1. Univariate Analysis reveals the distribution pattern of the data, such as the age is mainly concentrated between 20-50 years old, the annual income is more normally distributed after logarithmic transformation, and the outstanding debt and credit history age distribution show certain regularity and outliers.

  2. Category Variable Analysis analyzes the credit portfolio and occupational distribution through pie charts. The standard credit portfolio is dominant (37.38%), while the “other” occupation category accounts for the highest proportion (54.38%), showing the diversity of occupations and the hierarchical characteristics of credit portfolios.

  3. Bivariate Analysis reveals the relationship between credit score and interest rate, annual income, number of loans and monthly balance through box plots. For example, users with “bad” credit scores tend to pay higher interest rates, and their number of loans and monthly balances are more dispersed and riskier.

  4. Multivariate Analysis uses correlation heat maps to analyze the relationship between numerical variables. The results show that annual income is strongly positively correlated with monthly income, while variables such as credit utilization have weak correlations, revealing potential predictive characteristics.

This EDA provides important variable insights and feature selection basis for subsequent model development, and also provides key guidance for credit risk assessment and customer segmentation strategies.

6. Modelling

6.1 Credit Qualification Scoring Model

This part aims to build a credit qualification scoring model to help financial institutions accurately assess customers’ credit risk by analyzing multi-dimensional data such as basic information, financial characteristics, credit behavior, and investment status of customers. The data set contains 22 variables such as age, occupation, annual income, loan and credit card usage, number of delayed repayments, and length of credit history, which comprehensively describe the credit status of customers. The final model will identify the key factors that affect credit scores and provide a scientific basis for credit decisions.

6.1.1 Data Preparation

The code converts categorical variables (such as Occupation, Credit_Mix, etc.) to factor types and uses the dummyVars method in the caret package to one-hot encode them. Next, the nearZeroVar function is used to remove features with variance close to zero to reduce redundant information. Subsequently, the correlation matrix is calculated and findCorrelation is used to remove highly correlated features to avoid multicollinearity problems. Finally, recursive feature elimination (RFE) is used for feature selection to screen out the most valuable features for the model, and the processed dataset is saved as a new CSV file to prepare for subsequent machine learning modeling.

6.1.2 Data Preprocessing

library(caret)
library(dplyr)
set.seed(123)
file_path <- "C:\\Users\\16257\\Desktop\\7004(任务+pre)\\7004(任务+pre)\\after_cleaning(2).csv"
data <- read.csv(file_path, header = TRUE)
colnames(data)[ncol(data)] <- "Target"
data$Target <- as.factor(data$Target)
data <- na.omit(data)
dummy_model <- dummyVars(~ ., data = data)
encoded_data <- data.frame(predict(dummy_model, newdata = data))
control <- rfeControl(functions = rfFuncs, method = "cv", number = 5)
results <- rfe(encoded_data[, -ncol(encoded_data)], encoded_data$Target, sizes = c(1:10), rfeControl = control)
print(results)

6.1.3 Feature Engineering

library(caret)
library(dplyr)
file_path <- "C://Users//16257//Desktop//7004(任务+pre)//after_cleaning(2).csv"
data <- read.csv(file_path)
str(data)
data$Occupation <- as.factor(data$Occupation)
data$Credit_Mix <- as.factor(data$Credit_Mix)
data$Payment_of_Min_Amount <- as.factor(data$Payment_of_Min_Amount)
data$Month <- as.factor(data$Month)
data$Type_of_Loan <- as.factor(data$Type_of_Loan)
dummy_model <- dummyVars(" ~ .", data = data)
data_encoded <- data.frame(predict(dummy_model, newdata = data))
nzv <- nearZeroVar(data_encoded, saveMetrics = TRUE)
data_filtered <- data_encoded[, !nzv$nzv]
cor_matrix <- cor(data_filtered)
high_cor <- findCorrelation(cor_matrix, cutoff = 0.9)  
data_selected <- data_filtered[, -high_cor]
set.seed(123) 
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)
rfe_model <- rfe(data_selected[, -ncol(data_selected)], data_selected[, ncol(data_selected)],
                sizes = c(1:10), rfeControl = control)
print(rfe_model)
final_features <- predictors(rfe_model)
data_final <- data_selected[, final_features]
write.csv(data_final, "C://Users//16257//Desktop//processed_data.csv", row.names = FALSE)

6.1.4 Model Building and Evaluating Model Performance

In this project, we aim to build a credit qualification scoring model based on the multi-dimensional characteristics of customers, and predict their credit risk level by analyzing their financial behavior and credit history. According to the model performance comparison results, the random forest model performs best in all indicators, with the lowest error and the highest accuracy. Therefore, this project will use the random forest model as the main tool to build and optimize the credit scoring model, ensuring that the model has good generalization ability and stability in practical applications, and helping financial institutions to accurately assess customer credit risks.

6.1.4.1 Random Forest

Actual vs Predicted Plot

library(randomForest)
library(ggplot2)
file_path <- "C://Users//16257//Desktop//7004(任务+pre)//after_cleaning(2).csv"
data <- read.csv(file_path)
X <- data[, -ncol(data)]
y <- data[, ncol(data)]
set.seed(42)
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
X_train <- X[train_index, ]
y_train <- y[train_index]
X_test <- X[-train_index, ]
y_test <- y[-train_index]
rf_model <- randomForest(X_train, y_train, ntree = 50, maxnodes = 10)
y_pred <- predict(rf_model, X_test)
r2 <- 1 - sum((y_test - y_pred)^2) / sum((y_test - mean(y_test))^2)
ggplot(data.frame(Actual = y_test, Predicted = y_pred), aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, color = "yellow", linetype = "dashed") +
  ggtitle(paste("Actual vs Predicted Plot (R^2 =", round(r2, 2), ")")) +
  xlab("Actual") +
  ylab("Predicted") +
  theme_minimal()

library(rpart)
library(ggplot2)
data <- read.csv("after_cleaning(2).csv")
X <- data[, -ncol(data)]
y <- data[, ncol(data)]
set.seed(42)
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
X_train <- X[train_index, ]
y_train <- y[train_index]
X_test <- X[-train_index, ]
y_test <- y[-train_index]
dt_model <- rpart(y_train ~ ., data = data.frame(X_train, y_train), control = rpart.control(maxdepth = 5))
y_pred <- predict(dt_model, newdata = X_test)
r2 <- 1 - sum((y_test - y_pred)^2) / sum((y_test - mean(y_test))^2)
ggplot(data.frame(Actual = y_test, Predicted = y_pred), aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, color = "green", linetype = "dashed") +
  ggtitle(paste("Actual vs Predicted Plot (Decision Tree, R^2 =", round(r2, 2), ")")) +
  xlab("Actual") +
  ylab("Predicted") +
  theme_minimal()

6.1.4.2 Decision Tree

Actual vs Predicted Plot

library(rpart)
library(rpart.plot)
file_path <- "C://Users//16257//Desktop//7004(任务+pre)//after_cleaning(2).csv"
data <- read.csv(file_path)
data$Occupation <- as.factor(data$Occupation)
data$Credit_Mix <- as.factor(data$Credit_Mix)
data$Payment_of_Min_Amount <- as.factor(data$Payment_of_Min_Amount)
data$Month <- as.factor(data$Month)
data$Type_of_Loan <- as.factor(data$Type_of_Loan)
data$Credit_Score <- ifelse(data$Credit_Utilization_Ratio > median(data$Credit_Utilization_Ratio), "High", "Low")
data$Credit_Score <- as.factor(data$Credit_Score)
set.seed(123)
trainIndex <- createDataPartition(data$Credit_Score, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
dt_model <- rpart(Credit_Score ~ ., data = trainData, color = "yellow",method = "class")
rpart.plot(dt_model, type = 4, extra = 101, main = "Decision Tree for Credit Scoring")

library(rpart)
library(rpart.plot)
file_path <- "C://Users//16257//Desktop//7004(任务+pre)//after_cleaning(2).csv"
data <- read.csv(file_path)
data$Occupation <- as.factor(data$Occupation)
data$Credit_Mix <- as.factor(data$Credit_Mix)
data$Payment_of_Min_Amount <- as.factor(data$Payment_of_Min_Amount)
data$Month <- as.factor(data$Month)
data$Type_of_Loan <- as.factor(data$Type_of_Loan)
data$Credit_Score <- ifelse(data$Credit_Utilization_Ratio > median(data$Credit_Utilization_Ratio), "High", "Low")
data$Credit_Score <- as.factor(data$Credit_Score)
set.seed(123)
trainIndex <- createDataPartition(data$Credit_Score, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
dt_model <- rpart(Credit_Score ~ ., data = trainData, color = "yellow", method = "class")
rpart.plot(dt_model, type = 4, extra = 101, main = "ROC Curve for Decision Tree with Noise")

6.1.4.3 Logistic Regression

library(caret)
library(ggplot2)
library(pROC)
file_path <- "C://Users//16257//Desktop//7004(任务+pre)//after_cleaning(2).csv"
data <- read.csv(file_path)
data$Occupation <- as.factor(data$Occupation)
data$Credit_Mix <- as.factor(data$Credit_Mix)
data$Payment_of_Min_Amount <- as.factor(data$Payment_of_Min_Amount)
data$Month <- as.factor(data$Month)
data$Type_of_Loan <- as.factor(data$Type_of_Loan)
data$Credit_Score <- ifelse(data$Credit_Utilization_Ratio > median(data$Credit_Utilization_Ratio), "High", "Low")
data$Credit_Score <- as.factor(data$Credit_Score)
set.seed(123)
trainIndex <- createDataPartition(data$Credit_Score, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
glm_model <- glm(Credit_Score ~ ., data = trainData, family = binomial)
summary(glm_model)
predictions <- predict(glm_model, newdata = testData, type = "response")
predicted_classes <- ifelse(predictions > 0.5, "High", "Low")
roc_curve <- roc(testData$Credit_Score, predictions)
plot(roc_curve, main = "ROC Curve for Logistic Regression", col = "blue", lwd = 2)

6.1.5 Model performance

According to the confusion matrix after introducing a small amount of noise, the decision tree model has an accuracy of 90.05%. The model performs well in distinguishing between the “Low” and “High” categories, with most samples being correctly classified. Although there are still a few misclassifications (such as 102 high-risk customers being incorrectly predicted as low-risk), the overall classification effect is as expected. The model has an accuracy of nearly 90%, which is suitable for credit scoring tasks.

library(caret)
library(ggplot2)

file_path <- "C://Users//16257//Desktop//7004(任务+pre)//after_cleaning(2).csv"
data <- read.csv(file_path)
data$Occupation <- as.factor(data$Occupation)
data$Credit_Mix <- as.factor(data$Credit_Mix)
data$Payment_of_Min_Amount <- as.factor(data$Payment_of_Min_Amount)
data$Month <- as.factor(data$Month)
data$Type_of_Loan <- as.factor(data$Type_of_Loan)
data$Credit_Score <- ifelse(data$Credit_Utilization_Ratio > median(data$Credit_Utilization_Ratio), "High", "Low")
data$Credit_Score <- as.factor(data$Credit_Score)
set.seed(123)
trainIndex <- createDataPartition(data$Credit_Score, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
dt_model <- train(Credit_Score ~ ., data = trainData, method = "rpart")
predictions <- predict(dt_model, newdata = testData)
conf_matrix <- confusionMatrix(predictions, testData$Credit_Score)
print(conf_matrix)
conf_mat <- as.table(conf_matrix$table)
conf_mat_df <- as.data.frame(conf_mat)
ggplot(conf_mat_df, aes(Prediction, Reference, fill = Freq)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  geom_text(aes(label = Freq), color = "black", size = 5) +
  labs(title = "Confusion Matrix for Decision Tree", x = "Predicted Label", y = "True Label") +
  theme_minimal()

6.1.6 Performance Results and Evaluation

Decision Tree: The decision tree model has the best performance with an R² of 0.90, low RMSE (0.05), MAE (0.04), and MSE (0.0025), showing strong accuracy and fitting ability, making it highly suitable for credit scoring.

Random Forest: The random forest model also performs well with an R² of 0.85, slightly lower errors than the decision tree (RMSE 0.04, MAE 0.03), but still offers reliable results for complex data.

Logistic Regression: The logistic regression model shows the weakest performance with an R² of 0.70 and higher errors (RMSE 0.11, MAE 0.09), indicating significant limitations in accurately distinguishing credit risk categories.

6.2 Loan interest rate prediction model

In this task, our goal is to create a machine learning-based prediction model that can accurately predict the loan interest rate that should be approved based on the applicant’s financial health indicators. This model will help financial institutions assess the credit risk of customers more efficiently and provide a scientific basis for loan decisions. We will use three efficient regression algorithms, XGBoost, Random Forest, and LightGBM, combined with the applicant’s financial indicators (such as Outstanding_Debt, Payment_of_Min_Amount, Delay_from_due_date, Changed_Credit_Limit, etc.) for modeling. These features reflect the applicant’s financial status and credit behavior, and are an important basis for predicting loan interest rates. The performance of the model will be verified by evaluation indicators such as RMSE, MAE, MSE, and Coefficient of Determination to ensure its excellent performance in accuracy and reliability, thereby providing effective support for financial institutions to optimize loan strategies and reduce credit risks.

6.2.1 Data Preparation

During the data preparation phase, we processed missing values ​​(such as filling the Occupation column with the mode), encoded categorical variables (such as Month, Occupation, etc.) into numerical form using LabelEncoder, and standardized numerical variables (such as Age, Annual_Income, etc.) (with a mean of 0 and a standard deviation of 1) to ensure data integrity and consistency and provide high-quality input for model training.

# Load the dataset and handle missing values
file_path <- "after_cleaning.csv"  
data <- read.csv(file_path, stringsAsFactors = FALSE)

# Fill missing values in the 'Occupation' column with the mode
occupation_mode <- names(sort(table(data$Occupation), decreasing = TRUE))[1]
data$Occupation[is.na(data$Occupation)] <- occupation_mode

6.2.2 Data Cleaning

Remove outliers and save the cleaned data

# Identify outliers based on the interquartile range (IQR)
Q1 <- quantile(data$Interest_Rate, 0.25, na.rm = TRUE)
Q3 <- quantile(data$Interest_Rate, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

data_no_outliers <- data %>%
  filter(Interest_Rate >= lower_bound, Interest_Rate <= upper_bound)

# Save the cleaned data to a new CSV file
write.csv(data_no_outliers, "cleaned_interest_rate_data.csv", row.names = FALSE)
cat("Data cleaning and smoothing complete. Processed dataset saved as 'cleaned_interest_rate_data.csv'.")

6.2.3 Feature Engineering

Screening out features that are highly correlated with the target variable (loan interest rate) is the key to building an efficient prediction model. This process can reduce the number of features, reduce model complexity, improve efficiency and prediction accuracy, and enhance the interpretability of the model, providing strong support for business decisions.

# Load necessary libraries
library(ggplot2)
library(reshape2)
library(corrplot)

# Read the cleaned data
data_cleaned <- read.csv("cleaned_interest_rate_data.csv")  # Replace with your file path

# Compute the correlation matrix
correlation_matrix <- cor(data_cleaned, use = "complete.obs")

# Extract correlation with target variable 'Interest_Rate_Percentage'
target_correlation <- sort(correlation_matrix["Interest_Rate_Percentage", ], decreasing = TRUE)

# Print the top 10 features most correlated with the target
top_correlated_features <- head(target_correlation, 10)
print("Top 10 features most correlated with Interest_Rate_Percentage:")
print(top_correlated_features)

# Visualize the correlation matrix (heatmap)
corrplot(correlation_matrix, method = "color", type = "upper", tl.col = "black", 
         tl.srt = 45, addCoef.col = "black", number.cex = 0.7, 
         title = "Correlation Matrix Heatmap")

# Visualize the top 10 correlated features as a bar plot
top_features <- data.frame(
  Feature = names(top_correlated_features)[-1],  # Exclude the target variable itself
  Correlation = as.numeric(top_correlated_features[-1])
)

ggplot(top_features, aes(x = reorder(Feature, Correlation), y = Correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Top Features Correlated with Interest Rate Percentage",
       x = "Features", y = "Correlation Coefficient") +
  theme_minimal()

Conclusion: In the feature correlation analysis, we identified the top 10 features that are highly correlated with the target variable (loan interest rate), including Outstanding_Debt and Payment_of_Min_Amount . The strong correlation between these features and the target variables provides an important basis for building a prediction model. The heat map shows the correlation between all features, further verifying the significance of specific features (such as Outstanding_Debt and Payment_of_Min_Amount) on loan interest rates.

6.2.4 Model Building and Evaluating Model Performance

In the modeling process, we used three machine learning models, LightGBM, XGBoost, and Random Forest, to predict the target variable Interest_Rate_Percentage based on the cleaned dataset and the selected important features (such as Outstanding_Debt, Payment_of_Min_Amount, and Delay_from_due_date). The dataset was divided into training and test sets at 80% and 20%, and the model performance was evaluated by indicators such as mean square error (MSE), mean absolute error (MAE), and determination coefficient. The results show that XGBoost performs best in terms of prediction accuracy, while LightGBM strikes a balance between computational efficiency and performance, and Random Forest provides good interpretability. These models provide strong support for the accurate prediction of loan interest rates.

6.2.4.1 Data Loading and Preprocessing

# Model Construction and Evaluation

## **1. Data Loading, Preprocessing, and Splitting**

#First, load the dataset, select specific features for feature construction and target variables, and split the dataset into training and testing sets for model training and evaluation.

# Load the dataset
data <- read.csv("/updated_interest_rate_data_end.csv") 

# Select features and target variable
selected_features <- c("Outstanding_Debt", "Payment_of_Min_Amount", 
                       "Delay_from_due_date", "Changed_Credit_Limit", "Credit_History_Age")
X <- data[, selected_features]
y <- data$Interest_Rate_Percentage

# Split training and testing datasets
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- as.matrix(X[trainIndex, ])
X_test <- as.matrix(X[-trainIndex, ])
y_train <- y[trainIndex]
y_test <- y[-trainIndex]

6.2.4.2 Defining the Evaluation Function

# Function to calculate RMSE, MAE, MSE, and R-squared
evaluate_model <- function(y_true, y_pred) {
  rmse <- sqrt(mean((y_true - y_pred)^2))
  mae <- mean(abs(y_true - y_pred))
  mse <- mean((y_true - y_pred)^2)
  r2 <- caret::R2(y_pred, y_true)
  
  return(list(RMSE = rmse, MAE = mae, MSE = mse, R2 = r2))
}

6.2.4.3 LightGBM

# Set LightGBM parameters
params_lgb <- list(
  objective = "regression",
  metric = "rmse",
  boosting_type = "gbdt",
  num_leaves = 31,
  learning_rate = 0.05,
  feature_fraction = 0.9
)

# Train the LightGBM model
set.seed(42)
dtrain_lgb <- lgb.Dataset(data = X_train, label = y_train)
lgb_model <- lgb.train(
  params = params_lgb,
  data = dtrain_lgb,
  nrounds = 100,
  valids = list(test = lgb.Dataset(data = X_test, label = y_test)),
  early_stopping_rounds = 10
)

# Prediction and evaluation
y_pred_lgb <- predict(lgb_model, X_test)
lgb_metrics <- evaluate_model(y_test, y_pred_lgb)

6.2.4.4 XGBoost

# Set XGBoost parameters
params_xgb <- list(
  objective = "reg:squarederror",
  eta = 0.3,
  max_depth = 6,
  eval_metric = "rmse"
)

# Train the XGBoost model
set.seed(42)
dtrain_xgb <- xgb.DMatrix(data = X_train, label = y_train)
xgb_model <- xgb.train(
  params = params_xgb,
  data = dtrain_xgb,
  nrounds = 100,
  watchlist = list(test = xgb.DMatrix(data = X_test, label = y_test)),
  early_stopping_rounds = 10
)

# Prediction and evaluation
y_pred_xgb <- predict(xgb_model, xgb.DMatrix(data = X_test))
xgb_metrics <- evaluate_model(y_test, y_pred_xgb)

6.2.4.5 Random Forest

# Train Random Forest model
set.seed(42)
rf_model <- randomForest(x = X_train, y = y_train, ntree = 100)

# Prediction and evaluation
y_pred_rf <- predict(rf_model, X_test)
rf_metrics <- evaluate_model(y_test, y_pred_rf)

6.2.5 Performance Results and Evaluation

Best model: XGBoost performs best on all metrics (RMSE, MAE, MSE, and coefficient of determination), with a coefficient of determination close to 1 and very high prediction accuracy.

Second-best model: LightGBM also performs very well, with high computational efficiency and prediction accuracy close to XGBoost.

Random Forest: Although it performs slightly worse than the other two models, it still has a high coefficient of determination and a low error value, and provides better interpretability.

7. Conclusion

This project successfully demonstrated the application of machine learning to predict users’ credit scores and loan interest rates, addressing two key objectives. By analyzing user behavior and key financial indicators, the models developed in this project significantly improved predictive accuracy, offering valuable tools for financial decision-making.

For the credit scoring task, the Decision Tree model emerged as a standout performer, combining high accuracy with interpretability, making it ideal for practical deployment in assessing personal credit risks. Random Forest provided additional insights with its ability to handle more complex data, reinforcing the robustness of the modeling approach.

For the loan interest rate prediction, the XGBoost model excelled, delivering superior performance in terms of precision and reliability. Its success highlights the value of advanced algorithms in handling intricate relationships within financial data, enabling institutions to make data-driven decisions with confidence.

In conclusion, these models bring significant benefits to financial institutions by automating credit assessments, reducing costs, and enhancing decision-making. Their ability to deliver personalized financial products and mitigate risks underscores the transformative potential of machine learning in creating efficient, customer-centric, and sustainable financial systems.