GROUP: Group 10

Full Name: Kai Ern Chow, Nur Hidayah Binti Ahmad Shafii, Zeng YanTing, Pang Shang

Matric Number: “S2193226 ,22120931 ,S2168467 ,22106157

1. Introduction

With the increasing importance of consumer credit and lending practices, the importance of accurate and efficient credit risk assessment for both financial institution and the customer cannot be overstated. These processes not only protect the financial health of banks but also increase credits volume, ensure responsible lending practices that support economic stability and ensure loans are granted to reliable customers. Misjudging an applicant’s risk can lead to financial losses for the institution. Similarly, inaccurately estimating the loan amount can either burden the customer with more than they can handle or limit their potential to use the loan effectively (Sudhamathy, 2016).

Financial institutions to accurately predict the suitable loan amount to borrowers without exposing them to excessive risk is essential. Credit Risk Classification focuses on segregating loan applicants into different categories based on their creditworthiness. This classification helps financial institutions determine the level of risk associated with lending to each borrower, thereby impacting the decision-making process regarding loan approvals and interest rates.

Conversely, Loan Amount Prediction involves creating a predictive model to estimate the most appropriate loan amount based on a thorough analysis of an applicant’s financial health indicators. This includes income level, employment stability, credit history, existing debts, and more. The aim is to determine the optimal loan amount that aligns with the borrower’s ability to repay without causing financial strain, thus minimizing the risk of default.

By incorporating machine learning techniques using R, borrowers can make informed decisions that strike a balance between promoting credit access and maintaining a healthy loan portfolio.

1.1 Objectives

  1. Credit Risk Classification - To develop a machine learning model that classifies applicants into categories of credit risk
  2. Loan Amount Prediction - To create a predictive model that accurately predicts the loan amount that should be approved for applicants based on their financial health indicators.

2 Methods and Analysis

2.1 Data Preparation

The data was obtained from https://www.kaggle.com/datasets/laotse/credit-risk-dataset/code?datasetId=688532&sortBy=voteCount

Since there is no explicit date or origin provided on the source site, the dataset used in this report is assumed to be the latest available and originates from financial institutions in Malaysia. Prior to analysis, the data underwent thorough preparation, including cleaning to address missing values and inconsistencies, identification and categorization of variables, and transformation to meet analysis assumptions. By assuming the data’s timeliness and origin, this analysis provides valuable insights into credit risk within the Malaysian financial landscape.

2.2 Data Exploration

## 'data.frame':    32581 obs. of  12 variables:
##  $ person_age                : int  22 21 25 23 24 21 26 24 24 21 ...
##  $ person_income             : int  59000 9600 9600 65500 54400 9900 77100 78956 83000 10000 ...
##  $ person_home_ownership     : chr  "RENT" "OWN" "MORTGAGE" "RENT" ...
##  $ person_emp_length         : int  123 5 1 4 8 2 8 5 8 6 ...
##  $ loan_intent               : chr  "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
##  $ loan_grade                : chr  "D" "B" "C" "C" ...
##  $ loan_amnt                 : int  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
##  $ loan_int_rate             : num  16 11.1 12.9 15.2 14.3 ...
##  $ loan_status               : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ loan_percent_income       : num  0.59 0.1 0.57 0.53 0.55 0.25 0.45 0.44 0.42 0.16 ...
##  $ cb_person_default_on_file : chr  "Y" "N" "N" "N" ...
##  $ cb_person_cred_hist_length: int  3 2 3 2 4 2 3 4 2 3 ...

Here, the dataset consists of 32,581 observations of 12 variables. The definition of the variables are as such:

  1. person_age: The age of the borrower when securing the loan.
  2. person_income: The borrower’s annual earnings at the time of the loan.
  3. person_home_ownership: Type of home ownership.
  • rent: The borrower is currently renting a property.
  • mortgage: The borrower has a mortgage on the property they own.
  • own: The borrower owns their home outright.
  • other: Other categories of home ownership that may be specific to the dataset.
  1. person_emp_length: The amount of time in years that borrower is employed.
  2. loan_intent: Loan purpose.
  3. loan_grade: Classification system based on credit history, collateral quality, and likelihood of repayment of the principal and interest.
  • A: The borrower has a high creditworthiness, indicating low risk.
  • B: The borrower is relatively low-risk, but not as creditworthy as Grade A.
  • C: The borrower’s creditworthiness is moderate.
  • D: The borrower is considered to have higher risk compared to previous grades.
  • E: The borrower’s creditworthiness is lower, indicating a higher risk.
  • F: The borrower poses a significant credit risk.
  • G: The borrower’s creditworthiness is the lowest, signifying the highest risk.
  1. loan_amnt: Total amount of the loan.
  2. loan_int_rate: Interest rate of the loan.
  3. loan_status: Dummy variable indicating default (1) or non-default (0).
  • 0: Non-default - The borrower successfully repaid the loan as agreed, and there was no default.
  • 1: Default - The borrower failed to repay the loan according to the agreed-upon terms and defaulted.
  1. loan_percent_income: Ratio between the loan amount and the annual income.
  2. cb_person_cred_hist_length: The number of years of personal history since the first loan taken from borrower.
  3. cb_person_default_on_file: Indicate if the person has previously defaulted.

The loan_status variable is a crucial dependent variable. The value of 0 is no default and the value of 1 is default. A default occurs when a borrower is unable to make timely payments, misses payments, or avoids or stops making payments on interest or principal owed.

2.3 Data Cleaning

Data cleaning is an essential step in the data analysis process. Several specific steps were taken during the data cleaning process.Initially, duplicate rows were identified and removed to streamline the dataset. Missing values were handled through techniques such as imputation and deletion. Lastly, the formatting adjustments are take to ensure consistency and accuracy.

##    person_age     person_income     person_home_ownership person_emp_length
##  Min.   : 20.00   Min.   :   4000   Length:32581          Min.   :  0.00   
##  1st Qu.: 23.00   1st Qu.:  38500   Class :character      1st Qu.:  2.00   
##  Median : 26.00   Median :  55000   Mode  :character      Median :  4.00   
##  Mean   : 27.73   Mean   :  66075                         Mean   :  4.79   
##  3rd Qu.: 30.00   3rd Qu.:  79200                         3rd Qu.:  7.00   
##  Max.   :144.00   Max.   :6000000                         Max.   :123.00   
##                                                           NA's   :895      
##  loan_intent         loan_grade          loan_amnt     loan_int_rate  
##  Length:32581       Length:32581       Min.   :  500   Min.   : 5.42  
##  Class :character   Class :character   1st Qu.: 5000   1st Qu.: 7.90  
##  Mode  :character   Mode  :character   Median : 8000   Median :10.99  
##                                        Mean   : 9589   Mean   :11.01  
##                                        3rd Qu.:12200   3rd Qu.:13.47  
##                                        Max.   :35000   Max.   :23.22  
##                                                        NA's   :3116   
##   loan_status     loan_percent_income cb_person_default_on_file
##  Min.   :0.0000   Min.   :0.0000      Length:32581             
##  1st Qu.:0.0000   1st Qu.:0.0900      Class :character         
##  Median :0.0000   Median :0.1500      Mode  :character         
##  Mean   :0.2182   Mean   :0.1702                               
##  3rd Qu.:0.0000   3rd Qu.:0.2300                               
##  Max.   :1.0000   Max.   :0.8300                               
##                                                                
##  cb_person_cred_hist_length
##  Min.   : 2.000            
##  1st Qu.: 3.000            
##  Median : 4.000            
##  Mean   : 5.804            
##  3rd Qu.: 8.000            
##  Max.   :30.000            
## 

From the summary, it is unlikely to bee true that an borrower age could be 144 years old. Since the maximum loan tenure in Malaysia is up to the age of 70 years, borrowers over 70 years are excluded.

For person_emp_length variable, there is 895 NA values and it is impossible to have max of 123 years of employment.Hence, replace the NA’s value with the mode value and exclude rows whose work experience is >60 (Assuming average Upper bound of employement).

Missing values were handled through techniques such as imputation. For loan_int_rate variable, there’s a 3,116 NA’s value. Hence, replace the NA’s value with the median value.

The table below shows the data that has been cleaned.

##    person_age    person_income     person_home_ownership person_emp_length
##  Min.   :20.00   Min.   :   4000   Length:31665          Min.   : 0.000   
##  1st Qu.:23.00   1st Qu.:  39396   Class :character      1st Qu.: 2.000   
##  Median :26.00   Median :  56000   Mode  :character      Median : 4.000   
##  Mean   :27.71   Mean   :  66495                         Mean   : 4.781   
##  3rd Qu.:30.00   3rd Qu.:  80000                         3rd Qu.: 7.000   
##  Max.   :69.00   Max.   :2039784                         Max.   :38.000   
##  loan_intent         loan_grade          loan_amnt     loan_int_rate  
##  Length:31665       Length:31665       Min.   :  500   Min.   : 5.42  
##  Class :character   Class :character   1st Qu.: 5000   1st Qu.: 8.49  
##  Mode  :character   Mode  :character   Median : 8000   Median :10.99  
##                                        Mean   : 9661   Mean   :11.04  
##                                        3rd Qu.:12500   3rd Qu.:13.16  
##                                        Max.   :35000   Max.   :23.22  
##   loan_status     loan_percent_income cb_person_default_on_file
##  Min.   :0.0000   Min.   :0.0000      Length:31665             
##  1st Qu.:0.0000   1st Qu.:0.0900      Class :character         
##  Median :0.0000   Median :0.1500      Mode  :character         
##  Mean   :0.2155   Mean   :0.1696                               
##  3rd Qu.:0.0000   3rd Qu.:0.2300                               
##  Max.   :1.0000   Max.   :0.8300                               
##  cb_person_cred_hist_length
##  Min.   : 2.000            
##  1st Qu.: 3.000            
##  Median : 4.000            
##  Mean   : 5.801            
##  3rd Qu.: 8.000            
##  Max.   :30.000
## [1] "Sum of total missing values = 0"

Grouping the age:

## age_groups
##     <20   21-30   31-40   41-50   51-60 61 - 70     >70 
##       0   22857    7106    1386     251      65       0

Grouping the income using the B40, M40 and T20 classification:

## income_groups
##        <30720   30721-41280   41281-51720   51721-63000   63001-76080 
##          4228          4801          4860          4778          4369 
##   76081-92280  92281-113400 113401-141840 141841-190440       >190440 
##          3201          2352          1630           888           558

Formatting adjustments:

credit_risk$loan_intent <- gsub("HOMEIMPROVEMENT", "HOME IMPROVEMENT", credit_risk$loan_intent)

credit_risk$loan_intent <- gsub("DEBTCONSOLIDATION", "DEBT CONSOLIDATION", credit_risk$loan_intent)

Transform variables to factors:

credit_risk$loan_status <- as.factor(credit_risk$loan_status)
credit_risk$person_home_ownership <- as.factor(credit_risk$person_home_ownership)
credit_risk$loan_intent <- as.factor(credit_risk$loan_intent)
credit_risk$loan_grade <- as.factor(credit_risk$loan_grade)
credit_risk$cb_person_default_on_file <- as.factor(credit_risk$cb_person_default_on_file)

Checking Outlier:

high_income <- credit_risk[credit_risk$person_income > 1000000, ]
high_income

Upon checking the all the attributes above, we shall not drop any observations in the person_income variable although there is an outlier since the values make sense.

2.3 Exploratory Data Analysis

2.3.1 Univariate Analysis

The graph above highlights that the employment experience, annual income, loan amount, and the first personal history of borrowers exhibit positively skewed distributions.

From the Loan Intention of Borrower pie chart above, it outlines the distribution of borrowers taking out loans for various purposes. It shows that educational purposes represent the highest percentage at 19.86%. This suggests that a significant portion of borrowers are investing in their education, possibly to further their careers or pursue higher levels of education.On the other hand, home improvement purposes represent the lowest percentage at 11.08%, indicating a smaller but still notable portion are investing in renovating or upgrading their homes.

From the Home Ownership of Borrower pie chart above, it shows that renting is the most popular choice for borrower homeownership, comprising 50.7% of the dataset, followed closely by mortgage holders at 41.3%. Conversely, the “others” category constitutes the smallest portion at 0.3%. The “others” category likely includes borrowers with unconventional housing arrangements, such as living with family or in alternative housing situations. In conclusion, most borrower who secure loans do not own their house.

From the graph above, it illustrates that 82% of borrowers having a history of defaults on their loans, suggesting that there is a prevalent trend of financial difficulties among borrowers. Conversely, 18% of borrowers stand out for their clean repayment records, indicating a minority who have managed to navigate their financial obligations successfully.

From the graph above, it illustrates 78% of borrowers demonstrating a non-default status, indicating successful repayment according to agreed terms. On the other hand, 22% of borrowers exhibit a default status possibly due to financial challenges, unforeseen circumstances, or inadequate financial management. This minority underscores the prevalence of loan defaulting and highlights the need for proactive measures such as financial education, support programs, and risk assessment strategies to mitigate default risks and promote responsible borrowing behavior.

2.3.2 Bivariate Analysis

According to Department of Statistics Malaysia (DOSM), B40 income group is defined as individual with yearly incomes of less than RM 63,000. M40 income group is defined as individual with yearly between RM63,000 to RM141,840 and T20 income group is defined as individual with yearly incomes of more than RM141,840.

From the graph above, the positively skewed distribution of yearly income among the age groups 21-30, 31-40, and 41-50 suggests that the majority of borrowers within these age brackets tend to have lower incomes, with fewer individuals earning higher salaries. Furthermore, a significant portion of borrowers in the B40 group are aged 21-30, indicating a concentration of borrowing activity among younger individuals with relatively lower incomes.

The data analysis for the above graphs reveals several interesting trends regarding loan statuses across different borrower demographics and loan characteristics. Borrowers with home ownership of rent and securing loans for education purposes exhibit the highest non-default loan status indicating successful repayment and adherence to agreed-upon terms. Additionally, borrowers with no default history maintain the highest non-default loan status further emphasizing the correlation between past repayment performance and current loan status. Notably, loans graded as Grade A also correspond to the highest non-default loan status, indicating that loans classified under this grade are associated with lower default risk and higher levels of successful repayment. These findings underscore the importance of factors such as homeownership status, loan purpose, credit history, and loan grading in determining borrower loan statuses and repayment outcomes.

The graph above shows an insight into the borrowing patterns of different age groups and the reasons behind their loan acquisitions. Borrowers aged 21-30 emerge as the highest demographic securing loans, with education being the predominant reason and home improvement ranks lowest. This suggests a focus on investing in education and potentially early career development among younger borrowers. Next, it demonstrated that borrowers aged 31-40 and 41-50 has medical reasons being the main reasons, indicating a shift towards addressing healthcare needs or concerns in this age range. For borrowers aged 51-60 and 61-70, personal reasons become the primary motivator for securing loans, highlighting a diverse range of borrower financial needs or aspirations in these older age groups. These findings underscore the importance of understanding demographic-specific borrowing behaviors and tailoring financial products and services to meet the evolving needs of different age cohorts.

The graph above shows correlation between borrower age, loan status, and loan security. Borrowers aged 21-30 exhibit the highest loan status of default, indicating a concerning trend of repayment challenges or defaults within this demographic group. Additionally, this age group also demonstrates the highest rate of securing loans, suggesting a higher demand for financial assistance or credit among younger borrowers. This combination of factors highlights the importance of targeted financial education and support programs tailored towards younger borrowers to help them manage their debts effectively and avoid defaulting on loan obligations. Additionally, it underscores the need for financial institution to assess risk carefully and offer appropriate assistance and resources to mitigate default risks, especially among younger borrowers who may have limited financial experience or resources.

The graph above illustrates a comparison of default risk with three variables which are interest rates, yearly income, and loan amount. It shows that higher interest rates, a lower income and higher credit increase the possibility to have a default. This insight underscores the significance of these factors in determining the risk profile of borrowers. Specifically, borrowers facing higher interest rates may struggle with repayment, while those with lower incomes might find it challenging to meet their financial obligations. Additionally, larger loan amounts may increase the burden on borrowers, potentially leading to higher default rates. Understanding these relationships is crucial for financial institution to assess risk effectively and tailor lending practices to mitigate default risks, ultimately promoting financial stability for borrowers.

## 'data.frame':    32581 obs. of  8 variables:
##  $ person_age                : int  22 21 25 23 24 21 26 24 24 21 ...
##  $ person_income             : int  59000 9600 9600 65500 54400 9900 77100 78956 83000 10000 ...
##  $ person_emp_length         : int  123 5 1 4 8 2 8 5 8 6 ...
##  $ loan_amnt                 : int  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
##  $ loan_int_rate             : num  16 11.1 12.9 15.2 14.3 ...
##  $ loan_status               : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ loan_percent_income       : num  0.59 0.1 0.57 0.53 0.55 0.25 0.45 0.44 0.42 0.16 ...
##  $ cb_person_cred_hist_length: int  3 2 3 2 4 2 3 4 2 3 ...

Strong positive correlation (red):

There is a strong positive correlation between loan_amnt (loan amount) and person_income (personal income). This means that the higher an individual’s income, the larger the amount of borrowing is likely to be. There is also a positive correlation between loan_amnt and loan_int_rate (loan interest rate), which may mean that the larger the loan amount, the higher the corresponding interest rate may be. Strong negative correlation (blue):

A negative correlation is shown between person_age and loan_int_rate, which may mean that older borrowers may receive lower interest rates. Weak or no correlation (white or nearly white):

The correlation between person_age and loan_status is weak, suggesting that age may have little effect on the repayment status of a loan. The correlation between cb_person_cred_hist_length (credit history length) and person_income is also very weak, indicating that the length of credit history and personal income may not be directly related

2.4 Modelling

2.4.1 Credit Risk Classification

In this task, we aim to develop machine learning model to identify and predict applicants’ risk categories. The credit scoring process on the borrower is important to minimize the credit risk faced by banks, and the output of this process will be the basis of determining whether a new loan application is accepted or rejected. In this credit risk classification, we will classify and predict risk based on the loan status (0 = paid; 1 = unpaid) using random forest and logistic regression model. Random Forest classifier is a popular choice for classification tasks due to its robustness and accuracy. On the other hand, logistics regression is also a common and interpretable method for binary classification tasks.

library(caret)
library(randomForest)
library(ggplot2)

#split the data into traning and testing sets 
set.seed(123)
sample <- sample(1:nrow(credit_risk), size = 0.8 * nrow(credit_risk))
credit_train <- credit_risk[sample, ]
credit_test <- credit_risk[-sample, ]
2.4.1.1 Random Forest Model
#Train RF model 
set.seed(123)
rf_model <- randomForest(loan_status ~ ., data = credit_train, importance = TRUE)

#Prediction testing 
predict_risk<-predict(rf_model, credit_test)


# Convert predicted and actual factors to ensure the levels are in the order 1, 0
predict_risk <- factor(predict_risk, levels = c(1, 0))
loan_status_actual <- factor(credit_test$loan_status, levels = c(1, 0))

# Evaluate the Prediction Model
confusionMatrix <- table(predict_risk, loan_status_actual)
confusionMatrix
##             loan_status_actual
## predict_risk    1    0
##            1 1006   26
##            0  392 4909

## [1] "Number of correct predictions: 5915"
## [1] "Number of wrong predictions: 418"

## [1] "Accuracy:  0.07"
## [1] "Sensitivity:  0.28"
## [1] "Precision:  0.07"
## [1] "Recall:  0.28"
## [1] "F1 Score:  0.12"

The Random Forest Model for credit risk classification performs well with an overall accuracy of 93% and high precision of 97%. The RF model correctly identifies a significant number of paid loans (4909) and unpaid loan (1006). However, the model has a total number of 418 of wrong predictions, of which 26 is incorrectly identified paid loan as unpaid and 392 unpaid loan as paid. The model’s sensitivity (recall) is 72%, while the F1 score of 0.83 reflects a good balance between precision and recall. The confusion matrix and boxplot with jittered points further highlighted the strength of the RF model in classifying paid loans but also reveal areas for improvement in accurately identifying unpaid loans.

2.4.1.2 Logistic Regression Model
# Train a logistic regression model
logit_model <- glm(loan_status ~ ., data = credit_train, family = binomial)

# Predict on the test set
predict_prob_lr <- predict(logit_model, credit_test, type = "response")
predict_risk_lr <- ifelse(predict_prob_lr > 0.5, 1, 0)

# Evaluate the Prediction Model
conf_matrix_lr <- confusionMatrix(as.factor(predict_risk_lr), credit_test$loan_status)
print(conf_matrix_lr) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4716  625
##          1  219  773
##                                          
##                Accuracy : 0.8667         
##                  95% CI : (0.8581, 0.875)
##     No Information Rate : 0.7793         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5676         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9556         
##             Specificity : 0.5529         
##          Pos Pred Value : 0.8830         
##          Neg Pred Value : 0.7792         
##              Prevalence : 0.7793         
##          Detection Rate : 0.7447         
##    Detection Prevalence : 0.8434         
##       Balanced Accuracy : 0.7543         
##                                          
##        'Positive' Class : 0              
## 
# Create a data frame for comparison
comparison_df_lr <- data.frame(Actual = credit_test$loan_status, Predicted = as.factor(predict_risk_lr))

## [1] "Number of correct predictions: 5489"
## [1] "Number of wrong predictions: 844"

## [1] "Precision:  0.78"
## [1] "Recall:  0.55"
## [1] "F1 Score:  0.65"

The Logistic Regression Model for credit risk classification achieved an accuracy of 87% with the precision of 78%. The LR model correctly identifies a significant number of paid loans (4716) and unpaid loan (773). However, the model has a total number of 844 of wrong predictions, of which 219 are incorrectly identified paid loan as unpaid and 625 unpaid loan as paid. In comparison with RF model, LR model performed significantly less accurately, with a higher number of wrong predictions. The model’s sensitivity (recall) is 55%, while the F1 score is 0.65. In summary, the RF model will be a better classifier to classify the credit risk based on the individual loan repayment status.

2.4.2 Loan Amount Prediction

In this task, we aim to develop machine learning models to accurately predict the loan amount that applicants should be approved for. This is crucial for banks to assess applicants’ financial health and determine loan amounts. We will utilize LightGBM and XGBoost, two efficient and accurate regression algorithms, to train models based on applicants’ financial indicators such as person_income,person_home_ownership,loan_grade, etc. The goal of the models is to accurately predict the loan amount for each applicant, providing a basis for banks’ loan decisions. We will use evaluation metrics such as RMSE, MAE, MSE, and R² to validate the accuracy and reliability of the models。

library(caret)
library(randomForest)
library(ggplot2)

# Extract target variable
y_train <- credit_train$loan_amnt
y_test <- credit_test$loan_amnt

# Delete the target variable column and keep the other characteristics
X_train <- credit_train %>% select(-loan_amnt)
X_test <- credit_test %>% select(-loan_amnt)

# Convert data to matrix format
X_train <- as.matrix(X_train)
X_test <- as.matrix(X_test)
2.4.2.1 Lightgbm Model
# Build Lightgbm model 
library(lightgbm)
library(data.table)
library(ggplot2)

# # Convert data frame to matrix format
X_train <- as.matrix(credit_train[, -which(names(credit_train) == "loan_amnt")])
y_train <- credit_train$loan_amnt

X_test <- as.matrix(credit_test[, -which(names(credit_test) == "loan_amnt")])
y_test <- credit_test$loan_amnt

# creat LightGBM dataset
train_data <- lgb.Dataset(data = X_train, label = y_train)
test_data <- lgb.Dataset(data = X_test, label = y_test)

# set LightGBM params
params <- list(
  objective = "regression",
  metric = "rmse",
  seed = 42,
  num_threads = 1
)

# train LightGBM model,and record the log
model <- lgb.train(
  params = params,
  data = train_data,
  nrounds = 100,
  valids = list(train = train_data, test = test_data),
  eval_freq = 10,
  early_stopping_rounds = 10,
  record = TRUE
)
## [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000367 seconds.
## You can set `force_row_wise=true` to remove the overhead.
## And if memory is not enough, you can set `force_col_wise=true`.
## [LightGBM] [Info] Total Bins 675
## [LightGBM] [Info] Number of data points in the train set: 25332, number of used features: 7
## [LightGBM] [Info] Start training from score 9692.723630
## [1]:  train's rmse:5800.38  test's rmse:5616.08 
## Will train until there is no improvement in 10 rounds.
## [11]:  train's rmse:2317.9  test's rmse:2230.3 
## [21]:  train's rmse:1017.96  test's rmse:973.522 
## [31]:  train's rmse:576.974  test's rmse:551.996 
## [41]:  train's rmse:448.01  test's rmse:440.563 
## [51]:  train's rmse:410.124  test's rmse:413.922 
## [61]:  train's rmse:392.769  test's rmse:405.198 
## [71]:  train's rmse:381.808  test's rmse:401.081 
## [81]:  train's rmse:373.897  test's rmse:398.502 
## [91]:  train's rmse:367.393  test's rmse:397.931 
## [100]:  train's rmse:362.753  test's rmse:397.95 
## Did not meet early stopping, best iteration is: [100]:  train's rmse:362.753  test's rmse:397.95
# Model prediction
predictions <- predict(model, X_test)

# Lightgbm Model evaluation
evaluate_model <- function(actual, predicted) {
  rmse_value <- sqrt(mean((actual - predicted)^2)) # RMSE
  mae_value <- mean(abs(actual - predicted))       # MAE
  mse_value <- mean((actual - predicted)^2)        # MSE
  
  ss_res <- sum((actual - predicted)^2)
  ss_tot <- sum((actual - mean(actual))^2)
  r2_value <- 1 - (ss_res / ss_tot)                # R²
  
  cat("Model Evaluation Metrics:\n")
  cat("RMSE:", rmse_value, "\n")
  cat("MAE:", mae_value, "\n")
  cat("MSE:", mse_value, "\n")
  cat("R²:", r2_value, "\n")
}


evaluate_model(y_test, predictions)
## Model Evaluation Metrics:
## RMSE: 397.7587 
## MAE: 237.2592 
## MSE: 158212 
## R²: 0.9958508
# Get a log of the training process
eval_log <- model$record_evals

# Extract RMSE on the training set and validation set
train_rmse <- unlist(eval_log$train$rmse$eval)
test_rmse <- unlist(eval_log$test$rmse$eval)
iterations <- seq_along(train_rmse)

# Create a data frame for drawing
eval_df <- data.frame(
  iter = iterations,
  train_rmse = train_rmse,
  test_rmse = test_rmse
)

# draw a chart
learning_curve <- ggplot(eval_df, aes(x = iter)) +
  geom_line(aes(y = train_rmse, color = "Train RMSE"), size = 1) +
  geom_line(aes(y = test_rmse, color = "Validation RMSE"), size = 1, linetype = "dashed") +
  scale_color_manual(values = c("Train RMSE" = "#33a02c", "Validation RMSE" = "#4c72b0")) +
  theme_minimal() +
  theme(
    text = element_text(size = 14, family = "Arial"),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold"),
    axis.text = element_text(face = "bold"),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    legend.background = element_rect(fill = "lightgray", size = 0.5, linetype = "solid"),
    panel.background = element_rect(fill = "#d4e6f1", color = NA),
    plot.background = element_rect(fill = "#d4e6f1", color = NA),  
    panel.grid.major = element_line(color = "white")                
  ) +
  labs(
    title = "LightGBM Learning Curve (RMSE)",
    x = "Iteration",
    y = "RMSE",
    color = "Legend"
  )

# print learning_curve
print(learning_curve)

The Lightgbm Model performs well in predicting loan amounts by evaluating indicators.RMSE: 397.7587 , MAE : 237.2592 , MSE : 158212 , R² : 0.9958 . Especially with an R² value close to 1, suggesting that the model explains most of the variance.And The learning curve demonstrates a significant reduction in RMSE for both the training and validation sets as the number of iterations increases, stabilizing towards the end. This indicates continuous model optimization and good performance.

2.4.2.2 XGBoost Model
# Build XGBoost Model
library(xgboost)
library(data.table)

# Convert data frames to matrices
X_train <- as.matrix(credit_train[, -which(names(credit_train) == "loan_amnt")])
y_train <- credit_train$loan_amnt

X_test <- as.matrix(credit_test[, -which(names(credit_test) == "loan_amnt")])
y_test <- credit_test$loan_amnt


# Make sure that all non-numeric columns are converted to numeric,
convert_to_numeric <- function(df) {
  for (col in names(df)) {
    if (is.factor(df[[col]]) || is.character(df[[col]])) {
      levels <- sort(unique(df[[col]]))
      df[[col]] <- as.numeric(factor(df[[col]], levels = levels))
    }
  }
  return(df)
}

credit_train <- convert_to_numeric(credit_train)
credit_test <- convert_to_numeric(credit_test)

# Convert data frame to matrix format
X_train <- as.matrix(credit_train[, -which(names(credit_train) == "loan_amnt")])
y_train <- credit_train$loan_amnt

X_test <- as.matrix(credit_test[, -which(names(credit_test) == "loan_amnt")])
y_test <- credit_test$loan_amnt

# create DMatrix object
dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest <- xgb.DMatrix(data = X_test, label = y_test)


# set XGBoost params
params <- list(
  objective = "reg:squarederror",
  booster = "gbtree",
  seed = 42,
  nthread = 1 
)



evals_result <- list()

# Create a list that holds training and validation errors
train_errors <- c()
val_errors <- c()

# Define a custom evaluation function for recording errors
evalerror <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- sqrt(mean((labels - preds)^2))
  return(list(metric = "rmse", value = err))
}


model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain, test = dtest),
  feval = evalerror,
  maximize = FALSE, # Add this parameter to indicate that the evaluation metric needs to be minimized
  print_every_n = 10,
  early_stopping_rounds = 10,
  callbacks = list(
    cb.print.evaluation(period = 10),
    cb.evaluation.log()
  )
)
## [1]  train-rmse:8195.188197  test-rmse:8016.633601 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:501.794972   test-rmse:504.228730 
## [21] train-rmse:340.134299   test-rmse:388.029680 
## [31] train-rmse:316.023554   test-rmse:371.370527 
## [41] train-rmse:296.876646   test-rmse:361.364363 
## [51] train-rmse:280.699559   test-rmse:353.763273 
## [61] train-rmse:264.867155   test-rmse:346.254464 
## [71] train-rmse:255.084645   test-rmse:342.720700 
## [81] train-rmse:243.338333   test-rmse:335.643625 
## [91] train-rmse:234.935492   test-rmse:331.515232 
## [100]    train-rmse:225.312784   test-rmse:327.362310
# Model prediction
predictions <- predict(model, X_test)

# Lightgbm Model evaluation
evaluate_model <- function(actual, predicted) {
  rmse_value <- sqrt(mean((actual - predicted)^2)) # RMSE
  mae_value <- mean(abs(actual - predicted))       # MAE
  mse_value <- mean((actual - predicted)^2)        # MSE
  
  ss_res <- sum((actual - predicted)^2)
  ss_tot <- sum((actual - mean(actual))^2)
  r2_value <- 1 - (ss_res / ss_tot)                # R²
  
  cat("Model Evaluation Metrics:\n")
  cat("RMSE:", rmse_value, "\n")
  cat("MAE:", mae_value, "\n")
  cat("MSE:", mse_value, "\n")
  cat("R²:", r2_value, "\n")
}


evaluate_model(y_test, predictions)
## Model Evaluation Metrics:
## RMSE: 327.3623 
## MAE: 201.0726 
## MSE: 107166.1 
## R²: 0.9971895
# Get a log of the training process
eval_log <- model$evaluation_log



# Extract RMSE on the training set and validation set
train_rmse <- unlist(eval_log$train_rmse)
test_rmse <- unlist(eval_log$test_rmse)
iterations <- seq_along(train_rmse)

# Create a data frame for drawing
eval_df <- data.frame(
  iter = iterations,
  train_rmse = train_rmse,
  test_rmse = test_rmse
)

# draw a chart
learning_curve <- ggplot(eval_df, aes(x = iter)) +
  geom_line(aes(y = train_rmse, color = "Train RMSE"), size = 1) +
  geom_line(aes(y = test_rmse, color = "Validation RMSE"), size = 1, linetype = "dashed") +
  scale_color_manual(values = c("Train RMSE" = "#33a02c", "Validation RMSE" = "#4c72b0")) +
  theme_minimal() +
  theme(
    text = element_text(size = 14, family = "Arial"),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold"),
    axis.text = element_text(face = "bold"),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    legend.background = element_rect(fill = "lightgray", size = 0.5, linetype = "solid"),
    panel.background = element_rect(fill = "#d4e6f1", color = NA),
    plot.background = element_rect(fill = "#d4e6f1", color = NA),  
    panel.grid.major = element_line(color = "white")                
  ) +
  labs(
    title = "LightGBM Learning Curve (RMSE)",
    x = "Iteration",
    y = "RMSE",
    color = "Legend"
  )

# print learning_curve
print(learning_curve)

The Xgboost Model performs well in predicting loan amounts by evaluating indicators.RMSE: 327.3623, MAE : 201.0726 , MSE : 107166, R² : 0.9971 Especially with an R² value close to 1, suggesting that the model explains most of the variance.And The learning curve demonstrates a significant reduction in RMSE for both the training and validation sets as the number of iterations increases, stabilizing towards the end. This indicates continuous model optimization and good performance.

3 Conclusion

In today’s financial sector, developing accurate credit risk assessments and loan amount prediction models has become extremely important. These models meticulously analyze risk and financial health, helping financial institutions find a good balance between extending credit and managing risk. In practice, machine learning technology has proven to significantly enhance the accuracy and efficiency of these assessments. This paper summarizes the following points through the analysis and modeling of Credit Risk Assessment:

Borrowers’ loan intentions and home ownership: The document charts the distribution of borrowers applying for loans for different purposes, with education and home improvement being the main reasons. Moreover, most borrowers rent rather than own property.

Borrower default history and loan status: The analysis shows that a majority of borrowers have a history of default and only a minority show non-default status, highlighting the need for financial education and risk assessment strategies to mitigate default risk.

Borrowing patterns across age groups: Borrowers aged 21-30 are the group taking out the most loans, mainly for education reasons, while older groups tend to take out loans for personal reasons. This suggests that lenders need to tailor financial products and services to meet the needs of different age groups.

Model evaluation and prediction: The document evaluates the Random Forest, Logistic regression, LightGBM and Xgboost. The Random Forest model performed best in credit risk classification ,it’s accuracy of 93% and high precision of 97%, while the Xgboost model excelled in loan amount prediction,it’s RMSE: 327, MAE : 201.0726 , MSE : 107166, R² : 0.9971.The evaluation of the models highlights their application value and optimization direction in different loan situations.

References

[1] Sudhamathy, G., & Venkateswaran, C. J. (2016b). Analytics using R for predicting credit defaulters. Sudhamathy, Sudhamathy. https://doi.org/10.1109/icaca.2016.7887925

[2] Bussmann, N., Giudici, P., Marinelli, D., & Papenbrock, J. (2020). Explainable machine learning in credit risk management. Computational Economics, 57(1), 203–216. https://doi.org/10.1007/s10614-020-10042-0

[3] El-Qadi, A., Trocan, M., Conde-Cespedes, P., Frossard, T., & Díaz-Rodríguez, N. (2023). Credit risk scoring using a data fusion approach. In Lecture notes in computer science (pp. 769–781). https://doi.org/10.1007/978-3-031-41456-5_58

[4] Li, M. (2021). Uses and abuses of statistical control variables: Ruling out or creating alternative explanations? Journal of Business Research, 126, 472–488. https://doi.org/10.1016/j.jbusres.2020.12.037

[5] Bhattacharyya, S., Jha, S., Tharakunnel, K., & Westland, J. C. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613. https://doi.org/10.1016/j.dss.2010.08.008

[6] Pozzolo, A. D., Boracchi, G., Caelen, O., Alippi, C., & Bontempi, G. (2018). Credit Card Fraud Detection: a realistic modeling and a novel learning strategy. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3784–3797. https://doi.org/10.1109/tnnls.2017.2736643

[7] Kohout, J., Komárek, T., Čech, P., Bodnár, J., & Lokoč, J. (2018). Learning communication patterns for malware discovery in HTTPs data. Expert Systems With Applications, 101, 129–142. https://doi.org/10.1016/j.eswa.2018.02.010

[8] King, M. (2020). What makes a successful corporate investigator. Journal of Financial Crime, 27(3), 701–714. https://doi.org/10.1108/jfc-02-2020-0019

[9] Quijano-Sánchez, L., Liberatore, F., Camacho-Collados, J., & Camacho-Collados, M. (2018). Applying automatic text-based detection of deceptive language to police reports: Extracting behavioral patterns from a multi-step classification model to understand how we lie to the police. Knowledge-based Systems, 149, 155–168. https://doi.org/10.1016/j.knosys.2018.03.010

[10] Yanenkova, I., Nehoda, Y., Drobyazko, S., Zavhorodnii, A., & Berezovska, L. (2021). Modeling of bank credit risk management using the cost risk model. Journal of Risk and Financial Management, 14(5), 211. https://doi.org/10.3390/jrfm14050211