Group 11: Matriculation Numbers and Names

data <- data.frame(
  Matric = c("23119730", "23063431", "23083896", "S2157112", "23070795"),
  Full_Name = c("Kau Zhi Wei", "Chang Qi Han", "Cheah Yining", "Chee Zi Yaw", "Zhuang Zhen Yu")
)

data
##     Matric      Full_Name
## 1 23119730    Kau Zhi Wei
## 2 23063431   Chang Qi Han
## 3 23083896   Cheah Yining
## 4 S2157112    Chee Zi Yaw
## 5 23070795 Zhuang Zhen Yu

Introduction

Background of the study

A loan refers to a sum of money that is borrowed from licensed financial institution, often a bank, which associated with a certain amount of interest and is required to pay back by the borrowers within a predetermined period. A loan serves varying purposes for different groups. From the perspective of borrower, it can act as a financial resource, providing startup capital or initial funding to launch a project or business. On the other hand, for financial institutions, the interest generated from loans is one of their revenue streams. The loan application process can be lengthy, involving multiple steps, from the submission of the application by applicants to back-and-forth verification and validation, and finally, approval. With the growing demand for loan applications, challenges arise for both parties: applicants may face extended waiting periods, while financial institutions risk overlooking some applications, potentially leading to reputational damage and perceptions of incompetence. On top of that, even after a prolonged verification and validation process, there is no assurance that the chosen applicant will adhere to the repayment timeline. In order to address these limitations, machine learning approaches has been extensively studied to explore their potential in predicting the eligibility of loan applications.

Objectives

This project aims:

  • To perform exploratory data analysis to understand the relationship of the features with the loan approval status.
  • To build classification models to predict the loan status of the applicants.
  • To develop regression models to predict the credit score of the applicants.

Methodology

## Warning: package 'DiagrammeR' was built under R version 4.4.2

The flow of methodology of this study is structured as follows:

  1. Data acquisition: The dataset focuses on loan approval data was obtaiend from Kaggle.
  2. Data pre-processing: The dataset was explored and checked for the presence of missing values, outliers, duplicates and mismatch of data type and action was taken if required, including the removal required instances and correcting data type of the features.
  3. Exploratory data analysis: Mutiple analysis was carried out, consisting of univariate analysis, bivariate analysis and correlation analysis to have a better understanding on the data distribution and relationship between features. Different plots were plot to visualize output, such as boxplot, histogram, pie chart and heatmap.
  4. Train-test split: Prior to model training and splitting the dataset, categorical variables were encoded and standardization was performed on numerical variables to ensure consistency across data. After that, the dataset was split into 80-20 ratio, resulting in 80% training data and 20% test data.
  5. Machine learning training: In this study,two tasks were performed: binary classification of loan approval status and regression prediction on applicant’s credit score.
  6. Evaluation: The trained models were evaluated using different metrics for each task. For classification, accuracy, precision, recall and F1-score were used whereas regression, root mean squared error, mean absolute error and R-squared were implied.

Dataset Exploration

Dataset Details

Dataset Overview

# Import dataset
df <- read.csv("loan_data_amended.csv")
head(df)
##   person_age person_gender person_education person_income person_emp_exp
## 1         22        female           Master         71948              0
## 2         21        female      High School         12282              0
## 3         25        female      High School         12438              3
## 4         23        female         Bachelor         79753              0
## 5         24          male           Master         66135              1
## 6         21        female      High School         12951              0
##   person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income
## 1                  RENT     35000    PERSONAL         16.02                0.49
## 2                   OWN      1000   EDUCATION         11.14                0.08
## 3              MORTGAGE      5500     MEDICAL         12.87                0.44
## 4                  RENT     35000     MEDICAL         15.23                0.44
## 5                  RENT     35000     MEDICAL         14.27                0.53
## 6                   OWN      2500     VENTURE         7.14_                0.19
##   cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
## 1                          3          561                             No
## 2                          2          504                            Yes
## 3                          3          635                             No
## 4                          2          675                             No
## 5                          4          586                             No
## 6                          2          532                             No
##   loan_status
## 1           1
## 2           0
## 3           1
## 4           1
## 5           1
## 6           1

Feature Description

The description or explanation of each features are tabulated as follows:

variable description
person_age Applicant’s age
person_gender Applicant’s gender
person_education Applicant’s highest eductaion level
person_income Applicant’s annual income
person_emp_exp Work experience (in years)
person_home_ownership Home ownership status
loan_amnt Loan amount requested
loan_intent Purpose of applying loan
loan_int_rate Loan interest rate
loan_percent_income Loan amount as a percentage of annual income
cb_person_cred_hist_length Length of credit history (in years)
credit_score Applicant’s credit score
previous_loan_defaults_on_file Indicator of previous loan defaults
loan_status Loan approval status

Dataset Structure

str(df)
## 'data.frame':    45000 obs. of  14 variables:
##  $ person_age                    : int  22 21 25 23 24 21 26 24 24 21 ...
##  $ person_gender                 : chr  "female" "female" "female" "female" ...
##  $ person_education              : chr  "Master" "High School" "High School" "Bachelor" ...
##  $ person_income                 : int  71948 12282 12438 79753 66135 12951 93471 95550 100684 12739 ...
##  $ person_emp_exp                : int  0 0 3 0 1 0 1 5 3 0 ...
##  $ person_home_ownership         : chr  "RENT" "OWN" "MORTGAGE" "RENT" ...
##  $ loan_amnt                     : int  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
##  $ loan_intent                   : chr  "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
##  $ loan_int_rate                 : chr  "16.02" "11.14" "12.87" "15.23" ...
##  $ loan_percent_income           : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
##  $ cb_person_cred_hist_length    : chr  "3" "2" "3" "2" ...
##  $ credit_score                  : int  561 504 635 675 586 532 701 585 544 640 ...
##  $ previous_loan_defaults_on_file: chr  "No" "Yes" "No" "No" ...
##  $ loan_status                   : int  1 0 1 1 1 1 1 1 1 1 ...

This dataset consists of 45,000 instances and 14 variables, comprising of categorical and numerical variables. Based on the output above, it appears that loan_int_rate and cb_person_cred_hist_length have been incorrectly classified as a categorical variable, when in fact it is a continuous and discrete variable. Since column may containing text or mixed values, cannot convert the data type to numeric because it will replace them with NA, it might cause the important information lose.

Summary Statistics

summary(df)
##    person_age     person_gender      person_education   person_income     person_emp_exp   person_home_ownership   loan_amnt     loan_intent        loan_int_rate      loan_percent_income cb_person_cred_hist_length  credit_score   previous_loan_defaults_on_file  loan_status    
##  Min.   : 20.00   Length:45000       Length:45000       Min.   :   8000   Min.   :  0.00   Length:45000          Min.   :  500   Length:45000       Length:45000       Min.   :0.0000      Length:45000               Min.   :390.0   Length:45000                   Min.   :0.0000  
##  1st Qu.: 24.00   Class :character   Class :character   1st Qu.:  47202   1st Qu.:  1.00   Class :character      1st Qu.: 5000   Class :character   Class :character   1st Qu.:0.0700      Class :character           1st Qu.:601.0   Class :character               1st Qu.:0.0000  
##  Median : 26.00   Mode  :character   Mode  :character   Median :  67046   Median :  4.00   Mode  :character      Median : 8000   Mode  :character   Mode  :character   Median :0.1200      Mode  :character           Median :640.0   Mode  :character               Median :0.0000  
##  Mean   : 27.76                                         Mean   :  80311   Mean   :  5.41                         Mean   : 9583                                         Mean   :0.1397                                 Mean   :632.6                                  Mean   :0.2222  
##  3rd Qu.: 30.00                                         3rd Qu.:  95782   3rd Qu.:  8.00                         3rd Qu.:12237                                         3rd Qu.:0.1900                                 3rd Qu.:670.0                                  3rd Qu.:0.0000  
##  Max.   :144.00                                         Max.   :7200766   Max.   :125.00                         Max.   :35000                                         Max.   :0.6600                                 Max.   :850.0                                  Max.   :1.0000  
##                                                         NA's   :9

Here are some key insights based on the results shown above:

  • Potential outliers identified in the dataset:The maximum age of applicants is 144, which seems unrealistic. The maximum work experience is 125 years, suggesting a likely data entry error.
  • Missing values were observed in person_income. These issues will be addressed on the next section.

Data Pre-processing and Exploratory Data Analysis

colSums(is.na(df))
##                     person_age                  person_gender               person_education                  person_income                 person_emp_exp          person_home_ownership                      loan_amnt                    loan_intent                  loan_int_rate            loan_percent_income     cb_person_cred_hist_length                   credit_score previous_loan_defaults_on_file                    loan_status 
##                              0                              0                              0                              9                              0                              0                              0                              0                              0                              0                              0                              0                              0                              0
duplicatecount <- sum(duplicated(df))
print(duplicatecount)
## [1] 0
# Categorical Columns to process
columns <- c("person_gender", "loan_status", "person_education", "person_home_ownership", 
             "previous_loan_defaults_on_file", "loan_intent")

# Extract Unique Values and Counts
uniquevalues <- lapply(df[columns], unique)
counts <- lapply(df[columns], table)

# Print Unique Values
print(uniquevalues)
## $person_gender
## [1] "female" "male"  
## 
## $loan_status
## [1] 1 0
## 
## $person_education
## [1] "Master"      "High School" "Bachelor"    "Associate"   "Doctorate"  
## 
## $person_home_ownership
## [1] "RENT"     "OWN"      "MORTGAGE" "OTHER"   
## 
## $previous_loan_defaults_on_file
## [1] "No"  "Yes"
## 
## $loan_intent
## [1] "PERSONAL"          "EDUCATION"         "MEDICAL"           "VENTURE"           "HOMEIMPROVEMENT"   "DEBTCONSOLIDATION"
# Print Counts
print(counts)
## $person_gender
## 
## female   male 
##  20159  24841 
## 
## $loan_status
## 
##     0     1 
## 35000 10000 
## 
## $person_education
## 
##   Associate    Bachelor   Doctorate High School      Master 
##       12028       13399         621       11972        6980 
## 
## $person_home_ownership
## 
## MORTGAGE    OTHER      OWN     RENT 
##    18489      117     2951    23443 
## 
## $previous_loan_defaults_on_file
## 
##    No   Yes 
## 22142 22858 
## 
## $loan_intent
## 
## DEBTCONSOLIDATION         EDUCATION   HOMEIMPROVEMENT           MEDICAL          PERSONAL           VENTURE 
##              7145              9153              4783              8548              7552              7819
library(plotly)
# Pie chart for person_gender
plot_ly(df, labels = ~person_gender, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
  layout(title = "Distribution of Person Gender")

The pie chart above decipts the distribution of applicant’s gender in the dataset. The dataset has more male applicants (55.2%) than female applicants (44.8&) and it is slightly skewed towards male applicants.

# Pie chart for loan_status
plot_ly(df, labels = ~loan_status, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
  layout(title = "Distribution of Loan Status")

Based on the output shown above, there is small proportion of applicants (22.2%) qualify for the loan and the dataset is highly imbalanced. This reflect that there might be a strict loan approval process found from the samples in this dataset.

# Pie chart for person_education
plot_ly(df, labels = ~person_education, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
  layout(title = "Distribution of Person Education")

The applicants from this dataset are made up of three major groups of educational background, which are Bachelor’s (29.8%), Associate (26.7%) and High School (26.6%) whereas the smallest group falls to applicants with Doctorate, comprising of 1.38% of total applicants.

# Pie chart for person_home_ownership
plot_ly(df, labels = ~person_home_ownership, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
  layout(title = "Distribution of Person Home Ownership")

Most applicants are renting (52.1%) or have a mortgage (41.1%), where only a small group of applicant own a house.

# Pie chart for previous_loan_defaults_on_file
plot_ly(df, labels = ~previous_loan_defaults_on_file, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
  layout(title = "Distribution of Previous Loan Defaults")

Half of the applicants have loan defaults previously, which indicate they might be financially instable and possibly struggle with future debt repayment. Therefore, further analysis is required to access this group of applicants.

# Pie chart for loan_intent
plot_ly(df, labels = ~loan_intent, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
  layout(title = "Distribution of Loan Intent")

Based on the loan intention, it is found that the top two reasons are education (20.3%) and medical (19%), stating that many applicants seek financial support for academic and healthcare expenses, which in other way reflects the high cost in these two areas. Apart from that, nearly similar amount of applicants seek fundings for business (17.4%), personal purposes (16.8%) and debt consolidation (15.9%). The intention for home improvement occupies the least proportion - 10.6%, which might be due to low number of owners in this dataset.

# check the unique in cb_person_cred_hist_length column
personcred <- unique(df$cb_person_cred_hist_length)
print(personcred)
##  [1] "3"  "2"  "4"  "4_" "3_" "2_" "8"  "7"  "6"  "9"  "10" "5"  "8_" "11" "16" "15" "12" "13" "17" "14" "25" "28" "27" "22" "19" "29" "23" "26" "20" "21" "30" "24" "18" "6_"

After checking the unique value, there are exist some value have trailing underscores (). It is the reason to cause inconsistant data format and we would like to remove () and let it turn to numerical.

# Clean trailing underscores
df$cb_person_cred_hist_length <- gsub("_$", "", df$cb_person_cred_hist_length)
df$cb_person_cred_hist_length <- as.numeric(df$cb_person_cred_hist_length)
df$loan_int_rate <- gsub("_$", "", df$loan_int_rate)
df$loan_int_rate <- as.numeric(df$loan_int_rate)
colSums(is.na(df))
##                     person_age                  person_gender               person_education                  person_income                 person_emp_exp          person_home_ownership                      loan_amnt                    loan_intent                  loan_int_rate            loan_percent_income     cb_person_cred_hist_length                   credit_score previous_loan_defaults_on_file                    loan_status 
##                              0                              0                              0                              9                              0                              0                              0                              0                              4                              0                              0                              0                              0                              0

Check the Total number of missing value after clear (_) in the column of “loan_int_rate” and “cb_person_cred_hist_length” and there are 4 missing value in column “loan_int_rate” and 9 missing value in column “cb_person_cred_hist_length”.

Subsequently, proceeding to univariate analysis on numerical variables, the boxplots and histrograms below show the distribution of applicants across each numerical variables in this dataset. Also, through the visualizations, it is possible to detect potential outliers visually.

library(ggplot2)
library(gridExtra)

numericalcols <- df[sapply(df, is.numeric)]
numericalcols <- numericalcols[, setdiff(names(numericalcols), "loan_status")]

plot_list <- lapply(names(numericalcols), function(col) {
  ggplot(df, aes_string(y = col)) +
    geom_boxplot(fill = "lightblue", color = "black") +
    labs(title = paste("Boxplot of", col), y = col) +
    theme_minimal()
})

do.call(grid.arrange, c(plot_list, ncol = 2))

From the boxplots, the distinct patterns in numerical features are observed:

  1. Outliers are present in most numerical variables, including person_income, loan_amnt, person_age, person_emp_exp, loan_percent_income, and cb_person_cred_hist_length, where there is a significant number of extreme values persent beyond the upper boundary of the box.
  2. lon_int_rate and credit score exhibit relatively stable distribution with fewer extreme values detected. The outliers in loan_int_rate might correspond to high-risk borrowers whereas for credit score, it shows a relatively normal distributed pattern.
# Remove rows where person_income has missing values
df <- df[!is.na(df$person_income), ]
dim(df)
## [1] 44991    14

Due to person_income column exist a lot of outliers, then we decide to remove the missing value.

numericalcols <- df[sapply(df, is.numeric)]
numericalcols <- numericalcols[, setdiff(names(numericalcols), "loan_status")]

# Create a list of ggplot objects for each numerical column
plot_list <- lapply(names(numericalcols), function(col) {
  ggplot(df, aes_string(x = col)) +
    geom_histogram(aes(y = ..count..), bins = 30, fill = "lightblue", color = "gray") +
    labs(title = paste("Distribution of", col), x = col, y = "Count") +
    theme_minimal()
})

do.call(grid.arrange, c(plot_list, ncol = 2, top = "Distributions of Numerical Columns"))

Based on the above histograms, several insights are obtained:

  1. There are six features that are skewed to the right, which are person age, person income, person employment experience, loan amount, loan percent income and credit history length, stating that most values are concentrated toward lower range, and extreme values are detected in these features.
  2. The distribution of loan interest rates is likely to be multimodal distribution, with the highest peak around 10% and multiple smaller peaks as it does not generate a smooth curve, just like the credit score distribution. This may be due to the presence of different loan types, resulting in different interest rate.
  3. Credit score is nearly normally distributed.
library(dplyr)
# Filter the data into suitable range for person_age < 100, person_emp_exp < 50
df <- df %>% filter(person_age < 100, person_emp_exp < 50)
dim(df)
## [1] 44977    14

Filter the the data into suitable range for column “person_age” < 100 and column “person_emp_exp” < 50

# Define the categorical columns
df$loan_status <- factor(df$loan_status, 
                         levels = c("0", "1"), 
                         labels = c("Rejected", "Approved"))

categorical <- c("person_gender", "person_education", "person_home_ownership", "loan_intent")

plot_list <- lapply(categorical, function(col) {
  ggplot(df, aes_string(x = col, fill = "loan_status")) +
    geom_bar(position = "dodge", color = "black", width = 0.7) +  # Side-by-side bars
    labs(title = paste("Bar Chart of", col, "by Loan Status"),
         x = col, y = "Count", fill = "Loan Status") +
    scale_fill_brewer(palette = "Set1") +  # Use a color palette for distinction
    theme_minimal(base_size = 14) +
    theme(
      legend.position = "top",  # Move legend to the top
      axis.text.x = element_text(angle = 45, hjust = 1)  # Rotate x-axis text
    )
})

do.call(grid.arrange, c(plot_list, ncol = 2))

In univariate analysis, we convert the column of “loan_status” from numerical into categorical column, which are replace “Rejected” and “Approved” to “0” and “1” respectively, then focus on understanding the distribution of individual features. However, in this section, we further explore the distribution of four categorical variables, namely person_gender, person_education, person_home_ownership and loan_intent by the target variable - loan status. So, four bar plots are plotted and two general insights are observed across all plots:

  • The loan rejection rates are higher than loan approval rates.
  • The loan rejection rates are proportional with the number of applicants.
library(ggplot2)
library(tidyr)

df_long <- df %>%
  select(credit_score, person_income, loan_amnt, loan_int_rate, 
         loan_percent_income, cb_person_cred_hist_length) %>%
  pivot_longer(cols = -credit_score, names_to = "Variable", values_to = "Value")

# Plot scatter plots with facets
ggplot(df_long, aes(x = Value, y = credit_score)) +
  geom_point(color = "blue", alpha = 0.6) +  
  facet_wrap(~ Variable, scales = "free_x") +  
  labs(title = "Scatter Plots of Credit Score vs Other Variables",
       x = "Values of Each Variable", y = "Credit Score") +
  theme_minimal()

library(corrplot)

df2 <- df #make a copy
df2$loan_int_rate[is.na(df2$loan_int_rate)] <- mean(df2$loan_int_rate, na.rm = TRUE)
#  Remove the Outliers for the spicified numerical columns
numerical_columns <- c("person_age", "person_income", "person_emp_exp", "loan_amnt", 
                       "loan_int_rate", "loan_percent_income", 
                       "cb_person_cred_hist_length", "credit_score")
for (col in numerical_columns) {
  Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)  # First quartile
  Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)  # Third quartile
  IQR <- Q3 - Q1                                           # Interquartile range
  lower_bound <- Q1 - 1.5 * IQR                            # Lower bound
  upper_bound <- Q3 + 1.5 * IQR                            # Upper bound
  
  # Filter the dataframe
  df2 <- df2[
    df2[[col]] >= lower_bound & df2[[col]] <= upper_bound, 
  ]
}
numericalcols <- df2[sapply(df2, is.numeric)]
correlation_matrix <- cor(numericalcols, method = "pearson")
corrplot(correlation_matrix, method = "color", type = "upper", 
         col = colorRampPalette(c("blue", "white", "red"))(100),
         tl.cex = 0.8, number.cex = 0.7, addCoef.col = "black")

For EDA part, we handle missing values and outliers for the entire dataset (without considering train-test split) to compute the correlation heatmap. For Modelling part, we will Split the dataset into training and testing sets, then handle missing values with imputation of mean and remove outliers separately on the training and test sets to avoid data leakage.

Results and Discussion

  • The data pre-processing to clean and prepare the dataset for analysis. Filter was conducted on the person_age and person_emp_exp columns by excluding values over 100 and 50, respectively.

  • EDA:The exploratory data analysis (EDA) provided important insights into both categorical and numerical features. Among the categorical features, males, renters, and individuals with higher education levels (such as Bachelor’s and Associate degrees) were more likely to receive loan approvals, with the highest loan intent observed for education and medical purposes. The loan_status variable indicated a significant class imbalance, with 77.8% of loans being approved. In terms of numerical features, person_age and person_emp_exp were right-skewed, with most values falling below 50 and 10, respectively. Loan amounts were primarily concentrated under 20,000, and interest rates peaked around 10%. The credit_score exhibited a normal distribution centered between 600 and 700, showing a positive correlation with the length of credit history, while other numerical features demonstrated weaker relationships.

  • Correlation Coefficient: It reveals the relationships between various numerical variables in the dataset. Notably, there is a strong positive correlation between person_age and person_emp_exp (0.90), indicating that older individuals tend to have more work experience. Similarly, loan_percent_income and loan_amnt exhibit a moderately positive correlation (0.65), suggesting that higher of loan percent income individuals tend to take larger loans. Loan_int_rate shows a moderate positive correlation with loan_status (0.32), indicating that higher interest rates may be associated with higher risk of default. In contrast, loan_percent_income and person_income are negatively correlated (-0.38), implying that individuals with higher incomes tend to have lower loan-to-income ratios. Other correlations, such as those involving credit_score and cb_person_cred_hist_length, show relatively weak associations with the other variables. Overall, the heatmap highlights key relationships that can guide further analysis or modeling efforts.

# Define the levels and labels for the factor
education_levels <- c("High School", "Associate", "Bachelor", "Master", "Doctorate")
education_labels <- c(1, 2, 3, 4, 5)

df$person_education <- factor(df$person_education, 
                                      levels = education_levels, 
                                      labels = education_labels)
df$person_education <- as.numeric(df$person_education)

Define the levels and labels for the column “education_levels”.

Classification Prediction on Loan Status

library(caret) #For stratified k-fold cross validation
library(MLmetrics)
library(randomForest)
library(xgboost)

# Classification prediction on Loan Status
# Split data into training and testing
set.seed(123) 
train_index <- createDataPartition(df$loan_status, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Calculate the mean of loan_int_rate using the training data only
mean_loan <- mean(train_data$loan_int_rate, na.rm = TRUE)

train_data$loan_int_rate[is.na(train_data$loan_int_rate)] <- mean_loan
test_data$loan_int_rate[is.na(test_data$loan_int_rate)] <- mean_loan

#Remove Outliers
numerical_columns <- c("person_age", "person_income", "person_emp_exp", "loan_amnt", 
                       "loan_int_rate", "loan_percent_income", 
                       "cb_person_cred_hist_length", "credit_score")

for (col in numerical_columns) {
  Q1 <- quantile(train_data[[col]], 0.25, na.rm = TRUE)  # First quartile
  Q3 <- quantile(train_data[[col]], 0.75, na.rm = TRUE)  # Third quartile
  IQR <- Q3 - Q1                                        # Interquartile range
  lower_bound <- Q1 - 1.5 * IQR                         # Lower bound
  upper_bound <- Q3 + 1.5 * IQR                         # Upper bound
  
  train_data <- train_data[
    train_data[[col]] >= lower_bound & train_data[[col]] <= upper_bound, 
  ]
  
  test_data <- test_data[
    test_data[[col]] >= lower_bound & test_data[[col]] <= upper_bound, 
  ]
}

dim(train_data)
## [1] 28772    14
dim(test_data)
## [1] 7228   14
# StandardScaler()
numerical_cols <- c("person_age", "person_education", "person_income", "person_emp_exp", 
                    "loan_amnt", "loan_int_rate", "loan_percent_income", 
                    "credit_score", "cb_person_cred_hist_length")
preprocess_scaler <- preProcess(train_data[, numerical_cols], method = c("center", "scale"))
train_data[, numerical_cols] <- predict(preprocess_scaler, train_data[, numerical_cols])
test_data[, numerical_cols] <- predict(preprocess_scaler, test_data[, numerical_cols])

#One-Hot-Encoding
categorical_cols <- c("person_gender", "person_home_ownership", "loan_intent", 
                      "previous_loan_defaults_on_file")
train_encoded <- dummyVars(" ~ .", data = train_data[, categorical_cols]) %>%
  predict(newdata = train_data) %>%
  as.data.frame()
test_encoded <- dummyVars(" ~ .", data = test_data[, categorical_cols]) %>%
  predict(newdata = test_data) %>%
  as.data.frame()

train_data <- bind_cols(train_data, train_encoded) %>%
  select(-one_of(categorical_cols))  # Remove original categorical columns
test_data <- bind_cols(test_data, test_encoded) %>%
  select(-one_of(categorical_cols))  # Remove original categorical columns

dim(train_data)
## [1] 28772    24
dim(test_data)
## [1] 7228   24
# Set up stratified k-fold cross-validation with k-fold = 5
# When trainControl is used with classification tasks, the caret package automatically ensures that each fold maintains the class distribution of the dataset.
train_control <- trainControl(
  method = "cv",
  number = 5, 
  classProbs = TRUE,           # For classification Tasks
  summaryFunction = multiClassSummary,
  savePredictions = TRUE
)

# Define metric storage
results <- data.frame(Model = character(), Accuracy = numeric(),
                      Precision = numeric(), Recall = numeric(), F1 = numeric())

# Logistic Regression
logistic_model <- train(
  loan_status ~ ., data = train_data, method = "glm",
  family = "binomial", trControl = train_control
)
logistic_perf <- logistic_model$results
results <- rbind(results, data.frame(
  Model = "Logistic Regression",
  Accuracy = mean(logistic_perf$Accuracy),  
  Precision = mean(logistic_perf$Precision),  
  Recall = mean(logistic_perf$Recall),
  F1 = mean(logistic_perf$F1) 
))

# Random Forest
rf_model <- train(
  loan_status ~ ., data = train_data, method = "rf",
  trControl = train_control,
  tuneGrid = expand.grid(mtry = floor(sqrt(ncol(train_data)))),
  ntree = 100
)
rf_perf <- rf_model$results
results <- rbind(results, data.frame(
  Model = "Random Forest",
  Accuracy = mean(rf_perf$Accuracy), 
  Precision = mean(rf_perf$Precision),
  Recall = mean(rf_perf$Recall),
  F1 = mean(rf_perf$F1)
))

# XGBoost
tune_grid <- expand.grid(
  nrounds = c(50, 100),  # Fewer boosting rounds
  max_depth = 3,         # Shallower trees
  eta = 0.3,             # Learning rate
  gamma = 0, 
  colsample_bytree = 0.8, # Subsample features
  min_child_weight = 1, 
  subsample = 0.8
) # It will take a long time to run, because nround default by 500+ and we reduce the nround to (50-100)

xgb_model <- train(
  loan_status ~ ., 
  data = train_data, 
  method = "xgbTree", 
  trControl = train_control,
  tuneGrid = tune_grid
)
## [18:10:16] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:16] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:18] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:18] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:19] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:19] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:21] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:21] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:22] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:22] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
xgb_perf <- xgb_model$results
results <- rbind(results, data.frame(
  Model = "XGBoost",
  Accuracy = mean(xgb_perf$Accuracy), 
  Precision = mean(xgb_perf$Precision), 
  Recall = mean(xgb_perf$Recall), 
  F1 = mean(xgb_perf$F1)
))

# SVM
importantfea <- varImp(rf_model)$importance %>% arrange(desc(Overall))
selectedfea <- rownames(importantfea)[1:10]  # Top 10 features

svm_model <- train(
  loan_status ~ ., 
  data = train_data[, c(selectedfea, "loan_status")], 
  method = "svmLinear", 
  trControl = train_control
) # Use a Linear Kernel for Large Dataset: faster as it avoids computing complex transformations
svm_perf <- svm_model$results
results <- rbind(results, data.frame(
  Model = "SVM",
  Accuracy = mean(svm_perf$Accuracy), 
  Precision = mean(svm_perf$Precision), 
  Recall = mean(svm_perf$Recall),
  F1 = mean(svm_perf$F1)
))

# KNN
knn_model <- train(
  loan_status ~ ., data = train_data, method = "knn",
  trControl = train_control,
  tuneGrid = data.frame(k = seq(3, 5, 7))
)
knn_perf <- knn_model$results
results <- rbind(results, data.frame(
  Model = "KNN",
  Accuracy = mean(knn_perf$Accuracy),
  Precision = mean(knn_perf$Precision),
  Recall = mean(knn_perf$Recall),
  F1 = mean(knn_perf$F1)
))

# Display results
results %>% arrange(desc(Accuracy))
##                 Model  Accuracy Precision    Recall        F1
## 1             XGBoost 0.9236238 0.9366766 0.9698139 0.9529451
## 2       Random Forest 0.9235369 0.9317663 0.9755459 0.9531524
## 3 Logistic Regression 0.8956973 0.9277359 0.9426358 0.9351091
## 4                 SVM 0.8898927 0.9236442 0.9395840 0.9315401
## 5                 KNN 0.8860003 0.9207725 0.9377097 0.9291636

The dataset for classifying loan_status was divided into 80% for training and 20% for testing. There is a mean imputation for “loan_int_rate” column and remove the outliers will be conducted after the train test split. A StandardScaler() was used to normalize all numerical columns and do the StandardScaler(), and other categorical columns underwent one-hot encoding. To address the imbalance in loan status classes, stratified k-fold cross-validation with k-fold = “5” was performed on the training data to maintain representative distributions in each fold. Several models, including Logistic Regression, Random Forest, SVM, KNN, and XGBoost, were assessed based on accuracy, precision, recall, and F1-score. XGBoost stood out as the top-performing model, which achieving good performance on accuracy, precision, recall and F1-score during cross-validation.

# Final Prediction
final_model <- train(
  loan_status ~ ., data = train_data, method = "xgbTree",
  trControl = trainControl(method = "none"),
  tuneGrid = expand.grid(
    nrounds = 100,  
    max_depth = 3,         
    eta = 0.3,            
    gamma = 0, 
    colsample_bytree = 0.8,
    min_child_weight = 1, 
    subsample = 0.8
  )
)

test_predictions <- predict(final_model, newdata = test_data)

# Precision = Pos Pred Value, Recall = Sensitivity
confusionMatrix(test_predictions, test_data$loan_status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Rejected Approved
##   Rejected     5638      326
##   Approved      153     1111
##                                           
##                Accuracy : 0.9337          
##                  95% CI : (0.9277, 0.9394)
##     No Information Rate : 0.8012          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7821          
##                                           
##  Mcnemar's Test P-Value : 3.876e-15       
##                                           
##             Sensitivity : 0.9736          
##             Specificity : 0.7731          
##          Pos Pred Value : 0.9453          
##          Neg Pred Value : 0.8790          
##              Prevalence : 0.8012          
##          Detection Rate : 0.7800          
##    Detection Prevalence : 0.8251          
##       Balanced Accuracy : 0.8734          
##                                           
##        'Positive' Class : Rejected        
## 

XGBoost model was then applied to the testing data for final predictions. Overall, the XGBoost is able to classify most of the rejected and approved instances correctly, with small number of instances in false positive and false negative. From loan approval cases, a false positive indicates that an unqualified applicant’s loan request is approved, where the lenders might face financial loss if the borrower is a high risk borrower and loan might be defaulted. On the other hand, false negative means lenders might lose a potential customer when in fact the loan should be approved.

# XGBoost provide the tools to visualize the feature importance of the trained model
xgb_finalmodel <- final_model$finalModel  # Extract the underlying XGBoost model

importance <- xgb.importance(model = xgb_finalmodel)

xgb.plot.importance(importance_matrix = importance)  # Show top 10 features

Furthermore, an analysis of feature importance indicated that significant predictors included previous_loan_defaults_on_file, loan_int_rate and person_income, among others.

library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(ggplot2)

# Compute ROC Curve
pred_probs <- predict(xgb_model, newdata = test_data, type = "prob")
roc_curve <- roc(test_data$loan_status, pred_probs[, 2])
## Setting levels: control = Rejected, case = Approved
## Setting direction: controls < cases
ggplot() +
  geom_line(aes(x = 1 - roc_curve$specificities, y = roc_curve$sensitivities), color = "blue") +
  geom_abline(linetype = "dashed", color = "gray") +
  ggtitle(paste("ROC Curve (AUC =", round(auc(roc_curve), 2), ")")) +
  xlab("1 - Specificity (False Positive Rate)") +  # Corrected
  ylab("Sensitivity (True Positive Rate)") +
  theme_minimal()

The Receiver Operating Characteristics (ROC) curve is a graphical representation of sensitivity (true positive rate) against specificity (false positive rate), in other way, it shows a trade-off between sensitivity and specificity.

  • The above curve shows that XGBoost is effective in predicting loan approval status since AUC score of 0.98 proves that it has excellent discriminative capability as the closer the value to 1, the more effective the model is.
  • The pattern of the curve also shows that XGBoost is able to maintain a good balance between high sensitivity and low false positive rate. This means that the model is able to detect most applicants who should be rejected, and at the same time, it does not overly approve unqualified applicants by mistake, which is important in financial sector.

Regression Prediction on Credit Score

# Regression Prediction on Loan on Interest Rate
set.seed(124) 
trainindex <- createDataPartition(df$credit_score, p = 0.8, list = FALSE)
traindata <- df[trainindex, ]
testdata <- df[-trainindex, ]
meanloan <- mean(traindata$loan_int_rate, na.rm = TRUE)

traindata$loan_int_rate[is.na(traindata$loan_int_rate)] <- meanloan
testdata$loan_int_rate[is.na(testdata$loan_int_rate)] <- meanloan

# Remove Outliers
numerical_columns <- c("person_age", "person_income", "person_emp_exp", "loan_amnt", 
                       "loan_int_rate", "loan_percent_income", 
                       "cb_person_cred_hist_length", "credit_score")

for (col in numerical_columns) {
  Q1 <- quantile(traindata[[col]], 0.25, na.rm = TRUE)  # First quartile
  Q3 <- quantile(traindata[[col]], 0.75, na.rm = TRUE)  # Third quartile
  IQR <- Q3 - Q1                                        # Interquartile range
  lower_bound <- Q1 - 1.5 * IQR                         # Lower bound
  upper_bound <- Q3 + 1.5 * IQR                         # Upper bound
  
  traindata <- traindata[
    traindata[[col]] >= lower_bound & traindata[[col]] <= upper_bound, 
  ]
  
  testdata <- testdata[
    testdata[[col]] >= lower_bound & testdata[[col]] <= upper_bound, 
  ]
}
dim(traindata)
## [1] 28859    14
dim(testdata)
## [1] 7179   14
# StandardScaler()
numericalcols <- c("person_age", "person_education", "person_income", 
                   "person_emp_exp", "loan_amnt", "loan_percent_income", 
                   "loan_int_rate", "cb_person_cred_hist_length", "credit_score")
preprocessscaler <- preProcess(traindata[, numericalcols], method = c("center", "scale"))
traindata[, numericalcols] <- predict(preprocessscaler, traindata[, numericalcols])
testdata[, numericalcols] <- predict(preprocessscaler, testdata[, numericalcols])

#One-Hot-Encoding
categoricalcols <- c("person_gender", "person_home_ownership", "loan_intent", 
                      "previous_loan_defaults_on_file", "loan_status")
trainencoded <- dummyVars(" ~ .", data = traindata[, categoricalcols]) %>%
  predict(newdata = traindata) %>%
  as.data.frame()
testencoded <- dummyVars(" ~ .", data = testdata[, categoricalcols]) %>%
  predict(newdata = testdata) %>%
  as.data.frame()

traindata <- bind_cols(traindata, trainencoded) %>%
  select(-one_of(categoricalcols))  # Remove original categorical columns
testdata <- bind_cols(testdata, testencoded) %>%
  select(-one_of(categoricalcols))  # Remove original categorical columns

dim(traindata)
## [1] 28859    25
dim(testdata)
## [1] 7179   25
# K-fold Cross Validation
train_control <- trainControl(method = "cv", number = 5)

# Train models using different methods
lm_model <- train(credit_score ~ ., data = traindata, method = "lm", trControl = train_control)
dt_model <- train(credit_score ~ ., data = traindata, method = "rpart", trControl = train_control)
rf_model <- train(credit_score ~ ., data = traindata, method = "rf",
                  trControl = train_control,
                  tuneGrid = expand.grid(mtry = c(2, 3, 4)),
                  ntree = 50)
gbm_model <- train(credit_score ~ ., data = traindata, method = "gbm",
                   trControl = train_control,
                   tuneGrid = expand.grid(n.trees = c(50, 100),
                                          interaction.depth = c(1, 2),
                                          shrinkage = 0.1,
                                          n.minobsinnode = 10),
                   verbose = FALSE)

importantfea <- varImp(rf_model)$importance %>% arrange(desc(Overall))
selectedfea <- rownames(importantfea)[1:10]  # Top 10 features

svm_model <- train(
  credit_score ~ ., 
  data = traindata[, c(selectedfea, "credit_score")], 
  method = "svmLinear", 
  trControl = train_control
)

# Function to calculate metrics
extract_metrics <- function(model, model_name) {
  results <- model$results
  data.frame(
    Model = model_name,
    RMSE = mean(results$RMSE),
    MAE = mean(results$MAE),
    R2 = mean(results$Rsquared)
  )
}

# Collect metrics for each model
# Combine metrics for all models
regressionperformance <- rbind(
  extract_metrics(lm_model, "Linear Regression"),
  extract_metrics(dt_model, "Decision Tree"),
  extract_metrics(rf_model, "Random Forest"),
  extract_metrics(gbm_model, "Gradient Boosting"),
  extract_metrics(svm_model, "SVM")
)
print(regressionperformance)
##               Model      RMSE       MAE         R2
## 1 Linear Regression 0.9483485 0.7714067 0.10077404
## 2     Decision Tree 0.9796231 0.7962962 0.04702660
## 3     Random Forest 0.9603864 0.7806850 0.08037284
## 4 Gradient Boosting 0.9511344 0.7740556 0.09780567
## 5               SVM 0.9628069 0.7679873 0.09136300
# Final Prediction on the Best performance model
final_model <- train(
  credit_score ~ ., 
  data = traindata, 
  method = "lm",
  trControl = trainControl(method = "none")
)

# Make predictions on the test set
test_predictions <- predict(final_model, newdata = testdata)
y_test <- testdata$credit_score
rmse <- sqrt(mean((y_test - test_predictions)^2))
mae <- mean(abs(y_test - test_predictions))
r2 <- 1 - sum((y_test - test_predictions)^2) / sum((y_test - mean(y_test))^2)

# Display performance metrics
final_performance <- data.frame(
  RMSE = rmse,
  MAE = mae,
  R2 = r2
)
print(final_performance)
##        RMSE       MAE       R2
## 1 0.9386298 0.7643898 0.101789

For Regression model, after the train test split, there are mean imputation for “loan_int_rate” column and outliers removed will be conducted, then the StandardScaler for numerical columns and One Hot Encoding also will be implemented. In general, linear regression model exhibited the best performance, with the lowest RMSE (0.94) and R-squared (0.1. However, the R-squared value of 0.1 shows that it has poor predictive power for credit score. This result uncover that potential key features are missing out, causing the patterns are not well captured by the models. This can be clearly seen from the heatmap shown in correlation analysis section where the numerical features has low or weak correlation towards credit score. Features such as user payment history, number of defaults and credit utilization might be more useful to predict credit score.

Conclusion

Overall, each part in this project plays a significant role in contributing to the final outcome. From data pre-processing to the exploratory data analysis, visualizations and finally come up with the best performing model loan approval prediction. All objectives have been achieved in this study:

  • Through exploratory data analysis, there is a better comprehension on the distribution of the data and easily detect the presence of outliers or extreme values.
  • Different loan approval models were built and XGBoost yielded the best performance in terms of accuracy and considered as a well-balanced model.
  • Different credit score prediction models were built, where linear regression model shows as the performance among all models.

Additionally, this study is associated with limitations. Despite the linear regression model demonstrating the best performance among the evaluated models, it obtained a R-squared of 10%, indicating poor predictive power towards credit score. It is recommended to include key features that are more relevant to credit score.