Group Members

  1. HANG XIN YUAN (24204914)
  2. HO WEI AN (U2104506)
  3. TENG KAI QIAN (25087376)
  4. MICHELLE LEE POH YEE(25081023)
  5. LIM JIA WEI (U2004530)
  6. SARAH SAHIRA BINTI ZAMRI (25089582)

Overall Introduction

The “Lending Club” dataset on Kaggle is a large-scale real-world financial dataset containing loan information from a peer-to-peer lending platform. It includes accepted and rejected loan applications from 2007 to 2018, with detailed features such as borrower demographics, credit history, loan purpose, interest rates, and loan status. It can be a benchmark dataset in machine learning due to its size and real-world complexity. Peer-to-peer lending platforms such as Lending Club connect borrowers with individual investors.

Using a real-world Lending Club dataset of accepted loans issued in 2017-2018, this study develops predictive models based on borrower and loan characteristics available at the time of origination to address two main objectives:

  1. Classification (Loan Status Prediction): Accurate prediction of whether a borrower will repay or default on a loan is critical for managing credit risk and protecting investor returns. This study develops classification models to predict loan repayment outcomes based on borrower and loan characteristics available at the time of origination.

  2. Regression (Interest Rate Prediction): Interest rate is the central pricing mechanism in lending. It determines the cost of a loan to the borrower and the return to the lender, and is set based on the perceived risk of the borrower. Understanding which factors drive interest rate, and being able to predict the rate that a borrower is likely to be charged, is valuable to both lenders (for consistent risk-based pricing) and to borrowers (for understanding what affects the rate they receive). This study develops regression models to predict the interest rate assigned to a borrower from their financial profile and loan characteristics.

Objectives

Classification problem

  1. Classification: How can borrower and loan characteristics be used to classify a loan into “Fully Paid” or “Charged Off” groups?

There are few classification techniques can be use which are:

  1. Logistic Regression
  2. Decision Tree
  3. Random Forest

Then the performance of the classification techniques are evaluated.

Regression problem

  1. Regression: What factors most influence the interest rate assigned to a loan, and can we accurately predict the interest rate based on borrower and loan attributes?

The objective is to develop and evaluate regression models that predict int_rate (the loan’s interest rate) using borrower and loan attributes. Three regression techniques are built and compared:

  1. Multiple Linear Regression
  2. Decision Tree Regression
  3. Random Forest Regression

The performance of these techniques is evaluated using RMSE, MAE, and R-squared.

Important methodological note for Regression: Lending Club assigns each loan a credit grade and sub_grade, and the interest rate is directly determined from that grade. Including grade as a predictor of int_rate would therefore be circular and would produce an artificially perfect model. We deliberately exclude grade from the regression models and instead use the underlying borrower attributes (FICO score, debt-to-income ratio, income, term, etc.) — the same kind of variables that the grading process itself relies on. This produces a more meaningful and realistic predictive model.

Data Preparation Steps

This document prepares the raw Lending Club accepted loans dataset for downstream classification and regression modelling. It performs the following steps:

  1. Loads the full accepted_2007_to_2018Q4.csv (around 2.26 million rows, 151 columns).
  2. Selects 18 relevant columns (17 for modelling, plus issue_d used only for year filtering).
  3. Filters loan_status to only completed loans (Fully Paid or Charged Off).
  4. Filters to loans issued in 2017–2018, then drops issue_d since it is no longer needed.
  5. Cleans text fields (term, int_rate, emp_length) and removes missing values.

Load Libraries

library(dplyr)

Step 1: Load Full Dataset

The full Lending Club dataset is loaded from CSV.

loandata <- read.csv("accepted_2007_to_2018Q4.csv", stringsAsFactors = FALSE)
cat("Loaded:", nrow(loandata), "rows x", ncol(loandata), "columns\n")
## Loaded: 2260701 rows x 151 columns

Step 2: Select Useful Columns

Out of 151 columns, only 18 are needed: 17 will be used for modelling, plus issue_d which is used temporarily for year filtering and then dropped. We keep borrower demographics, loan terms, credit history indicators, and the two target-related columns (loan_status for classification, int_rate for regression).

keep_cols <- c(
  "loan_amnt", "term", "int_rate", "installment", "grade",
  "emp_length", "home_ownership", "annual_inc", "verification_status",
  "purpose", "dti", "fico_range_low", "fico_range_high",
  "open_acc", "revol_util", "total_acc", "loan_status",
  "issue_d"
)
loandata <- loandata[, keep_cols]
cat("Reduced to", ncol(loandata), "columns\n")
## Reduced to 18 columns

Step 3: Filter Loan Status

The loan_status column contains six categories: Fully Paid, Charged Off, Current, Late (16–30 days), Late (31–120 days), and In Grace Period. For classification, only loans with a definitive final outcome are useful, so we keep only “Fully Paid” and “Charged Off”. Loans that are still active or in late stages have unknown final outcomes and are removed.

loandata <- loandata %>%
  filter(loan_status %in% c("Fully Paid", "Charged Off"))
cat("After loan_status filter:", nrow(loandata), "rows\n")
## After loan_status filter: 1345310 rows

Step 4: Filter to Recent Years (2017–2018)

Lending Club operated under different lending standards across its history. Loans from 2007–2010 reflect post-financial-crisis conditions, while 2017–2018 represents the most recent and consistent period before the dataset ends. Focusing on 2017–2018 ensures that the modelling reflects contemporary lending behaviour.

The issue_d column contains the issue date in formats such as "Dec-2018" or "2018-12-01". A regular expression extracts the year reliably from either format.

loandata$issue_year <- as.numeric(sub(".*?([0-9]{4}).*", "\\1", loandata$issue_d))

loandata <- loandata %>% filter(issue_year %in% c(2017, 2018))
cat("After year filter (2017-2018):", nrow(loandata), "rows\n")
## After year filter (2017-2018): 225611 rows
# Drop helper columns
loandata <- loandata %>% select(-issue_d, -issue_year)

Step 5: Clean Text Columns

Several columns contain text that needs to be converted to numeric form for modelling.

5a. Term

The term column is stored as " 36 months" or " 60 months". We extract the numeric value.

loandata$term <- as.numeric(gsub("[^0-9]", "", loandata$term))

5b. Interest Rate

In some versions of the dataset, int_rate is stored as a character string with a percent sign (e.g. "13.99%"). In others it is already numeric. This conditional check handles both cases.

if (is.character(loandata$int_rate)) {
  loandata$int_rate <- as.numeric(gsub("%", "", loandata$int_rate))
}

5c. Employment Length

emp_length is stored as text values like "10+ years", "< 1 year", or "n/a". We convert these into numeric years, with n/a becoming NA.

parse_emp_length <- function(x) {
  x <- trimws(as.character(x))
  ifelse(x == "" | x == "n/a", NA,
  ifelse(x == "< 1 year", 0,
  ifelse(x == "10+ years", 10,
         as.numeric(gsub("[^0-9]", "", x)))))
}
loandata$emp_length <- parse_emp_length(loandata$emp_length)

Step 6: Handle Missing Values

We first inspect missing values per column, then drop rows with any remaining missing values. After all the column selection and parsing above, the only meaningful missing values should be in emp_length (where the original “n/a” became NA) and possibly a few in revol_util and dti.

cat("Missing values per column before na.omit:\n")
## Missing values per column before na.omit:
print(colSums(is.na(loandata)))
##           loan_amnt                term            int_rate         installment 
##                   0                   0                   0                   0 
##               grade          emp_length      home_ownership          annual_inc 
##                   0               17056                   0                   0 
## verification_status             purpose                 dti      fico_range_low 
##                   0                   0                 334                   0 
##     fico_range_high            open_acc          revol_util           total_acc 
##                   0                   0                 225                   0 
##         loan_status 
##                   0
before <- nrow(loandata)
loandata <- na.omit(loandata)
cat("Removed", before - nrow(loandata), "rows with missing values\n")
## Removed 17274 rows with missing values
cat("Rows remaining after cleaning:", nrow(loandata), "\n")
## Rows remaining after cleaning: 208337

Step 7: Final Cleaned Dataset

After all the cleaning steps above, the final dataset is ready for downstream modelling by the classification and regression teams.

cat("Final dataset:", nrow(loandata), "rows x", ncol(loandata), "columns\n")
## Final dataset: 208337 rows x 17 columns

Summary

The data preparation pipeline transformed the raw 2.26 million-row Lending Club file into a focused, clean modelling dataset of approximately 1 million recent loans across 17 columns (after dropping the helper issue_d column used for year filtering), ready for downstream classification and regression.

Classification Problem

Outlier Detection

Outliers are extreme values that may affect model performance.

The IQR capping is used to reduce the influence of extreme outliers while preserving the overall dataset size. This method identifies outliers based on the spread of the middle 50% of the data, where values falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR are considered outliers.

The Interquartile Range (IQR) method is used to detect outliers in:

  • Annual Income

  • Loan Amount

  • Debt-to-Income Ratio

  • Revolving Utilisation

#Make a copy of cleaned dataset
loandata_cleaned<-loandata

#IQR outliers detection function
iqr_outliers <- function(x){
  
  Q1 <- quantile(x,0.25)
  Q3 <- quantile(x,0.75)
  
  IQR <- IQR(x)
  
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  return(x < lower_bound | x > upper_bound)
}

#99th percentile outliers detection function
percentile_outliers <- function(x, lower_p = 0.01, upper_p = 0.99){

  lower <- quantile(x, lower_p, na.rm = TRUE)
  upper <- quantile(x, upper_p, na.rm = TRUE)

  return(x < lower | x > upper)
}

outliers_income <- iqr_outliers(loandata_cleaned$annual_inc)

outliers_loanamt <- iqr_outliers(loandata_cleaned$loan_amnt)

outliers_dtiratio <- percentile_outliers(loandata_cleaned$dti)

outliers_revolutil <- percentile_outliers(loandata_cleaned$revol_util)

print(paste0("Total count of outliers for Annual Incomes: ",sum(outliers_income)))
## [1] "Total count of outliers for Annual Incomes: 10009"
print(paste0("Total count of outliers for Loan Amount: ",sum(outliers_loanamt)))
## [1] "Total count of outliers for Loan Amount: 4767"
print(paste0("Total count of outliers for DTI ratio: ",sum(outliers_dtiratio)))
## [1] "Total count of outliers for DTI ratio: 4154"
print(paste0("Total count of outliers for Revolving Utilization: ",sum(outliers_revolutil)))
## [1] "Total count of outliers for Revolving Utilization: 4027"

Display the summary of these four variables to view their min, max, 1st quartile and 3rd quartile values.

print("Annual Incomes:")
## [1] "Annual Incomes:"
summary(loandata_cleaned$annual_inc)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##       33    50000    70000    81805    98000 10999200
print("Loan Amount:")
## [1] "Loan Amount:"
summary(loandata_cleaned$loan_amnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    7000   12000   14672   20000   40000
print("DTI ratio:")
## [1] "DTI ratio:"
summary(loandata_cleaned$dti)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00   11.36   17.33   18.35   24.05  999.00
print("Revolving Utilization:")
## [1] "Revolving Utilization:"
summary(loandata_cleaned$revol_util)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.60   43.10   44.58   63.30  154.90

Visualise the outliers by plotting boxplots

par(mfrow = c(2, 2),
    mar = c(4, 4, 2, 1))

boxplot(loandata_cleaned$annual_inc,
        main = "Annual Income",
        col = "lightblue")

boxplot(loandata_cleaned$loan_amnt,
        main = "Loan Amount",
        col = "lightgreen")

boxplot(loandata_cleaned$dti,
        main = "DTI Ratio",
        col = "lightpink")

boxplot(loandata_cleaned$revol_util,
        main = "Revolving Utilization",
        col = "lightyellow")

By plotting boxplot, we can observe that there are several outliers beyond the whiskers.

IQR capping is applied to annual_inc and loan_amnt because these financial variables naturally contain highly skewed distributions and extreme values.

iqr_cap <- function(x){

  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)

  IQR_val <- IQR(x, na.rm = TRUE)

  lower_bound <- Q1 - 1.5 * IQR_val
  upper_bound <- Q3 + 1.5 * IQR_val

  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound

  return(x)
}
# Apply IQR capping
loandata_cleaned$annual_inc <- iqr_cap(loandata_cleaned$annual_inc)

loandata_cleaned$loan_amnt <- iqr_cap(loandata_cleaned$loan_amnt)

However, 99th percentile capping is used for variables that have a realistic and meaningful boundary such as revol_util and dti. Both of them are in ratio and capped by using 99th percentile capping to avoid excessive trimming of meaningful financial behavior.

cap_99th <- function(x, lower_p = 0.01, upper_p = 0.99){

  lower_bound <- quantile(x, lower_p, na.rm = TRUE)
  upper_bound <- quantile(x, upper_p, na.rm = TRUE)

  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound

  return(x)
}
# Apply 99th percentile capping
loandata_cleaned$dti <- cap_99th(loandata_cleaned$dti)

loandata_cleaned$revol_util <- cap_99th(loandata_cleaned$revol_util)

To check whether any outlier exist after using IQR and 99th percentile capping

outliers_income <- iqr_outliers(loandata_cleaned$annual_inc)

outliers_loanamt <- iqr_outliers(loandata_cleaned$loan_amnt)

outliers_dtiratio <- percentile_outliers(loandata_cleaned$dti)

outliers_revolutil <- percentile_outliers(loandata_cleaned$revol_util)

print(paste0("Total count of outliers for Annual Incomes: ",sum(outliers_income)))
## [1] "Total count of outliers for Annual Incomes: 0"
print(paste0("Total count of outliers for Loan Amount: ",sum(outliers_loanamt)))
## [1] "Total count of outliers for Loan Amount: 0"
print(paste0("Total count of outliers for DTI ratio: ",sum(outliers_dtiratio)))
## [1] "Total count of outliers for DTI ratio: 0"
print(paste0("Total count of outliers for Revolving Utilization: ",sum(outliers_revolutil)))
## [1] "Total count of outliers for Revolving Utilization: 0"

To view the summary of each variables after handling outliers

print("Annual Incomes:")
## [1] "Annual Incomes:"
summary(loandata_cleaned$annual_inc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      33   50000   70000   77329   98000  170000
print("Loan Amount:")
## [1] "Loan Amount:"
summary(loandata_cleaned$loan_amnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    7000   12000   14660   20000   39500
print("DTI ratio:")
## [1] "DTI ratio:"
summary(loandata_cleaned$dti)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.23   11.36   17.33   18.03   24.05   46.48
print("Revolving Utilization:")
## [1] "Revolving Utilization:"
summary(loandata_cleaned$revol_util)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.10   24.60   43.10   44.56   63.30   97.80

Visualize again the outliers by plotting boxplots.

par(mfrow = c(2, 2),
    mar = c(4, 4, 2, 1))

boxplot(loandata_cleaned$annual_inc,
        main = "Annual Income",
        col = "lightblue")

boxplot(loandata_cleaned$loan_amnt,
        main = "Loan Amount",
        col = "lightgreen")

boxplot(loandata_cleaned$dti,
        main = "DTI Ratio",
        col = "lightpink")

boxplot(loandata_cleaned$revol_util,
        main = "Revolving Utilization",
        col = "lightyellow")

We can still observe some points that spread beyond the whisker in DTI Ratio plot because boxplot is based on the IQR method for outlier detection, whereas the DTI values were treated using 99th percentile capping. Therefore, although extreme values have been limited, some observations may still appear as statistical outliers under the IQR rule.

Data Transformation

The categorical target variable loan_status was transformed into a binary numeric format using label encoding, where “Charged Off” is encoded as 1 and “Fully Paid” is encoded as 0 to enable machine learning model training.

#Label encoding
loandata_trans <- loandata_cleaned
loandata_trans$loan_status <- ifelse(loandata_cleaned$loan_status == "Charged Off", 1, 0)

The categorical variables are converted into factor format because classification models require categorical variables to be factors.

loandata_trans <- loandata_trans %>%
  mutate(across(c(term,
                  grade,
                  emp_length,
                  home_ownership,
                  verification_status,
                  purpose,
                  loan_status),
                as.factor))
head(loandata_trans)
##   loan_amnt term int_rate installment grade emp_length home_ownership
## 1      3000   36     7.34       93.10     A          9           RENT
## 2      5000   36    11.98      166.03     B         10            OWN
## 3      7000   36    11.98      232.44     B          0       MORTGAGE
## 4     30000   36    21.85     1143.39     D         10            OWN
## 5     21000   60    20.39      560.94     D         10            OWN
## 6     39500   60     9.43      838.71     B         10       MORTGAGE
##   annual_inc verification_status            purpose   dti fico_range_low
## 1      52000     Source Verified     major_purchase  1.23            760
## 2      55000        Not Verified              other 14.18            675
## 3      40000            Verified   home_improvement 20.25            695
## 4      57000            Verified debt_consolidation 27.58            680
## 5      85000     Source Verified              house 15.76            665
## 6     150000        Not Verified debt_consolidation  4.09            730
##   fico_range_high open_acc revol_util total_acc loan_status
## 1             764        7        0.5        30           0
## 2             679       14       33.9        24           0
## 3             699       13       36.0        29           0
## 4             684       11       53.2        26           0
## 5             669       15       34.2        27           0
## 6             734        6        9.4         9           0

Exploratory Data Analysis (EDA)

To gain comprehensive understanding of the borrower dataset, including their distributional properties, variability and structural characteristics, exploratory data analysis was performed. This step is important in data preprocessing as it helps to identify the underlying trends to provide insights which can guide feature engineering and model development.

Numerical Variable Distribution Analysis

The distribution of numerical variables, such as annual income, loan amount, DTI ratio and Revolving Credit Utilization was visualized using histograms below. The income distribution shows a right-skewed distribution which suggests that most applicants earning between roughly 40,000 and 80,000 annually. The wide range of income levels implies that the borrowers come from different economic backgrounds which may influence their ability to pay loans.

Similarly, the distribution of loan amount shows a right-skewed distribution where smaller loan amounts are more common, while larger loans are less frequent. The peak frequency lies in the lower-to-mid loan amounts, around 10,000 to 20,000.

The DTI values form a bell-shaped curve centered around 15–20%. Few applicants have DTI above 35%, indicating moderate debt levels relative to income.

The Revolving Credit Utilization is fairly uniform distribution peaking near 40–50%. This can be explained that many borrowers use between 20–70% of their available revolving credit.

par(mfrow = c(2,2))

hist(loandata_trans$annual_inc, main = "Income", col="lightblue")
hist(loandata_trans$loan_amnt, main = "Loan Amount", col="lightgreen")
hist(loandata_trans$dti, main = "DTI", col="lightpink")
hist(loandata_trans$revol_util, main = "Revolving Utilization", col="lightyellow")

Feature Engineering

LoanToIncomeRatio is calculated to compare the loan size to income, which can capture the financial burden of borrowers.

Loan-to-Income ratio is derived as a new variable to improve predictive capability. The feature is defined as:

\[\text{Loan-to-Income Ratio} = \frac{\text{Loan Amount}}{\text{Annual Income}}\]

Compared to the absolute loan amount alone, the ratio of loan amount to income provides a relative measure of loan burden in relation to affordability. A higher loan-to-income ratio suggests that the borrower is taking more debt relative to their income, which may increase the likelihood of default.

loandata_trans$LoanToIncomeRatio <-
  loandata_trans$loan_amnt /
  loandata_trans$annual_inc

head(loandata_trans)
##   loan_amnt term int_rate installment grade emp_length home_ownership
## 1      3000   36     7.34       93.10     A          9           RENT
## 2      5000   36    11.98      166.03     B         10            OWN
## 3      7000   36    11.98      232.44     B          0       MORTGAGE
## 4     30000   36    21.85     1143.39     D         10            OWN
## 5     21000   60    20.39      560.94     D         10            OWN
## 6     39500   60     9.43      838.71     B         10       MORTGAGE
##   annual_inc verification_status            purpose   dti fico_range_low
## 1      52000     Source Verified     major_purchase  1.23            760
## 2      55000        Not Verified              other 14.18            675
## 3      40000            Verified   home_improvement 20.25            695
## 4      57000            Verified debt_consolidation 27.58            680
## 5      85000     Source Verified              house 15.76            665
## 6     150000        Not Verified debt_consolidation  4.09            730
##   fico_range_high open_acc revol_util total_acc loan_status LoanToIncomeRatio
## 1             764        7        0.5        30           0        0.05769231
## 2             679       14       33.9        24           0        0.09090909
## 3             699       13       36.0        29           0        0.17500000
## 4             684       11       53.2        26           0        0.52631579
## 5             669       15       34.2        27           0        0.24705882
## 6             734        6        9.4         9           0        0.26333333
  • fico_range_low represents the lower bound of the borrower’s estimated credit score

  • fico_range_high represents the upper bound of the borrower’s estimated credit score

  • fico_avg is calculated by computing the average FICO score to represent overall creditworthiness and the FICO range width to capture variability in credit assessment. This transformation reduces redundancy and improves interpretability for predictive modelling.

\[\text{Average FICO Score} = \frac{\text{FICO Range Low + FICO Range High}}{\text{2}}\]

loandata_trans$fico_avg <- (loandata_trans$fico_range_low +
                              loandata_trans$fico_range_high) / 2

head(loandata_trans)
##   loan_amnt term int_rate installment grade emp_length home_ownership
## 1      3000   36     7.34       93.10     A          9           RENT
## 2      5000   36    11.98      166.03     B         10            OWN
## 3      7000   36    11.98      232.44     B          0       MORTGAGE
## 4     30000   36    21.85     1143.39     D         10            OWN
## 5     21000   60    20.39      560.94     D         10            OWN
## 6     39500   60     9.43      838.71     B         10       MORTGAGE
##   annual_inc verification_status            purpose   dti fico_range_low
## 1      52000     Source Verified     major_purchase  1.23            760
## 2      55000        Not Verified              other 14.18            675
## 3      40000            Verified   home_improvement 20.25            695
## 4      57000            Verified debt_consolidation 27.58            680
## 5      85000     Source Verified              house 15.76            665
## 6     150000        Not Verified debt_consolidation  4.09            730
##   fico_range_high open_acc revol_util total_acc loan_status LoanToIncomeRatio
## 1             764        7        0.5        30           0        0.05769231
## 2             679       14       33.9        24           0        0.09090909
## 3             699       13       36.0        29           0        0.17500000
## 4             684       11       53.2        26           0        0.52631579
## 5             669       15       34.2        27           0        0.24705882
## 6             734        6        9.4         9           0        0.26333333
##   fico_avg
## 1      762
## 2      677
## 3      697
## 4      682
## 5      667
## 6      732

Since fico_avg is computed, fico_range_low and fico_range_high are removed to reduce redundancy

loandata_trans<-loandata_trans %>%
  select(-c(fico_range_low,fico_range_high))

head(loandata_trans)
##   loan_amnt term int_rate installment grade emp_length home_ownership
## 1      3000   36     7.34       93.10     A          9           RENT
## 2      5000   36    11.98      166.03     B         10            OWN
## 3      7000   36    11.98      232.44     B          0       MORTGAGE
## 4     30000   36    21.85     1143.39     D         10            OWN
## 5     21000   60    20.39      560.94     D         10            OWN
## 6     39500   60     9.43      838.71     B         10       MORTGAGE
##   annual_inc verification_status            purpose   dti open_acc revol_util
## 1      52000     Source Verified     major_purchase  1.23        7        0.5
## 2      55000        Not Verified              other 14.18       14       33.9
## 3      40000            Verified   home_improvement 20.25       13       36.0
## 4      57000            Verified debt_consolidation 27.58       11       53.2
## 5      85000     Source Verified              house 15.76       15       34.2
## 6     150000        Not Verified debt_consolidation  4.09        6        9.4
##   total_acc loan_status LoanToIncomeRatio fico_avg
## 1        30           0        0.05769231      762
## 2        24           0        0.09090909      677
## 3        29           0        0.17500000      697
## 4        26           0        0.52631579      682
## 5        27           0        0.24705882      667
## 6         9           0        0.26333333      732

Class Distribution

The class balance of “Charged Off” (1) and “Fully Paid” (0) in loan_status (target variable) is checked.

The target variable loan_status was examined to understand the class balance in the dataset. The results show highly imbalanced distribution where approximately 79.4% of the loan are fully paid (0) and 20.6% are charged off (1).

table(loandata_trans$loan_status)
## 
##      0      1 
## 165466  42871
prop.table(table(loandata_trans$loan_status))
## 
##         0         1 
## 0.7942228 0.2057772
library(ggplot2)
ggplot(loandata_trans,
       aes(x = loan_status,
           fill = loan_status)) +

  geom_bar(color = "black") +

  geom_text(stat = "count",
            aes(label = after_stat(count)),
            vjust = -0.3,
            size = 4,
            fontface = "bold") +

  labs(title = "Class Distribution of Loan Status",
       x = "Loan Status",
       y = "Frequency") +

  scale_x_discrete(labels = c("Fully Paid",
                              "Charged Off")) +

  scale_fill_manual(values = c("lightblue", "tomato")) +

  theme_classic() +

  theme(
    legend.position = "none",
    axis.line = element_line(color = "black"),
    axis.ticks = element_line(color = "black"),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

The bar plot shows a dominant off majority class (fully paid) compared to minority class (charged off). The imbalance is common issue in credit risk modelling which can lead to biased model performance where the model is more likely to predict the majority class. Hence, class imbalance need to be addressed so that the model can performs well in predicting default cases.

Data Modelling

Training and Testing Data Splitting

The code uses the R package caret to split the loan dataset into training and testing datasets. To make the process to be reproducible, set.seed(123) is applied.

The dataset is divided into training (70%) and testing (30%) datasets. Since the target variable is imbalanced,we used a stratified sampling to maintain the original proportion of charged off and fully paid in both training and test set. This ensures both set represent same class distribution.

library(caret)
set.seed(123)

sample_index <- createDataPartition(loandata_trans$loan_status,
                    p=0.7,
                    list=FALSE)   

train_data <- loandata_trans[sample_index, ]

test_data <- loandata_trans[-sample_index, ]

prop.table(table(loandata_trans$loan_status))
## 
##         0         1 
## 0.7942228 0.2057772
prop.table(table(train_data$loan_status))
## 
##         0         1 
## 0.7942223 0.2057777
prop.table(table(test_data$loan_status))
## 
##        0        1 
## 0.794224 0.205776
cat("Number of row in training data:", nrow(train_data), "\n")
## Number of row in training data: 145837
cat("Number of row in test data:", nrow(test_data), "\n")
## Number of row in test data: 62500

After splitting the data, the class proportions of target variable loan_status in both training and testing data are observed. There are approximately 79% of fully paid and 21% of charged off in both training and testing set, indicating that the stratified sampling was successfully performed. This balanced preservation of class distribution helps ensure the model developed will be more reliable.

After the dataset is split into training and testing sets, the categorical variables in cat_cols are converted into factors again to ensure the consistency between both datasets. The function levels = levels(train_data[[col]]) is applied to testing data in order to force the corresponding columns in testing data to use the same factor levels as the training data. This is to make sure both datasets have identical category levels and prevent predictions errors caused by mismatched or unseen categories in the testing dataset.

cat_cols <- c("term",
              "grade",
              "emp_length",
              "home_ownership",
              "verification_status",
              "purpose",
              "loan_status")

for (col in cat_cols) {
  train_data[[col]] <- as.factor(train_data[[col]])
  test_data[[col]] <- factor(test_data[[col]],
                             levels = levels(train_data[[col]]))
}

Feature Scaling

Feature scaling is especially useful for Logistic Regression and distance-based algorithms. Feature scaling standardizes numerical variables so they are on similar scales.

First, a list of numeric columns is stored in numeric_cols. The scale() function subtracts the mean from each value and divides by the standard deviation, resulting the variables to have a mean close to 0 and standard deviation close to 1.

Next, the same scaling parameters obtained from the training data are applied to the testing dataset using center and scale attributes from train_scaled. This is to ensure that the testing data is transformed using the statistics same as the training dataset to avoid data leakage and enhance the fair model evaluation.

numeric_cols <- c("loan_amnt",
                  "int_rate",
                  "installment",
                  "annual_inc",
                  "dti",
                  "open_acc",
                  "revol_util",
                  "total_acc",
                  "LoanToIncomeRatio",
                  "fico_avg")

train_scaled <- scale(train_data[numeric_cols])

train_data[numeric_cols] <- train_scaled

test_data[numeric_cols] <- scale(test_data[numeric_cols],
                                 center = attr(train_scaled, "scaled:center"),
                                 scale = attr(train_scaled, "scaled:scale"))
summary(train_data)
##    loan_amnt       term           int_rate        installment      grade    
##  Min.   :-1.4099   36:109221   Min.   :-1.5134   Min.   :-1.4249   A:25619  
##  1st Qu.:-0.7892   60: 36616   1st Qu.:-0.6894   1st Qu.:-0.7379   B:40404  
##  Median :-0.2719               Median :-0.1882   Median :-0.2786   C:45053  
##  Mean   : 0.0000               Mean   : 0.0000   Mean   : 0.0000   D:21881  
##  3rd Qu.: 0.5558               3rd Qu.: 0.3968   3rd Qu.: 0.5343   E: 8509  
##  Max.   : 2.5732               Max.   : 3.0667   Max.   : 4.3600   F: 2842  
##                                                                    G: 1529  
##    emp_length     home_ownership    annual_inc           verification_status
##  10     :52153   ANY     :  135   Min.   :-2.0386   Not Verified   :49902   
##  2      :14410   MORTGAGE:73615   1st Qu.:-0.7191   Source Verified:60258   
##  0      :13891   NONE    :    2   Median :-0.1910   Verified       :35677   
##  3      :12878   OWN     :16982   Mean   : 0.0000                           
##  1      :10418   RENT    :55103   3rd Qu.: 0.5352                           
##  5      : 9767                    Max.   : 2.4497                           
##  (Other):32320                                                              
##                purpose           dti              open_acc      
##  debt_consolidation:80153   Min.   :-1.85527   Min.   :-1.8386  
##  credit_card       :28292   1st Qu.:-0.73748   1st Qu.:-0.6375  
##  home_improvement  :11706   Median :-0.07651   Median :-0.1227  
##  other             :11579   Mean   : 0.00000   Mean   : 0.0000  
##  major_purchase    : 4003   3rd Qu.: 0.66611   3rd Qu.: 0.5637  
##  medical           : 2332   Max.   : 3.13783   Max.   :13.0896  
##  (Other)           : 7772                                       
##    revol_util         total_acc       loan_status LoanToIncomeRatio  
##  Min.   :-1.76604   Min.   :-1.8035   0:115827    Min.   : -0.18387  
##  1st Qu.:-0.79373   1st Qu.:-0.7531   1: 30010    1st Qu.: -0.09397  
##  Median :-0.05557   Median :-0.1875               Median : -0.03126  
##  Mean   : 0.00000   Mean   : 0.0000               Mean   :  0.00000  
##  3rd Qu.: 0.74212   3rd Qu.: 0.5396               3rd Qu.:  0.04868  
##  Max.   : 2.11130   Max.   :10.2353               Max.   :269.23479  
##                                                                      
##     fico_avg      
##  Min.   :-1.1725  
##  1st Qu.:-0.7466  
##  Median :-0.1787  
##  Mean   : 0.0000  
##  3rd Qu.: 0.5311  
##  Max.   : 4.0945  
## 
summary(test_data)
##    loan_amnt        term          int_rate          installment       grade    
##  Min.   :-1.40990   36:46712   Min.   :-1.513369   Min.   :-1.42491   A:11271  
##  1st Qu.:-0.78917   60:15788   1st Qu.:-0.689377   1st Qu.:-0.73087   B:16935  
##  Median :-0.27189              Median :-0.188205   Median :-0.26598   C:19308  
##  Mean   : 0.01107              Mean   :-0.001456   Mean   : 0.01072   D: 9468  
##  3rd Qu.: 0.55576              3rd Qu.: 0.396794   3rd Qu.: 0.55755   E: 3736  
##  Max.   : 2.57315              Max.   : 3.066741   Max.   : 4.34495   F: 1162  
##                                                                       G:  620  
##    emp_length     home_ownership    annual_inc            verification_status
##  10     :22412   ANY     :   70   Min.   :-2.01835   Not Verified   :21378   
##  2      : 6257   MORTGAGE:31445   1st Qu.:-0.71912   Source Verified:26007   
##  0      : 6010   NONE    :    0   Median :-0.19098   Verified       :15115   
##  3      : 5401   OWN     : 7109   Mean   : 0.00847                           
##  1      : 4475   RENT    :23876   3rd Qu.: 0.54841                           
##  4      : 4120                    Max.   : 2.44972                           
##  (Other):13825                                                               
##                purpose           dti               open_acc        
##  debt_consolidation:34406   Min.   :-1.855270   Min.   :-1.838578  
##  credit_card       :12172   1st Qu.:-0.736373   1st Qu.:-0.637462  
##  home_improvement  : 4976   Median :-0.084235   Median :-0.122698  
##  other             : 4960   Mean   :-0.004259   Mean   :-0.001639  
##  major_purchase    : 1705   3rd Qu.: 0.656179   3rd Qu.: 0.563654  
##  medical           : 1018   Max.   : 3.137834   Max.   : 9.486230  
##  (Other)           : 3263                                          
##    revol_util          total_acc         loan_status LoanToIncomeRatio  
##  Min.   :-1.766041   Min.   :-1.803470   0:49639     Min.   :-0.183869  
##  1st Qu.:-0.793730   1st Qu.:-0.753108   1:12861     1st Qu.:-0.093912  
##  Median :-0.063505   Median :-0.187528               Median :-0.031260  
##  Mean   :-0.005479   Mean   :-0.001719               Mean   :-0.003902  
##  3rd Qu.: 0.742123   3rd Qu.: 0.539647               3rd Qu.: 0.050315  
##  Max.   : 2.111296   Max.   : 9.669723               Max.   :14.990758  
##                                                                         
##     fico_avg         
##  Min.   :-1.1724742  
##  1st Qu.:-0.7465720  
##  Median :-0.1787023  
##  Mean   :-0.0004905  
##  3rd Qu.: 0.5311347  
##  Max.   : 4.0945167  
## 

Finally, summary() is applied and we can observe that the mean of the numerical columns in both training and testing data are close to 0.

Class Imbalance Handling

Random oversampling was used to handle the class imbalance. Before oversampling, the number of Charged Off loans was much smaller than the number of Fully Paid loans, which may cause the classification models to become biased toward the majority class.

table(train_data$loan_status)
## 
##      0      1 
## 115827  30010
# Separate majority and minority classes
majority_class <- train_data[train_data$loan_status == "0", ]
minority_class <- train_data[train_data$loan_status == "1", ]

# Random oversampling of minority class
set.seed(123)

minority_oversampled <- minority_class[
  sample(1:nrow(minority_class),
         size = nrow(majority_class),
         replace = TRUE),
]
# Combine majority class and oversampled minority class
train_final <- rbind(majority_class, minority_oversampled)

# Shuffle the training data
train_final <- train_final[sample(1:nrow(train_final)), ]

# Test set remains unchanged
test_model <- test_data

# Confirm target variable is factor
train_final$loan_status <- as.factor(train_final$loan_status)
test_model$loan_status <- as.factor(test_model$loan_status)

SMOTE was considered initially as it can generate synthetic minority-class samples.However, ordinary SMOTE is more suitable for continuous numerical predictors because it creates new observations through interpolation. This dataset contains several categorical predictors, such as term, grade, employment length, home ownership, verification status and loan purpose. When these categorical variables are converted into numeric labels and ordinary SMOTE is applied, the interpolation process may create unrealistic intermediate values, such as decimal values for loan grade or loan term. These values do not correspond to legitimate borrower profiles.Therefore, random oversampling was selected as it balances the minority class while preserving the original categorical values.

table(train_data$loan_status)
## 
##      0      1 
## 115827  30010
table(train_final$loan_status)
## 
##      0      1 
## 115827 115827

table() function is applied to compare the class distribution before and after applying random oversampling. After oversampling, the number of Fully Paid and Charged Off loans in the training dataset becomes balanced.

Model 1: Logistic Regression

Logistic Regression was used as an interpretable baseline model because the target variable is binary. The Logistic Regression model estimates the probability that a loan will be charged off based on the predictor variables such as annual income, loan amount, interest rate, fico_avg, loan-to-income ratio and so on.

logit_model <- glm(loan_status ~ .,
                   data = train_final,
                   family = "binomial")

summary(logit_model)
## 
## Call:
## glm(formula = loan_status ~ ., family = "binomial", data = train_final)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         -0.921862   0.152579  -6.042 1.52e-09 ***
## loan_amnt                            0.064762   0.031478   2.057 0.039647 *  
## term60                               0.356957   0.023447  15.224  < 2e-16 ***
## int_rate                             0.215682   0.021699   9.940  < 2e-16 ***
## installment                          0.163542   0.029462   5.551 2.84e-08 ***
## gradeB                               0.529147   0.022331  23.695  < 2e-16 ***
## gradeC                               0.815175   0.032210  25.308  < 2e-16 ***
## gradeD                               0.855510   0.047868  17.872  < 2e-16 ***
## gradeE                               0.789600   0.069663  11.335  < 2e-16 ***
## gradeF                               0.749065   0.090565   8.271  < 2e-16 ***
## gradeG                               0.823050   0.099009   8.313  < 2e-16 ***
## emp_length1                         -0.022703   0.021782  -1.042 0.297272    
## emp_length2                         -0.071490   0.020088  -3.559 0.000372 ***
## emp_length3                         -0.037366   0.020637  -1.811 0.070196 .  
## emp_length4                         -0.035160   0.022311  -1.576 0.115045    
## emp_length5                         -0.107347   0.022424  -4.787 1.69e-06 ***
## emp_length6                         -0.131266   0.025229  -5.203 1.96e-07 ***
## emp_length7                         -0.060364   0.026756  -2.256 0.024067 *  
## emp_length8                         -0.070384   0.028604  -2.461 0.013868 *  
## emp_length9                         -0.115572   0.027970  -4.132 3.60e-05 ***
## emp_length10                        -0.117245   0.016354  -7.169 7.55e-13 ***
## home_ownershipMORTGAGE              -0.379570   0.142831  -2.657 0.007873 ** 
## home_ownershipNONE                 -10.139296  51.112705  -0.198 0.842755    
## home_ownershipOWN                   -0.116108   0.143266  -0.810 0.417687    
## home_ownershipRENT                   0.103776   0.142848   0.726 0.467549    
## annual_inc                          -0.114408   0.005931 -19.290  < 2e-16 ***
## verification_statusSource Verified   0.144753   0.010962  13.205  < 2e-16 ***
## verification_statusVerified          0.255466   0.012503  20.433  < 2e-16 ***
## purposecredit_card                   0.041106   0.044434   0.925 0.354914    
## purposedebt_consolidation            0.112344   0.043579   2.578 0.009939 ** 
## purposehome_improvement              0.311445   0.046227   6.737 1.61e-11 ***
## purposehouse                         0.192141   0.062951   3.052 0.002271 ** 
## purposemajor_purchase                0.370754   0.050808   7.297 2.94e-13 ***
## purposemedical                       0.438944   0.055323   7.934 2.12e-15 ***
## purposemoving                        0.344802   0.062744   5.495 3.90e-08 ***
## purposeother                         0.268024   0.045860   5.844 5.09e-09 ***
## purposerenewable_energy              0.746263   0.169638   4.399 1.09e-05 ***
## purposesmall_business                0.893228   0.059248  15.076  < 2e-16 ***
## purposevacation                      0.287704   0.062953   4.570 4.87e-06 ***
## purposewedding                      10.027826  32.406412   0.309 0.756987    
## dti                                  0.123536   0.005388  22.929  < 2e-16 ***
## open_acc                             0.154205   0.006745  22.863  < 2e-16 ***
## revol_util                           0.033734   0.005653   5.968 2.40e-09 ***
## total_acc                           -0.193786   0.006839 -28.337  < 2e-16 ***
## LoanToIncomeRatio                   -0.006272   0.004486  -1.398 0.162041    
## fico_avg                            -0.182405   0.006500 -28.064  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 321141  on 231653  degrees of freedom
## Residual deviance: 286752  on 231608  degrees of freedom
## AIC: 286844
## 
## Number of Fisher Scoring iterations: 8

The Logistic Regression model was trained using the random-oversampled training dataset. Since Logistic Regression is suitable for binary classification, it estimates the probability that a loan belongs to the Charged Off class. The predicted probabilities were then converted into class labels for model evaluation.

Logistic Regression Prediction

Although Random Oversampling method was applied to reduce class imbalance during model training, probability threshold tuning remained necessary because classification decisions are still influenced by the default cutoff value of 0.5. This can adjust the decision rule and improve the detection of minority cases by tuning the probability threshold. Threshold tuning suggested a threshold of 0.4 as it provides an optimal balance between recall and F1-score.

probabilities <- predict(logit_model,
                         test_model,
                         type = "response")

log_pred <- ifelse(probabilities > 0.4,
                          1,
                          0)

head(log_pred)
## 2 3 4 5 6 7 
## 0 1 1 1 0 0

Model 2: Decision Tree Classification

Decision Tree classification model was developed to classify borrowers into charged off and fully paid groups based on borrower characteristics. Unlike Logistic Regression, Decision Tree is a rule-based model that separates borrowers using a sequence of splitting conditions. Decision Tree is easier to interpret as the classification process and the decisions made at every node can be visualised.

cp=0.001 is the minimum improvement threshold required for the tree to make a new split. A value of 0.001 means the tree will only split a node if it improves the overall model fit by at least 0.1%. Lowering this value allows more splits to occur.

minsplit = 20 sets the minimum number of observations that must exist in a node before the tree even attempts to split it. This prevents the tree from creating splits based on very small groups of data which could lead to overfitting.

minbucket = 7 sets the minimum number of observations allowed in any terminal node (leaf). The tree will not create a final leaf node that contains fewer than 7 observations.

maxdepth = 5 limits how many levels deep the tree can grow from the root node.

library(rpart)
set.seed(123)
tree_model <- rpart(loan_status ~ .,
                    data = train_final,
                    method = "class",
                    control = rpart.control(
                      cp       = 0.001,   # lower threshold makes more splits
                      minsplit = 20,      # minimum observations to attempt a split
                      minbucket = 7,      # minimum observations in a leaf node
                      maxdepth = 5        # limit depth to keep tree readable
                    ))

Decision Tree Diagram

The Decision Tree structure is visualized to show the classification rules used by the model.

The Decision Tree diagram displays:

  • splitting conditions

  • classification rules

  • predicted classes

  • decision paths used for borrower classification

library(rpart.plot)

rpart.plot(tree_model,
           main  = "Decision Tree")

The first split of the Decision Tree is based on grade, indicating that loan grade provides the strongest initial separation between Fully Paid and Charged Off loans in this tree. Loans with stronger grades such as A and B are mainly directed toward the lower-risk side of the tree, while weaker grades are more likely to move toward the higher-risk branches.

After the initial grade split, the tree further separates borrowers using home ownership, LoanToIncomeRatio, interest rate and revolving utilization. These variables suggest that repayment outcome is influenced by a combination of credit risk grade, borrower housing profile, repayment burden and credit utilization. However, the terminal nodes still contain mixed probabilities rather than pure classifications, showing that Fully Paid and Charged Off loans are not perfectly separable using the available predictors.

Variable Importance

The Decision Tree variable importance plot identifies the variables that contributed most to classify loans as Fully Paid or Charged Off. The results show that grade and int_rate were the most influential variables, indicating that loan grade and borrowing cost played the strongest role in splitting the data. FICO-related variables, revolving utilization and home ownership also contributed, but their importance was much lower. This suggests that the model relied mainly on credit risk indicators when predicting loan status.

tree_model$variable.importance
##            int_rate               grade            fico_avg   LoanToIncomeRatio 
##        1.006685e+04        1.004386e+04        1.913174e+03        9.117994e+02 
##         installment          revol_util      home_ownership           loan_amnt 
##        7.891540e+02        7.015702e+02        3.879545e+02        3.251806e+02 
## verification_status                term           total_acc          annual_inc 
##        2.394755e+02        9.797676e+01        4.504493e+01        3.561110e+01 
##            open_acc          emp_length             purpose                 dti 
##        1.987857e+01        1.499473e+01        1.275823e+01        7.824433e-01
barplot(tree_model$variable.importance,

        main = "Decision Tree Variable Importance",

        ylab = "Importance",
        
        las=2,
        
        cex.names=0.65)

The variable importance plot shows which borrower characteristics contribute the most to loan status classification.

Decision Tree Prediction

The Decision Tree model is used to predict loan status classifications for the testing datasets. Since the Decision Tree model is primarily determined by rule-based splits and the prediction of the class labels are based on terminal nodes, the probability threshold tuning impact is generally less significant.

tree_pred <- predict(tree_model,
                     test_model,
                     type = "class")

head(tree_pred)
## 2 3 4 5 6 7 
## 0 0 1 1 0 0 
## Levels: 0 1

Model 3: Random Forest Classification

Random Forest is an ensemble classification technique that combines multiple decision trees. Instead of relying on a single decision tree,Random Forest builds multiple decision trees by using random parts of the data and each of the trees are using random variables to split at the decision node. Decision Tree applied a majority voting mechanism to aggregate the predictions from individual trees,outputting the modes as final final classification. Compared to a single decision tree which is prone to high variance and overfitting, Random Forest provides robustness by reducing the dependence on only one tree structure and increasing accuracy through combining the predictions of every single tree to average out the mistakes.

library(randomForest)
set.seed(123)
rf_model <- randomForest(loan_status ~ .,
                         data = train_final,
                         ntree = 200)

rf_model
## 
## Call:
##  randomForest(formula = loan_status ~ ., data = train_final, ntree = 200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 4.03%
## Confusion matrix:
##        0      1 class.error
## 0 108488   7339  0.06336174
## 1   2008 113819  0.01733620

The model output shows an out-of-bag (OOB) error rate of 4.03%. This is an internal estimate of prediction error during the Random Forest training process. However, since the final model evaluation is carried out on the independent testing dataset, the OOB error rate should not be treated as the final model performance.

The relatively low OOB error suggests that the Random Forest model fits the oversampled training data well, but the independent testing results are still more important for evaluating generalisation performance.

Variable Importance

Variable importance was used to identify the most influential variables affecting loan status prediction in the Random Forest model. The result shows that int_rate, dti, LoanToIncomeRatio, revol_util, emp_length and installment were among the most important predictors.

sort(importance(rf_model)[,1],decreasing=TRUE)
##            int_rate                 dti   LoanToIncomeRatio          revol_util 
##          11314.5325          11024.5783          10743.6033          10544.0941 
##         installment          emp_length           total_acc          annual_inc 
##           9328.7660           9292.8530           8445.2959           8364.6331 
##            fico_avg            open_acc           loan_amnt               grade 
##           7211.5404           7031.3545           6363.1100           5447.4985 
##             purpose verification_status      home_ownership                term 
##           5182.0493           2354.3219           2125.3479            994.3745
varImpPlot(rf_model)

This suggests that the Random Forest model relied strongly on borrowing cost, debt burden, credit utilisation and repayment-related characteristics when classifying loans as Fully Paid or Charged Off. Variables such as grade, fico_avg and verification_status still contributed to the model, but they were less influential compared with the top-ranked predictors in this Random Forest model.

Random Forest Prediction

rf_prob <- predict(rf_model,
                   test_model,
                   type = "prob")
#The optimal probability threshold was found to be 0.3, it provides the best balance between recall and F1-score. 
rf_pred <- ifelse(rf_prob[,2] > 0.3,
                  1,
                  0)

head(rf_pred)
## 2 3 4 5 6 7 
## 0 0 1 1 0 0

Model Evaluation

The models’ performance are evaluated using four standard evaluation metrics which are accuracy, precision, recall and F1-score. These evaluation metrics were selected due to the reason that the target variable is imbalanced where the fully paid borrowers form the majority class. The evaluation of the model cannot simply rely on accuracy as it may not provide a comprehensive view on the model performance and can be misleading. A more detailed set of metrics is therefore necessary to fully understand a model’s strengths and weaknesses.

A confusion matrix will be utilized to summarize and visualize the performance of the classification models by comparing the predicted class labels against the actual class labels.

It is a 2x2 matrix consisting of 4 components: True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). In the project, a positive class refers to loan default where Default=1.

These parameters are the components for computing accuracy, precision, recall and F1-score

\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\] Accuracy measures the overall proportion of correct predictions where the actual charged off loan and actual fully paid loan are correctly predicted. \[Precision = \frac{TP}{TP + FP}\] Precision measures the proportion of predicted charged off loan that were actually charged off. \[Recall = \frac{TP}{TP + FN}\] Recall, also known as sensitivity or true positive rate, measures the proportion of actual charged off loans that the model correctly captured. \[F1\text{-}Score = \frac{2 \times Precision \times Recall}{Precision + Recall}\] F1-score provides a balance between precision and recall. It is useful when both false positives and false negatives need to be considered.

evaluate_model <- function(actual,predicted) {
  
  actual <- as.numeric(as.character(actual))
  cm <- table(Predicted = predicted, Actual = actual)
  
  TN <- cm["0","0"]
  FN <- cm["0","1"]
  FP <- cm["1","0"]
  TP <- cm["1","1"]
  
  accuracy <- (TP+TN)/sum(cm)
  precision <- TP/(TP+FP)
  recall <- TP/(TP+FN)
  f1_score <- (2*precision*recall)/(precision+recall)
  
  list(
    Confusion_Matrix = cm,
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    F1_Score = f1_score
  )
}

The function is applied to each models for evaluation.

log_evaluate <- evaluate_model(test_model$loan_status,log_pred)

tree_evaluate <- evaluate_model(test_model$loan_status,tree_pred)

rf_evaluate <- evaluate_model(test_model$loan_status,rf_pred)

cat("===== Logistic Regression =====\n")
## ===== Logistic Regression =====
print(log_evaluate)
## $Confusion_Matrix
##          Actual
## Predicted     0     1
##         0 21395  2018
##         1 28244 10843
## 
## $Accuracy
## [1] 0.515808
## 
## $Precision
## [1] 0.2774068
## 
## $Recall
## [1] 0.8430915
## 
## $F1_Score
## [1] 0.4174559

The Logistic Regression model achieved an accuracy of 51.58%, precision of 27.74%, recall of 84.31% and F1-score of 41.75%. Besides,the confusion matrix shows that the model correctly spot 10,843 actual charged off loan cases and incorrectly predicted 2,018 actual charged off loan as fully paid. This model also misclassified 28,244 fully paid loan as charged off.

The high recall rate of the Logistic Regression model indicates that the model is sensitive toward detecting borrowers who are likely to be charged off. In the context of loan status prediction, this is significant as missing actual charged off loan which have a lower recall rate may expose financial institutions to credit loss. However, the low precision indicates that there are many fully paid loan incorrectly classified as charged off by the model. This means the model produces many false positives, which may cause reliable and good standing borrowers to be incorrectly treated as high-risk borrowers who fail to fulfill the legal obligations of the loan contract.

Overall, Logistic Regression is useful when the main objective is to identify as many potential defaulted borrowers as possible. In contrast, it will be less suitable when the financial institution wants to minimize rejection of low-risk borrowers.

cat("\n===== Decision Tree =====\n")
## 
## ===== Decision Tree =====
print(tree_evaluate)
## $Confusion_Matrix
##          Actual
## Predicted     0     1
##         0 32467  4798
##         1 17172  8063
## 
## $Accuracy
## [1] 0.64848
## 
## $Precision
## [1] 0.3195165
## 
## $Recall
## [1] 0.6269341
## 
## $F1_Score
## [1] 0.423299

The Decision Tree model achieved an accuracy of 64.85%, which means that it made more correct predictions overall compared to the Logistic Regression model. However, it had a lower recall rate of 62.69%, indicating a lower performance in detecting actual charged off loans.

As shown as the confusion matrix, the model correctly classified 32,467 fully paid loan and 8,063 charged off loans. However, there are 4,798 actual charged off loans that were still predicted as fully paid. At the same time, 17,172 loans are misclassified as charged off loans when they are actually fully paid, resulting in a low precision score. This shows that the Decision Tree still had a poor performance to capture fully paid loans and a significant percentage to overlook high-risk borrowers even though rule-based decisions enhance the interpretability of the model.

Unlike other ‘black boxes’ models, Decision Tree allows the classification process to be visualized through splitting rules at every node, which helps to explain how borrower characteristics such as grade,interest rate and average of FICO affect the prediction.

cat("\n===== Random Forest =====\n")
## 
## ===== Random Forest =====
print(rf_evaluate)
## $Confusion_Matrix
##          Actual
## Predicted     0     1
##         0 32591  4629
##         1 17048  8232
## 
## $Accuracy
## [1] 0.653168
## 
## $Precision
## [1] 0.3256329
## 
## $Recall
## [1] 0.6400746
## 
## $F1_Score
## [1] 0.4316615

The Random Forest model had the strongest overall predictive performance as it achieved the highest accuracy of 65.32%. Other than that, it also had a higher precision rate, 32.56% and F1-score,43.17%. As a result shown in the confusion matrix, Random Forest had correctly predicted 32,591 fully paid loans and 8,232 charged off loans.

However, the Random Forest model achieved the lowest recall rate of 64.01%. This indicates that the model only identifies nearly approximately two-thirds of actual charged off loans. This also reduced the amount of false positives and increased the true negative cases with the cost of missing more actual defaulted borrowers. In other words, Random Forest was more conservative in predicting defaulted cases.

Therefore, Random Forest is the best model if the objective is to achieve stronger overall classification performance and a better trade-off between precision and recall. However, this is not a suitable model if the primary business goal is to capture the borrowers who are more likely to be defaulted and reduce the risk of lending to those borrowers in order to prevent any potential financial losses.

Model Comparison

The performance of those models is compared by using the standard evaluation metrics such as accuracy, precision, recall and also F1-score, each of the metrics reflect the model’s performance from different aspects.

comparison <- data.frame(
  
  Model = c("Logistic Regression",
            "Decision Tree",
            "Random Forest"),
  
  Accuracy = c(log_evaluate$Accuracy,
               tree_evaluate$Accuracy,
               rf_evaluate$Accuracy),
  
  Precision = c(log_evaluate$Precision,
               tree_evaluate$Precision,
               rf_evaluate$Precision),
  
  Recall = c(log_evaluate$Recall,
               tree_evaluate$Recall,
               rf_evaluate$Recall),
  
  F1_Score = c(log_evaluate$F1_Score,
               tree_evaluate$F1_Score,
               rf_evaluate$F1_Score)
)

comparison
##                 Model Accuracy Precision    Recall  F1_Score
## 1 Logistic Regression 0.515808 0.2774068 0.8430915 0.4174559
## 2       Decision Tree 0.648480 0.3195165 0.6269341 0.4232990
## 3       Random Forest 0.653168 0.3256329 0.6400746 0.4316615

Based on the comparison table, Random Forest stood out from Logistic Regression and Decision Tree as the strongest predictive model in this study because it had the highest score in 3 evaluation metrics which are accuracy, precision and F1-score. The high precision rate indicates that the predictions to identify defaulted borrowers are more reliable compared to the others model.

The Logistic Regression model achieved the highest recall rate. This means that it is good to capture the actual charged off loans but at the same time creating more false positive cases, leading to the lowest accuracy. Besides, the Decision Tree had a moderate performance compared to Logistic Regression and Random Forest in terms of accuracy, precision and recall.

In conclusion, there is no perfect model in this prediction analysis by having superior performance across all the evaluation metrics. The Logistic Regression model has the best performance in identifying the actual charged off loans and increasing the sensitivity of the model. While Random Forest has the strongest overall performance with a highest accuracy and least false positive cases generated.

Accuracy Visualization

par(mar = c(5, 4, 4, 8), xpd = TRUE)

barplot(t(as.matrix(comparison[,-1])),
        beside = TRUE,
        names.arg = c("Logistic\nRegression",
              "Decision\nTree",
              "Random\nForest"),
        col = c("skyblue", "orange", "seagreen", "pink"),
        legend.text = colnames(comparison)[-1],
        args.legend = list(
          x = "topright",
          inset = c(-0.25, 0),
          bty = "n"
        ),
        main = "Model Performance Comparison",
        ylab = "Score",
        ylim = c(0, 1)
)

In this dataset, loan status is classified as either Fully Paid or Charged Off, where Charged Off loans refer to cases in which borrowers have defaulted on their repayment obligations, also called as defaulted borrowers.

The model performance visualization highlights the trade-off between accuracy, precision, recall and F1-score. In this study, the models with higher recall rate tend to have a lower accuracy due to the reason that when the models are more likely to predict the loans as charged off but at the same time, it has a higher chance to misclassify actual fully paid loans as charged off loans resulting in higher false positive cases

On the other hand, the model with lower recall is because the models tend to be more conservative in predicting the actual class. This makes the model correctly identify more actual non-default borrowers, resulting in decreasing false positive cases and better overall accuracy.

The low precision values across all models indicate that predicting loan status remains challenging. Many borrowers predicted as default were actually non-default. This suggests that default and non-default borrowers may overlap across the available features, making perfect classification difficult. Hence, the interpretation should be seen as a trade-off between identifying high-risk borrowers and preventing false alarms.

Discussion (Classification)

The feature importance results show that loan status prediction was influenced by several borrower and loan-related characteristics. Across the models, variables such as interest rate, debt-to-income ratio, loan-to-income ratio, revolving utilisation, loan grade and FICO-related variables contributed to distinguishing between Fully Paid and Charged Off loans.

The results demonstrate that each of the three models has different strengths and limitations. Logistic Regression is the most sensitive model in detecting actual defaulted borrowers. This is important for credit risk assessment since it might result in financial loss if a borrower with a high-risk of default is not identified. However, the low precision indicates that the model misclassified many non-default borrowers as default, resulting in rejection of reliable applicants.

Decision Tree had moderate performance across all metrics compared to other models. The tree structure allows the classification process to be explained by decision rules. The data split into branches based on specific features when moving down from the top of the tree, making it easier to understand how borrower characteristics influence loan status prediction.

Random Forest had the highest accuracy, precision and F1-score, suggesting that it provided the strongest overall classification performance. Its ensemble structure allows it to combine multiple decision trees, which helps reduce overfitting and improves prediction stability. However, its recall rate showed that it missed more actual charged off cases.

Generally, the findings show that model selection should not be based on accuracy alone but also interpret from different aspects. If the financial institution aims to maximize overall predictive performance and reduce false alarms, Random Forest is the most suitable model. If the primary goal is to have a strict inspection and screening process to detect default,Logistic Regression may be more appropriate due to its higher recall.

Conclusion (Classification)

This project successfully developed and evaluated three classification models for loan default prediction: Logistic Regression, Decision Tree and Random Forest. The dataset was initially prepared from data understanding to understand the structure of the data and a series of preprocessing steps including data transformation, feature scaling, feature engineering and oversampling class imbalance handling before model training.

According to the evaluation results, Random Forest achieved the highest accuracy, precision and F1-score, making it the strongest overall model in this study. Conversely, Logistic Regression and Decision Tree has the higher recall score which indicated they were more sensitive individual who were likely to be default. This highlights an important trade-off in loan status prediction.

The findings demonstrate that machine learning classification techniques can assist financial institutions in improving lending decisions and reducing financial risk. Model selection should depend on the financial institution’s objective. If the priority is to reduce missed default cases and financial risk, recall should be emphasized. If the priority is to avoid incorrectly rejecting reliable borrowers, precision should also be considered.Accuracy which mean perform well in predicting non-default but fail to capture the actual default.

Regression Problem

Outlier Detection

Outliers are extreme values that deviate significantly from the rest of the data and can distort regression model performance by inflating error metrics such as RMSE and biasing coefficient estimates toward extreme observations.

The IQR method is used to detect outliers across the key numeric predictors and the regression target. Values falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR are flagged as statistical outliers. Rather than removing flagged records — which would reduce the dataset size and may discard legitimate observations — capping treatments are applied selectively depending on the nature of each variable.

The IQR method is applied to detect outliers in:

  • Annual Income (annual_inc)
  • Interest Rate (int_rate) — the regression target
  • Debt-to-Income Ratio (dti)
  • Revolving Utilisation (revol_util)
outliers <- function(x){
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  IQR <- IQR(x)
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  return(x < lower_bound | x > upper_bound)
}

cat("Outliers in annual_inc: ", sum(outliers(loandata$annual_inc)), "\n")
## Outliers in annual_inc:  10009
cat("Outliers in int_rate:   ", sum(outliers(loandata$int_rate)),   "\n")
## Outliers in int_rate:    11452
cat("Outliers in dti:        ", sum(outliers(loandata$dti)),        "\n")
## Outliers in dti:         2613
cat("Outliers in revol_util: ", sum(outliers(loandata$revol_util)), "\n")
## Outliers in revol_util:  10

The IQR method flags outliers in income, DTI, and revolving utilisation. Rather than removing legitimate records, we apply capping treatments: annual_inc and dti are winsorised at the 99th percentile, while revol_util is capped at 100 (its theoretical maximum, since revolving utilisation above 100% represents data anomalies). Note that the 11452 ‘outliers’ detected in int_rate are not treated, because int_rate is our regression target and these represent legitimate high-risk loans.

# Cap annual_inc at 99th percentile
inc_cap <- quantile(loandata$annual_inc, 0.99)
n_capped_inc <- sum(loandata$annual_inc > inc_cap)
loandata$annual_inc <- pmin(loandata$annual_inc, inc_cap)
cat("annual_inc capped at", round(inc_cap, 2), 
    "- affected", n_capped_inc, "rows\n")
## annual_inc capped at 285000 - affected 2066 rows
# Cap dti at 99th percentile (handles the "999" placeholders)
dti_cap <- quantile(loandata$dti, 0.99)
n_capped_dti <- sum(loandata$dti > dti_cap)
loandata$dti <- pmin(loandata$dti, dti_cap)
cat("dti capped at", round(dti_cap, 2),
    "- affected", n_capped_dti, "rows\n")
## dti capped at 46.48 - affected 2083 rows
# Cap revol_util at 100 (theoretical maximum)
n_capped_rev <- sum(loandata$revol_util > 100)
loandata$revol_util <- pmin(loandata$revol_util, 100)
cat("revol_util capped at 100 - affected", n_capped_rev, "rows\n")
## revol_util capped at 100 - affected 546 rows

The capping output confirms that the treatments were applied successfully. Around 1% of records had annual_inc and dti values trimmed to the 99th percentile, while several hundred records had revol_util capped at 100. This reduces the influence of extreme values on the linear model coefficients without removing any rows from the dataset.

Data Transformation

Categorical variables are converted to factors. The grade and loan_status columns are removed from the regression dataset:

  • grade directly determines int_rate (would cause target leakage and circular reasoning).

  • loan_status is the target of the separate classification study.

library(dplyr)

loandata <- loandata %>%
  select(-grade, -loan_status)

loandata <- loandata %>%
  mutate(across(c(home_ownership,
                  verification_status,
                  purpose),
                as.factor))

head(loandata)
##   loan_amnt term int_rate installment emp_length home_ownership annual_inc
## 1      3000   36     7.34       93.10          9           RENT      52000
## 2      5000   36    11.98      166.03         10            OWN      55000
## 3      7000   36    11.98      232.44          0       MORTGAGE      40000
## 4     30000   36    21.85     1143.39         10            OWN      57000
## 5     21000   60    20.39      560.94         10            OWN      85000
## 6     40000   60     9.43      838.71         10       MORTGAGE     150000
##   verification_status            purpose   dti fico_range_low fico_range_high
## 1     Source Verified     major_purchase  0.58            760             764
## 2        Not Verified              other 14.18            675             679
## 3            Verified   home_improvement 20.25            695             699
## 4            Verified debt_consolidation 27.58            680             684
## 5     Source Verified              house 15.76            665             669
## 6        Not Verified debt_consolidation  4.09            730             734
##   open_acc revol_util total_acc
## 1        7        0.5        30
## 2       14       33.9        24
## 3       13       36.0        29
## 4       11       53.2        26
## 5       15       34.2        27
## 6        6        9.4         9

After this step, the dataset contains 15 predictor variables: 11 numeric borrower attributes and 4 categorical variables encoded as factors. The target variable int_rate remains in its original numeric form so that error metrics (RMSE, MAE) can be interpreted directly in percentage points.

Exploratory Data Analysis (EDA)

Distribution of Interest Rate (Target)

par(mfrow = c(2,2), mar = c(4,4,2,1))

hist(loandata$int_rate,
     main = "Interest Rate Distribution",
     xlab = "Interest Rate (%)",
     col = "darkgreen",
     border = "white")

hist(loandata$loan_amnt,
     main = "Loan Amount Distribution",
     xlab = "Loan Amount",
     col = "pink",
     border = "white")

hist(loandata$annual_inc,
     main = "Annual Income Distribution (capped)",
     xlab = "Annual Income",
     col = "lightblue",
     border = "white")

hist(loandata$fico_range_low,
     main = "FICO Score (low) Distribution",
     xlab = "FICO Score",
     col = "orange",
     border = "white")

Interest rate is right-skewed, with a long tail of higher-risk loans charged higher rates.

Relationship Between Features and Interest Rate

numeric_vars <- c("int_rate", "loan_amnt", "term", "installment",
                  "emp_length", "annual_inc", "dti",
                  "fico_range_low", "fico_range_high",
                  "open_acc", "revol_util", "total_acc")

cor_matrix <- cor(loandata[numeric_vars])
round(sort(cor_matrix["int_rate", -1], decreasing = TRUE), 4)
##            term      revol_util     installment             dti       loan_amnt 
##          0.3664          0.2870          0.2010          0.2008          0.1414 
##        open_acc      emp_length       total_acc      annual_inc fico_range_high 
##          0.0112         -0.0225         -0.0431         -0.1001         -0.4216 
##  fico_range_low 
##         -0.4216

The correlation analysis shows clear, intuitive relationships:

  • FICO score is strongly negatively correlated with interest rate — higher creditworthiness means lower rates.

  • Term, installment, revol_util, dti are positively correlated — longer terms, larger payments, higher utilisation and higher DTI all increase the rate.

  • Annual income has a smaller but negative effect.

This is exactly the kind of structured signal we want for a regression task. It strongly suggests that meaningful prediction of interest rate is achievable.

library(ggplot2)

ggplot(loandata, aes(x = factor(term), y = int_rate, fill = factor(term))) +
  geom_boxplot() +
  labs(title = "Interest Rate by Loan Term",
       x = "Term (months)", y = "Interest Rate (%)") +
  theme_classic() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

60-month loans carry noticeably higher interest rates than 36-month loans, consistent with longer-term loans bearing more risk.

Data Modelling

Training and Testing Data Splitting

The caret package is used to split the dataset into 70% training and 30% testing, with stratification on the target.

library(caret)
set.seed(123)

sample_index <- createDataPartition(loandata$int_rate,
                                    p = 0.7,
                                    list = FALSE)

train_data <- loandata[sample_index, ]
test_data  <- loandata[-sample_index, ]

cat("Training rows:", nrow(train_data), "\n")
## Training rows: 145838
cat("Test rows:    ", nrow(test_data),  "\n")
## Test rows:     62499
cat_cols <- c("home_ownership", "verification_status", "purpose")

for (col in cat_cols) {
  train_data[[col]] <- as.factor(train_data[[col]])
  test_data[[col]]  <- factor(test_data[[col]],
                              levels = levels(train_data[[col]]))
}

Feature Scaling

numeric_predictors <- c("loan_amnt", "term", "installment",
                        "emp_length", "annual_inc", "dti",
                        "fico_range_low", "fico_range_high",
                        "open_acc", "revol_util", "total_acc")

train_scaled <- scale(train_data[numeric_predictors])

train_data[numeric_predictors] <- train_scaled

test_data[numeric_predictors] <- scale(test_data[numeric_predictors],
                                       center = attr(train_scaled, "scaled:center"),
                                       scale  = attr(train_scaled, "scaled:scale"))

Feature scaling standardises numerical predictors so they have a mean of zero and a standard deviation of one. This is particularly important for linear models, where coefficient magnitudes become directly comparable when predictors are on the same scale. The same scaling parameters from the training set are applied to the test set to prevent data leakage.

Evaluation Metric Function

evaluate_reg <- function(actual, predicted, model_name) {
  rmse_val <- sqrt(mean((actual - predicted)^2))
  mae_val  <- mean(abs(actual - predicted))
  r2_val   <- 1 - sum((actual - predicted)^2) /
                  sum((actual - mean(actual))^2)
  data.frame(Model = model_name,
             RMSE = round(rmse_val, 4),
             MAE  = round(mae_val, 4),
             R2   = round(r2_val, 4))
}

baseline_pred <- rep(mean(train_data$int_rate), nrow(test_data))
baseline_rmse <- sqrt(mean((test_data$int_rate - baseline_pred)^2))
cat("Baseline RMSE (predicting the mean):", round(baseline_rmse, 4), "\n")
## Baseline RMSE (predicting the mean): 5.6176

A helper function computes the three standard regression metrics: RMSE (average prediction error in the same units as the target), MAE (median-like average error, less sensitive to extreme errors), and R-squared (the proportion of variance explained). A baseline RMSE is also computed by predicting the training mean for every test observation. Any useful model must produce predictions that beat this baseline.

Model Building

Model 1: Multiple Linear Regression

lm_model <- lm(int_rate ~ ., data = train_data)
summary(lm_model)
## 
## Call:
## lm(formula = int_rate ~ ., data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0684  -2.1728  -0.5128   1.7373  20.1702 
## 
## Coefficients:
##                                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                         13.618945   0.284808   47.818  < 2e-16 ***
## loan_amnt                          -15.277628   0.046920 -325.610  < 2e-16 ***
## term                                 5.563151   0.014151  393.137  < 2e-16 ***
## installment                         14.782137   0.043267  341.646  < 2e-16 ***
## emp_length                          -0.092329   0.008777  -10.520  < 2e-16 ***
## home_ownershipMORTGAGE              -0.421249   0.275264   -1.530 0.125933    
## home_ownershipNONE                  -2.017119   3.276375   -0.616 0.538123    
## home_ownershipOWN                    0.071399   0.276125    0.259 0.795964    
## home_ownershipRENT                   0.100870   0.275344    0.366 0.714110    
## annual_inc                          -0.208233   0.010718  -19.429  < 2e-16 ***
## verification_statusSource Verified   0.734149   0.020342   36.090  < 2e-16 ***
## verification_statusVerified          1.342534   0.024030   55.870  < 2e-16 ***
## purposecredit_card                  -1.128430   0.078610  -14.355  < 2e-16 ***
## purposedebt_consolidation           -0.355859   0.077102   -4.615 3.93e-06 ***
## purposehome_improvement             -0.059428   0.081958   -0.725 0.468391    
## purposehouse                         0.307964   0.114577    2.688 0.007192 ** 
## purposemajor_purchase                0.558504   0.092003    6.071 1.28e-09 ***
## purposemedical                       0.606797   0.101604    5.972 2.35e-09 ***
## purposemoving                        0.758710   0.119322    6.358 2.04e-10 ***
## purposeother                         0.916630   0.081780   11.208  < 2e-16 ***
## purposerenewable_energy              1.239478   0.329149    3.766 0.000166 ***
## purposesmall_business                0.765950   0.111685    6.858 7.00e-12 ***
## purposevacation                      0.855244   0.114786    7.451 9.33e-14 ***
## purposewedding                       8.224323   2.309850    3.561 0.000370 ***
## dti                                  0.403456   0.010211   39.513  < 2e-16 ***
## fico_range_low                     -64.657663  19.208818   -3.366 0.000763 ***
## fico_range_high                     63.469164  19.208730    3.304 0.000953 ***
## open_acc                             0.075261   0.012731    5.911 3.40e-09 ***
## revol_util                           0.157690   0.010983   14.357  < 2e-16 ***
## total_acc                           -0.254448   0.012697  -20.039  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.265 on 145808 degrees of freedom
## Multiple R-squared:  0.6606, Adjusted R-squared:  0.6605 
## F-statistic:  9787 on 29 and 145808 DF,  p-value: < 2.2e-16
lm_pred <- predict(lm_model, newdata = test_data)
lm_eval <- evaluate_reg(test_data$int_rate, lm_pred, "Linear Regression")
print(lm_eval)
##               Model   RMSE    MAE     R2
## 1 Linear Regression 3.2983 2.5186 0.6553

The regression summary shows several highly significant predictors (term, FICO scores, DTI, revolving utilisation), confirming the relationships found in EDA.

Model 2: Decision Tree Regression

library(rpart)
library(rpart.plot)

tree_model <- rpart(int_rate ~ ., data = train_data, method = "anova",
                    control = rpart.control(
                      cp       = 0.001,
                      minsplit = 200,
                      maxdepth = 10
                    ))
rpart.plot(tree_model, fallen.leaves = TRUE, cex = 0.55)

tree_pred <- predict(tree_model, newdata = test_data)
tree_eval <- evaluate_reg(test_data$int_rate, tree_pred, "Decision Tree")
print(tree_eval)
##           Model   RMSE    MAE     R2
## 1 Decision Tree 4.1083 3.1032 0.4652

The decision tree is configured with a low complexity parameter (cp = 0.001) and a minimum split size of 200 to allow it to grow deep enough to capture the structure in the data, while a maximum depth of 10 prevents excessive overfitting. The resulting tree visualises how borrower attributes hierarchically determine interest rate, with FICO score and loan term appearing as the most important early splits.

Model 3: Random Forest Regression

library(ranger)

set.seed(123)
rf_model <- ranger(int_rate ~ ., data = train_data,
                   num.trees     = 300,
                   mtry          = 5,
                   min.node.size = 10,
                   importance    = "impurity")
## Growing trees.. Progress: 37%. Estimated remaining time: 52 seconds.
## Growing trees.. Progress: 73%. Estimated remaining time: 23 seconds.
rf_pred <- predict(rf_model, data = test_data)$predictions
rf_eval <- evaluate_reg(test_data$int_rate, rf_pred, "Random Forest")
print(rf_eval)
##           Model   RMSE    MAE     R2
## 1 Random Forest 2.7307 1.9475 0.7637

The Random Forest is configured with 300 trees, 5 features tried at each split (mtry = 5), and a minimum terminal node size of 10. Compared to a single decision tree, the ensemble averages predictions across many trees, which substantially reduces variance and improves generalization.

importance_df <- data.frame(
  Feature = names(rf_model$variable.importance),
  Importance = rf_model$variable.importance) %>%
  arrange(desc(Importance)) %>%
  head(10)

ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = "seagreen") +
  coord_flip() +
  labs(title = "Random Forest: Top 10 Feature Importances",
       x = "Feature", y = "Importance") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The feature importance plot ranks predictors by their contribution to reducing prediction error across all trees in the forest. Installment amount, loan amount, and loan term emerge as the dominant predictors, followed by FICO scores and debt-to-income ratio. This ranking aligns closely with how real lenders price loans: the size of the payment, the amount borrowed, the loan duration, and the borrower’s creditworthiness are all primary inputs to risk-based pricing.

Model Comparison

With all three models evaluated on the held-out test set, the metrics are combined into a single comparison table sorted by RMSE. The model with the lowest RMSE is treated as the best-performing model for this task.

comparison <- rbind(lm_eval, tree_eval, rf_eval)
comparison <- comparison[order(comparison$RMSE), ]
print(comparison)
##               Model   RMSE    MAE     R2
## 3     Random Forest 2.7307 1.9475 0.7637
## 1 Linear Regression 3.2983 2.5186 0.6553
## 2     Decision Tree 4.1083 3.1032 0.4652
ggplot(comparison, aes(x = reorder(Model, -RMSE), y = RMSE, fill = Model)) +
  geom_col() +
  geom_hline(yintercept = baseline_rmse, linetype = "dashed",
             color = "red", linewidth = 1) +
  coord_flip() +
  labs(title = "Model Comparison: RMSE (lower is better)",
       subtitle = "Red dashed line = baseline (predicting the mean)",
       x = "Model", y = "RMSE") +
  theme_classic() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

All three models comfortably beat the baseline (red dashed line at RMSE = 5.6176). Random Forest produces the lowest RMSE, followed by Linear Regression and then Decision Tree.

# Actual vs Predicted for the best model
best_name <- comparison$Model[1]
best_pred <- switch(best_name,
  "Linear Regression" = lm_pred,
  "Decision Tree"     = tree_pred,
  "Random Forest"     = rf_pred)

set.seed(1)
plot_idx <- sample(nrow(test_data), 5000)

ggplot(data.frame(Actual = test_data$int_rate[plot_idx],
                  Predicted = best_pred[plot_idx]),
       aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.2, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  labs(title = paste0("Best Model (", best_name, "): Actual vs Predicted"),
       subtitle = "Sample of 5,000 test points",
       x = "Actual Interest Rate (%)",
       y = "Predicted Interest Rate (%)") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The Actual vs Predicted plot shows predictions clustering along the red diagonal, indicating the best model captures the structure of interest rate well.

Discussion (Regression)

The regression analysis tested three modelling approaches: a linear model, a decision tree, and a random forest. All three models produced predictions substantially better than the naive baseline RMSE of 5.6176, with R-squared values well above zero and RMSE values clearly below the baseline.

Random Forest produced the strongest performance, with R² = 0.7637, RMSE = 2.7307, and MAE = 1.9475, explaining roughly 76% of the variation in interest rate. Linear Regression performed reasonably well at R² = 0.6553, RMSE = 3.2983, MAE = 2.5186, demonstrating that a simple linear model can still capture most of the relationship between borrower attributes and interest rate. Decision Tree was the weakest with R² = 0.4652, RMSE = 4.1083, MAE = 3.1032, since a single tree cannot capture as many interactions as the Random Forest ensemble does.

Random Forest’s clear advantage over Linear Regression suggests that non-linear relationships and interactions between borrower attributes matter for accurate interest rate prediction. While Linear Regression offers superior interpretability through its coefficients, Random Forest delivers meaningfully better predictions in this dataset.

Feature importance from the Random Forest highlights that installment amount, loan amount, loan term, FICO scores, and debt-to-income ratio are the strongest drivers of interest rate. This is consistent with real-world risk-based pricing: lenders charge more when the borrower has lower credit scores, longer terms, higher debt burdens, or larger loan exposures.

By deliberately excluding grade from the predictors, the models prove that interest rate can be reasonably predicted from raw borrower characteristics, not just from Lending Club’s pre-computed grading system. This makes the analysis applicable to lenders that do not have an existing internal grading model.

Conclusion (Regression)

This project developed and evaluated three regression models which is Multiple Linear Regression, Decision Tree, and Random Forest to predict loan interest rate on the Lending Club 2017-2018 dataset. The data was prepared through column selection, year filtering for relevance, outlier capping, factor encoding, stratified train-test splitting, and feature scaling.

Random Forest emerged as the clear best model with R² = 0.7637 and RMSE = 2.7307, substantially outperforming both Linear Regression (R² = 0.6553) and Decision Tree (R² = 0.4652), and well below the baseline RMSE of 5.6176. The most influential predictors were installment amount, loan amount, loan term, FICO score, and DTI, aligning directly with real-world lending practice. These results demonstrate that interest rates can be predicted with meaningful accuracy from borrower attributes alone.

For future work, more advanced models such as Gradient Boosting (XGBoost) could be explored to push predictive performance further, and the analysis could be extended to time-aware modelling to capture macroeconomic effects on interest-rate setting.