The “Lending Club” dataset on Kaggle is a large-scale real-world financial dataset containing loan information from a peer-to-peer lending platform. It includes accepted and rejected loan applications from 2007 to 2018, with detailed features such as borrower demographics, credit history, loan purpose, interest rates, and loan status. It can be a benchmark dataset in machine learning due to its size and real-world complexity. Peer-to-peer lending platforms such as Lending Club connect borrowers with individual investors.
Using a real-world Lending Club dataset of accepted loans issued in 2017-2018, this study develops predictive models based on borrower and loan characteristics available at the time of origination to address two main objectives:
Classification (Loan Status Prediction): Accurate prediction of whether a borrower will repay or default on a loan is critical for managing credit risk and protecting investor returns. This study develops classification models to predict loan repayment outcomes based on borrower and loan characteristics available at the time of origination.
Regression (Interest Rate Prediction): Interest rate is the central pricing mechanism in lending. It determines the cost of a loan to the borrower and the return to the lender, and is set based on the perceived risk of the borrower. Understanding which factors drive interest rate, and being able to predict the rate that a borrower is likely to be charged, is valuable to both lenders (for consistent risk-based pricing) and to borrowers (for understanding what affects the rate they receive). This study develops regression models to predict the interest rate assigned to a borrower from their financial profile and loan characteristics.
There are few classification techniques can be use which are:
Then the performance of the classification techniques are evaluated.
The objective is to develop and evaluate regression models that predict int_rate (the loan’s interest rate) using borrower and loan attributes. Three regression techniques are built and compared:
The performance of these techniques is evaluated using RMSE, MAE, and R-squared.
Important methodological note for Regression:
Lending Club assigns each loan a credit grade and
sub_grade, and the interest rate is directly determined
from that grade. Including grade as a predictor of
int_rate would therefore be circular and would produce an
artificially perfect model. We deliberately exclude
grade from the regression models and instead use
the underlying borrower attributes (FICO score, debt-to-income ratio,
income, term, etc.) — the same kind of variables that the grading
process itself relies on. This produces a more meaningful and realistic
predictive model.
This document prepares the raw Lending Club accepted loans dataset for downstream classification and regression modelling. It performs the following steps:
accepted_2007_to_2018Q4.csv (around 2.26
million rows, 151 columns).issue_d used only for year filtering).loan_status to only completed loans (Fully Paid
or Charged Off).issue_d since it is no longer needed.term, int_rate,
emp_length) and removes missing values.library(dplyr)
The full Lending Club dataset is loaded from CSV.
loandata <- read.csv("accepted_2007_to_2018Q4.csv", stringsAsFactors = FALSE)
cat("Loaded:", nrow(loandata), "rows x", ncol(loandata), "columns\n")
## Loaded: 2260701 rows x 151 columns
Out of 151 columns, only 18 are needed: 17 will be used for
modelling, plus issue_d which is used temporarily for year
filtering and then dropped. We keep borrower demographics, loan terms,
credit history indicators, and the two target-related columns
(loan_status for classification, int_rate for
regression).
keep_cols <- c(
"loan_amnt", "term", "int_rate", "installment", "grade",
"emp_length", "home_ownership", "annual_inc", "verification_status",
"purpose", "dti", "fico_range_low", "fico_range_high",
"open_acc", "revol_util", "total_acc", "loan_status",
"issue_d"
)
loandata <- loandata[, keep_cols]
cat("Reduced to", ncol(loandata), "columns\n")
## Reduced to 18 columns
The loan_status column contains six categories: Fully
Paid, Charged Off, Current, Late (16–30 days), Late (31–120 days), and
In Grace Period. For classification, only loans with a definitive final
outcome are useful, so we keep only “Fully Paid” and “Charged Off”.
Loans that are still active or in late stages have unknown final
outcomes and are removed.
loandata <- loandata %>%
filter(loan_status %in% c("Fully Paid", "Charged Off"))
cat("After loan_status filter:", nrow(loandata), "rows\n")
## After loan_status filter: 1345310 rows
Lending Club operated under different lending standards across its history. Loans from 2007–2010 reflect post-financial-crisis conditions, while 2017–2018 represents the most recent and consistent period before the dataset ends. Focusing on 2017–2018 ensures that the modelling reflects contemporary lending behaviour.
The issue_d column contains the issue date in formats
such as "Dec-2018" or "2018-12-01". A regular
expression extracts the year reliably from either format.
loandata$issue_year <- as.numeric(sub(".*?([0-9]{4}).*", "\\1", loandata$issue_d))
loandata <- loandata %>% filter(issue_year %in% c(2017, 2018))
cat("After year filter (2017-2018):", nrow(loandata), "rows\n")
## After year filter (2017-2018): 225611 rows
# Drop helper columns
loandata <- loandata %>% select(-issue_d, -issue_year)
Several columns contain text that needs to be converted to numeric form for modelling.
The term column is stored as " 36 months"
or " 60 months". We extract the numeric value.
loandata$term <- as.numeric(gsub("[^0-9]", "", loandata$term))
In some versions of the dataset, int_rate is stored as a
character string with a percent sign (e.g. "13.99%"). In
others it is already numeric. This conditional check handles both
cases.
if (is.character(loandata$int_rate)) {
loandata$int_rate <- as.numeric(gsub("%", "", loandata$int_rate))
}
emp_length is stored as text values like
"10+ years", "< 1 year", or
"n/a". We convert these into numeric years, with
n/a becoming NA.
parse_emp_length <- function(x) {
x <- trimws(as.character(x))
ifelse(x == "" | x == "n/a", NA,
ifelse(x == "< 1 year", 0,
ifelse(x == "10+ years", 10,
as.numeric(gsub("[^0-9]", "", x)))))
}
loandata$emp_length <- parse_emp_length(loandata$emp_length)
We first inspect missing values per column, then drop rows with any
remaining missing values. After all the column selection and parsing
above, the only meaningful missing values should be in
emp_length (where the original “n/a” became
NA) and possibly a few in revol_util and
dti.
cat("Missing values per column before na.omit:\n")
## Missing values per column before na.omit:
print(colSums(is.na(loandata)))
## loan_amnt term int_rate installment
## 0 0 0 0
## grade emp_length home_ownership annual_inc
## 0 17056 0 0
## verification_status purpose dti fico_range_low
## 0 0 334 0
## fico_range_high open_acc revol_util total_acc
## 0 0 225 0
## loan_status
## 0
before <- nrow(loandata)
loandata <- na.omit(loandata)
cat("Removed", before - nrow(loandata), "rows with missing values\n")
## Removed 17274 rows with missing values
cat("Rows remaining after cleaning:", nrow(loandata), "\n")
## Rows remaining after cleaning: 208337
After all the cleaning steps above, the final dataset is ready for downstream modelling by the classification and regression teams.
cat("Final dataset:", nrow(loandata), "rows x", ncol(loandata), "columns\n")
## Final dataset: 208337 rows x 17 columns
The data preparation pipeline transformed the raw 2.26 million-row
Lending Club file into a focused, clean modelling dataset of
approximately 1 million recent loans across 17 columns
(after dropping the helper issue_d column used for year
filtering), ready for downstream classification and regression.
Outliers are extreme values that may affect model performance.
The IQR capping is used to reduce the influence of extreme outliers while preserving the overall dataset size. This method identifies outliers based on the spread of the middle 50% of the data, where values falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR are considered outliers.
The Interquartile Range (IQR) method is used to detect outliers in:
Annual Income
Loan Amount
Debt-to-Income Ratio
Revolving Utilisation
#Make a copy of cleaned dataset
loandata_cleaned<-loandata
#IQR outliers detection function
iqr_outliers <- function(x){
Q1 <- quantile(x,0.25)
Q3 <- quantile(x,0.75)
IQR <- IQR(x)
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
return(x < lower_bound | x > upper_bound)
}
#99th percentile outliers detection function
percentile_outliers <- function(x, lower_p = 0.01, upper_p = 0.99){
lower <- quantile(x, lower_p, na.rm = TRUE)
upper <- quantile(x, upper_p, na.rm = TRUE)
return(x < lower | x > upper)
}
outliers_income <- iqr_outliers(loandata_cleaned$annual_inc)
outliers_loanamt <- iqr_outliers(loandata_cleaned$loan_amnt)
outliers_dtiratio <- percentile_outliers(loandata_cleaned$dti)
outliers_revolutil <- percentile_outliers(loandata_cleaned$revol_util)
print(paste0("Total count of outliers for Annual Incomes: ",sum(outliers_income)))
## [1] "Total count of outliers for Annual Incomes: 10009"
print(paste0("Total count of outliers for Loan Amount: ",sum(outliers_loanamt)))
## [1] "Total count of outliers for Loan Amount: 4767"
print(paste0("Total count of outliers for DTI ratio: ",sum(outliers_dtiratio)))
## [1] "Total count of outliers for DTI ratio: 4154"
print(paste0("Total count of outliers for Revolving Utilization: ",sum(outliers_revolutil)))
## [1] "Total count of outliers for Revolving Utilization: 4027"
Display the summary of these four variables to view their min, max, 1st quartile and 3rd quartile values.
print("Annual Incomes:")
## [1] "Annual Incomes:"
summary(loandata_cleaned$annual_inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33 50000 70000 81805 98000 10999200
print("Loan Amount:")
## [1] "Loan Amount:"
summary(loandata_cleaned$loan_amnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 7000 12000 14672 20000 40000
print("DTI ratio:")
## [1] "DTI ratio:"
summary(loandata_cleaned$dti)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00 11.36 17.33 18.35 24.05 999.00
print("Revolving Utilization:")
## [1] "Revolving Utilization:"
summary(loandata_cleaned$revol_util)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.60 43.10 44.58 63.30 154.90
Visualise the outliers by plotting boxplots
par(mfrow = c(2, 2),
mar = c(4, 4, 2, 1))
boxplot(loandata_cleaned$annual_inc,
main = "Annual Income",
col = "lightblue")
boxplot(loandata_cleaned$loan_amnt,
main = "Loan Amount",
col = "lightgreen")
boxplot(loandata_cleaned$dti,
main = "DTI Ratio",
col = "lightpink")
boxplot(loandata_cleaned$revol_util,
main = "Revolving Utilization",
col = "lightyellow")
By plotting boxplot, we can observe that there are several outliers beyond the whiskers.
IQR capping is applied to annual_inc and
loan_amnt because these financial variables naturally
contain highly skewed distributions and extreme values.
iqr_cap <- function(x){
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- IQR(x, na.rm = TRUE)
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
# Apply IQR capping
loandata_cleaned$annual_inc <- iqr_cap(loandata_cleaned$annual_inc)
loandata_cleaned$loan_amnt <- iqr_cap(loandata_cleaned$loan_amnt)
However, 99th percentile capping is used for variables that have a
realistic and meaningful boundary such as revol_util and
dti. Both of them are in ratio and capped by using 99th
percentile capping to avoid excessive trimming of meaningful financial
behavior.
cap_99th <- function(x, lower_p = 0.01, upper_p = 0.99){
lower_bound <- quantile(x, lower_p, na.rm = TRUE)
upper_bound <- quantile(x, upper_p, na.rm = TRUE)
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
# Apply 99th percentile capping
loandata_cleaned$dti <- cap_99th(loandata_cleaned$dti)
loandata_cleaned$revol_util <- cap_99th(loandata_cleaned$revol_util)
To check whether any outlier exist after using IQR and 99th percentile capping
outliers_income <- iqr_outliers(loandata_cleaned$annual_inc)
outliers_loanamt <- iqr_outliers(loandata_cleaned$loan_amnt)
outliers_dtiratio <- percentile_outliers(loandata_cleaned$dti)
outliers_revolutil <- percentile_outliers(loandata_cleaned$revol_util)
print(paste0("Total count of outliers for Annual Incomes: ",sum(outliers_income)))
## [1] "Total count of outliers for Annual Incomes: 0"
print(paste0("Total count of outliers for Loan Amount: ",sum(outliers_loanamt)))
## [1] "Total count of outliers for Loan Amount: 0"
print(paste0("Total count of outliers for DTI ratio: ",sum(outliers_dtiratio)))
## [1] "Total count of outliers for DTI ratio: 0"
print(paste0("Total count of outliers for Revolving Utilization: ",sum(outliers_revolutil)))
## [1] "Total count of outliers for Revolving Utilization: 0"
To view the summary of each variables after handling outliers
print("Annual Incomes:")
## [1] "Annual Incomes:"
summary(loandata_cleaned$annual_inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33 50000 70000 77329 98000 170000
print("Loan Amount:")
## [1] "Loan Amount:"
summary(loandata_cleaned$loan_amnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 7000 12000 14660 20000 39500
print("DTI ratio:")
## [1] "DTI ratio:"
summary(loandata_cleaned$dti)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.23 11.36 17.33 18.03 24.05 46.48
print("Revolving Utilization:")
## [1] "Revolving Utilization:"
summary(loandata_cleaned$revol_util)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 24.60 43.10 44.56 63.30 97.80
Visualize again the outliers by plotting boxplots.
par(mfrow = c(2, 2),
mar = c(4, 4, 2, 1))
boxplot(loandata_cleaned$annual_inc,
main = "Annual Income",
col = "lightblue")
boxplot(loandata_cleaned$loan_amnt,
main = "Loan Amount",
col = "lightgreen")
boxplot(loandata_cleaned$dti,
main = "DTI Ratio",
col = "lightpink")
boxplot(loandata_cleaned$revol_util,
main = "Revolving Utilization",
col = "lightyellow")
We can still observe some points that spread beyond the whisker in DTI Ratio plot because boxplot is based on the IQR method for outlier detection, whereas the DTI values were treated using 99th percentile capping. Therefore, although extreme values have been limited, some observations may still appear as statistical outliers under the IQR rule.
The categorical target variable loan_status was
transformed into a binary numeric format using label encoding, where
“Charged Off” is encoded as 1 and “Fully Paid” is encoded as 0 to enable
machine learning model training.
#Label encoding
loandata_trans <- loandata_cleaned
loandata_trans$loan_status <- ifelse(loandata_cleaned$loan_status == "Charged Off", 1, 0)
The categorical variables are converted into factor format because classification models require categorical variables to be factors.
loandata_trans <- loandata_trans %>%
mutate(across(c(term,
grade,
emp_length,
home_ownership,
verification_status,
purpose,
loan_status),
as.factor))
head(loandata_trans)
## loan_amnt term int_rate installment grade emp_length home_ownership
## 1 3000 36 7.34 93.10 A 9 RENT
## 2 5000 36 11.98 166.03 B 10 OWN
## 3 7000 36 11.98 232.44 B 0 MORTGAGE
## 4 30000 36 21.85 1143.39 D 10 OWN
## 5 21000 60 20.39 560.94 D 10 OWN
## 6 39500 60 9.43 838.71 B 10 MORTGAGE
## annual_inc verification_status purpose dti fico_range_low
## 1 52000 Source Verified major_purchase 1.23 760
## 2 55000 Not Verified other 14.18 675
## 3 40000 Verified home_improvement 20.25 695
## 4 57000 Verified debt_consolidation 27.58 680
## 5 85000 Source Verified house 15.76 665
## 6 150000 Not Verified debt_consolidation 4.09 730
## fico_range_high open_acc revol_util total_acc loan_status
## 1 764 7 0.5 30 0
## 2 679 14 33.9 24 0
## 3 699 13 36.0 29 0
## 4 684 11 53.2 26 0
## 5 669 15 34.2 27 0
## 6 734 6 9.4 9 0
To gain comprehensive understanding of the borrower dataset, including their distributional properties, variability and structural characteristics, exploratory data analysis was performed. This step is important in data preprocessing as it helps to identify the underlying trends to provide insights which can guide feature engineering and model development.
The distribution of numerical variables, such as annual income, loan amount, DTI ratio and Revolving Credit Utilization was visualized using histograms below. The income distribution shows a right-skewed distribution which suggests that most applicants earning between roughly 40,000 and 80,000 annually. The wide range of income levels implies that the borrowers come from different economic backgrounds which may influence their ability to pay loans.
Similarly, the distribution of loan amount shows a right-skewed distribution where smaller loan amounts are more common, while larger loans are less frequent. The peak frequency lies in the lower-to-mid loan amounts, around 10,000 to 20,000.
The DTI values form a bell-shaped curve centered around 15–20%. Few applicants have DTI above 35%, indicating moderate debt levels relative to income.
The Revolving Credit Utilization is fairly uniform distribution peaking near 40–50%. This can be explained that many borrowers use between 20–70% of their available revolving credit.
par(mfrow = c(2,2))
hist(loandata_trans$annual_inc, main = "Income", col="lightblue")
hist(loandata_trans$loan_amnt, main = "Loan Amount", col="lightgreen")
hist(loandata_trans$dti, main = "DTI", col="lightpink")
hist(loandata_trans$revol_util, main = "Revolving Utilization", col="lightyellow")
LoanToIncomeRatio is calculated to compare the loan size to income, which can capture the financial burden of borrowers.
Loan-to-Income ratio is derived as a new variable to improve predictive capability. The feature is defined as:
\[\text{Loan-to-Income Ratio} = \frac{\text{Loan Amount}}{\text{Annual Income}}\]
Compared to the absolute loan amount alone, the ratio of loan amount to income provides a relative measure of loan burden in relation to affordability. A higher loan-to-income ratio suggests that the borrower is taking more debt relative to their income, which may increase the likelihood of default.
loandata_trans$LoanToIncomeRatio <-
loandata_trans$loan_amnt /
loandata_trans$annual_inc
head(loandata_trans)
## loan_amnt term int_rate installment grade emp_length home_ownership
## 1 3000 36 7.34 93.10 A 9 RENT
## 2 5000 36 11.98 166.03 B 10 OWN
## 3 7000 36 11.98 232.44 B 0 MORTGAGE
## 4 30000 36 21.85 1143.39 D 10 OWN
## 5 21000 60 20.39 560.94 D 10 OWN
## 6 39500 60 9.43 838.71 B 10 MORTGAGE
## annual_inc verification_status purpose dti fico_range_low
## 1 52000 Source Verified major_purchase 1.23 760
## 2 55000 Not Verified other 14.18 675
## 3 40000 Verified home_improvement 20.25 695
## 4 57000 Verified debt_consolidation 27.58 680
## 5 85000 Source Verified house 15.76 665
## 6 150000 Not Verified debt_consolidation 4.09 730
## fico_range_high open_acc revol_util total_acc loan_status LoanToIncomeRatio
## 1 764 7 0.5 30 0 0.05769231
## 2 679 14 33.9 24 0 0.09090909
## 3 699 13 36.0 29 0 0.17500000
## 4 684 11 53.2 26 0 0.52631579
## 5 669 15 34.2 27 0 0.24705882
## 6 734 6 9.4 9 0 0.26333333
fico_range_low represents the lower bound of the
borrower’s estimated credit score
fico_range_high represents the upper bound of the
borrower’s estimated credit score
fico_avg is calculated by computing the average FICO
score to represent overall creditworthiness and the FICO range width to
capture variability in credit assessment. This transformation reduces
redundancy and improves interpretability for predictive
modelling.
\[\text{Average FICO Score} = \frac{\text{FICO Range Low + FICO Range High}}{\text{2}}\]
loandata_trans$fico_avg <- (loandata_trans$fico_range_low +
loandata_trans$fico_range_high) / 2
head(loandata_trans)
## loan_amnt term int_rate installment grade emp_length home_ownership
## 1 3000 36 7.34 93.10 A 9 RENT
## 2 5000 36 11.98 166.03 B 10 OWN
## 3 7000 36 11.98 232.44 B 0 MORTGAGE
## 4 30000 36 21.85 1143.39 D 10 OWN
## 5 21000 60 20.39 560.94 D 10 OWN
## 6 39500 60 9.43 838.71 B 10 MORTGAGE
## annual_inc verification_status purpose dti fico_range_low
## 1 52000 Source Verified major_purchase 1.23 760
## 2 55000 Not Verified other 14.18 675
## 3 40000 Verified home_improvement 20.25 695
## 4 57000 Verified debt_consolidation 27.58 680
## 5 85000 Source Verified house 15.76 665
## 6 150000 Not Verified debt_consolidation 4.09 730
## fico_range_high open_acc revol_util total_acc loan_status LoanToIncomeRatio
## 1 764 7 0.5 30 0 0.05769231
## 2 679 14 33.9 24 0 0.09090909
## 3 699 13 36.0 29 0 0.17500000
## 4 684 11 53.2 26 0 0.52631579
## 5 669 15 34.2 27 0 0.24705882
## 6 734 6 9.4 9 0 0.26333333
## fico_avg
## 1 762
## 2 677
## 3 697
## 4 682
## 5 667
## 6 732
Since fico_avg is computed, fico_range_low
and fico_range_high are removed to reduce redundancy
loandata_trans<-loandata_trans %>%
select(-c(fico_range_low,fico_range_high))
head(loandata_trans)
## loan_amnt term int_rate installment grade emp_length home_ownership
## 1 3000 36 7.34 93.10 A 9 RENT
## 2 5000 36 11.98 166.03 B 10 OWN
## 3 7000 36 11.98 232.44 B 0 MORTGAGE
## 4 30000 36 21.85 1143.39 D 10 OWN
## 5 21000 60 20.39 560.94 D 10 OWN
## 6 39500 60 9.43 838.71 B 10 MORTGAGE
## annual_inc verification_status purpose dti open_acc revol_util
## 1 52000 Source Verified major_purchase 1.23 7 0.5
## 2 55000 Not Verified other 14.18 14 33.9
## 3 40000 Verified home_improvement 20.25 13 36.0
## 4 57000 Verified debt_consolidation 27.58 11 53.2
## 5 85000 Source Verified house 15.76 15 34.2
## 6 150000 Not Verified debt_consolidation 4.09 6 9.4
## total_acc loan_status LoanToIncomeRatio fico_avg
## 1 30 0 0.05769231 762
## 2 24 0 0.09090909 677
## 3 29 0 0.17500000 697
## 4 26 0 0.52631579 682
## 5 27 0 0.24705882 667
## 6 9 0 0.26333333 732
The class balance of “Charged Off” (1) and “Fully Paid” (0) in
loan_status (target variable) is checked.
The target variable loan_status was examined to
understand the class balance in the dataset. The results show highly
imbalanced distribution where approximately 79.4% of the loan are fully
paid (0) and 20.6% are charged off (1).
table(loandata_trans$loan_status)
##
## 0 1
## 165466 42871
prop.table(table(loandata_trans$loan_status))
##
## 0 1
## 0.7942228 0.2057772
library(ggplot2)
ggplot(loandata_trans,
aes(x = loan_status,
fill = loan_status)) +
geom_bar(color = "black") +
geom_text(stat = "count",
aes(label = after_stat(count)),
vjust = -0.3,
size = 4,
fontface = "bold") +
labs(title = "Class Distribution of Loan Status",
x = "Loan Status",
y = "Frequency") +
scale_x_discrete(labels = c("Fully Paid",
"Charged Off")) +
scale_fill_manual(values = c("lightblue", "tomato")) +
theme_classic() +
theme(
legend.position = "none",
axis.line = element_line(color = "black"),
axis.ticks = element_line(color = "black"),
plot.title = element_text(hjust = 0.5, face = "bold")
)
The bar plot shows a dominant off majority class (fully paid) compared to minority class (charged off). The imbalance is common issue in credit risk modelling which can lead to biased model performance where the model is more likely to predict the majority class. Hence, class imbalance need to be addressed so that the model can performs well in predicting default cases.
The code uses the R package caret to split the loan
dataset into training and testing datasets. To make the process to be
reproducible, set.seed(123) is applied.
The dataset is divided into training (70%) and testing (30%) datasets. Since the target variable is imbalanced,we used a stratified sampling to maintain the original proportion of charged off and fully paid in both training and test set. This ensures both set represent same class distribution.
library(caret)
set.seed(123)
sample_index <- createDataPartition(loandata_trans$loan_status,
p=0.7,
list=FALSE)
train_data <- loandata_trans[sample_index, ]
test_data <- loandata_trans[-sample_index, ]
prop.table(table(loandata_trans$loan_status))
##
## 0 1
## 0.7942228 0.2057772
prop.table(table(train_data$loan_status))
##
## 0 1
## 0.7942223 0.2057777
prop.table(table(test_data$loan_status))
##
## 0 1
## 0.794224 0.205776
cat("Number of row in training data:", nrow(train_data), "\n")
## Number of row in training data: 145837
cat("Number of row in test data:", nrow(test_data), "\n")
## Number of row in test data: 62500
After splitting the data, the class proportions of target variable
loan_status in both training and testing data are observed.
There are approximately 79% of fully paid and 21% of charged off in both
training and testing set, indicating that the stratified sampling was
successfully performed. This balanced preservation of class distribution
helps ensure the model developed will be more reliable.
After the dataset is split into training and testing sets, the
categorical variables in cat_cols are converted into factors again to
ensure the consistency between both datasets. The function
levels = levels(train_data[[col]]) is applied to testing
data in order to force the corresponding columns in testing data to use
the same factor levels as the training data. This is to make sure both
datasets have identical category levels and prevent predictions errors
caused by mismatched or unseen categories in the testing dataset.
cat_cols <- c("term",
"grade",
"emp_length",
"home_ownership",
"verification_status",
"purpose",
"loan_status")
for (col in cat_cols) {
train_data[[col]] <- as.factor(train_data[[col]])
test_data[[col]] <- factor(test_data[[col]],
levels = levels(train_data[[col]]))
}
Feature scaling is especially useful for Logistic Regression and distance-based algorithms. Feature scaling standardizes numerical variables so they are on similar scales.
First, a list of numeric columns is stored in
numeric_cols. The scale() function subtracts
the mean from each value and divides by the standard deviation,
resulting the variables to have a mean close to 0 and standard deviation
close to 1.
Next, the same scaling parameters obtained from the training data are
applied to the testing dataset using center and
scale attributes from train_scaled. This is to
ensure that the testing data is transformed using the statistics same as
the training dataset to avoid data leakage and enhance the fair model
evaluation.
numeric_cols <- c("loan_amnt",
"int_rate",
"installment",
"annual_inc",
"dti",
"open_acc",
"revol_util",
"total_acc",
"LoanToIncomeRatio",
"fico_avg")
train_scaled <- scale(train_data[numeric_cols])
train_data[numeric_cols] <- train_scaled
test_data[numeric_cols] <- scale(test_data[numeric_cols],
center = attr(train_scaled, "scaled:center"),
scale = attr(train_scaled, "scaled:scale"))
summary(train_data)
## loan_amnt term int_rate installment grade
## Min. :-1.4099 36:109221 Min. :-1.5134 Min. :-1.4249 A:25619
## 1st Qu.:-0.7892 60: 36616 1st Qu.:-0.6894 1st Qu.:-0.7379 B:40404
## Median :-0.2719 Median :-0.1882 Median :-0.2786 C:45053
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 D:21881
## 3rd Qu.: 0.5558 3rd Qu.: 0.3968 3rd Qu.: 0.5343 E: 8509
## Max. : 2.5732 Max. : 3.0667 Max. : 4.3600 F: 2842
## G: 1529
## emp_length home_ownership annual_inc verification_status
## 10 :52153 ANY : 135 Min. :-2.0386 Not Verified :49902
## 2 :14410 MORTGAGE:73615 1st Qu.:-0.7191 Source Verified:60258
## 0 :13891 NONE : 2 Median :-0.1910 Verified :35677
## 3 :12878 OWN :16982 Mean : 0.0000
## 1 :10418 RENT :55103 3rd Qu.: 0.5352
## 5 : 9767 Max. : 2.4497
## (Other):32320
## purpose dti open_acc
## debt_consolidation:80153 Min. :-1.85527 Min. :-1.8386
## credit_card :28292 1st Qu.:-0.73748 1st Qu.:-0.6375
## home_improvement :11706 Median :-0.07651 Median :-0.1227
## other :11579 Mean : 0.00000 Mean : 0.0000
## major_purchase : 4003 3rd Qu.: 0.66611 3rd Qu.: 0.5637
## medical : 2332 Max. : 3.13783 Max. :13.0896
## (Other) : 7772
## revol_util total_acc loan_status LoanToIncomeRatio
## Min. :-1.76604 Min. :-1.8035 0:115827 Min. : -0.18387
## 1st Qu.:-0.79373 1st Qu.:-0.7531 1: 30010 1st Qu.: -0.09397
## Median :-0.05557 Median :-0.1875 Median : -0.03126
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.74212 3rd Qu.: 0.5396 3rd Qu.: 0.04868
## Max. : 2.11130 Max. :10.2353 Max. :269.23479
##
## fico_avg
## Min. :-1.1725
## 1st Qu.:-0.7466
## Median :-0.1787
## Mean : 0.0000
## 3rd Qu.: 0.5311
## Max. : 4.0945
##
summary(test_data)
## loan_amnt term int_rate installment grade
## Min. :-1.40990 36:46712 Min. :-1.513369 Min. :-1.42491 A:11271
## 1st Qu.:-0.78917 60:15788 1st Qu.:-0.689377 1st Qu.:-0.73087 B:16935
## Median :-0.27189 Median :-0.188205 Median :-0.26598 C:19308
## Mean : 0.01107 Mean :-0.001456 Mean : 0.01072 D: 9468
## 3rd Qu.: 0.55576 3rd Qu.: 0.396794 3rd Qu.: 0.55755 E: 3736
## Max. : 2.57315 Max. : 3.066741 Max. : 4.34495 F: 1162
## G: 620
## emp_length home_ownership annual_inc verification_status
## 10 :22412 ANY : 70 Min. :-2.01835 Not Verified :21378
## 2 : 6257 MORTGAGE:31445 1st Qu.:-0.71912 Source Verified:26007
## 0 : 6010 NONE : 0 Median :-0.19098 Verified :15115
## 3 : 5401 OWN : 7109 Mean : 0.00847
## 1 : 4475 RENT :23876 3rd Qu.: 0.54841
## 4 : 4120 Max. : 2.44972
## (Other):13825
## purpose dti open_acc
## debt_consolidation:34406 Min. :-1.855270 Min. :-1.838578
## credit_card :12172 1st Qu.:-0.736373 1st Qu.:-0.637462
## home_improvement : 4976 Median :-0.084235 Median :-0.122698
## other : 4960 Mean :-0.004259 Mean :-0.001639
## major_purchase : 1705 3rd Qu.: 0.656179 3rd Qu.: 0.563654
## medical : 1018 Max. : 3.137834 Max. : 9.486230
## (Other) : 3263
## revol_util total_acc loan_status LoanToIncomeRatio
## Min. :-1.766041 Min. :-1.803470 0:49639 Min. :-0.183869
## 1st Qu.:-0.793730 1st Qu.:-0.753108 1:12861 1st Qu.:-0.093912
## Median :-0.063505 Median :-0.187528 Median :-0.031260
## Mean :-0.005479 Mean :-0.001719 Mean :-0.003902
## 3rd Qu.: 0.742123 3rd Qu.: 0.539647 3rd Qu.: 0.050315
## Max. : 2.111296 Max. : 9.669723 Max. :14.990758
##
## fico_avg
## Min. :-1.1724742
## 1st Qu.:-0.7465720
## Median :-0.1787023
## Mean :-0.0004905
## 3rd Qu.: 0.5311347
## Max. : 4.0945167
##
Finally, summary() is applied and we can observe that
the mean of the numerical columns in both training and testing data are
close to 0.
Random oversampling was used to handle the class imbalance. Before oversampling, the number of Charged Off loans was much smaller than the number of Fully Paid loans, which may cause the classification models to become biased toward the majority class.
table(train_data$loan_status)
##
## 0 1
## 115827 30010
# Separate majority and minority classes
majority_class <- train_data[train_data$loan_status == "0", ]
minority_class <- train_data[train_data$loan_status == "1", ]
# Random oversampling of minority class
set.seed(123)
minority_oversampled <- minority_class[
sample(1:nrow(minority_class),
size = nrow(majority_class),
replace = TRUE),
]
# Combine majority class and oversampled minority class
train_final <- rbind(majority_class, minority_oversampled)
# Shuffle the training data
train_final <- train_final[sample(1:nrow(train_final)), ]
# Test set remains unchanged
test_model <- test_data
# Confirm target variable is factor
train_final$loan_status <- as.factor(train_final$loan_status)
test_model$loan_status <- as.factor(test_model$loan_status)
SMOTE was considered initially as it can generate synthetic minority-class samples.However, ordinary SMOTE is more suitable for continuous numerical predictors because it creates new observations through interpolation. This dataset contains several categorical predictors, such as term, grade, employment length, home ownership, verification status and loan purpose. When these categorical variables are converted into numeric labels and ordinary SMOTE is applied, the interpolation process may create unrealistic intermediate values, such as decimal values for loan grade or loan term. These values do not correspond to legitimate borrower profiles.Therefore, random oversampling was selected as it balances the minority class while preserving the original categorical values.
table(train_data$loan_status)
##
## 0 1
## 115827 30010
table(train_final$loan_status)
##
## 0 1
## 115827 115827
table() function is applied to compare the class
distribution before and after applying random oversampling. After
oversampling, the number of Fully Paid and Charged Off loans in the
training dataset becomes balanced.
Logistic Regression was used as an interpretable baseline model because the target variable is binary. The Logistic Regression model estimates the probability that a loan will be charged off based on the predictor variables such as annual income, loan amount, interest rate, fico_avg, loan-to-income ratio and so on.
logit_model <- glm(loan_status ~ .,
data = train_final,
family = "binomial")
summary(logit_model)
##
## Call:
## glm(formula = loan_status ~ ., family = "binomial", data = train_final)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.921862 0.152579 -6.042 1.52e-09 ***
## loan_amnt 0.064762 0.031478 2.057 0.039647 *
## term60 0.356957 0.023447 15.224 < 2e-16 ***
## int_rate 0.215682 0.021699 9.940 < 2e-16 ***
## installment 0.163542 0.029462 5.551 2.84e-08 ***
## gradeB 0.529147 0.022331 23.695 < 2e-16 ***
## gradeC 0.815175 0.032210 25.308 < 2e-16 ***
## gradeD 0.855510 0.047868 17.872 < 2e-16 ***
## gradeE 0.789600 0.069663 11.335 < 2e-16 ***
## gradeF 0.749065 0.090565 8.271 < 2e-16 ***
## gradeG 0.823050 0.099009 8.313 < 2e-16 ***
## emp_length1 -0.022703 0.021782 -1.042 0.297272
## emp_length2 -0.071490 0.020088 -3.559 0.000372 ***
## emp_length3 -0.037366 0.020637 -1.811 0.070196 .
## emp_length4 -0.035160 0.022311 -1.576 0.115045
## emp_length5 -0.107347 0.022424 -4.787 1.69e-06 ***
## emp_length6 -0.131266 0.025229 -5.203 1.96e-07 ***
## emp_length7 -0.060364 0.026756 -2.256 0.024067 *
## emp_length8 -0.070384 0.028604 -2.461 0.013868 *
## emp_length9 -0.115572 0.027970 -4.132 3.60e-05 ***
## emp_length10 -0.117245 0.016354 -7.169 7.55e-13 ***
## home_ownershipMORTGAGE -0.379570 0.142831 -2.657 0.007873 **
## home_ownershipNONE -10.139296 51.112705 -0.198 0.842755
## home_ownershipOWN -0.116108 0.143266 -0.810 0.417687
## home_ownershipRENT 0.103776 0.142848 0.726 0.467549
## annual_inc -0.114408 0.005931 -19.290 < 2e-16 ***
## verification_statusSource Verified 0.144753 0.010962 13.205 < 2e-16 ***
## verification_statusVerified 0.255466 0.012503 20.433 < 2e-16 ***
## purposecredit_card 0.041106 0.044434 0.925 0.354914
## purposedebt_consolidation 0.112344 0.043579 2.578 0.009939 **
## purposehome_improvement 0.311445 0.046227 6.737 1.61e-11 ***
## purposehouse 0.192141 0.062951 3.052 0.002271 **
## purposemajor_purchase 0.370754 0.050808 7.297 2.94e-13 ***
## purposemedical 0.438944 0.055323 7.934 2.12e-15 ***
## purposemoving 0.344802 0.062744 5.495 3.90e-08 ***
## purposeother 0.268024 0.045860 5.844 5.09e-09 ***
## purposerenewable_energy 0.746263 0.169638 4.399 1.09e-05 ***
## purposesmall_business 0.893228 0.059248 15.076 < 2e-16 ***
## purposevacation 0.287704 0.062953 4.570 4.87e-06 ***
## purposewedding 10.027826 32.406412 0.309 0.756987
## dti 0.123536 0.005388 22.929 < 2e-16 ***
## open_acc 0.154205 0.006745 22.863 < 2e-16 ***
## revol_util 0.033734 0.005653 5.968 2.40e-09 ***
## total_acc -0.193786 0.006839 -28.337 < 2e-16 ***
## LoanToIncomeRatio -0.006272 0.004486 -1.398 0.162041
## fico_avg -0.182405 0.006500 -28.064 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 321141 on 231653 degrees of freedom
## Residual deviance: 286752 on 231608 degrees of freedom
## AIC: 286844
##
## Number of Fisher Scoring iterations: 8
The Logistic Regression model was trained using the random-oversampled training dataset. Since Logistic Regression is suitable for binary classification, it estimates the probability that a loan belongs to the Charged Off class. The predicted probabilities were then converted into class labels for model evaluation.
Although Random Oversampling method was applied to reduce class imbalance during model training, probability threshold tuning remained necessary because classification decisions are still influenced by the default cutoff value of 0.5. This can adjust the decision rule and improve the detection of minority cases by tuning the probability threshold. Threshold tuning suggested a threshold of 0.4 as it provides an optimal balance between recall and F1-score.
probabilities <- predict(logit_model,
test_model,
type = "response")
log_pred <- ifelse(probabilities > 0.4,
1,
0)
head(log_pred)
## 2 3 4 5 6 7
## 0 1 1 1 0 0
Decision Tree classification model was developed to classify borrowers into charged off and fully paid groups based on borrower characteristics. Unlike Logistic Regression, Decision Tree is a rule-based model that separates borrowers using a sequence of splitting conditions. Decision Tree is easier to interpret as the classification process and the decisions made at every node can be visualised.
cp=0.001 is the minimum improvement threshold required
for the tree to make a new split. A value of 0.001 means the tree will
only split a node if it improves the overall model fit by at least 0.1%.
Lowering this value allows more splits to occur.
minsplit = 20 sets the minimum number of observations
that must exist in a node before the tree even attempts to split it.
This prevents the tree from creating splits based on very small groups
of data which could lead to overfitting.
minbucket = 7 sets the minimum number of observations
allowed in any terminal node (leaf). The tree will not create a final
leaf node that contains fewer than 7 observations.
maxdepth = 5 limits how many levels deep the tree can
grow from the root node.
library(rpart)
set.seed(123)
tree_model <- rpart(loan_status ~ .,
data = train_final,
method = "class",
control = rpart.control(
cp = 0.001, # lower threshold makes more splits
minsplit = 20, # minimum observations to attempt a split
minbucket = 7, # minimum observations in a leaf node
maxdepth = 5 # limit depth to keep tree readable
))
The Decision Tree structure is visualized to show the classification rules used by the model.
The Decision Tree diagram displays:
splitting conditions
classification rules
predicted classes
decision paths used for borrower classification
library(rpart.plot)
rpart.plot(tree_model,
main = "Decision Tree")
The first split of the Decision Tree is based on grade, indicating that loan grade provides the strongest initial separation between Fully Paid and Charged Off loans in this tree. Loans with stronger grades such as A and B are mainly directed toward the lower-risk side of the tree, while weaker grades are more likely to move toward the higher-risk branches.
After the initial grade split, the tree further separates borrowers using home ownership, LoanToIncomeRatio, interest rate and revolving utilization. These variables suggest that repayment outcome is influenced by a combination of credit risk grade, borrower housing profile, repayment burden and credit utilization. However, the terminal nodes still contain mixed probabilities rather than pure classifications, showing that Fully Paid and Charged Off loans are not perfectly separable using the available predictors.
The Decision Tree variable importance plot identifies the variables that contributed most to classify loans as Fully Paid or Charged Off. The results show that grade and int_rate were the most influential variables, indicating that loan grade and borrowing cost played the strongest role in splitting the data. FICO-related variables, revolving utilization and home ownership also contributed, but their importance was much lower. This suggests that the model relied mainly on credit risk indicators when predicting loan status.
tree_model$variable.importance
## int_rate grade fico_avg LoanToIncomeRatio
## 1.006685e+04 1.004386e+04 1.913174e+03 9.117994e+02
## installment revol_util home_ownership loan_amnt
## 7.891540e+02 7.015702e+02 3.879545e+02 3.251806e+02
## verification_status term total_acc annual_inc
## 2.394755e+02 9.797676e+01 4.504493e+01 3.561110e+01
## open_acc emp_length purpose dti
## 1.987857e+01 1.499473e+01 1.275823e+01 7.824433e-01
barplot(tree_model$variable.importance,
main = "Decision Tree Variable Importance",
ylab = "Importance",
las=2,
cex.names=0.65)
The variable importance plot shows which borrower characteristics contribute the most to loan status classification.
The Decision Tree model is used to predict loan status classifications for the testing datasets. Since the Decision Tree model is primarily determined by rule-based splits and the prediction of the class labels are based on terminal nodes, the probability threshold tuning impact is generally less significant.
tree_pred <- predict(tree_model,
test_model,
type = "class")
head(tree_pred)
## 2 3 4 5 6 7
## 0 0 1 1 0 0
## Levels: 0 1
Random Forest is an ensemble classification technique that combines multiple decision trees. Instead of relying on a single decision tree,Random Forest builds multiple decision trees by using random parts of the data and each of the trees are using random variables to split at the decision node. Decision Tree applied a majority voting mechanism to aggregate the predictions from individual trees,outputting the modes as final final classification. Compared to a single decision tree which is prone to high variance and overfitting, Random Forest provides robustness by reducing the dependence on only one tree structure and increasing accuracy through combining the predictions of every single tree to average out the mistakes.
library(randomForest)
set.seed(123)
rf_model <- randomForest(loan_status ~ .,
data = train_final,
ntree = 200)
rf_model
##
## Call:
## randomForest(formula = loan_status ~ ., data = train_final, ntree = 200)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 4.03%
## Confusion matrix:
## 0 1 class.error
## 0 108488 7339 0.06336174
## 1 2008 113819 0.01733620
The model output shows an out-of-bag (OOB) error rate of 4.03%. This is an internal estimate of prediction error during the Random Forest training process. However, since the final model evaluation is carried out on the independent testing dataset, the OOB error rate should not be treated as the final model performance.
The relatively low OOB error suggests that the Random Forest model fits the oversampled training data well, but the independent testing results are still more important for evaluating generalisation performance.
Variable importance was used to identify the most influential variables affecting loan status prediction in the Random Forest model. The result shows that int_rate, dti, LoanToIncomeRatio, revol_util, emp_length and installment were among the most important predictors.
sort(importance(rf_model)[,1],decreasing=TRUE)
## int_rate dti LoanToIncomeRatio revol_util
## 11314.5325 11024.5783 10743.6033 10544.0941
## installment emp_length total_acc annual_inc
## 9328.7660 9292.8530 8445.2959 8364.6331
## fico_avg open_acc loan_amnt grade
## 7211.5404 7031.3545 6363.1100 5447.4985
## purpose verification_status home_ownership term
## 5182.0493 2354.3219 2125.3479 994.3745
varImpPlot(rf_model)
This suggests that the Random Forest model relied strongly on borrowing cost, debt burden, credit utilisation and repayment-related characteristics when classifying loans as Fully Paid or Charged Off. Variables such as grade, fico_avg and verification_status still contributed to the model, but they were less influential compared with the top-ranked predictors in this Random Forest model.
rf_prob <- predict(rf_model,
test_model,
type = "prob")
#The optimal probability threshold was found to be 0.3, it provides the best balance between recall and F1-score.
rf_pred <- ifelse(rf_prob[,2] > 0.3,
1,
0)
head(rf_pred)
## 2 3 4 5 6 7
## 0 0 1 1 0 0
The models’ performance are evaluated using four standard evaluation metrics which are accuracy, precision, recall and F1-score. These evaluation metrics were selected due to the reason that the target variable is imbalanced where the fully paid borrowers form the majority class. The evaluation of the model cannot simply rely on accuracy as it may not provide a comprehensive view on the model performance and can be misleading. A more detailed set of metrics is therefore necessary to fully understand a model’s strengths and weaknesses.
A confusion matrix will be utilized to summarize and visualize the performance of the classification models by comparing the predicted class labels against the actual class labels.
It is a 2x2 matrix consisting of 4 components: True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). In the project, a positive class refers to loan default where Default=1.
True Positive (TP): actual charged off loan correctly predicted as charged off.
True Negative (TN): actual fully paid loan correctly predicted as fully paid.
False Positive (FP): actual fully paid loan incorrectly predicted as charged off.
False Negative (FN): actual charged off loan incorrectly predicted as fully paid.
These parameters are the components for computing accuracy, precision, recall and F1-score
\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\] Accuracy measures the overall proportion of correct predictions where the actual charged off loan and actual fully paid loan are correctly predicted. \[Precision = \frac{TP}{TP + FP}\] Precision measures the proportion of predicted charged off loan that were actually charged off. \[Recall = \frac{TP}{TP + FN}\] Recall, also known as sensitivity or true positive rate, measures the proportion of actual charged off loans that the model correctly captured. \[F1\text{-}Score = \frac{2 \times Precision \times Recall}{Precision + Recall}\] F1-score provides a balance between precision and recall. It is useful when both false positives and false negatives need to be considered.
evaluate_model <- function(actual,predicted) {
actual <- as.numeric(as.character(actual))
cm <- table(Predicted = predicted, Actual = actual)
TN <- cm["0","0"]
FN <- cm["0","1"]
FP <- cm["1","0"]
TP <- cm["1","1"]
accuracy <- (TP+TN)/sum(cm)
precision <- TP/(TP+FP)
recall <- TP/(TP+FN)
f1_score <- (2*precision*recall)/(precision+recall)
list(
Confusion_Matrix = cm,
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score
)
}
The function is applied to each models for evaluation.
log_evaluate <- evaluate_model(test_model$loan_status,log_pred)
tree_evaluate <- evaluate_model(test_model$loan_status,tree_pred)
rf_evaluate <- evaluate_model(test_model$loan_status,rf_pred)
cat("===== Logistic Regression =====\n")
## ===== Logistic Regression =====
print(log_evaluate)
## $Confusion_Matrix
## Actual
## Predicted 0 1
## 0 21395 2018
## 1 28244 10843
##
## $Accuracy
## [1] 0.515808
##
## $Precision
## [1] 0.2774068
##
## $Recall
## [1] 0.8430915
##
## $F1_Score
## [1] 0.4174559
The Logistic Regression model achieved an accuracy of 51.58%, precision of 27.74%, recall of 84.31% and F1-score of 41.75%. Besides,the confusion matrix shows that the model correctly spot 10,843 actual charged off loan cases and incorrectly predicted 2,018 actual charged off loan as fully paid. This model also misclassified 28,244 fully paid loan as charged off.
The high recall rate of the Logistic Regression model indicates that the model is sensitive toward detecting borrowers who are likely to be charged off. In the context of loan status prediction, this is significant as missing actual charged off loan which have a lower recall rate may expose financial institutions to credit loss. However, the low precision indicates that there are many fully paid loan incorrectly classified as charged off by the model. This means the model produces many false positives, which may cause reliable and good standing borrowers to be incorrectly treated as high-risk borrowers who fail to fulfill the legal obligations of the loan contract.
Overall, Logistic Regression is useful when the main objective is to identify as many potential defaulted borrowers as possible. In contrast, it will be less suitable when the financial institution wants to minimize rejection of low-risk borrowers.
cat("\n===== Decision Tree =====\n")
##
## ===== Decision Tree =====
print(tree_evaluate)
## $Confusion_Matrix
## Actual
## Predicted 0 1
## 0 32467 4798
## 1 17172 8063
##
## $Accuracy
## [1] 0.64848
##
## $Precision
## [1] 0.3195165
##
## $Recall
## [1] 0.6269341
##
## $F1_Score
## [1] 0.423299
The Decision Tree model achieved an accuracy of 64.85%, which means that it made more correct predictions overall compared to the Logistic Regression model. However, it had a lower recall rate of 62.69%, indicating a lower performance in detecting actual charged off loans.
As shown as the confusion matrix, the model correctly classified 32,467 fully paid loan and 8,063 charged off loans. However, there are 4,798 actual charged off loans that were still predicted as fully paid. At the same time, 17,172 loans are misclassified as charged off loans when they are actually fully paid, resulting in a low precision score. This shows that the Decision Tree still had a poor performance to capture fully paid loans and a significant percentage to overlook high-risk borrowers even though rule-based decisions enhance the interpretability of the model.
Unlike other ‘black boxes’ models, Decision Tree allows the classification process to be visualized through splitting rules at every node, which helps to explain how borrower characteristics such as grade,interest rate and average of FICO affect the prediction.
cat("\n===== Random Forest =====\n")
##
## ===== Random Forest =====
print(rf_evaluate)
## $Confusion_Matrix
## Actual
## Predicted 0 1
## 0 32591 4629
## 1 17048 8232
##
## $Accuracy
## [1] 0.653168
##
## $Precision
## [1] 0.3256329
##
## $Recall
## [1] 0.6400746
##
## $F1_Score
## [1] 0.4316615
The Random Forest model had the strongest overall predictive performance as it achieved the highest accuracy of 65.32%. Other than that, it also had a higher precision rate, 32.56% and F1-score,43.17%. As a result shown in the confusion matrix, Random Forest had correctly predicted 32,591 fully paid loans and 8,232 charged off loans.
However, the Random Forest model achieved the lowest recall rate of 64.01%. This indicates that the model only identifies nearly approximately two-thirds of actual charged off loans. This also reduced the amount of false positives and increased the true negative cases with the cost of missing more actual defaulted borrowers. In other words, Random Forest was more conservative in predicting defaulted cases.
Therefore, Random Forest is the best model if the objective is to achieve stronger overall classification performance and a better trade-off between precision and recall. However, this is not a suitable model if the primary business goal is to capture the borrowers who are more likely to be defaulted and reduce the risk of lending to those borrowers in order to prevent any potential financial losses.
The performance of those models is compared by using the standard evaluation metrics such as accuracy, precision, recall and also F1-score, each of the metrics reflect the model’s performance from different aspects.
comparison <- data.frame(
Model = c("Logistic Regression",
"Decision Tree",
"Random Forest"),
Accuracy = c(log_evaluate$Accuracy,
tree_evaluate$Accuracy,
rf_evaluate$Accuracy),
Precision = c(log_evaluate$Precision,
tree_evaluate$Precision,
rf_evaluate$Precision),
Recall = c(log_evaluate$Recall,
tree_evaluate$Recall,
rf_evaluate$Recall),
F1_Score = c(log_evaluate$F1_Score,
tree_evaluate$F1_Score,
rf_evaluate$F1_Score)
)
comparison
## Model Accuracy Precision Recall F1_Score
## 1 Logistic Regression 0.515808 0.2774068 0.8430915 0.4174559
## 2 Decision Tree 0.648480 0.3195165 0.6269341 0.4232990
## 3 Random Forest 0.653168 0.3256329 0.6400746 0.4316615
Based on the comparison table, Random Forest stood out from Logistic Regression and Decision Tree as the strongest predictive model in this study because it had the highest score in 3 evaluation metrics which are accuracy, precision and F1-score. The high precision rate indicates that the predictions to identify defaulted borrowers are more reliable compared to the others model.
The Logistic Regression model achieved the highest recall rate. This means that it is good to capture the actual charged off loans but at the same time creating more false positive cases, leading to the lowest accuracy. Besides, the Decision Tree had a moderate performance compared to Logistic Regression and Random Forest in terms of accuracy, precision and recall.
In conclusion, there is no perfect model in this prediction analysis by having superior performance across all the evaluation metrics. The Logistic Regression model has the best performance in identifying the actual charged off loans and increasing the sensitivity of the model. While Random Forest has the strongest overall performance with a highest accuracy and least false positive cases generated.
par(mar = c(5, 4, 4, 8), xpd = TRUE)
barplot(t(as.matrix(comparison[,-1])),
beside = TRUE,
names.arg = c("Logistic\nRegression",
"Decision\nTree",
"Random\nForest"),
col = c("skyblue", "orange", "seagreen", "pink"),
legend.text = colnames(comparison)[-1],
args.legend = list(
x = "topright",
inset = c(-0.25, 0),
bty = "n"
),
main = "Model Performance Comparison",
ylab = "Score",
ylim = c(0, 1)
)
In this dataset, loan status is classified as either Fully Paid or Charged Off, where Charged Off loans refer to cases in which borrowers have defaulted on their repayment obligations, also called as defaulted borrowers.
The model performance visualization highlights the trade-off between accuracy, precision, recall and F1-score. In this study, the models with higher recall rate tend to have a lower accuracy due to the reason that when the models are more likely to predict the loans as charged off but at the same time, it has a higher chance to misclassify actual fully paid loans as charged off loans resulting in higher false positive cases
On the other hand, the model with lower recall is because the models tend to be more conservative in predicting the actual class. This makes the model correctly identify more actual non-default borrowers, resulting in decreasing false positive cases and better overall accuracy.
The low precision values across all models indicate that predicting loan status remains challenging. Many borrowers predicted as default were actually non-default. This suggests that default and non-default borrowers may overlap across the available features, making perfect classification difficult. Hence, the interpretation should be seen as a trade-off between identifying high-risk borrowers and preventing false alarms.
The feature importance results show that loan status prediction was influenced by several borrower and loan-related characteristics. Across the models, variables such as interest rate, debt-to-income ratio, loan-to-income ratio, revolving utilisation, loan grade and FICO-related variables contributed to distinguishing between Fully Paid and Charged Off loans.
The results demonstrate that each of the three models has different strengths and limitations. Logistic Regression is the most sensitive model in detecting actual defaulted borrowers. This is important for credit risk assessment since it might result in financial loss if a borrower with a high-risk of default is not identified. However, the low precision indicates that the model misclassified many non-default borrowers as default, resulting in rejection of reliable applicants.
Decision Tree had moderate performance across all metrics compared to other models. The tree structure allows the classification process to be explained by decision rules. The data split into branches based on specific features when moving down from the top of the tree, making it easier to understand how borrower characteristics influence loan status prediction.
Random Forest had the highest accuracy, precision and F1-score, suggesting that it provided the strongest overall classification performance. Its ensemble structure allows it to combine multiple decision trees, which helps reduce overfitting and improves prediction stability. However, its recall rate showed that it missed more actual charged off cases.
Generally, the findings show that model selection should not be based on accuracy alone but also interpret from different aspects. If the financial institution aims to maximize overall predictive performance and reduce false alarms, Random Forest is the most suitable model. If the primary goal is to have a strict inspection and screening process to detect default,Logistic Regression may be more appropriate due to its higher recall.
This project successfully developed and evaluated three classification models for loan default prediction: Logistic Regression, Decision Tree and Random Forest. The dataset was initially prepared from data understanding to understand the structure of the data and a series of preprocessing steps including data transformation, feature scaling, feature engineering and oversampling class imbalance handling before model training.
According to the evaluation results, Random Forest achieved the highest accuracy, precision and F1-score, making it the strongest overall model in this study. Conversely, Logistic Regression and Decision Tree has the higher recall score which indicated they were more sensitive individual who were likely to be default. This highlights an important trade-off in loan status prediction.
The findings demonstrate that machine learning classification techniques can assist financial institutions in improving lending decisions and reducing financial risk. Model selection should depend on the financial institution’s objective. If the priority is to reduce missed default cases and financial risk, recall should be emphasized. If the priority is to avoid incorrectly rejecting reliable borrowers, precision should also be considered.Accuracy which mean perform well in predicting non-default but fail to capture the actual default.
Outliers are extreme values that deviate significantly from the rest of the data and can distort regression model performance by inflating error metrics such as RMSE and biasing coefficient estimates toward extreme observations.
The IQR method is used to detect outliers across the key numeric predictors and the regression target. Values falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR are flagged as statistical outliers. Rather than removing flagged records — which would reduce the dataset size and may discard legitimate observations — capping treatments are applied selectively depending on the nature of each variable.
The IQR method is applied to detect outliers in:
annual_inc)int_rate) — the regression targetdti)revol_util)outliers <- function(x){
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- IQR(x)
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
return(x < lower_bound | x > upper_bound)
}
cat("Outliers in annual_inc: ", sum(outliers(loandata$annual_inc)), "\n")
## Outliers in annual_inc: 10009
cat("Outliers in int_rate: ", sum(outliers(loandata$int_rate)), "\n")
## Outliers in int_rate: 11452
cat("Outliers in dti: ", sum(outliers(loandata$dti)), "\n")
## Outliers in dti: 2613
cat("Outliers in revol_util: ", sum(outliers(loandata$revol_util)), "\n")
## Outliers in revol_util: 10
The IQR method flags outliers in income, DTI, and revolving utilisation. Rather than removing legitimate records, we apply capping treatments: annual_inc and dti are winsorised at the 99th percentile, while revol_util is capped at 100 (its theoretical maximum, since revolving utilisation above 100% represents data anomalies). Note that the 11452 ‘outliers’ detected in int_rate are not treated, because int_rate is our regression target and these represent legitimate high-risk loans.
# Cap annual_inc at 99th percentile
inc_cap <- quantile(loandata$annual_inc, 0.99)
n_capped_inc <- sum(loandata$annual_inc > inc_cap)
loandata$annual_inc <- pmin(loandata$annual_inc, inc_cap)
cat("annual_inc capped at", round(inc_cap, 2),
"- affected", n_capped_inc, "rows\n")
## annual_inc capped at 285000 - affected 2066 rows
# Cap dti at 99th percentile (handles the "999" placeholders)
dti_cap <- quantile(loandata$dti, 0.99)
n_capped_dti <- sum(loandata$dti > dti_cap)
loandata$dti <- pmin(loandata$dti, dti_cap)
cat("dti capped at", round(dti_cap, 2),
"- affected", n_capped_dti, "rows\n")
## dti capped at 46.48 - affected 2083 rows
# Cap revol_util at 100 (theoretical maximum)
n_capped_rev <- sum(loandata$revol_util > 100)
loandata$revol_util <- pmin(loandata$revol_util, 100)
cat("revol_util capped at 100 - affected", n_capped_rev, "rows\n")
## revol_util capped at 100 - affected 546 rows
The capping output confirms that the treatments were applied
successfully. Around 1% of records had annual_inc and
dti values trimmed to the 99th percentile, while several
hundred records had revol_util capped at 100. This reduces
the influence of extreme values on the linear model coefficients without
removing any rows from the dataset.
Categorical variables are converted to factors. The
grade and loan_status columns are
removed from the regression dataset:
grade directly determines int_rate
(would cause target leakage and circular reasoning).
loan_status is the target of the separate
classification study.
library(dplyr)
loandata <- loandata %>%
select(-grade, -loan_status)
loandata <- loandata %>%
mutate(across(c(home_ownership,
verification_status,
purpose),
as.factor))
head(loandata)
## loan_amnt term int_rate installment emp_length home_ownership annual_inc
## 1 3000 36 7.34 93.10 9 RENT 52000
## 2 5000 36 11.98 166.03 10 OWN 55000
## 3 7000 36 11.98 232.44 0 MORTGAGE 40000
## 4 30000 36 21.85 1143.39 10 OWN 57000
## 5 21000 60 20.39 560.94 10 OWN 85000
## 6 40000 60 9.43 838.71 10 MORTGAGE 150000
## verification_status purpose dti fico_range_low fico_range_high
## 1 Source Verified major_purchase 0.58 760 764
## 2 Not Verified other 14.18 675 679
## 3 Verified home_improvement 20.25 695 699
## 4 Verified debt_consolidation 27.58 680 684
## 5 Source Verified house 15.76 665 669
## 6 Not Verified debt_consolidation 4.09 730 734
## open_acc revol_util total_acc
## 1 7 0.5 30
## 2 14 33.9 24
## 3 13 36.0 29
## 4 11 53.2 26
## 5 15 34.2 27
## 6 6 9.4 9
After this step, the dataset contains 15 predictor variables: 11
numeric borrower attributes and 4 categorical variables encoded as
factors. The target variable int_rate remains in its
original numeric form so that error metrics (RMSE, MAE) can be
interpreted directly in percentage points.
par(mfrow = c(2,2), mar = c(4,4,2,1))
hist(loandata$int_rate,
main = "Interest Rate Distribution",
xlab = "Interest Rate (%)",
col = "darkgreen",
border = "white")
hist(loandata$loan_amnt,
main = "Loan Amount Distribution",
xlab = "Loan Amount",
col = "pink",
border = "white")
hist(loandata$annual_inc,
main = "Annual Income Distribution (capped)",
xlab = "Annual Income",
col = "lightblue",
border = "white")
hist(loandata$fico_range_low,
main = "FICO Score (low) Distribution",
xlab = "FICO Score",
col = "orange",
border = "white")
Interest rate is right-skewed, with a long tail of higher-risk loans charged higher rates.
numeric_vars <- c("int_rate", "loan_amnt", "term", "installment",
"emp_length", "annual_inc", "dti",
"fico_range_low", "fico_range_high",
"open_acc", "revol_util", "total_acc")
cor_matrix <- cor(loandata[numeric_vars])
round(sort(cor_matrix["int_rate", -1], decreasing = TRUE), 4)
## term revol_util installment dti loan_amnt
## 0.3664 0.2870 0.2010 0.2008 0.1414
## open_acc emp_length total_acc annual_inc fico_range_high
## 0.0112 -0.0225 -0.0431 -0.1001 -0.4216
## fico_range_low
## -0.4216
The correlation analysis shows clear, intuitive relationships:
FICO score is strongly negatively correlated with interest rate — higher creditworthiness means lower rates.
Term, installment, revol_util, dti are positively correlated — longer terms, larger payments, higher utilisation and higher DTI all increase the rate.
Annual income has a smaller but negative effect.
This is exactly the kind of structured signal we want for a regression task. It strongly suggests that meaningful prediction of interest rate is achievable.
library(ggplot2)
ggplot(loandata, aes(x = factor(term), y = int_rate, fill = factor(term))) +
geom_boxplot() +
labs(title = "Interest Rate by Loan Term",
x = "Term (months)", y = "Interest Rate (%)") +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))
60-month loans carry noticeably higher interest rates than 36-month loans, consistent with longer-term loans bearing more risk.
The caret package is used to split the dataset into 70%
training and 30% testing, with stratification on the target.
library(caret)
set.seed(123)
sample_index <- createDataPartition(loandata$int_rate,
p = 0.7,
list = FALSE)
train_data <- loandata[sample_index, ]
test_data <- loandata[-sample_index, ]
cat("Training rows:", nrow(train_data), "\n")
## Training rows: 145838
cat("Test rows: ", nrow(test_data), "\n")
## Test rows: 62499
cat_cols <- c("home_ownership", "verification_status", "purpose")
for (col in cat_cols) {
train_data[[col]] <- as.factor(train_data[[col]])
test_data[[col]] <- factor(test_data[[col]],
levels = levels(train_data[[col]]))
}
numeric_predictors <- c("loan_amnt", "term", "installment",
"emp_length", "annual_inc", "dti",
"fico_range_low", "fico_range_high",
"open_acc", "revol_util", "total_acc")
train_scaled <- scale(train_data[numeric_predictors])
train_data[numeric_predictors] <- train_scaled
test_data[numeric_predictors] <- scale(test_data[numeric_predictors],
center = attr(train_scaled, "scaled:center"),
scale = attr(train_scaled, "scaled:scale"))
Feature scaling standardises numerical predictors so they have a mean of zero and a standard deviation of one. This is particularly important for linear models, where coefficient magnitudes become directly comparable when predictors are on the same scale. The same scaling parameters from the training set are applied to the test set to prevent data leakage.
evaluate_reg <- function(actual, predicted, model_name) {
rmse_val <- sqrt(mean((actual - predicted)^2))
mae_val <- mean(abs(actual - predicted))
r2_val <- 1 - sum((actual - predicted)^2) /
sum((actual - mean(actual))^2)
data.frame(Model = model_name,
RMSE = round(rmse_val, 4),
MAE = round(mae_val, 4),
R2 = round(r2_val, 4))
}
baseline_pred <- rep(mean(train_data$int_rate), nrow(test_data))
baseline_rmse <- sqrt(mean((test_data$int_rate - baseline_pred)^2))
cat("Baseline RMSE (predicting the mean):", round(baseline_rmse, 4), "\n")
## Baseline RMSE (predicting the mean): 5.6176
A helper function computes the three standard regression metrics: RMSE (average prediction error in the same units as the target), MAE (median-like average error, less sensitive to extreme errors), and R-squared (the proportion of variance explained). A baseline RMSE is also computed by predicting the training mean for every test observation. Any useful model must produce predictions that beat this baseline.
lm_model <- lm(int_rate ~ ., data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = int_rate ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0684 -2.1728 -0.5128 1.7373 20.1702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.618945 0.284808 47.818 < 2e-16 ***
## loan_amnt -15.277628 0.046920 -325.610 < 2e-16 ***
## term 5.563151 0.014151 393.137 < 2e-16 ***
## installment 14.782137 0.043267 341.646 < 2e-16 ***
## emp_length -0.092329 0.008777 -10.520 < 2e-16 ***
## home_ownershipMORTGAGE -0.421249 0.275264 -1.530 0.125933
## home_ownershipNONE -2.017119 3.276375 -0.616 0.538123
## home_ownershipOWN 0.071399 0.276125 0.259 0.795964
## home_ownershipRENT 0.100870 0.275344 0.366 0.714110
## annual_inc -0.208233 0.010718 -19.429 < 2e-16 ***
## verification_statusSource Verified 0.734149 0.020342 36.090 < 2e-16 ***
## verification_statusVerified 1.342534 0.024030 55.870 < 2e-16 ***
## purposecredit_card -1.128430 0.078610 -14.355 < 2e-16 ***
## purposedebt_consolidation -0.355859 0.077102 -4.615 3.93e-06 ***
## purposehome_improvement -0.059428 0.081958 -0.725 0.468391
## purposehouse 0.307964 0.114577 2.688 0.007192 **
## purposemajor_purchase 0.558504 0.092003 6.071 1.28e-09 ***
## purposemedical 0.606797 0.101604 5.972 2.35e-09 ***
## purposemoving 0.758710 0.119322 6.358 2.04e-10 ***
## purposeother 0.916630 0.081780 11.208 < 2e-16 ***
## purposerenewable_energy 1.239478 0.329149 3.766 0.000166 ***
## purposesmall_business 0.765950 0.111685 6.858 7.00e-12 ***
## purposevacation 0.855244 0.114786 7.451 9.33e-14 ***
## purposewedding 8.224323 2.309850 3.561 0.000370 ***
## dti 0.403456 0.010211 39.513 < 2e-16 ***
## fico_range_low -64.657663 19.208818 -3.366 0.000763 ***
## fico_range_high 63.469164 19.208730 3.304 0.000953 ***
## open_acc 0.075261 0.012731 5.911 3.40e-09 ***
## revol_util 0.157690 0.010983 14.357 < 2e-16 ***
## total_acc -0.254448 0.012697 -20.039 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.265 on 145808 degrees of freedom
## Multiple R-squared: 0.6606, Adjusted R-squared: 0.6605
## F-statistic: 9787 on 29 and 145808 DF, p-value: < 2.2e-16
lm_pred <- predict(lm_model, newdata = test_data)
lm_eval <- evaluate_reg(test_data$int_rate, lm_pred, "Linear Regression")
print(lm_eval)
## Model RMSE MAE R2
## 1 Linear Regression 3.2983 2.5186 0.6553
The regression summary shows several highly significant predictors (term, FICO scores, DTI, revolving utilisation), confirming the relationships found in EDA.
library(rpart)
library(rpart.plot)
tree_model <- rpart(int_rate ~ ., data = train_data, method = "anova",
control = rpart.control(
cp = 0.001,
minsplit = 200,
maxdepth = 10
))
rpart.plot(tree_model, fallen.leaves = TRUE, cex = 0.55)
tree_pred <- predict(tree_model, newdata = test_data)
tree_eval <- evaluate_reg(test_data$int_rate, tree_pred, "Decision Tree")
print(tree_eval)
## Model RMSE MAE R2
## 1 Decision Tree 4.1083 3.1032 0.4652
The decision tree is configured with a low complexity parameter (cp = 0.001) and a minimum split size of 200 to allow it to grow deep enough to capture the structure in the data, while a maximum depth of 10 prevents excessive overfitting. The resulting tree visualises how borrower attributes hierarchically determine interest rate, with FICO score and loan term appearing as the most important early splits.
library(ranger)
set.seed(123)
rf_model <- ranger(int_rate ~ ., data = train_data,
num.trees = 300,
mtry = 5,
min.node.size = 10,
importance = "impurity")
## Growing trees.. Progress: 37%. Estimated remaining time: 52 seconds.
## Growing trees.. Progress: 73%. Estimated remaining time: 23 seconds.
rf_pred <- predict(rf_model, data = test_data)$predictions
rf_eval <- evaluate_reg(test_data$int_rate, rf_pred, "Random Forest")
print(rf_eval)
## Model RMSE MAE R2
## 1 Random Forest 2.7307 1.9475 0.7637
The Random Forest is configured with 300 trees, 5 features tried at
each split (mtry = 5), and a minimum terminal node size of
10. Compared to a single decision tree, the ensemble averages
predictions across many trees, which substantially reduces variance and
improves generalization.
importance_df <- data.frame(
Feature = names(rf_model$variable.importance),
Importance = rf_model$variable.importance) %>%
arrange(desc(Importance)) %>%
head(10)
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "seagreen") +
coord_flip() +
labs(title = "Random Forest: Top 10 Feature Importances",
x = "Feature", y = "Importance") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
The feature importance plot ranks predictors by their contribution to reducing prediction error across all trees in the forest. Installment amount, loan amount, and loan term emerge as the dominant predictors, followed by FICO scores and debt-to-income ratio. This ranking aligns closely with how real lenders price loans: the size of the payment, the amount borrowed, the loan duration, and the borrower’s creditworthiness are all primary inputs to risk-based pricing.
With all three models evaluated on the held-out test set, the metrics are combined into a single comparison table sorted by RMSE. The model with the lowest RMSE is treated as the best-performing model for this task.
comparison <- rbind(lm_eval, tree_eval, rf_eval)
comparison <- comparison[order(comparison$RMSE), ]
print(comparison)
## Model RMSE MAE R2
## 3 Random Forest 2.7307 1.9475 0.7637
## 1 Linear Regression 3.2983 2.5186 0.6553
## 2 Decision Tree 4.1083 3.1032 0.4652
ggplot(comparison, aes(x = reorder(Model, -RMSE), y = RMSE, fill = Model)) +
geom_col() +
geom_hline(yintercept = baseline_rmse, linetype = "dashed",
color = "red", linewidth = 1) +
coord_flip() +
labs(title = "Model Comparison: RMSE (lower is better)",
subtitle = "Red dashed line = baseline (predicting the mean)",
x = "Model", y = "RMSE") +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))
All three models comfortably beat the baseline (red dashed line at RMSE = 5.6176). Random Forest produces the lowest RMSE, followed by Linear Regression and then Decision Tree.
# Actual vs Predicted for the best model
best_name <- comparison$Model[1]
best_pred <- switch(best_name,
"Linear Regression" = lm_pred,
"Decision Tree" = tree_pred,
"Random Forest" = rf_pred)
set.seed(1)
plot_idx <- sample(nrow(test_data), 5000)
ggplot(data.frame(Actual = test_data$int_rate[plot_idx],
Predicted = best_pred[plot_idx]),
aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.2, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
labs(title = paste0("Best Model (", best_name, "): Actual vs Predicted"),
subtitle = "Sample of 5,000 test points",
x = "Actual Interest Rate (%)",
y = "Predicted Interest Rate (%)") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
The Actual vs Predicted plot shows predictions clustering along the red diagonal, indicating the best model captures the structure of interest rate well.
The regression analysis tested three modelling approaches: a linear model, a decision tree, and a random forest. All three models produced predictions substantially better than the naive baseline RMSE of 5.6176, with R-squared values well above zero and RMSE values clearly below the baseline.
Random Forest produced the strongest performance, with R² = 0.7637, RMSE = 2.7307, and MAE = 1.9475, explaining roughly 76% of the variation in interest rate. Linear Regression performed reasonably well at R² = 0.6553, RMSE = 3.2983, MAE = 2.5186, demonstrating that a simple linear model can still capture most of the relationship between borrower attributes and interest rate. Decision Tree was the weakest with R² = 0.4652, RMSE = 4.1083, MAE = 3.1032, since a single tree cannot capture as many interactions as the Random Forest ensemble does.
Random Forest’s clear advantage over Linear Regression suggests that non-linear relationships and interactions between borrower attributes matter for accurate interest rate prediction. While Linear Regression offers superior interpretability through its coefficients, Random Forest delivers meaningfully better predictions in this dataset.
Feature importance from the Random Forest highlights that installment amount, loan amount, loan term, FICO scores, and debt-to-income ratio are the strongest drivers of interest rate. This is consistent with real-world risk-based pricing: lenders charge more when the borrower has lower credit scores, longer terms, higher debt burdens, or larger loan exposures.
By deliberately excluding grade from the predictors, the
models prove that interest rate can be reasonably predicted from raw
borrower characteristics, not just from Lending Club’s pre-computed
grading system. This makes the analysis applicable to lenders that do
not have an existing internal grading model.
This project developed and evaluated three regression models which is Multiple Linear Regression, Decision Tree, and Random Forest to predict loan interest rate on the Lending Club 2017-2018 dataset. The data was prepared through column selection, year filtering for relevance, outlier capping, factor encoding, stratified train-test splitting, and feature scaling.
Random Forest emerged as the clear best model with R² = 0.7637 and RMSE = 2.7307, substantially outperforming both Linear Regression (R² = 0.6553) and Decision Tree (R² = 0.4652), and well below the baseline RMSE of 5.6176. The most influential predictors were installment amount, loan amount, loan term, FICO score, and DTI, aligning directly with real-world lending practice. These results demonstrate that interest rates can be predicted with meaningful accuracy from borrower attributes alone.
For future work, more advanced models such as Gradient Boosting (XGBoost) could be explored to push predictive performance further, and the analysis could be extended to time-aware modelling to capture macroeconomic effects on interest-rate setting.