Predicting Loan Repayment: Leveraging Decision Trees and Random Forest Models on LendingClub Data Using Python

Independent Data Analysis Project

Author
Affiliations

John Karuitha, PhD

Published

November 15, 2024

Modified

November 15, 2024

Executive Summary

This project explores the use of machine learning models to predict loan repayment behavior using historical data from LendingClub.com. Focusing on the years 2007 to 2010, the analysis aims to help investors assess borrower risk and make more informed lending decisions. The dataset includes borrower profiles, loan characteristics, and repayment outcomes, allowing us to identify key factors that influence loan repayment. We employ Decision Trees and Random Forests to classify whether a borrower is likely to repay their loan in full. To address the class imbalance inherent in financial data, we apply the Synthetic Minority Over-sampling Technique (SMOTE), which improves model performance by balancing the dataset. Our findings reveal that the Random Forest model outperforms the Decision Tree model, achieving higher accuracy and recall rates. The results demonstrate the practical application of predictive analytics in enhancing credit risk assessment, with implications for better investment strategies in peer-to-peer lending.

Keywords

Data Analysis, Python, Pandas, Seaborn, Numpy, Machine Learning, K-Nearest Neighbors (KNN), Scikit-learn, Classification

1 Background

In this project, we delve into peer-to-peer lending data from LendingClub.com, a financial platform that connects borrowers directly with individual investors. The primary goal of this analysis is to develop predictive models that can help investors make better decisions by assessing the likelihood of a borrower fully repaying their loan. By leveraging historical data from Lending Club, we aim to build classification models that identify factors influencing loan repayment behavior.

The dataset we analyze spans the years 2007 to 2010, a significant period before Lending Club went public. This timeframe is particularly important, as it captures data during the company’s early growth phase, reflecting the risk profiles of borrowers and lending criteria at that time. The analysis is set against the backdrop of the economic climate preceding the 2008 financial crisis, providing unique insights into consumer credit behavior.

We will utilize machine learning techniques, specifically Decision Trees and Random Forests, to classify whether borrowers successfully repaid their loans. By applying these models, we aim to uncover patterns in the lending data that indicate repayment likelihood. This project not only demonstrates the application of advanced classification algorithms but also highlights the practical value of using predictive analytics in financial decision-making.

Our ultimate objective is to create models that assist investors in identifying trustworthy borrowers, thereby maximizing returns and minimizing the risk of default. The insights gained through this analysis have broader implications for improving lending strategies, credit risk assessment, and investment decisions in the peer-to-peer lending space.


2 Key Insights

The dataset reveals various factors that influence the likelihood of loan repayment, such as credit scores, income levels, and loan purposes. We will explore these features in-depth to build predictive models that effectively classify borrowers based on their risk profiles.


3 Loading the Packages Used in the Analysis

We load the Python packages required for data manipulation, visualization, and modeling:

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from imblearn.over_sampling import SMOTE

4 Data

For this project, we are exploring publicly available data from LendingClub.com. Lending Club connects borrowers with investors who are interested in funding loans. The dataset used includes lending data from 2007 to 2010 and is cleaned of any missing values.

4.0.1 Data Dictionary

The dataset contains several variables, including information about the loan, the borrower’s financial profile, and repayment status. Key variables include credit.policy, purpose, int.rate, fico, and not.fully.paid, which will be used as the target variable for prediction.


5 Exploratory Data Analysis

We start by loading the dataset and exploring its structure:

Code
lending = pd.read_csv('loan_data.csv')
lending.columns = lending.columns.str.replace(".", "_")
lending.head()
credit_policy purpose int_rate installment log_annual_inc dti fico days_with_cr_line revol_bal revol_util inq_last_6mths delinq_2yrs pub_rec not_fully_paid
0 1 debt_consolidation 0.1189 829.10 11.350407 19.48 737 5639.958333 28854 52.1 0 0 0 0
1 1 credit_card 0.1071 228.22 11.082143 14.29 707 2760.000000 33623 76.7 0 0 0 0
2 1 debt_consolidation 0.1357 366.86 10.373491 11.63 682 4710.000000 3511 25.6 1 0 0 0
3 1 debt_consolidation 0.1008 162.34 11.350407 8.10 712 2699.958333 33667 73.2 1 0 0 0
4 1 credit_card 0.1426 102.92 11.299732 14.97 667 4066.000000 4740 39.5 0 1 0 0

5.0.1 Checking for Missing Values

Code
sns.heatmap(lending.isnull(), cmap="coolwarm")
plt.title("Heatmap of Missing Values")
plt.show()

The dataset has no missing values, which is ideal for our analysis focused on predictive modeling.


6 Data Visualization

6.0.1 FICO Scores Distribution by Credit Policy

Code
sns.displot(x="fico", hue="credit_policy", data=lending, palette="rocket", kde=True)
plt.title("Distribution of FICO Scores by Credit Policy")
plt.show()

6.0.2 FICO Scores Distribution by Loan Payment Status

Code
sns.displot(x="fico", hue="not_fully_paid", data=lending, palette="rocket", kde=True)
plt.title("Distribution of FICO Scores by Payment Status")
plt.show()

6.0.3 Loan Purpose vs. Payment Status

Code
plt.figure(figsize=(12, 8))
sns.countplot(x="purpose", hue="not_fully_paid", data=lending, palette="mako")
plt.title("Loan Purpose vs Payment Status")
plt.show()

6.0.4 FICO Score vs. Interest Rate

Code
sns.jointplot(x="fico", y="int_rate", data=lending, joint_kws={"color":"purple"}, marginal_kws={"color":"purple"})
plt.title("FICO Score vs Interest Rate")
plt.show()

We observe that lower FICO scores are associated with higher interest rates, reflecting higher risk.


7 Feature Engineering

We convert the categorical purpose variable into dummy variables:

Code
cat = pd.get_dummies(lending["purpose"], drop_first=True, prefix="purpose")
lending = pd.concat([lending.drop("purpose", axis=1), cat], axis=1)

8 Machine Learning Models

8.1 Train-Test Split

We start by separating the target variable (not_fully_paid) from the features:

Code
X = lending.drop("not_fully_paid", axis=1)
y = lending["not_fully_paid"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

9 Handling Class Imbalance Using SMOTE

Many real-world datasets, particularly in finance, have imbalanced class distributions. We use SMOTE (Synthetic Minority Over-sampling Technique) to balance our dataset:

Code
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f'Old data had {y_train.shape[0]} rows; new data has {y_smote.shape[0]} rows.')
Old data had 6704 rows; new data has 11228 rows.

10 Decision Tree Model

10.0.1 Building the Decision Tree

Code
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_smote, y_smote)
tree_preds = tree.predict(X_test)

10.0.2 Evaluating the Model

Code
print(classification_report(y_test, tree_preds))
sns.heatmap(confusion_matrix(y_test, tree_preds), cmap="Blues", annot=True, fmt=".0f")
plt.title("Confusion Matrix - Decision Tree")
plt.show()
              precision    recall  f1-score   support

           0       0.86      0.76      0.80      2431
           1       0.18      0.30      0.23       443

    accuracy                           0.69      2874
   macro avg       0.52      0.53      0.52      2874
weighted avg       0.75      0.69      0.72      2874

Decision Trees are intuitive models that split the data into branches to make predictions. They are useful for understanding which features are most important in predicting outcomes.


11 Random Forest Model

11.0.1 Building the Random Forest

Code
rand = RandomForestClassifier(n_estimators=100, random_state=42)
rand.fit(X_smote, y_smote)
rand_pred = rand.predict(X_test)

11.0.2 Evaluating the Model

Code
print(classification_report(y_test, rand_pred))
sns.heatmap(confusion_matrix(y_test, rand_pred), cmap="Blues", annot=True, fmt=".0f")
plt.title("Confusion Matrix - Random Forest")
plt.show()
              precision    recall  f1-score   support

           0       0.86      0.89      0.87      2431
           1       0.26      0.22      0.24       443

    accuracy                           0.78      2874
   macro avg       0.56      0.55      0.56      2874
weighted avg       0.77      0.78      0.78      2874

Random Forests build multiple decision trees and aggregate their predictions for a more robust result. This model generally performs better than a single decision tree by reducing overfitting and improving accuracy.

11.0.3 Model Comparison

We observe that the Random Forest model outperforms the Decision Tree in terms of accuracy, precision, and recall. The F1-score for the Random Forest is significantly higher, making it a more reliable choice for predicting loan repayment.


12 Conclusion

In this project, we successfully leveraged machine learning models to predict loan repayment behavior using historical data from LendingClub.com. By focusing on key borrower attributes such as FICO scores, interest rates, and debt-to-income ratios, we developed predictive models using Decision Trees and Random Forests to classify loan repayment outcomes. To address the class imbalance in the dataset—where the majority of loans were fully repaid—we utilized SMOTE (Synthetic Minority Over-sampling Technique), which significantly improved model performance by generating synthetic samples for the minority class.

Our analysis revealed that the Random Forest model outperformed the Decision Tree model, achieving better accuracy, precision, and recall. This superior performance demonstrates the effectiveness of ensemble methods in handling complex datasets with imbalanced classes. The insights gained from this project underscore the value of using predictive analytics for enhancing risk assessment in peer-to-peer lending, ultimately helping investors make more informed decisions.

Future enhancements could include exploring more advanced algorithms like Gradient Boosting or incorporating additional features to further improve model accuracy. Additionally, using time-series analysis to assess trends in borrower behavior could provide even deeper insights into the factors affecting loan repayment. By refining these models, we can contribute to more robust strategies for credit risk management and lending decisions (Muddana and Vinayakam 2024; James et al. 2013).

12.0.1 Future Work

  • Experiment with more advanced techniques like Gradient Boosting.
  • Perform feature selection to reduce model complexity.
  • Explore time-series analysis for predicting trends in loan repayments.

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.