Executive Summary

This project explores the use of machine learning models to predict loan repayment behavior using historical data from LendingClub.com. Focusing on the years 2007 to 2010, the analysis aims to help investors assess borrower risk and make more informed lending decisions. The dataset includes borrower profiles, loan characteristics, and repayment outcomes, allowing us to identify key factors that influence loan repayment. We employ Decision Trees and Random Forests to classify whether a borrower is likely to repay their loan in full. To address the class imbalance inherent in financial data, we apply the Synthetic Minority Over-sampling Technique (SMOTE), which improves model performance by balancing the dataset. Our findings reveal that the Random Forest model outperforms the Decision Tree model, achieving higher accuracy and recall rates. The results demonstrate the practical application of predictive analytics in enhancing credit risk assessment, with implications for better investment strategies in peer-to-peer lending.

1 Background

In this project, we delve into peer-to-peer lending data from LendingClub.com, a financial platform that connects borrowers directly with individual investors. The primary goal of this analysis is to develop predictive models that can help investors make better decisions by assessing the likelihood of a borrower fully repaying their loan. By leveraging historical data from Lending Club, we aim to build classification models that identify factors influencing loan repayment behavior.

The dataset we analyze spans the years 2007 to 2010, a significant period before Lending Club went public. This timeframe is particularly important, as it captures data during the company’s early growth phase, reflecting the risk profiles of borrowers and lending criteria at that time. The analysis is set against the backdrop of the economic climate preceding the 2008 financial crisis, providing unique insights into consumer credit behavior.

We will utilize machine learning techniques, specifically Decision Trees and Random Forests, to classify whether borrowers successfully repaid their loans. By applying these models, we aim to uncover patterns in the lending data that indicate repayment likelihood. This project not only demonstrates the application of advanced classification algorithms but also highlights the practical value of using predictive analytics in financial decision-making.

Our ultimate objective is to create models that assist investors in identifying trustworthy borrowers, thereby maximizing returns and minimizing the risk of default. The insights gained through this analysis have broader implications for improving lending strategies, credit risk assessment, and investment decisions in the peer-to-peer lending space.

2 Key Insights

The dataset reveals various factors that influence the likelihood of loan repayment, such as credit scores, income levels, and loan purposes. We will explore these features in-depth to build predictive models that effectively classify borrowers based on their risk profiles.

3 Loading the Packages Used in the Analysis

We load the Python packages required for data manipulation, visualization, and modeling:

Code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from imblearn.over_sampling import SMOTE

4 Data

For this project, we are exploring publicly available data from LendingClub.com. Lending Club connects borrowers with investors who are interested in funding loans. The dataset used includes lending data from 2007 to 2010 and is cleaned of any missing values.

4.0.1 Data Dictionary

The dataset contains several variables, including information about the loan, the borrower’s financial profile, and repayment status. Key variables include credit.policy, purpose, int.rate, fico, and not.fully.paid, which will be used as the target variable for prediction.

5 Exploratory Data Analysis

We start by loading the dataset and exploring its structure:

Code

lending = pd.read_csv('loan_data.csv')
lending.columns = lending.columns.str.replace(".", "_")
lending.head()

	credit_policy	purpose	int_rate	installment	log_annual_inc	dti	fico	days_with_cr_line	revol_bal	revol_util	inq_last_6mths	delinq_2yrs
0	1	debt_consolidation	0.1189	829.10	11.350407	19.48	737	5639.958333	28854	52.1	0	0
1	1	credit_card	0.1071	228.22	11.082143	14.29	707	2760.000000	33623	76.7	0	0
2	1	debt_consolidation	0.1357	366.86	10.373491	11.63	682	4710.000000	3511	25.6	1	0
3	1	debt_consolidation	0.1008	162.34	11.350407	8.10	712	2699.958333	33667	73.2	1	0
4	1	credit_card	0.1426	102.92	11.299732	14.97	667	4066.000000	4740	39.5	0	1

5.0.1 Checking for Missing Values

Code

sns.heatmap(lending.isnull(), cmap="coolwarm")
plt.title("Heatmap of Missing Values")
plt.show()

The dataset has no missing values, which is ideal for our analysis focused on predictive modeling.

6 Data Visualization

6.0.1 FICO Scores Distribution by Credit Policy

Code

sns.displot(x="fico", hue="credit_policy", data=lending, palette="rocket", kde=True)
plt.title("Distribution of FICO Scores by Credit Policy")
plt.show()

6.0.2 FICO Scores Distribution by Loan Payment Status

Code

sns.displot(x="fico", hue="not_fully_paid", data=lending, palette="rocket", kde=True)
plt.title("Distribution of FICO Scores by Payment Status")
plt.show()

6.0.3 Loan Purpose vs. Payment Status

Code

plt.figure(figsize=(12, 8))
sns.countplot(x="purpose", hue="not_fully_paid", data=lending, palette="mako")
plt.title("Loan Purpose vs Payment Status")
plt.show()

6.0.4 FICO Score vs. Interest Rate

Code

sns.jointplot(x="fico", y="int_rate", data=lending, joint_kws={"color":"purple"}, marginal_kws={"color":"purple"})
plt.title("FICO Score vs Interest Rate")
plt.show()

We observe that lower FICO scores are associated with higher interest rates, reflecting higher risk.

7 Feature Engineering

We convert the categorical purpose variable into dummy variables:

Code

cat = pd.get_dummies(lending["purpose"], drop_first=True, prefix="purpose")
lending = pd.concat([lending.drop("purpose", axis=1), cat], axis=1)

8 Machine Learning Models

8.1 Train-Test Split

We start by separating the target variable (not_fully_paid) from the features:

Code

X = lending.drop("not_fully_paid", axis=1)
y = lending["not_fully_paid"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

9 Handling Class Imbalance Using SMOTE

Many real-world datasets, particularly in finance, have imbalanced class distributions. We use SMOTE (Synthetic Minority Over-sampling Technique) to balance our dataset:

Code

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f'Old data had {y_train.shape[0]} rows; new data has {y_smote.shape[0]} rows.')

Old data had 6704 rows; new data has 11228 rows.

10 Decision Tree Model

10.0.1 Building the Decision Tree

Code

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_smote, y_smote)
tree_preds = tree.predict(X_test)

10.0.2 Evaluating the Model

Code

print(classification_report(y_test, tree_preds))
sns.heatmap(confusion_matrix(y_test, tree_preds), cmap="Blues", annot=True, fmt=".0f")
plt.title("Confusion Matrix - Decision Tree")
plt.show()

              precision    recall  f1-score   support

           0       0.86      0.76      0.80      2431
           1       0.18      0.30      0.23       443

    accuracy                           0.69      2874
   macro avg       0.52      0.53      0.52      2874
weighted avg       0.75      0.69      0.72      2874

Decision Trees are intuitive models that split the data into branches to make predictions. They are useful for understanding which features are most important in predicting outcomes.

11 Random Forest Model

11.0.1 Building the Random Forest

Code

rand = RandomForestClassifier(n_estimators=100, random_state=42)
rand.fit(X_smote, y_smote)
rand_pred = rand.predict(X_test)

11.0.2 Evaluating the Model

Code

print(classification_report(y_test, rand_pred))
sns.heatmap(confusion_matrix(y_test, rand_pred), cmap="Blues", annot=True, fmt=".0f")
plt.title("Confusion Matrix - Random Forest")
plt.show()

              precision    recall  f1-score   support

           0       0.86      0.89      0.87      2431
           1       0.26      0.22      0.24       443

    accuracy                           0.78      2874
   macro avg       0.56      0.55      0.56      2874
weighted avg       0.77      0.78      0.78      2874

Random Forests build multiple decision trees and aggregate their predictions for a more robust result. This model generally performs better than a single decision tree by reducing overfitting and improving accuracy.

11.0.3 Model Comparison

We observe that the Random Forest model outperforms the Decision Tree in terms of accuracy, precision, and recall. The F1-score for the Random Forest is significantly higher, making it a more reliable choice for predicting loan repayment.

12 Conclusion

In this project, we successfully leveraged machine learning models to predict loan repayment behavior using historical data from LendingClub.com. By focusing on key borrower attributes such as FICO scores, interest rates, and debt-to-income ratios, we developed predictive models using Decision Trees and Random Forests to classify loan repayment outcomes. To address the class imbalance in the dataset—where the majority of loans were fully repaid—we utilized SMOTE (Synthetic Minority Over-sampling Technique), which significantly improved model performance by generating synthetic samples for the minority class.

Our analysis revealed that the Random Forest model outperformed the Decision Tree model, achieving better accuracy, precision, and recall. This superior performance demonstrates the effectiveness of ensemble methods in handling complex datasets with imbalanced classes. The insights gained from this project underscore the value of using predictive analytics for enhancing risk assessment in peer-to-peer lending, ultimately helping investors make more informed decisions.

Future enhancements could include exploring more advanced algorithms like Gradient Boosting or incorporating additional features to further improve model accuracy. Additionally, using time-series analysis to assess trends in borrower behavior could provide even deeper insights into the factors affecting loan repayment. By refining these models, we can contribute to more robust strategies for credit risk management and lending decisions (Muddana and Vinayakam 2024; James et al. 2013).

12.0.1 Future Work

Experiment with more advanced techniques like Gradient Boosting.
Perform feature selection to reduce model complexity.
Explore time-series analysis for predicting trends in loan repayments.

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.

--- title: "**Predicting Loan Repayment: Leveraging Decision Trees and Random Forest Models on LendingClub Data Using Python**" subtitle: "*Independent Data Analysis Project*" author: - name: John Karuitha, PhD orcid: 0000-0002-8204-7034 email: jkaruitha@karu.ac.ke affiliations: - name: Karatina University, Department of Business and Economics city: Karatina state: Kenya postal-code: 10101 url: https://www.rpubs.com/Karuitha - name: University of the Witwatersrand, School of Construction Economics & Management city: Johannesburg state: South Africa postal-code: 2000 url: https://www.linkedin.com/in/Karuitha date: today date-modified: last-modified date-format: long abstract-title: "Executive Summary" abstract: | This project explores the use of machine learning models to predict loan repayment behavior using historical data from LendingClub.com. Focusing on the years 2007 to 2010, the analysis aims to help investors assess borrower risk and make more informed lending decisions. The dataset includes borrower profiles, loan characteristics, and repayment outcomes, allowing us to identify key factors that influence loan repayment. We employ **Decision Trees** and **Random Forests** to classify whether a borrower is likely to repay their loan in full. To address the class imbalance inherent in financial data, we apply the **Synthetic Minority Over-sampling Technique (SMOTE)**, which improves model performance by balancing the dataset. Our findings reveal that the Random Forest model outperforms the Decision Tree model, achieving higher accuracy and recall rates. The results demonstrate the practical application of predictive analytics in enhancing credit risk assessment, with implications for better investment strategies in peer-to-peer lending. keywords: - Data Analysis - Python - Pandas - Seaborn - Numpy - Machine Learning - K-Nearest Neighbors (KNN) - Scikit-learn - Classification bibliography: bibliography.bib format: html: toc: true toc-depth: 3 toc-title: "Contents" fontsize: 1.2em number-sections: true number-depth: 3 code-fold: true code-tools: true link-external-icon: true theme: lux css: styles.css html-math-method: katex fig-align: center smooth-scroll: true toc-location: left title-block-banner: "Untitled.jpg" title-block-banner-color: black header-includes: | <link rel="icon" type="image/png" href="favicon.png"> execute: echo: true warning: false message: false cache: true --- # **Background** In this project, we delve into **peer-to-peer lending** data from LendingClub.com, a financial platform that connects borrowers directly with individual investors. The primary goal of this analysis is to develop predictive models that can help investors make better decisions by assessing the likelihood of a borrower fully repaying their loan. By leveraging historical data from Lending Club, we aim to build classification models that identify factors influencing loan repayment behavior. The dataset we analyze spans the years **2007 to 2010**, a significant period before Lending Club went public. This timeframe is particularly important, as it captures data during the company’s early growth phase, reflecting the risk profiles of borrowers and lending criteria at that time. The analysis is set against the backdrop of the economic climate preceding the 2008 financial crisis, providing unique insights into consumer credit behavior. We will utilize **machine learning techniques**, specifically **Decision Trees** and **Random Forests**, to classify whether borrowers successfully repaid their loans. By applying these models, we aim to uncover patterns in the lending data that indicate repayment likelihood. This project not only demonstrates the application of advanced classification algorithms but also highlights the practical value of using predictive analytics in financial decision-making. Our ultimate objective is to create models that assist investors in identifying trustworthy borrowers, thereby maximizing returns and minimizing the risk of default. The insights gained through this analysis have broader implications for improving lending strategies, credit risk assessment, and investment decisions in the peer-to-peer lending space. --- # **Key Insights** The dataset reveals various factors that influence the likelihood of loan repayment, such as credit scores, income levels, and loan purposes. We will explore these features in-depth to build predictive models that effectively classify borrowers based on their risk profiles. --- # **Loading the Packages Used in the Analysis** We load the Python packages required for data manipulation, visualization, and modeling: ```{python} import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix from sklearn.preprocessing import StandardScaler from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from imblearn.over_sampling import SMOTE ``` --- # **Data** For this project, we are exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects borrowers with investors who are interested in funding loans. The dataset used includes lending data from **2007 to 2010** and is cleaned of any missing values. ### **Data Dictionary** The dataset contains several variables, including information about the loan, the borrower’s financial profile, and repayment status. Key variables include `credit.policy`, `purpose`, `int.rate`, `fico`, and `not.fully.paid`, which will be used as the target variable for prediction. --- # **Exploratory Data Analysis** We start by loading the dataset and exploring its structure: ```{python} lending = pd.read_csv('loan_data.csv') lending.columns = lending.columns.str.replace(".", "_") lending.head() ``` ### **Checking for Missing Values** ```{python} sns.heatmap(lending.isnull(), cmap="coolwarm") plt.title("Heatmap of Missing Values") plt.show() ``` The dataset has no missing values, which is ideal for our analysis focused on predictive modeling. --- # **Data Visualization** ### **FICO Scores Distribution by Credit Policy** ```{python} sns.displot(x="fico", hue="credit_policy", data=lending, palette="rocket", kde=True) plt.title("Distribution of FICO Scores by Credit Policy") plt.show() ``` ### **FICO Scores Distribution by Loan Payment Status** ```{python} sns.displot(x="fico", hue="not_fully_paid", data=lending, palette="rocket", kde=True) plt.title("Distribution of FICO Scores by Payment Status") plt.show() ``` ### **Loan Purpose vs. Payment Status** ```{python} plt.figure(figsize=(12, 8)) sns.countplot(x="purpose", hue="not_fully_paid", data=lending, palette="mako") plt.title("Loan Purpose vs Payment Status") plt.show() ``` ### **FICO Score vs. Interest Rate** ```{python} sns.jointplot(x="fico", y="int_rate", data=lending, joint_kws={"color":"purple"}, marginal_kws={"color":"purple"}) plt.title("FICO Score vs Interest Rate") plt.show() ``` We observe that lower FICO scores are associated with higher interest rates, reflecting higher risk. --- # **Feature Engineering** We convert the categorical `purpose` variable into dummy variables: ```{python} cat = pd.get_dummies(lending["purpose"], drop_first=True, prefix="purpose") lending = pd.concat([lending.drop("purpose", axis=1), cat], axis=1) ``` --- # **Machine Learning Models** ## **Train-Test Split** We start by separating the target variable (`not_fully_paid`) from the features: ```{python} X = lending.drop("not_fully_paid", axis=1) y = lending["not_fully_paid"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101) ``` --- # **Handling Class Imbalance Using SMOTE** Many real-world datasets, particularly in finance, have imbalanced class distributions. We use **SMOTE (Synthetic Minority Over-sampling Technique)** to balance our dataset: ```{python} smote = SMOTE(random_state=42) X_smote, y_smote = smote.fit_resample(X_train, y_train) print(f'Old data had {y_train.shape[0]} rows; new data has {y_smote.shape[0]} rows.') ``` --- # **Decision Tree Model** ### **Building the Decision Tree** ```{python} tree = DecisionTreeClassifier(random_state=42) tree.fit(X_smote, y_smote) tree_preds = tree.predict(X_test) ``` ### **Evaluating the Model** ```{python} print(classification_report(y_test, tree_preds)) sns.heatmap(confusion_matrix(y_test, tree_preds), cmap="Blues", annot=True, fmt=".0f") plt.title("Confusion Matrix - Decision Tree") plt.show() ``` **Decision Trees** are intuitive models that split the data into branches to make predictions. They are useful for understanding which features are most important in predicting outcomes. --- # **Random Forest Model** ### **Building the Random Forest** ```{python} rand = RandomForestClassifier(n_estimators=100, random_state=42) rand.fit(X_smote, y_smote) rand_pred = rand.predict(X_test) ``` ### **Evaluating the Model** ```{python} print(classification_report(y_test, rand_pred)) sns.heatmap(confusion_matrix(y_test, rand_pred), cmap="Blues", annot=True, fmt=".0f") plt.title("Confusion Matrix - Random Forest") plt.show() ``` **Random Forests** build multiple decision trees and aggregate their predictions for a more robust result. This model generally performs better than a single decision tree by reducing overfitting and improving accuracy. ### **Model Comparison** We observe that the **Random Forest model outperforms the Decision Tree** in terms of accuracy, precision, and recall. The F1-score for the Random Forest is significantly higher, making it a more reliable choice for predicting loan repayment. --- # **Conclusion** In this project, we successfully leveraged machine learning models to predict loan repayment behavior using historical data from LendingClub.com. By focusing on key borrower attributes such as FICO scores, interest rates, and debt-to-income ratios, we developed predictive models using **Decision Trees** and **Random Forests** to classify loan repayment outcomes. To address the class imbalance in the dataset—where the majority of loans were fully repaid—we utilized **SMOTE (Synthetic Minority Over-sampling Technique)**, which significantly improved model performance by generating synthetic samples for the minority class. Our analysis revealed that the **Random Forest model** outperformed the Decision Tree model, achieving better accuracy, precision, and recall. This superior performance demonstrates the effectiveness of ensemble methods in handling complex datasets with imbalanced classes. The insights gained from this project underscore the value of using predictive analytics for enhancing risk assessment in peer-to-peer lending, ultimately helping investors make more informed decisions. Future enhancements could include exploring more advanced algorithms like **Gradient Boosting** or incorporating additional features to further improve model accuracy. Additionally, using time-series analysis to assess trends in borrower behavior could provide even deeper insights into the factors affecting loan repayment. By refining these models, we can contribute to more robust strategies for credit risk management and lending decisions [@muddana2024python; @james2013introduction]. ### **Future Work** - Experiment with more advanced techniques like **Gradient Boosting**. - Perform feature selection to reduce model complexity. - Explore time-series analysis for predicting trends in loan repayments. --- # **References** {-}