This project explores the use of machine learning models to predict loan repayment behavior using historical data from LendingClub.com. Focusing on the years 2007 to 2010, the analysis aims to help investors assess borrower risk and make more informed lending decisions. The dataset includes borrower profiles, loan characteristics, and repayment outcomes, allowing us to identify key factors that influence loan repayment. We employ Decision Trees and Random Forests to classify whether a borrower is likely to repay their loan in full. To address the class imbalance inherent in financial data, we apply the Synthetic Minority Over-sampling Technique (SMOTE), which improves model performance by balancing the dataset. Our findings reveal that the Random Forest model outperforms the Decision Tree model, achieving higher accuracy and recall rates. The results demonstrate the practical application of predictive analytics in enhancing credit risk assessment, with implications for better investment strategies in peer-to-peer lending.
In this project, we delve into peer-to-peer lending data from LendingClub.com, a financial platform that connects borrowers directly with individual investors. The primary goal of this analysis is to develop predictive models that can help investors make better decisions by assessing the likelihood of a borrower fully repaying their loan. By leveraging historical data from Lending Club, we aim to build classification models that identify factors influencing loan repayment behavior.
The dataset we analyze spans the years 2007 to 2010, a significant period before Lending Club went public. This timeframe is particularly important, as it captures data during the company’s early growth phase, reflecting the risk profiles of borrowers and lending criteria at that time. The analysis is set against the backdrop of the economic climate preceding the 2008 financial crisis, providing unique insights into consumer credit behavior.
We will utilize machine learning techniques, specifically Decision Trees and Random Forests, to classify whether borrowers successfully repaid their loans. By applying these models, we aim to uncover patterns in the lending data that indicate repayment likelihood. This project not only demonstrates the application of advanced classification algorithms but also highlights the practical value of using predictive analytics in financial decision-making.
Our ultimate objective is to create models that assist investors in identifying trustworthy borrowers, thereby maximizing returns and minimizing the risk of default. The insights gained through this analysis have broader implications for improving lending strategies, credit risk assessment, and investment decisions in the peer-to-peer lending space.
2Key Insights
The dataset reveals various factors that influence the likelihood of loan repayment, such as credit scores, income levels, and loan purposes. We will explore these features in-depth to build predictive models that effectively classify borrowers based on their risk profiles.
3Loading the Packages Used in the Analysis
We load the Python packages required for data manipulation, visualization, and modeling:
Code
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report, confusion_matrixfrom sklearn.preprocessing import StandardScalerfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom imblearn.over_sampling import SMOTE
4Data
For this project, we are exploring publicly available data from LendingClub.com. Lending Club connects borrowers with investors who are interested in funding loans. The dataset used includes lending data from 2007 to 2010 and is cleaned of any missing values.
4.0.1Data Dictionary
The dataset contains several variables, including information about the loan, the borrower’s financial profile, and repayment status. Key variables include credit.policy, purpose, int.rate, fico, and not.fully.paid, which will be used as the target variable for prediction.
5Exploratory Data Analysis
We start by loading the dataset and exploring its structure:
We start by separating the target variable (not_fully_paid) from the features:
Code
X = lending.drop("not_fully_paid", axis=1)y = lending["not_fully_paid"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
9Handling Class Imbalance Using SMOTE
Many real-world datasets, particularly in finance, have imbalanced class distributions. We use SMOTE (Synthetic Minority Over-sampling Technique) to balance our dataset:
Code
smote = SMOTE(random_state=42)X_smote, y_smote = smote.fit_resample(X_train, y_train)print(f'Old data had {y_train.shape[0]} rows; new data has {y_smote.shape[0]} rows.')
Old data had 6704 rows; new data has 11228 rows.
10Decision Tree Model
10.0.1Building the Decision Tree
Code
tree = DecisionTreeClassifier(random_state=42)tree.fit(X_smote, y_smote)tree_preds = tree.predict(X_test)
Decision Trees are intuitive models that split the data into branches to make predictions. They are useful for understanding which features are most important in predicting outcomes.
Random Forests build multiple decision trees and aggregate their predictions for a more robust result. This model generally performs better than a single decision tree by reducing overfitting and improving accuracy.
11.0.3Model Comparison
We observe that the Random Forest model outperforms the Decision Tree in terms of accuracy, precision, and recall. The F1-score for the Random Forest is significantly higher, making it a more reliable choice for predicting loan repayment.
12Conclusion
In this project, we successfully leveraged machine learning models to predict loan repayment behavior using historical data from LendingClub.com. By focusing on key borrower attributes such as FICO scores, interest rates, and debt-to-income ratios, we developed predictive models using Decision Trees and Random Forests to classify loan repayment outcomes. To address the class imbalance in the dataset—where the majority of loans were fully repaid—we utilized SMOTE (Synthetic Minority Over-sampling Technique), which significantly improved model performance by generating synthetic samples for the minority class.
Our analysis revealed that the Random Forest model outperformed the Decision Tree model, achieving better accuracy, precision, and recall. This superior performance demonstrates the effectiveness of ensemble methods in handling complex datasets with imbalanced classes. The insights gained from this project underscore the value of using predictive analytics for enhancing risk assessment in peer-to-peer lending, ultimately helping investors make more informed decisions.
Future enhancements could include exploring more advanced algorithms like Gradient Boosting or incorporating additional features to further improve model accuracy. Additionally, using time-series analysis to assess trends in borrower behavior could provide even deeper insights into the factors affecting loan repayment. By refining these models, we can contribute to more robust strategies for credit risk management and lending decisions (Muddana and Vinayakam 2024; James et al. 2013).
12.0.1Future Work
Experiment with more advanced techniques like Gradient Boosting.
Perform feature selection to reduce model complexity.
Explore time-series analysis for predicting trends in loan repayments.
References
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.
Source Code
---title: "**Predicting Loan Repayment: Leveraging Decision Trees and Random Forest Models on LendingClub Data Using Python**"subtitle: "*Independent Data Analysis Project*"author: - name: John Karuitha, PhD orcid: 0000-0002-8204-7034 email: jkaruitha@karu.ac.ke affiliations: - name: Karatina University, Department of Business and Economics city: Karatina state: Kenya postal-code: 10101 url: https://www.rpubs.com/Karuitha - name: University of the Witwatersrand, School of Construction Economics & Management city: Johannesburg state: South Africa postal-code: 2000 url: https://www.linkedin.com/in/Karuithadate: todaydate-modified: last-modifieddate-format: longabstract-title: "Executive Summary"abstract: | This project explores the use of machine learning models to predict loan repayment behavior using historical data from LendingClub.com. Focusing on the years 2007 to 2010, the analysis aims to help investors assess borrower risk and make more informed lending decisions. The dataset includes borrower profiles, loan characteristics, and repayment outcomes, allowing us to identify key factors that influence loan repayment. We employ **Decision Trees** and **Random Forests** to classify whether a borrower is likely to repay their loan in full. To address the class imbalance inherent in financial data, we apply the **Synthetic Minority Over-sampling Technique (SMOTE)**, which improves model performance by balancing the dataset. Our findings reveal that the Random Forest model outperforms the Decision Tree model, achieving higher accuracy and recall rates. The results demonstrate the practical application of predictive analytics in enhancing credit risk assessment, with implications for better investment strategies in peer-to-peer lending.keywords: - Data Analysis - Python - Pandas - Seaborn - Numpy - Machine Learning - K-Nearest Neighbors (KNN) - Scikit-learn - Classificationbibliography: bibliography.bibformat: html: toc: true toc-depth: 3 toc-title: "Contents" fontsize: 1.2em number-sections: true number-depth: 3 code-fold: true code-tools: true link-external-icon: true theme: lux css: styles.css html-math-method: katex fig-align: center smooth-scroll: true toc-location: left title-block-banner: "Untitled.jpg" title-block-banner-color: black header-includes: | <link rel="icon" type="image/png" href="favicon.png"> execute: echo: true warning: false message: false cache: true---# **Background**In this project, we delve into **peer-to-peer lending** data from LendingClub.com, a financial platform that connects borrowers directly with individual investors. The primary goal of this analysis is to develop predictive models that can help investors make better decisions by assessing the likelihood of a borrower fully repaying their loan. By leveraging historical data from Lending Club, we aim to build classification models that identify factors influencing loan repayment behavior.The dataset we analyze spans the years **2007 to 2010**, a significant period before Lending Club went public. This timeframe is particularly important, as it captures data during the company’s early growth phase, reflecting the risk profiles of borrowers and lending criteria at that time. The analysis is set against the backdrop of the economic climate preceding the 2008 financial crisis, providing unique insights into consumer credit behavior.We will utilize **machine learning techniques**, specifically **Decision Trees** and **Random Forests**, to classify whether borrowers successfully repaid their loans. By applying these models, we aim to uncover patterns in the lending data that indicate repayment likelihood. This project not only demonstrates the application of advanced classification algorithms but also highlights the practical value of using predictive analytics in financial decision-making.Our ultimate objective is to create models that assist investors in identifying trustworthy borrowers, thereby maximizing returns and minimizing the risk of default. The insights gained through this analysis have broader implications for improving lending strategies, credit risk assessment, and investment decisions in the peer-to-peer lending space.---# **Key Insights**The dataset reveals various factors that influence the likelihood of loan repayment, such as credit scores, income levels, and loan purposes. We will explore these features in-depth to build predictive models that effectively classify borrowers based on their risk profiles.---# **Loading the Packages Used in the Analysis**We load the Python packages required for data manipulation, visualization, and modeling:```{python}import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report, confusion_matrixfrom sklearn.preprocessing import StandardScalerfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom imblearn.over_sampling import SMOTE```---# **Data**For this project, we are exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects borrowers with investors who are interested in funding loans. The dataset used includes lending data from **2007 to 2010** and is cleaned of any missing values.### **Data Dictionary**The dataset contains several variables, including information about the loan, the borrower’s financial profile, and repayment status. Key variables include `credit.policy`, `purpose`, `int.rate`, `fico`, and `not.fully.paid`, which will be used as the target variable for prediction.---# **Exploratory Data Analysis**We start by loading the dataset and exploring its structure:```{python}lending = pd.read_csv('loan_data.csv')lending.columns = lending.columns.str.replace(".", "_")lending.head()```### **Checking for Missing Values**```{python}sns.heatmap(lending.isnull(), cmap="coolwarm")plt.title("Heatmap of Missing Values")plt.show()```The dataset has no missing values, which is ideal for our analysis focused on predictive modeling.---# **Data Visualization**### **FICO Scores Distribution by Credit Policy**```{python}sns.displot(x="fico", hue="credit_policy", data=lending, palette="rocket", kde=True)plt.title("Distribution of FICO Scores by Credit Policy")plt.show()```### **FICO Scores Distribution by Loan Payment Status**```{python}sns.displot(x="fico", hue="not_fully_paid", data=lending, palette="rocket", kde=True)plt.title("Distribution of FICO Scores by Payment Status")plt.show()```### **Loan Purpose vs. Payment Status**```{python}plt.figure(figsize=(12, 8))sns.countplot(x="purpose", hue="not_fully_paid", data=lending, palette="mako")plt.title("Loan Purpose vs Payment Status")plt.show()```### **FICO Score vs. Interest Rate**```{python}sns.jointplot(x="fico", y="int_rate", data=lending, joint_kws={"color":"purple"}, marginal_kws={"color":"purple"})plt.title("FICO Score vs Interest Rate")plt.show()```We observe that lower FICO scores are associated with higher interest rates, reflecting higher risk.---# **Feature Engineering**We convert the categorical `purpose` variable into dummy variables:```{python}cat = pd.get_dummies(lending["purpose"], drop_first=True, prefix="purpose")lending = pd.concat([lending.drop("purpose", axis=1), cat], axis=1)```---# **Machine Learning Models**## **Train-Test Split**We start by separating the target variable (`not_fully_paid`) from the features:```{python}X = lending.drop("not_fully_paid", axis=1)y = lending["not_fully_paid"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)```---# **Handling Class Imbalance Using SMOTE**Many real-world datasets, particularly in finance, have imbalanced class distributions. We use **SMOTE (Synthetic Minority Over-sampling Technique)** to balance our dataset:```{python}smote = SMOTE(random_state=42)X_smote, y_smote = smote.fit_resample(X_train, y_train)print(f'Old data had {y_train.shape[0]} rows; new data has {y_smote.shape[0]} rows.')```---# **Decision Tree Model**### **Building the Decision Tree**```{python}tree = DecisionTreeClassifier(random_state=42)tree.fit(X_smote, y_smote)tree_preds = tree.predict(X_test)```### **Evaluating the Model**```{python}print(classification_report(y_test, tree_preds))sns.heatmap(confusion_matrix(y_test, tree_preds), cmap="Blues", annot=True, fmt=".0f")plt.title("Confusion Matrix - Decision Tree")plt.show()```**Decision Trees** are intuitive models that split the data into branches to make predictions. They are useful for understanding which features are most important in predicting outcomes.---# **Random Forest Model**### **Building the Random Forest**```{python}rand = RandomForestClassifier(n_estimators=100, random_state=42)rand.fit(X_smote, y_smote)rand_pred = rand.predict(X_test)```### **Evaluating the Model**```{python}print(classification_report(y_test, rand_pred))sns.heatmap(confusion_matrix(y_test, rand_pred), cmap="Blues", annot=True, fmt=".0f")plt.title("Confusion Matrix - Random Forest")plt.show()```**Random Forests** build multiple decision trees and aggregate their predictions for a more robust result. This model generally performs better than a single decision tree by reducing overfitting and improving accuracy.### **Model Comparison**We observe that the **Random Forest model outperforms the Decision Tree** in terms of accuracy, precision, and recall. The F1-score for the Random Forest is significantly higher, making it a more reliable choice for predicting loan repayment.---# **Conclusion**In this project, we successfully leveraged machine learning models to predict loan repayment behavior using historical data from LendingClub.com. By focusing on key borrower attributes such as FICO scores, interest rates, and debt-to-income ratios, we developed predictive models using **Decision Trees** and **Random Forests** to classify loan repayment outcomes. To address the class imbalance in the dataset—where the majority of loans were fully repaid—we utilized **SMOTE (Synthetic Minority Over-sampling Technique)**, which significantly improved model performance by generating synthetic samples for the minority class.Our analysis revealed that the **Random Forest model** outperformed the Decision Tree model, achieving better accuracy, precision, and recall. This superior performance demonstrates the effectiveness of ensemble methods in handling complex datasets with imbalanced classes. The insights gained from this project underscore the value of using predictive analytics for enhancing risk assessment in peer-to-peer lending, ultimately helping investors make more informed decisions.Future enhancements could include exploring more advanced algorithms like **Gradient Boosting** or incorporating additional features to further improve model accuracy. Additionally, using time-series analysis to assess trends in borrower behavior could provide even deeper insights into the factors affecting loan repayment. By refining these models, we can contribute to more robust strategies for credit risk management and lending decisions [@muddana2024python; @james2013introduction].### **Future Work**- Experiment with more advanced techniques like **Gradient Boosting**.- Perform feature selection to reduce model complexity.- Explore time-series analysis for predicting trends in loan repayments.---# **References** {-}