This section provides the descriptive statistics of the data set. The raw dataset contains 27,003 loans comprising the following 48 variables.
Variable Name
Description
loan_amnt
The total amount of the loan applied for by the borrower.
term
The duration of the loan in months (e.g., 36 or 60 months).
int_rate
The interest rate on the loan, expressed as a percentage.
installment
The monthly payment owed by the borrower if the loan is funded.
grade
Loan grade assigned by the lending institution, typically from A to G.
sub_grade
More granular breakdown of the loan grade (e.g., A1, A2, B1).
emp_title
The job title supplied by the borrower.
emp_length
The length of employment in years.
home_ownership
The home ownership status provided by the borrower (e.g., Rent, Own).
annual_inc
The self-reported annual income provided by the borrower.
verification_status
Indicates if income was verified by the lender.
issue_d
The month and year when the loan was funded.
loan_status
Current status of the loan (e.g., Fully Paid, Charged Off).
purpose
A category provided by the borrower for the loan request.
title
The loan title provided by the borrower.
dti
Debt-to-income ratio calculated using the borrower’s monthly obligations.
earliest_cr_line
The month the borrower’s earliest reported credit line was opened.
open_acc
The number of open credit lines in the borrower’s credit file.
pub_rec
Number of derogatory public records.
revol_bal
Total credit revolving balance.
revol_util
Revolving line utilization rate, or the amount of credit used relative to all available revolving credit.
total_acc
The total number of credit lines currently in the borrower’s credit file.
initial_list_status
The initial listing status of the loan.
application_type
Indicates whether the loan is an individual application or a joint application with two co-borrowers.
mort_acc
Number of mortgage accounts.
pub_rec_bankruptcies
Number of public record bankruptcies.
Code
# Load necessary librarieslibrary(tidyverse) # For data manipulation and visualizationlibrary(lubridate) # For handling date variableslibrary(caret) # For splitting data and confusion matrixlibrary(rpart) # For decision tree modelinglibrary(rpart.plot) # For visualizing decision treeslibrary(pROC) # For ROC-AUC calculations and plots# Load the datasetloan_data <-read.csv("loan_defaults.csv") # Combined dataset# Descriptive stats of the datasetsummary(loan_data)
id member_id loan_amnt funded_amnt
Min. : 54734 Min. : 70699 Min. : 500 Min. : 500
1st Qu.: 512663 1st Qu.: 661776 1st Qu.: 5300 1st Qu.: 5175
Median : 657052 Median : 840072 Median : 9750 Median : 9600
Mean : 676301 Mean : 842247 Mean :11088 Mean :10816
3rd Qu.: 827102 3rd Qu.:1035238 3rd Qu.:15000 3rd Qu.:15000
Max. :1077430 Max. :1314167 Max. :35000 Max. :35000
funded_amnt_inv term int_rate installment
Min. : 0 Length:27003 Length:27003 Min. : 15.69
1st Qu.: 5000 Class :character Class :character 1st Qu.: 165.81
Median : 8731 Mode :character Mode :character Median : 277.57
Mean :10251 Mean : 323.38
3rd Qu.:14025 3rd Qu.: 427.00
Max. :35000 Max. :1302.69
grade sub_grade emp_title emp_length
Length:27003 Length:27003 Length:27003 Length:27003
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
home_ownership annual_inc verification_status issue_d
Length:27003 Min. : 4000 Length:27003 Length:27003
Class :character 1st Qu.: 40000 Class :character Class :character
Mode :character Median : 59000 Mode :character Mode :character
Mean : 68778
3rd Qu.: 82000
Max. :6000000
loan_status url desc purpose
Min. :0.0000 Length:27003 Length:27003 Length:27003
1st Qu.:0.0000 Class :character Class :character Class :character
Median :0.0000 Mode :character Mode :character Mode :character
Mean :0.1473
3rd Qu.:0.0000
Max. :1.0000
title zip_code addr_state dti
Length:27003 Length:27003 Length:27003 Min. : 0.00
Class :character Class :character Class :character 1st Qu.: 8.13
Mode :character Mode :character Mode :character Median :13.35
Mean :13.26
3rd Qu.:18.52
Max. :29.99
delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq
Min. : 0.0000 Length:27003 Min. :0.0000 Min. : 0.0
1st Qu.: 0.0000 Class :character 1st Qu.:0.0000 1st Qu.: 18.0
Median : 0.0000 Mode :character Median :1.0000 Median : 34.0
Mean : 0.1492 Mean :0.8711 Mean : 35.7
3rd Qu.: 0.0000 3rd Qu.:1.0000 3rd Qu.: 52.0
Max. :11.0000 Max. :8.0000 Max. :120.0
NA's :17395
mths_since_last_record open_acc pub_rec revol_bal
Min. : 0.00 Min. : 2.000 Min. :0.00000 Min. : 0
1st Qu.: 8.00 1st Qu.: 6.000 1st Qu.:0.00000 1st Qu.: 3647
Median : 90.00 Median : 9.000 Median :0.00000 Median : 8808
Mean : 68.99 Mean : 9.269 Mean :0.05585 Mean : 13341
3rd Qu.:104.00 3rd Qu.:12.000 3rd Qu.:0.00000 3rd Qu.: 17008
Max. :129.00 Max. :44.000 Max. :4.00000 Max. :149588
NA's :25065
revol_util total_acc out_prncp out_prncp_inv total_pymnt
Length:27003 Min. : 2.00 Min. :0 Min. :0 Min. : 0
Class :character 1st Qu.:13.00 1st Qu.:0 1st Qu.:0 1st Qu.: 5514
Mode :character Median :20.00 Median :0 Median :0 Median : 9690
Mean :22.08 Mean :0 Mean :0 Mean :11874
3rd Qu.:29.00 3rd Qu.:0 3rd Qu.:0 3rd Qu.:16112
Max. :81.00 Max. :0 Max. :0 Max. :58564
total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.000
1st Qu.: 5019 1st Qu.: 4500 1st Qu.: 641.7 1st Qu.: 0.000
Median : 9064 Median : 8000 Median : 1294.6 Median : 0.000
Mean :11278 Mean : 9657 Mean : 2118.8 Mean : 1.364
3rd Qu.:15337 3rd Qu.:13135 3rd Qu.: 2672.3 3rd Qu.: 0.000
Max. :58564 Max. :35000 Max. :23563.7 Max. :165.690
recoveries collection_recovery_fee last_pymnt_d
Min. : 0.00 Min. : 0.00 Length:27003
1st Qu.: 0.00 1st Qu.: 0.00 Class :character
Median : 0.00 Median : 0.00 Mode :character
Mean : 96.44 Mean : 12.29
3rd Qu.: 0.00 3rd Qu.: 0.00
Max. :29623.35 Max. :7002.19
last_pymnt_amnt last_credit_pull_d pub_rec_bankruptcies
Min. : 0.0 Length:27003 Min. :0.000
1st Qu.: 215.9 Class :character 1st Qu.:0.000
Median : 571.3 Mode :character Median :0.000
Mean : 2754.9 Mean :0.044
3rd Qu.: 3459.6 3rd Qu.:0.000
Max. :36115.2 Max. :2.000
NA's :503
Data Pre-Processing and Transformation
Data cleaning involved converting variables to numeric format, parsing dates, ecoding categorical variables as factors and feature engineering one variable (Loan to Income Ratio).
Code
# Preprocess the data# Convert interest rate to numeric (remove the % sign and convert to numeric)loan_data$int_rate <-as.numeric(gsub("%", "", loan_data$int_rate))# Convert term to numeric (remove " months" and convert to numeric)loan_data$term <-as.numeric(gsub(" months", "", loan_data$term))# Convert dates to Date format (issue date and earliest credit line)loan_data$issue_d <-parse_date_time(loan_data$issue_d, orders ="my")# Encode categorical variables as factorsloan_data$grade <-as.factor(loan_data$grade)loan_data$home_ownership <-as.factor(loan_data$home_ownership)loan_data$verification_status <-as.factor(loan_data$verification_status)# Feature engineering: Create loan-to-income ratioloan_data$loan_to_income <- loan_data$loan_amnt / loan_data$annual_inc
Logistic Regression Model Training
Code
# Split data into training and testing setsset.seed(123) # For reproducibilitytrain_index <-createDataPartition(loan_data$loan_status, p =0.8, list =FALSE)train_set <- loan_data[train_index, ]test_set <- loan_data[-train_index, ]# Logistic Regression Modellogistic_model <-glm(loan_status ~ loan_amnt + int_rate + dti + term + grade + home_ownership + verification_status + loan_to_income,data = train_set, family = binomial)# Summary of the logistic regression modelsummary(logistic_model)
# Predict on the test setlogistic_probs <-predict(logistic_model, newdata = test_set, type ="response")test_set$logistic_pred <-ifelse(logistic_probs >0.5, 1, 0)# Ensure levels align for confusion matrixtest_set$logistic_pred <-factor(test_set$logistic_pred, levels =c(0, 1))test_set$loan_status <-factor(test_set$loan_status, levels =c(0, 1))# Confusion matrix for logistic regressionlogistic_confusion <-confusionMatrix(test_set$logistic_pred, test_set$loan_status)print(logistic_confusion)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4625 755
1 9 11
Accuracy : 0.8585
95% CI : (0.8489, 0.8677)
No Information Rate : 0.8581
P-Value [Acc > NIR] : 0.4785
Kappa : 0.0209
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.99806
Specificity : 0.01436
Pos Pred Value : 0.85967
Neg Pred Value : 0.55000
Prevalence : 0.85815
Detection Rate : 0.85648
Detection Prevalence : 0.99630
Balanced Accuracy : 0.50621
'Positive' Class : 0
Code
# ROC and AUC for logistic regressionlogistic_roc <-roc(as.numeric(as.character(test_set$loan_status)), logistic_probs)logistic_auc <-auc(logistic_roc)cat("Logistic Regression AUC:", logistic_auc, "\n")
Logistic Regression AUC: 0.6946519
Code
plot(logistic_roc, main ="ROC Curve for Logistic Regression", col ="blue")
Decision Tree Model Training
Code
# Decision Tree Modeldecision_tree_model <-rpart(loan_status ~ loan_amnt + int_rate + dti + term + grade + home_ownership + verification_status + loan_to_income,data = train_set, method ="class")# Predict on the test set using decision treetree_probs <-predict(decision_tree_model, newdata = test_set, type ="prob")test_set$tree_pred <-ifelse(tree_probs[, 2] >0.5, 1, 0)# Ensure levels align for confusion matrixtest_set$tree_pred <-factor(test_set$tree_pred, levels =c(0, 1))# Confusion matrix for decision treetree_confusion <-confusionMatrix(test_set$tree_pred, test_set$loan_status)print(tree_confusion)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4634 766
1 0 0
Accuracy : 0.8581
95% CI : (0.8486, 0.8674)
No Information Rate : 0.8581
P-Value [Acc > NIR] : 0.5096
Kappa : 0
Mcnemar's Test P-Value : <2e-16
Sensitivity : 1.0000
Specificity : 0.0000
Pos Pred Value : 0.8581
Neg Pred Value : NaN
Prevalence : 0.8581
Detection Rate : 0.8581
Detection Prevalence : 1.0000
Balanced Accuracy : 0.5000
'Positive' Class : 0
Code
# ROC and AUC for decision treetree_roc <-roc(as.numeric(as.character(test_set$loan_status)), tree_probs[, 2])tree_auc <-auc(tree_roc)cat("Decision Tree AUC:", tree_auc, "\n")
Decision Tree AUC: 0.5
Code
plot(tree_roc, main ="ROC Curve for Decision Tree", col ="red")
Compare Model Performance
Code
# Compare Performancecat("\nComparison of Models:\n")
cat("Decision Tree Accuracy:", tree_confusion$overall["Accuracy"], "\n")
Decision Tree Accuracy: 0.8581481
Code
if (logistic_auc > tree_auc) {cat("Logistic Regression performs better based on AUC.")} elseif (logistic_auc < tree_auc) {cat("Decision Tree performs better based on AUC.")} else {cat("Both models perform equally well based on AUC.")}