Descriptive Analysis

This section provides the descriptive statistics of the data set. The raw dataset contains 27,003 loans comprising the following 48 variables.

Variable Name Description
loan_amnt The total amount of the loan applied for by the borrower.
term The duration of the loan in months (e.g., 36 or 60 months).
int_rate The interest rate on the loan, expressed as a percentage.
installment The monthly payment owed by the borrower if the loan is funded.
grade Loan grade assigned by the lending institution, typically from A to G.
sub_grade More granular breakdown of the loan grade (e.g., A1, A2, B1).
emp_title The job title supplied by the borrower.
emp_length The length of employment in years.
home_ownership The home ownership status provided by the borrower (e.g., Rent, Own).
annual_inc The self-reported annual income provided by the borrower.
verification_status Indicates if income was verified by the lender.
issue_d The month and year when the loan was funded.
loan_status Current status of the loan (e.g., Fully Paid, Charged Off).
purpose A category provided by the borrower for the loan request.
title The loan title provided by the borrower.
dti Debt-to-income ratio calculated using the borrower’s monthly obligations.
earliest_cr_line The month the borrower’s earliest reported credit line was opened.
open_acc The number of open credit lines in the borrower’s credit file.
pub_rec Number of derogatory public records.
revol_bal Total credit revolving balance.
revol_util Revolving line utilization rate, or the amount of credit used relative to all available revolving credit.
total_acc The total number of credit lines currently in the borrower’s credit file.
initial_list_status The initial listing status of the loan.
application_type Indicates whether the loan is an individual application or a joint application with two co-borrowers.
mort_acc Number of mortgage accounts.
pub_rec_bankruptcies Number of public record bankruptcies.
Code
# Load necessary libraries
library(tidyverse)  # For data manipulation and visualization
library(lubridate)  # For handling date variables
library(caret)      # For splitting data and confusion matrix
library(rpart)      # For decision tree modeling
library(rpart.plot) # For visualizing decision trees
library(pROC)       # For ROC-AUC calculations and plots

# Load the dataset
loan_data <- read.csv("loan_defaults.csv")  # Combined dataset

# Descriptive stats of the dataset
summary(loan_data)
       id            member_id         loan_amnt      funded_amnt   
 Min.   :  54734   Min.   :  70699   Min.   :  500   Min.   :  500  
 1st Qu.: 512663   1st Qu.: 661776   1st Qu.: 5300   1st Qu.: 5175  
 Median : 657052   Median : 840072   Median : 9750   Median : 9600  
 Mean   : 676301   Mean   : 842247   Mean   :11088   Mean   :10816  
 3rd Qu.: 827102   3rd Qu.:1035238   3rd Qu.:15000   3rd Qu.:15000  
 Max.   :1077430   Max.   :1314167   Max.   :35000   Max.   :35000  
                                                                    
 funded_amnt_inv     term             int_rate          installment     
 Min.   :    0   Length:27003       Length:27003       Min.   :  15.69  
 1st Qu.: 5000   Class :character   Class :character   1st Qu.: 165.81  
 Median : 8731   Mode  :character   Mode  :character   Median : 277.57  
 Mean   :10251                                         Mean   : 323.38  
 3rd Qu.:14025                                         3rd Qu.: 427.00  
 Max.   :35000                                         Max.   :1302.69  
                                                                        
    grade            sub_grade          emp_title          emp_length       
 Length:27003       Length:27003       Length:27003       Length:27003      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 home_ownership       annual_inc      verification_status   issue_d         
 Length:27003       Min.   :   4000   Length:27003        Length:27003      
 Class :character   1st Qu.:  40000   Class :character    Class :character  
 Mode  :character   Median :  59000   Mode  :character    Mode  :character  
                    Mean   :  68778                                         
                    3rd Qu.:  82000                                         
                    Max.   :6000000                                         
                                                                            
  loan_status         url                desc             purpose         
 Min.   :0.0000   Length:27003       Length:27003       Length:27003      
 1st Qu.:0.0000   Class :character   Class :character   Class :character  
 Median :0.0000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.1473                                                           
 3rd Qu.:0.0000                                                           
 Max.   :1.0000                                                           
                                                                          
    title             zip_code          addr_state             dti       
 Length:27003       Length:27003       Length:27003       Min.   : 0.00  
 Class :character   Class :character   Class :character   1st Qu.: 8.13  
 Mode  :character   Mode  :character   Mode  :character   Median :13.35  
                                                          Mean   :13.26  
                                                          3rd Qu.:18.52  
                                                          Max.   :29.99  
                                                                         
  delinq_2yrs      earliest_cr_line   inq_last_6mths   mths_since_last_delinq
 Min.   : 0.0000   Length:27003       Min.   :0.0000   Min.   :  0.0         
 1st Qu.: 0.0000   Class :character   1st Qu.:0.0000   1st Qu.: 18.0         
 Median : 0.0000   Mode  :character   Median :1.0000   Median : 34.0         
 Mean   : 0.1492                      Mean   :0.8711   Mean   : 35.7         
 3rd Qu.: 0.0000                      3rd Qu.:1.0000   3rd Qu.: 52.0         
 Max.   :11.0000                      Max.   :8.0000   Max.   :120.0         
                                                       NA's   :17395         
 mths_since_last_record    open_acc         pub_rec          revol_bal     
 Min.   :  0.00         Min.   : 2.000   Min.   :0.00000   Min.   :     0  
 1st Qu.:  8.00         1st Qu.: 6.000   1st Qu.:0.00000   1st Qu.:  3647  
 Median : 90.00         Median : 9.000   Median :0.00000   Median :  8808  
 Mean   : 68.99         Mean   : 9.269   Mean   :0.05585   Mean   : 13341  
 3rd Qu.:104.00         3rd Qu.:12.000   3rd Qu.:0.00000   3rd Qu.: 17008  
 Max.   :129.00         Max.   :44.000   Max.   :4.00000   Max.   :149588  
 NA's   :25065                                                             
  revol_util          total_acc       out_prncp out_prncp_inv  total_pymnt   
 Length:27003       Min.   : 2.00   Min.   :0   Min.   :0     Min.   :    0  
 Class :character   1st Qu.:13.00   1st Qu.:0   1st Qu.:0     1st Qu.: 5514  
 Mode  :character   Median :20.00   Median :0   Median :0     Median : 9690  
                    Mean   :22.08   Mean   :0   Mean   :0     Mean   :11874  
                    3rd Qu.:29.00   3rd Qu.:0   3rd Qu.:0     3rd Qu.:16112  
                    Max.   :81.00   Max.   :0   Max.   :0     Max.   :58564  
                                                                             
 total_pymnt_inv total_rec_prncp total_rec_int     total_rec_late_fee
 Min.   :    0   Min.   :    0   Min.   :    0.0   Min.   :  0.000   
 1st Qu.: 5019   1st Qu.: 4500   1st Qu.:  641.7   1st Qu.:  0.000   
 Median : 9064   Median : 8000   Median : 1294.6   Median :  0.000   
 Mean   :11278   Mean   : 9657   Mean   : 2118.8   Mean   :  1.364   
 3rd Qu.:15337   3rd Qu.:13135   3rd Qu.: 2672.3   3rd Qu.:  0.000   
 Max.   :58564   Max.   :35000   Max.   :23563.7   Max.   :165.690   
                                                                     
   recoveries       collection_recovery_fee last_pymnt_d      
 Min.   :    0.00   Min.   :   0.00         Length:27003      
 1st Qu.:    0.00   1st Qu.:   0.00         Class :character  
 Median :    0.00   Median :   0.00         Mode  :character  
 Mean   :   96.44   Mean   :  12.29                           
 3rd Qu.:    0.00   3rd Qu.:   0.00                           
 Max.   :29623.35   Max.   :7002.19                           
                                                              
 last_pymnt_amnt   last_credit_pull_d pub_rec_bankruptcies
 Min.   :    0.0   Length:27003       Min.   :0.000       
 1st Qu.:  215.9   Class :character   1st Qu.:0.000       
 Median :  571.3   Mode  :character   Median :0.000       
 Mean   : 2754.9                      Mean   :0.044       
 3rd Qu.: 3459.6                      3rd Qu.:0.000       
 Max.   :36115.2                      Max.   :2.000       
                                      NA's   :503         

Data Pre-Processing and Transformation

Data cleaning involved converting variables to numeric format, parsing dates, ecoding categorical variables as factors and feature engineering one variable (Loan to Income Ratio).

Code
# Preprocess the data
# Convert interest rate to numeric (remove the % sign and convert to numeric)
loan_data$int_rate <- as.numeric(gsub("%", "", loan_data$int_rate))

# Convert term to numeric (remove " months" and convert to numeric)
loan_data$term <- as.numeric(gsub(" months", "", loan_data$term))

# Convert dates to Date format (issue date and earliest credit line)
loan_data$issue_d <- parse_date_time(loan_data$issue_d, orders = "my")

# Encode categorical variables as factors
loan_data$grade <- as.factor(loan_data$grade)
loan_data$home_ownership <- as.factor(loan_data$home_ownership)
loan_data$verification_status <- as.factor(loan_data$verification_status)

# Feature engineering: Create loan-to-income ratio
loan_data$loan_to_income <- loan_data$loan_amnt / loan_data$annual_inc

Logistic Regression Model Training

Code
# Split data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(loan_data$loan_status, p = 0.8, list = FALSE)
train_set <- loan_data[train_index, ]
test_set <- loan_data[-train_index, ]

# Logistic Regression Model
logistic_model <- glm(loan_status ~ loan_amnt + int_rate + dti + term + grade + home_ownership + verification_status + loan_to_income,
                      data = train_set, family = binomial)

# Summary of the logistic regression model
summary(logistic_model)

Call:
glm(formula = loan_status ~ loan_amnt + int_rate + dti + term + 
    grade + home_ownership + verification_status + loan_to_income, 
    family = binomial, data = train_set)

Coefficients:
                                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        -4.767e+00  1.640e-01 -29.070  < 2e-16 ***
loan_amnt                          -3.807e-05  3.820e-06  -9.966  < 2e-16 ***
int_rate                            1.477e-01  1.931e-02   7.648 2.03e-14 ***
dti                                 4.654e-03  3.061e-03   1.520    0.128    
term                                2.215e-02  2.040e-03  10.857  < 2e-16 ***
gradeB                              1.245e-01  9.624e-02   1.293    0.196    
gradeC                              1.259e-01  1.355e-01   0.929    0.353    
gradeD                              8.758e-02  1.737e-01   0.504    0.614    
gradeE                             -4.861e-02  2.107e-01  -0.231    0.818    
gradeF                             -1.440e-01  2.554e-01  -0.564    0.573    
gradeG                             -7.661e-02  3.117e-01  -0.246    0.806    
home_ownershipNONE                 -9.278e+00  1.125e+02  -0.082    0.934    
home_ownershipOTHER                 4.840e-02  4.189e-01   0.116    0.908    
home_ownershipOWN                  -1.964e-02  7.808e-02  -0.252    0.801    
home_ownershipRENT                 -4.836e-02  4.394e-02  -1.101    0.271    
verification_statusSource Verified -7.551e-02  5.189e-02  -1.455    0.146    
verification_statusVerified        -2.721e-02  5.169e-02  -0.526    0.599    
loan_to_income                      2.707e+00  2.090e-01  12.952  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 18161  on 21602  degrees of freedom
Residual deviance: 16839  on 21585  degrees of freedom
AIC: 16875

Number of Fisher Scoring iterations: 10
Code
# Predict on the test set
logistic_probs <- predict(logistic_model, newdata = test_set, type = "response")
test_set$logistic_pred <- ifelse(logistic_probs > 0.5, 1, 0)

# Ensure levels align for confusion matrix
test_set$logistic_pred <- factor(test_set$logistic_pred, levels = c(0, 1))
test_set$loan_status <- factor(test_set$loan_status, levels = c(0, 1))

# Confusion matrix for logistic regression
logistic_confusion <- confusionMatrix(test_set$logistic_pred, test_set$loan_status)
print(logistic_confusion)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 4625  755
         1    9   11
                                          
               Accuracy : 0.8585          
                 95% CI : (0.8489, 0.8677)
    No Information Rate : 0.8581          
    P-Value [Acc > NIR] : 0.4785          
                                          
                  Kappa : 0.0209          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.99806         
            Specificity : 0.01436         
         Pos Pred Value : 0.85967         
         Neg Pred Value : 0.55000         
             Prevalence : 0.85815         
         Detection Rate : 0.85648         
   Detection Prevalence : 0.99630         
      Balanced Accuracy : 0.50621         
                                          
       'Positive' Class : 0               
                                          
Code
# ROC and AUC for logistic regression
logistic_roc <- roc(as.numeric(as.character(test_set$loan_status)), logistic_probs)
logistic_auc <- auc(logistic_roc)
cat("Logistic Regression AUC:", logistic_auc, "\n")
Logistic Regression AUC: 0.6946519 
Code
plot(logistic_roc, main = "ROC Curve for Logistic Regression", col = "blue")

Decision Tree Model Training

Code
# Decision Tree Model
decision_tree_model <- rpart(loan_status ~ loan_amnt + int_rate + dti + term + grade + home_ownership + verification_status + loan_to_income,
                             data = train_set, method = "class")

# Predict on the test set using decision tree
tree_probs <- predict(decision_tree_model, newdata = test_set, type = "prob")
test_set$tree_pred <- ifelse(tree_probs[, 2] > 0.5, 1, 0)

# Ensure levels align for confusion matrix
test_set$tree_pred <- factor(test_set$tree_pred, levels = c(0, 1))

# Confusion matrix for decision tree
tree_confusion <- confusionMatrix(test_set$tree_pred, test_set$loan_status)
print(tree_confusion)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 4634  766
         1    0    0
                                          
               Accuracy : 0.8581          
                 95% CI : (0.8486, 0.8674)
    No Information Rate : 0.8581          
    P-Value [Acc > NIR] : 0.5096          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.0000          
         Pos Pred Value : 0.8581          
         Neg Pred Value :    NaN          
             Prevalence : 0.8581          
         Detection Rate : 0.8581          
   Detection Prevalence : 1.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : 0               
                                          
Code
# ROC and AUC for decision tree
tree_roc <- roc(as.numeric(as.character(test_set$loan_status)), tree_probs[, 2])
tree_auc <- auc(tree_roc)
cat("Decision Tree AUC:", tree_auc, "\n")
Decision Tree AUC: 0.5 
Code
plot(tree_roc, main = "ROC Curve for Decision Tree", col = "red")

Compare Model Performance

Code
# Compare Performance
cat("\nComparison of Models:\n")

Comparison of Models:
Code
cat("Logistic Regression Accuracy:", logistic_confusion$overall["Accuracy"], "\n")
Logistic Regression Accuracy: 0.8585185 
Code
cat("Decision Tree Accuracy:", tree_confusion$overall["Accuracy"], "\n")
Decision Tree Accuracy: 0.8581481 
Code
if (logistic_auc > tree_auc) {
  cat("Logistic Regression performs better based on AUC.")
} else if (logistic_auc < tree_auc) {
  cat("Decision Tree performs better based on AUC.")
} else {
  cat("Both models perform equally well based on AUC.")
}
Logistic Regression performs better based on AUC.