Description:

A Bad Debt or Loan is a monetary amount owed to a creditor that is unlikely to be paid and, or which the creditor is not willing to take action to collect because of various reasons, often due to the debtor not having the money to pay. Iam trying to build a bad loan model that can be used by the investors to easily decide whether to finance the borrower for new loans. I will be using Machine Learning Random Forest or some other classification algorithms.

Libraries:

Loading the required libraries

library(RMySQL)

## Loading required package: DBI

library(DBI)
library(ggplot2)
library(mongolite)
library(RMySQL)
library(DBI)
library(ggplot2)
library(rjson)
library(knitr)
library(stringr)
library(rpart.plot)

## Loading required package: rpart

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(ggplot2)
library(caret)

## Loading required package: lattice

library(rpart)
library(ROSE)

## Loaded ROSE 0.0-3

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(ipred)

Data Description:

Loan data of Lending Club(https://www.lendingclub.com/info/download-data.action) from 2007-2011. I will import the .csv file and use it to build model.

I will also use unemployment rate data from Bureau of Labor Statistics( https://data.bls.gov/map/MapToolServlet).I will load the data to MongoDB.

Data preparation:

The documentation published in the “Data Dictionary”on the Lending Club website was very helpful in understanding and knowing the variables and their description.

The loans data:

loans <- read.csv("https://raw.githubusercontent.com/Riteshlohiya/Data607-Final-Project/master/lending_club_loansd.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
dim(loans)

## [1] 38652   144

The unemployment data:

unemp <- read.csv("https://raw.githubusercontent.com/Riteshlohiya/Data607-Final-Project/master/Unemp_rate_new1.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
unemp

##    addr_state  AL  AK  AZ  AR   CA  CO  CT  DE  DC  FL  GA  HI  ID  IL  IN
## 1 un_emp_rate 8.3 7.4 8.8 7.8 11.2 8.2 8.3 7.2 9.9 9.1 9.8 6.8 7.8 9.4 8.8
##    IA  KS  KY  LA  ME MD  MA  MI  MN  MS  MO  MT  NE   NV  NH  NJ  NM  NY
## 1 5.3 6.1 8.6 7.5 7.7  7 6.8 9.4 5.9 9.4 7.6 6.4 4.2 12.3 5.4 9.2 7.6 8.6
##    NC  ND OH  OK  OR  PA   RI SC  SD  TN  TX UT  VT  VA  WA  WV  WI  WY
## 1 9.8 3.3  8 5.6 9.2 7.8 11.2 10 4.4 8.1 7.2  6 5.2 6.4 8.8 7.5 7.3 5.6

Writing to MongoDB:

Writing the unemployment data to the MongoDB

c=mongo(collection="unemp", db="upemp")
c$drop()
c$insert(unemp)

## List of 5
##  $ nInserted  : num 1
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

alldata <- c$find('{}')
alldata

##    addr_state  AL  AK  AZ  AR   CA  CO  CT  DE  DC  FL  GA  HI  ID  IL  IN
## 1 un_emp_rate 8.3 7.4 8.8 7.8 11.2 8.2 8.3 7.2 9.9 9.1 9.8 6.8 7.8 9.4 8.8
##    IA  KS  KY  LA  ME MD  MA  MI  MN  MS  MO  MT  NE   NV  NH  NJ  NM  NY
## 1 5.3 6.1 8.6 7.5 7.7  7 6.8 9.4 5.9 9.4 7.6 6.4 4.2 12.3 5.4 9.2 7.6 8.6
##    NC  ND OH  OK  OR  PA   RI SC  SD  TN  TX UT  VT  VA  WA  WV  WI  WY
## 1 9.8 3.3  8 5.6 9.2 7.8 11.2 10 4.4 8.1 7.2  6 5.2 6.4 8.8 7.5 7.3 5.6

alldata1 <- gather(alldata, "addr_state", "un_emp_rate")
head(alldata1)

##   addr_state un_emp_rate
## 1         AL         8.3
## 2         AK         7.4
## 3         AZ         8.8
## 4         AR         7.8
## 5         CA        11.2
## 6         CO         8.2

Merge the loans data and unemployment data:

loans_data <- merge(loans, alldata1, by="addr_state", all.x=TRUE)
count(loans_data)

## # A tibble: 1 x 1
##       n
##   <int>
## 1 38652

Cleaning the data with mostly na values:

#remove fields that are mostly NA
pc <- sapply(loans_data, function(x) {
  t1 <- 1 - sum(is.na(x)) / length(x)
  t1 < .8
})
df <- loans_data[,pc==FALSE]
head(df)

##   addr_state loan_amnt funded_amnt funded_amnt_inv       term int_rate
## 1         AK     13200       13200           13200  60 months    12.69
## 2         AK     18750       18750           18650  36 months    10.38
## 3         AK      2000        2000            2000  60 months    16.77
## 4         AK      4000        4000            4000  36 months     8.90
## 5         AK      7650        7650            7650  36 months    19.29
## 6         AK     20000       20000           20000  60 months    14.65
##   installment grade sub_grade                           emp_title
## 1      298.25     B        B5          Bartlett Regional Hospital
## 2      608.37     B        B1 Bristol Bay Area Health Corporation
## 3       49.46     E        E2                        US Air Force
## 4      127.02     A        A5                                 ATT
## 5      281.55     E        E4                      Cape Fox Lodge
## 6      472.14     C        C3                 Todd Communications
##   emp_length home_ownership annual_inc verification_status issue_d
## 1    3 years           RENT      45000        Not Verified  11-Dec
## 2    2 years           RENT      70000     Source Verified  10-Sep
## 3  10+ years           RENT      78000     Source Verified  11-Mar
## 4    2 years       MORTGAGE     112000     Source Verified  11-Oct
## 5    4 years           RENT      39000        Not Verified  11-Sep
## 6    5 years       MORTGAGE      55000     Source Verified  11-Oct
##   loan_status pymnt_plan            purpose            title zip_code
## 1  Fully Paid          n debt_consolidation           E LOAN    998xx
## 2  Fully Paid          n debt_consolidation      Bristol Bay    995xx
## 3  Fully Paid          n debt_consolidation   March '11 Loan    995xx
## 4  Fully Paid          n   home_improvement    Rental repair    998xx
## 5 Charged Off          n        credit_card         debt cos    999xx
## 6  Fully Paid          n   home_improvement Basement Remodel    995xx
##     dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec
## 1 22.93           0            2-Dec              0        5       0
## 2 16.58           0           Nov-97              3       16       0
## 3 24.40           2           Dec-95              1       13       0
## 4  3.19           0            2-Sep              0        5       2
## 5 12.28           0            1-Mar              0        6       0
## 6  4.06           0           Jan-96              1        3       0
##   revol_bal revol_util total_acc initial_list_status out_prncp
## 1      8241       40.2        13                   f         0
## 2     11923       63.1        32                   f         0
## 3     35078       91.6        51                   f         0
## 4       554       39.6        12                   f         0
## 5      8064       99.6         9                   f         0
## 6      9256       51.9        11                   f         0
##   out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int
## 1             0   17559.940        17559.94        13200.00       4359.94
## 2             0   20479.426        20370.21        18750.00       1729.43
## 3             0    2578.851         2578.85         2000.00        578.85
## 4             0    4572.426         4572.43         4000.00        572.43
## 5             0    2343.740         2343.74         1164.94        805.77
## 6             0   23943.238        23943.24        20000.00       3943.24
##   total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d
## 1                  0       0.00                    0.00       15-Sep
## 2                  0       0.00                    0.00       12-Feb
## 3                  0       0.00                    0.00       13-Apr
## 4                  0       0.00                    0.00       14-Oct
## 5                  0     373.03                    3.54       12-Apr
## 6                  0       0.00                    0.00       13-May
##   last_pymnt_amnt last_credit_pull_d collections_12_mths_ex_med
## 1         4459.14             16-Aug                          0
## 2          629.34             16-Nov                          0
## 3         1445.49             17-Oct                          0
## 4          132.66             17-Dec                          0
## 5          281.55             16-Oct                          0
## 6        15919.10             18-Feb                          0
##   policy_code application_type acc_now_delinq chargeoff_within_12_mths
## 1           1       Individual              0                        0
## 2           1       Individual              0                        0
## 3           1       Individual              0                        0
## 4           1       Individual              0                        0
## 5           1       Individual              0                        0
## 6           1       Individual              0                        0
##   delinq_amnt pub_rec_bankruptcies tax_liens hardship_flag
## 1           0                    0         0             N
## 2           0                    0         0             N
## 3           0                    0         0             N
## 4           0                    0         0             N
## 5           0                    0         0             N
## 6           0                    0         0             N
##   disbursement_method debt_settlement_flag debt_settlement_flag_date
## 1                Cash                    N                          
## 2                Cash                    N                          
## 3                Cash                    N                          
## 4                Cash                    N                          
## 5                Cash                    N                          
## 6                Cash                    N                          
##   settlement_status settlement_date un_emp_rate
## 1                                           7.4
## 2                                           7.4
## 3                                           7.4
## 4                                           7.4
## 5                                           7.4
## 6                                           7.4

colnames(df)

##  [1] "addr_state"                 "loan_amnt"                 
##  [3] "funded_amnt"                "funded_amnt_inv"           
##  [5] "term"                       "int_rate"                  
##  [7] "installment"                "grade"                     
##  [9] "sub_grade"                  "emp_title"                 
## [11] "emp_length"                 "home_ownership"            
## [13] "annual_inc"                 "verification_status"       
## [15] "issue_d"                    "loan_status"               
## [17] "pymnt_plan"                 "purpose"                   
## [19] "title"                      "zip_code"                  
## [21] "dti"                        "delinq_2yrs"               
## [23] "earliest_cr_line"           "inq_last_6mths"            
## [25] "open_acc"                   "pub_rec"                   
## [27] "revol_bal"                  "revol_util"                
## [29] "total_acc"                  "initial_list_status"       
## [31] "out_prncp"                  "out_prncp_inv"             
## [33] "total_pymnt"                "total_pymnt_inv"           
## [35] "total_rec_prncp"            "total_rec_int"             
## [37] "total_rec_late_fee"         "recoveries"                
## [39] "collection_recovery_fee"    "last_pymnt_d"              
## [41] "last_pymnt_amnt"            "last_credit_pull_d"        
## [43] "collections_12_mths_ex_med" "policy_code"               
## [45] "application_type"           "acc_now_delinq"            
## [47] "chargeoff_within_12_mths"   "delinq_amnt"               
## [49] "pub_rec_bankruptcies"       "tax_liens"                 
## [51] "hardship_flag"              "disbursement_method"       
## [53] "debt_settlement_flag"       "debt_settlement_flag_date" 
## [55] "settlement_status"          "settlement_date"           
## [57] "un_emp_rate"

Now we have complete data in a dataframe, so started with the analysis:

levels(loans_data$loan_status)

## NULL

table(df$loan_status)

## 
## Charged Off  Fully Paid 
##        5493       33159

We are going to take only the data with status as “Charged Off” or “Fully Paid”

df <- subset(df, loan_status %in% c("Charged Off", "Fully Paid" ))
table(df$loan_status)

## 
## Charged Off  Fully Paid 
##        5493       33159

bad_id <- c("Charged Off")

df$bad_loans <- ifelse(df$loan_status %in% bad_id, 1,
                         ifelse(df$loan_status=="", NA,
                                0))
table(df$loan_status)

## 
## Charged Off  Fully Paid 
##        5493       33159

table(df$bad_loans)

## 
##     0     1 
## 33159  5493

Exploratory Analysis:

Checking loan status against the loan grades. Most of the lower grade loans failed to pay back.

table(df$loan_status, df$grade)

##              
##                   A     B     C     D     E     F     G
##   Charged Off   581  1392  1313  1089   707   313    98
##   Fully Paid   9152 10313  6590  4096  2084   710   214

ggplot(df, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Similarly loan status against the unemployment rate.

table(df$loan_status, df$un_emp_rate)

##              
##                4.2  4.4  5.2  5.3  5.4  5.6  5.9    6  6.1  6.4  6.8    7
##   Charged Off    3   12    5    0   24   41   78   40   31  179  181  159
##   Fully Paid     2   51   46    5  146  327  518  213  236 1265 1295  859
##              
##                7.2  7.3  7.4  7.5  7.6  7.7  7.8    8  8.1  8.2  8.3  8.6
##   Charged Off  318   61   15   71  142    0  206  147    2   94  145  533
##   Fully Paid  2427  386   64  523  707    2 1508 1043   13  669 1022 3498
##              
##                8.8  9.1  9.2  9.4  9.8  9.9   10 11.2 12.3
##   Charged Off  244  485  343  296  320   15   66 1130  107
##   Fully Paid  1440 2308 1902 1909 1814  191  389 6004  377

ggplot(df, aes(x = un_emp_rate))+ geom_histogram(aes(fill = un_emp_rate)) + facet_wrap(~loan_status, ncol = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finding out the numeric columns:

We need to find all the numeric data that we can utilize to decide on the predictors, so iam doing numeric analysis. Ploting the graph for clear understanding.

numeric_cols <- sapply(df, is.numeric)
df.lng <- melt(df[,numeric_cols], id ="bad_loans")
head(df.lng)

##   bad_loans  variable value
## 1         0 loan_amnt 13200
## 2         0 loan_amnt 18750
## 3         0 loan_amnt  2000
## 4         0 loan_amnt  4000
## 5         1 loan_amnt  7650
## 6         0 loan_amnt 20000

p <- ggplot(aes(x=value, group=bad_loans, colour=factor(bad_loans)), data=df.lng)
p + geom_density() +
  facet_wrap(~variable, scales="free")

## Warning: Removed 797 rows containing non-finite values (stat_density).

Removing outliers:

We need to remove the outliers so that it should not impact by making wrong predictions.

#Removing annual_inc outliers

summary(df$annual_inc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4000   40800   59450   68985   82404 6000000

inc_outliers <- which(df$annual_inc > 1000000) 
df <- df[-inc_outliers,]

Now the data is ready for model development. First i will try logistic regression and then if not satisfied we can use some of the advanced models like Random forest.

Building the Models

#Sampling the data
set.seed(123)

sample <- runif(nrow(df)) > 0.70
train <- df[sample==FALSE,]
test <- df[sample==TRUE,]

table(train$bad_loans)

## 
##     0     1 
## 23291  3948

#Building the logistic regression model

logistic_regressor <- glm(bad_loans ~ loan_amnt + int_rate + installment + annual_inc + dti +
revol_bal + revol_util + total_acc + un_emp_rate, family = "binomial", data = train)
summary(logistic_regressor)

## 
## Call:
## glm(formula = bad_loans ~ loan_amnt + int_rate + installment + 
##     annual_inc + dti + revol_bal + revol_util + total_acc + un_emp_rate, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2595  -0.5964  -0.4754  -0.3575   3.7732  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.939e+00  1.300e-01 -30.307  < 2e-16 ***
## loan_amnt    5.636e-05  5.964e-06   9.449  < 2e-16 ***
## int_rate     1.435e-01  5.779e-03  24.832  < 2e-16 ***
## installment -1.904e-03  2.182e-04  -8.725  < 2e-16 ***
## annual_inc  -7.256e-06  6.606e-07 -10.985  < 2e-16 ***
## dti         -1.405e-03  3.089e-03  -0.455 0.649208    
## revol_bal    3.086e-06  1.467e-06   2.103 0.035450 *  
## revol_util   2.693e-03  8.094e-04   3.327 0.000877 ***
## total_acc    3.092e-03  1.840e-03   1.680 0.092880 .  
## un_emp_rate  6.133e-02  1.103e-02   5.559 2.71e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22545  on 27238  degrees of freedom
## Residual deviance: 21189  on 27229  degrees of freedom
## AIC: 21209
## 
## Number of Fisher Scoring iterations: 5

#Predicting on test data
pred <- predict(logistic_regressor, newdata = test, type = "response")
summary(pred)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0003209 0.0827812 0.1274705 0.1439276 0.1837348 0.5560589

pred_co <- ifelse(pred > 0.8, 1,0) 
table(test$bad_loans,pred_co )

##    pred_co
##        0
##   0 9856
##   1 1544

pred_t <- prediction(pred_co,test$bad_loans)
perfm <- performance(pred_t, "tpr", "fpr")

#Printing AUC Value
perfm <- performance(pred_t, "auc")
print(perfm@y.values[[1]])

## [1] 0.5

#ROC curve
roc.curve(test$bad_loans, pred_co,col="red", main="The ROC-curve for Model")

## Area under the curve (AUC): 0.500

text(0.6,0.2,paste("AUC=0.5"))

The AUC is just 0.5 so we need to resample the data so that we can have balanced samples. Iam improving the sampling by balancing between good and bad loans.

#Improving the sampling by balancing between good and bad
improved_train <- ROSE(bad_loans ~ loan_amnt + int_rate + installment + annual_inc + dti +
revol_bal + revol_util + total_acc + un_emp_rate, data = train, seed = 1)$data
table(improved_train$bad_loans)

## 
##     0     1 
## 13721 13518

#Building new logistic regression model
improved_regressor <- glm(bad_loans ~ loan_amnt + int_rate + installment + annual_inc + dti +
revol_bal + revol_util + total_acc + un_emp_rate, family = "binomial", data = improved_train)
summary(improved_regressor)

## 
## Call:
## glm(formula = bad_loans ~ loan_amnt + int_rate + installment + 
##     annual_inc + dti + revol_bal + revol_util + total_acc + un_emp_rate, 
##     family = "binomial", data = improved_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0128  -1.0986  -0.6291   1.1082   2.7041  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.037e+00  8.381e-02 -24.302  < 2e-16 ***
## loan_amnt    1.860e-05  2.373e-06   7.840 4.50e-15 ***
## int_rate     1.247e-01  3.594e-03  34.689  < 2e-16 ***
## installment -5.879e-04  8.498e-05  -6.918 4.58e-12 ***
## annual_inc  -4.285e-06  3.158e-07 -13.569  < 2e-16 ***
## dti          3.750e-03  1.861e-03   2.015   0.0439 *  
## revol_bal    8.999e-07  8.519e-07   1.056   0.2908    
## revol_util   2.741e-03  4.800e-04   5.710 1.13e-08 ***
## total_acc   -5.471e-04  1.109e-03  -0.493   0.6219    
## un_emp_rate  5.515e-02  7.148e-03   7.715 1.21e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37760  on 27238  degrees of freedom
## Residual deviance: 35519  on 27229  degrees of freedom
## AIC: 35539
## 
## Number of Fisher Scoring iterations: 4

#Making predictions on test set

improved_pred <- predict(improved_regressor, newdata = test, type="response")
hist(improved_pred)

#Evaluating new model
roc.curve(test$bad_loans, improved_pred, col="dark red", main="The ROC-curve for Improved Model")

## Area under the curve (AUC): 0.674

text(0.6,0.2,paste("AUC=0.672"))

I will now try Random forest to see if i can make better predictions.

#Using random forest
rf <- randomForest(bad_loans ~ loan_amnt + int_rate + installment + annual_inc + dti  + revol_bal + revol_util + total_acc 
+ un_emp_rate, type="classification", data=improved_train, importance=TRUE, na.action=na.omit)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

rf_pred <- predict(rf, newdata = test, type="response")
hist(rf_pred)

#Evaluating new model
roc.curve(test$bad_loans, rf_pred, col="dark red", main="The ROC-curve for Improved Model")

## Area under the curve (AUC): 0.667

text(0.6,0.2,paste("AUC=0.667"))

Conclusions:

I have developed a model using logistic regression and Random Forest to predict if a borrower will repay the loan based on historical data provided by Lending Club and to help investors when deciding which investment strategy to choose.

I think both Logistic and Random Forest Model have similar outcomes. I would say the data is very imbalanced, so very difficult to predict with high accuracy.

Data607 Final Project

Ritesh Lohiya

May 10, 2018