A Bad Debt or Loan is a monetary amount owed to a creditor that is unlikely to be paid and, or which the creditor is not willing to take action to collect because of various reasons, often due to the debtor not having the money to pay. Iam trying to build a bad loan model that can be used by the investors to easily decide whether to finance the borrower for new loans. I will be using Machine Learning Random Forest or some other classification algorithms.
Loading the required libraries
library(RMySQL)
## Loading required package: DBI
library(DBI)
library(ggplot2)
library(mongolite)
library(RMySQL)
library(DBI)
library(ggplot2)
library(rjson)
library(knitr)
library(stringr)
library(rpart.plot)
## Loading required package: rpart
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggplot2)
library(caret)
## Loading required package: lattice
library(rpart)
library(ROSE)
## Loaded ROSE 0.0-3
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ipred)
Loan data of Lending Club(https://www.lendingclub.com/info/download-data.action) from 2007-2011. I will import the .csv file and use it to build model.
I will also use unemployment rate data from Bureau of Labor Statistics( https://data.bls.gov/map/MapToolServlet).I will load the data to MongoDB.
The documentation published in the “Data Dictionary”on the Lending Club website was very helpful in understanding and knowing the variables and their description.
loans <- read.csv("https://raw.githubusercontent.com/Riteshlohiya/Data607-Final-Project/master/lending_club_loansd.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
dim(loans)
## [1] 38652 144
unemp <- read.csv("https://raw.githubusercontent.com/Riteshlohiya/Data607-Final-Project/master/Unemp_rate_new1.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
unemp
## addr_state AL AK AZ AR CA CO CT DE DC FL GA HI ID IL IN
## 1 un_emp_rate 8.3 7.4 8.8 7.8 11.2 8.2 8.3 7.2 9.9 9.1 9.8 6.8 7.8 9.4 8.8
## IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY
## 1 5.3 6.1 8.6 7.5 7.7 7 6.8 9.4 5.9 9.4 7.6 6.4 4.2 12.3 5.4 9.2 7.6 8.6
## NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY
## 1 9.8 3.3 8 5.6 9.2 7.8 11.2 10 4.4 8.1 7.2 6 5.2 6.4 8.8 7.5 7.3 5.6
Writing the unemployment data to the MongoDB
c=mongo(collection="unemp", db="upemp")
c$drop()
c$insert(unemp)
## List of 5
## $ nInserted : num 1
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
alldata <- c$find('{}')
alldata
## addr_state AL AK AZ AR CA CO CT DE DC FL GA HI ID IL IN
## 1 un_emp_rate 8.3 7.4 8.8 7.8 11.2 8.2 8.3 7.2 9.9 9.1 9.8 6.8 7.8 9.4 8.8
## IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY
## 1 5.3 6.1 8.6 7.5 7.7 7 6.8 9.4 5.9 9.4 7.6 6.4 4.2 12.3 5.4 9.2 7.6 8.6
## NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY
## 1 9.8 3.3 8 5.6 9.2 7.8 11.2 10 4.4 8.1 7.2 6 5.2 6.4 8.8 7.5 7.3 5.6
alldata1 <- gather(alldata, "addr_state", "un_emp_rate")
head(alldata1)
## addr_state un_emp_rate
## 1 AL 8.3
## 2 AK 7.4
## 3 AZ 8.8
## 4 AR 7.8
## 5 CA 11.2
## 6 CO 8.2
loans_data <- merge(loans, alldata1, by="addr_state", all.x=TRUE)
count(loans_data)
## # A tibble: 1 x 1
## n
## <int>
## 1 38652
#remove fields that are mostly NA
pc <- sapply(loans_data, function(x) {
t1 <- 1 - sum(is.na(x)) / length(x)
t1 < .8
})
df <- loans_data[,pc==FALSE]
head(df)
## addr_state loan_amnt funded_amnt funded_amnt_inv term int_rate
## 1 AK 13200 13200 13200 60 months 12.69
## 2 AK 18750 18750 18650 36 months 10.38
## 3 AK 2000 2000 2000 60 months 16.77
## 4 AK 4000 4000 4000 36 months 8.90
## 5 AK 7650 7650 7650 36 months 19.29
## 6 AK 20000 20000 20000 60 months 14.65
## installment grade sub_grade emp_title
## 1 298.25 B B5 Bartlett Regional Hospital
## 2 608.37 B B1 Bristol Bay Area Health Corporation
## 3 49.46 E E2 US Air Force
## 4 127.02 A A5 ATT
## 5 281.55 E E4 Cape Fox Lodge
## 6 472.14 C C3 Todd Communications
## emp_length home_ownership annual_inc verification_status issue_d
## 1 3 years RENT 45000 Not Verified 11-Dec
## 2 2 years RENT 70000 Source Verified 10-Sep
## 3 10+ years RENT 78000 Source Verified 11-Mar
## 4 2 years MORTGAGE 112000 Source Verified 11-Oct
## 5 4 years RENT 39000 Not Verified 11-Sep
## 6 5 years MORTGAGE 55000 Source Verified 11-Oct
## loan_status pymnt_plan purpose title zip_code
## 1 Fully Paid n debt_consolidation E LOAN 998xx
## 2 Fully Paid n debt_consolidation Bristol Bay 995xx
## 3 Fully Paid n debt_consolidation March '11 Loan 995xx
## 4 Fully Paid n home_improvement Rental repair 998xx
## 5 Charged Off n credit_card debt cos 999xx
## 6 Fully Paid n home_improvement Basement Remodel 995xx
## dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec
## 1 22.93 0 2-Dec 0 5 0
## 2 16.58 0 Nov-97 3 16 0
## 3 24.40 2 Dec-95 1 13 0
## 4 3.19 0 2-Sep 0 5 2
## 5 12.28 0 1-Mar 0 6 0
## 6 4.06 0 Jan-96 1 3 0
## revol_bal revol_util total_acc initial_list_status out_prncp
## 1 8241 40.2 13 f 0
## 2 11923 63.1 32 f 0
## 3 35078 91.6 51 f 0
## 4 554 39.6 12 f 0
## 5 8064 99.6 9 f 0
## 6 9256 51.9 11 f 0
## out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int
## 1 0 17559.940 17559.94 13200.00 4359.94
## 2 0 20479.426 20370.21 18750.00 1729.43
## 3 0 2578.851 2578.85 2000.00 578.85
## 4 0 4572.426 4572.43 4000.00 572.43
## 5 0 2343.740 2343.74 1164.94 805.77
## 6 0 23943.238 23943.24 20000.00 3943.24
## total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d
## 1 0 0.00 0.00 15-Sep
## 2 0 0.00 0.00 12-Feb
## 3 0 0.00 0.00 13-Apr
## 4 0 0.00 0.00 14-Oct
## 5 0 373.03 3.54 12-Apr
## 6 0 0.00 0.00 13-May
## last_pymnt_amnt last_credit_pull_d collections_12_mths_ex_med
## 1 4459.14 16-Aug 0
## 2 629.34 16-Nov 0
## 3 1445.49 17-Oct 0
## 4 132.66 17-Dec 0
## 5 281.55 16-Oct 0
## 6 15919.10 18-Feb 0
## policy_code application_type acc_now_delinq chargeoff_within_12_mths
## 1 1 Individual 0 0
## 2 1 Individual 0 0
## 3 1 Individual 0 0
## 4 1 Individual 0 0
## 5 1 Individual 0 0
## 6 1 Individual 0 0
## delinq_amnt pub_rec_bankruptcies tax_liens hardship_flag
## 1 0 0 0 N
## 2 0 0 0 N
## 3 0 0 0 N
## 4 0 0 0 N
## 5 0 0 0 N
## 6 0 0 0 N
## disbursement_method debt_settlement_flag debt_settlement_flag_date
## 1 Cash N
## 2 Cash N
## 3 Cash N
## 4 Cash N
## 5 Cash N
## 6 Cash N
## settlement_status settlement_date un_emp_rate
## 1 7.4
## 2 7.4
## 3 7.4
## 4 7.4
## 5 7.4
## 6 7.4
colnames(df)
## [1] "addr_state" "loan_amnt"
## [3] "funded_amnt" "funded_amnt_inv"
## [5] "term" "int_rate"
## [7] "installment" "grade"
## [9] "sub_grade" "emp_title"
## [11] "emp_length" "home_ownership"
## [13] "annual_inc" "verification_status"
## [15] "issue_d" "loan_status"
## [17] "pymnt_plan" "purpose"
## [19] "title" "zip_code"
## [21] "dti" "delinq_2yrs"
## [23] "earliest_cr_line" "inq_last_6mths"
## [25] "open_acc" "pub_rec"
## [27] "revol_bal" "revol_util"
## [29] "total_acc" "initial_list_status"
## [31] "out_prncp" "out_prncp_inv"
## [33] "total_pymnt" "total_pymnt_inv"
## [35] "total_rec_prncp" "total_rec_int"
## [37] "total_rec_late_fee" "recoveries"
## [39] "collection_recovery_fee" "last_pymnt_d"
## [41] "last_pymnt_amnt" "last_credit_pull_d"
## [43] "collections_12_mths_ex_med" "policy_code"
## [45] "application_type" "acc_now_delinq"
## [47] "chargeoff_within_12_mths" "delinq_amnt"
## [49] "pub_rec_bankruptcies" "tax_liens"
## [51] "hardship_flag" "disbursement_method"
## [53] "debt_settlement_flag" "debt_settlement_flag_date"
## [55] "settlement_status" "settlement_date"
## [57] "un_emp_rate"
Now we have complete data in a dataframe, so started with the analysis:
levels(loans_data$loan_status)
## NULL
table(df$loan_status)
##
## Charged Off Fully Paid
## 5493 33159
We are going to take only the data with status as “Charged Off” or “Fully Paid”
df <- subset(df, loan_status %in% c("Charged Off", "Fully Paid" ))
table(df$loan_status)
##
## Charged Off Fully Paid
## 5493 33159
bad_id <- c("Charged Off")
df$bad_loans <- ifelse(df$loan_status %in% bad_id, 1,
ifelse(df$loan_status=="", NA,
0))
table(df$loan_status)
##
## Charged Off Fully Paid
## 5493 33159
table(df$bad_loans)
##
## 0 1
## 33159 5493
Checking loan status against the loan grades. Most of the lower grade loans failed to pay back.
table(df$loan_status, df$grade)
##
## A B C D E F G
## Charged Off 581 1392 1313 1089 707 313 98
## Fully Paid 9152 10313 6590 4096 2084 710 214
ggplot(df, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Similarly loan status against the unemployment rate.
table(df$loan_status, df$un_emp_rate)
##
## 4.2 4.4 5.2 5.3 5.4 5.6 5.9 6 6.1 6.4 6.8 7
## Charged Off 3 12 5 0 24 41 78 40 31 179 181 159
## Fully Paid 2 51 46 5 146 327 518 213 236 1265 1295 859
##
## 7.2 7.3 7.4 7.5 7.6 7.7 7.8 8 8.1 8.2 8.3 8.6
## Charged Off 318 61 15 71 142 0 206 147 2 94 145 533
## Fully Paid 2427 386 64 523 707 2 1508 1043 13 669 1022 3498
##
## 8.8 9.1 9.2 9.4 9.8 9.9 10 11.2 12.3
## Charged Off 244 485 343 296 320 15 66 1130 107
## Fully Paid 1440 2308 1902 1909 1814 191 389 6004 377
ggplot(df, aes(x = un_emp_rate))+ geom_histogram(aes(fill = un_emp_rate)) + facet_wrap(~loan_status, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We need to find all the numeric data that we can utilize to decide on the predictors, so iam doing numeric analysis. Ploting the graph for clear understanding.
numeric_cols <- sapply(df, is.numeric)
df.lng <- melt(df[,numeric_cols], id ="bad_loans")
head(df.lng)
## bad_loans variable value
## 1 0 loan_amnt 13200
## 2 0 loan_amnt 18750
## 3 0 loan_amnt 2000
## 4 0 loan_amnt 4000
## 5 1 loan_amnt 7650
## 6 0 loan_amnt 20000
p <- ggplot(aes(x=value, group=bad_loans, colour=factor(bad_loans)), data=df.lng)
p + geom_density() +
facet_wrap(~variable, scales="free")
## Warning: Removed 797 rows containing non-finite values (stat_density).
We need to remove the outliers so that it should not impact by making wrong predictions.
#Removing annual_inc outliers
summary(df$annual_inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4000 40800 59450 68985 82404 6000000
inc_outliers <- which(df$annual_inc > 1000000)
df <- df[-inc_outliers,]
Now the data is ready for model development. First i will try logistic regression and then if not satisfied we can use some of the advanced models like Random forest.
#Sampling the data
set.seed(123)
sample <- runif(nrow(df)) > 0.70
train <- df[sample==FALSE,]
test <- df[sample==TRUE,]
table(train$bad_loans)
##
## 0 1
## 23291 3948
#Building the logistic regression model
logistic_regressor <- glm(bad_loans ~ loan_amnt + int_rate + installment + annual_inc + dti +
revol_bal + revol_util + total_acc + un_emp_rate, family = "binomial", data = train)
summary(logistic_regressor)
##
## Call:
## glm(formula = bad_loans ~ loan_amnt + int_rate + installment +
## annual_inc + dti + revol_bal + revol_util + total_acc + un_emp_rate,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2595 -0.5964 -0.4754 -0.3575 3.7732
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.939e+00 1.300e-01 -30.307 < 2e-16 ***
## loan_amnt 5.636e-05 5.964e-06 9.449 < 2e-16 ***
## int_rate 1.435e-01 5.779e-03 24.832 < 2e-16 ***
## installment -1.904e-03 2.182e-04 -8.725 < 2e-16 ***
## annual_inc -7.256e-06 6.606e-07 -10.985 < 2e-16 ***
## dti -1.405e-03 3.089e-03 -0.455 0.649208
## revol_bal 3.086e-06 1.467e-06 2.103 0.035450 *
## revol_util 2.693e-03 8.094e-04 3.327 0.000877 ***
## total_acc 3.092e-03 1.840e-03 1.680 0.092880 .
## un_emp_rate 6.133e-02 1.103e-02 5.559 2.71e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22545 on 27238 degrees of freedom
## Residual deviance: 21189 on 27229 degrees of freedom
## AIC: 21209
##
## Number of Fisher Scoring iterations: 5
#Predicting on test data
pred <- predict(logistic_regressor, newdata = test, type = "response")
summary(pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0003209 0.0827812 0.1274705 0.1439276 0.1837348 0.5560589
pred_co <- ifelse(pred > 0.8, 1,0)
table(test$bad_loans,pred_co )
## pred_co
## 0
## 0 9856
## 1 1544
pred_t <- prediction(pred_co,test$bad_loans)
perfm <- performance(pred_t, "tpr", "fpr")
#Printing AUC Value
perfm <- performance(pred_t, "auc")
print(perfm@y.values[[1]])
## [1] 0.5
#ROC curve
roc.curve(test$bad_loans, pred_co,col="red", main="The ROC-curve for Model")
## Area under the curve (AUC): 0.500
text(0.6,0.2,paste("AUC=0.5"))
The AUC is just 0.5 so we need to resample the data so that we can have balanced samples. Iam improving the sampling by balancing between good and bad loans.
#Improving the sampling by balancing between good and bad
improved_train <- ROSE(bad_loans ~ loan_amnt + int_rate + installment + annual_inc + dti +
revol_bal + revol_util + total_acc + un_emp_rate, data = train, seed = 1)$data
table(improved_train$bad_loans)
##
## 0 1
## 13721 13518
#Building new logistic regression model
improved_regressor <- glm(bad_loans ~ loan_amnt + int_rate + installment + annual_inc + dti +
revol_bal + revol_util + total_acc + un_emp_rate, family = "binomial", data = improved_train)
summary(improved_regressor)
##
## Call:
## glm(formula = bad_loans ~ loan_amnt + int_rate + installment +
## annual_inc + dti + revol_bal + revol_util + total_acc + un_emp_rate,
## family = "binomial", data = improved_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0128 -1.0986 -0.6291 1.1082 2.7041
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.037e+00 8.381e-02 -24.302 < 2e-16 ***
## loan_amnt 1.860e-05 2.373e-06 7.840 4.50e-15 ***
## int_rate 1.247e-01 3.594e-03 34.689 < 2e-16 ***
## installment -5.879e-04 8.498e-05 -6.918 4.58e-12 ***
## annual_inc -4.285e-06 3.158e-07 -13.569 < 2e-16 ***
## dti 3.750e-03 1.861e-03 2.015 0.0439 *
## revol_bal 8.999e-07 8.519e-07 1.056 0.2908
## revol_util 2.741e-03 4.800e-04 5.710 1.13e-08 ***
## total_acc -5.471e-04 1.109e-03 -0.493 0.6219
## un_emp_rate 5.515e-02 7.148e-03 7.715 1.21e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37760 on 27238 degrees of freedom
## Residual deviance: 35519 on 27229 degrees of freedom
## AIC: 35539
##
## Number of Fisher Scoring iterations: 4
#Making predictions on test set
improved_pred <- predict(improved_regressor, newdata = test, type="response")
hist(improved_pred)
#Evaluating new model
roc.curve(test$bad_loans, improved_pred, col="dark red", main="The ROC-curve for Improved Model")
## Area under the curve (AUC): 0.674
text(0.6,0.2,paste("AUC=0.672"))
I will now try Random forest to see if i can make better predictions.
#Using random forest
rf <- randomForest(bad_loans ~ loan_amnt + int_rate + installment + annual_inc + dti + revol_bal + revol_util + total_acc
+ un_emp_rate, type="classification", data=improved_train, importance=TRUE, na.action=na.omit)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
rf_pred <- predict(rf, newdata = test, type="response")
hist(rf_pred)
#Evaluating new model
roc.curve(test$bad_loans, rf_pred, col="dark red", main="The ROC-curve for Improved Model")
## Area under the curve (AUC): 0.667
text(0.6,0.2,paste("AUC=0.667"))
I have developed a model using logistic regression and Random Forest to predict if a borrower will repay the loan based on historical data provided by Lending Club and to help investors when deciding which investment strategy to choose.
I think both Logistic and Random Forest Model have similar outcomes. I would say the data is very imbalanced, so very difficult to predict with high accuracy.