Introduction

For this data science project, I have downloaded the Lending Club data from the lending club’s website. The goal of this project is to predict whether loans will default or not, given various information about the loans.

Loading Libraries and Data

library(tidyverse)
library(lubridate)
library(corrplot)
library(zoo)
library(plyr)
library(magrittr)
library(ROSE)
library(caTools)
library(ROCR)
library(caret)
library(randomForest)
setwd("~/Documents/Data Mining Slides/R and associated")
loandata <- read_csv("LoanLatest.csv")

Data Cleaning

DataSet Structure

dim(loandata)
## [1] 42540   137
head(loandata)
## # A tibble: 6 × 137
##      id member_id loan_amnt funded_amnt funded_amnt_inv      term int_rate
##   <chr>     <chr>     <int>       <int>           <dbl>     <chr>    <chr>
## 1  <NA>      <NA>      5000        5000            4975 36 months   10.65%
## 2  <NA>      <NA>      2500        2500            2500 60 months   15.27%
## 3  <NA>      <NA>      2400        2400            2400 36 months   15.96%
## 4  <NA>      <NA>     10000       10000           10000 36 months   13.49%
## 5  <NA>      <NA>      3000        3000            3000 60 months   12.69%
## 6  <NA>      <NA>      5000        5000            5000 36 months    7.90%
## # ... with 130 more variables: installment <dbl>, grade <chr>,
## #   sub_grade <chr>, emp_title <chr>, emp_length <chr>,
## #   home_ownership <chr>, annual_inc <dbl>, verification_status <chr>,
## #   issue_d <chr>, loan_status <chr>, pymnt_plan <chr>, url <chr>,
## #   desc <chr>, purpose <chr>, title <chr>, zip_code <chr>,
## #   addr_state <chr>, dti <dbl>, delinq_2yrs <int>,
## #   earliest_cr_line <chr>, inq_last_6mths <int>,
## #   mths_since_last_delinq <int>, mths_since_last_record <int>,
## #   open_acc <int>, pub_rec <int>, revol_bal <int>, revol_util <chr>,
## #   total_acc <int>, initial_list_status <chr>, out_prncp <int>,
## #   out_prncp_inv <int>, total_pymnt <dbl>, total_pymnt_inv <dbl>,
## #   total_rec_prncp <dbl>, total_rec_int <dbl>, total_rec_late_fee <dbl>,
## #   recoveries <dbl>, collection_recovery_fee <dbl>, last_pymnt_d <chr>,
## #   last_pymnt_amnt <dbl>, next_pymnt_d <chr>, last_credit_pull_d <chr>,
## #   collections_12_mths_ex_med <int>, mths_since_last_major_derog <chr>,
## #   policy_code <int>, application_type <chr>, annual_inc_joint <chr>,
## #   dti_joint <chr>, verification_status_joint <chr>,
## #   acc_now_delinq <int>, tot_coll_amt <chr>, tot_cur_bal <chr>,
## #   open_acc_6m <chr>, open_il_6m <chr>, open_il_12m <chr>,
## #   open_il_24m <chr>, mths_since_rcnt_il <chr>, total_bal_il <chr>,
## #   il_util <chr>, open_rv_12m <chr>, open_rv_24m <chr>, max_bal_bc <chr>,
## #   all_util <chr>, total_rev_hi_lim <chr>, inq_fi <chr>,
## #   total_cu_tl <chr>, inq_last_12m <chr>, acc_open_past_24mths <chr>,
## #   avg_cur_bal <chr>, bc_open_to_buy <chr>, bc_util <chr>,
## #   chargeoff_within_12_mths <int>, delinq_amnt <int>,
## #   mo_sin_old_il_acct <chr>, mo_sin_old_rev_tl_op <chr>,
## #   mo_sin_rcnt_rev_tl_op <chr>, mo_sin_rcnt_tl <chr>, mort_acc <chr>,
## #   mths_since_recent_bc <chr>, mths_since_recent_bc_dlq <chr>,
## #   mths_since_recent_inq <chr>, mths_since_recent_revol_delinq <chr>,
## #   num_accts_ever_120_pd <chr>, num_actv_bc_tl <chr>,
## #   num_actv_rev_tl <chr>, num_bc_sats <chr>, num_bc_tl <chr>,
## #   num_il_tl <chr>, num_op_rev_tl <chr>, num_rev_accts <chr>,
## #   num_rev_tl_bal_gt_0 <chr>, num_sats <chr>, num_tl_120dpd_2m <chr>,
## #   num_tl_30dpd <chr>, num_tl_90g_dpd_24m <chr>,
## #   num_tl_op_past_12m <chr>, pct_tl_nvr_dlq <chr>,
## #   percent_bc_gt_75 <chr>, pub_rec_bankruptcies <int>, tax_liens <int>,
## #   ...

There are 42540 observations with 137 variables. A lot of those 137 variables are irrelevant for our current objective of default prediction. A thorough understanding of the domain and all the variables is necessary to remove irrelevant variables, so I used the lending club’s data dictionary as well as the orchard platform’s explanations of the workings of the lending world to identify and remove these variables. Links:

LC Dictionary - https://help.lendingclub.com/hc/en-us/articles/216127307-Data-Dictionaries

Orchard Platform - https://www.orchardplatform.com/blog/credit-variables-explained-inquiries-in-the-last-6-months/

loandata1 <- loandata %>% 
              select(loan_status, issue_d, loan_amnt, emp_title, emp_length, verification_status, 
                            home_ownership, annual_inc, purpose, inq_last_6mths, desc,
                            open_acc, pub_rec, revol_util, dti, total_acc, delinq_2yrs, 
                            earliest_cr_line, mths_since_last_delinq)

Now let’s have a look at the dataset with all those irrelevant variables removed.

head(loandata1)
## # A tibble: 6 × 19
##   loan_status issue_d loan_amnt                emp_title emp_length
##         <chr>   <chr>     <int>                    <chr>      <chr>
## 1  Fully Paid  Dec-11      5000                     <NA>  10+ years
## 2 Charged Off  Dec-11      2500                    Ryder   < 1 year
## 3  Fully Paid  Dec-11      2400                     <NA>  10+ years
## 4  Fully Paid  Dec-11     10000      AIR RESOURCES BOARD  10+ years
## 5  Fully Paid  Dec-11      3000 University Medical Group     1 year
## 6  Fully Paid  Dec-11      5000     Veolia Transportaton    3 years
## # ... with 14 more variables: verification_status <chr>,
## #   home_ownership <chr>, annual_inc <dbl>, purpose <chr>,
## #   inq_last_6mths <int>, desc <chr>, open_acc <int>, pub_rec <int>,
## #   revol_util <chr>, dti <dbl>, total_acc <int>, delinq_2yrs <int>,
## #   earliest_cr_line <chr>, mths_since_last_delinq <int>

Target Variable

The target variable ‘Default’ is recoded into numeric terms.

loandata2 <- loandata1 %>% 
              mutate(default = ifelse(loan_status=="Charged Off",1,0))

The earliest credit line and issued dates are given in the form of month and year. In order to use these variables for prediction, we are converting them into date using the yearmon package from ‘zoo’ library.

loandata2$earliest_cr_line <- as.Date(as.yearmon(loandata$earliest_cr_line, "%b-%y"))
loandata2$issue_d <- as.Date(as.yearmon(loandata$issue_d, "%b-%y"))

Feature Engineering - Age of the Credit Account

These dates cannot be used as it is, so we are calcuating a new variable - the difference between account opening and credit line opening. This will indirectly give the age of the credit lines, which is a crucial factor in determining credit. Even for credit cards, the age of the credit account is taken into account to compute the credit scores.

loandata3 <- loandata2 %>%
              mutate(time_history = issue_d - earliest_cr_line) 

Cleaning utilization percentage

Revol_util indicates the utilization percentage of the credit. It is expressed in terms of percentage. To make it useful to our analysis, we are removing the needless character symbol and converting it into a numeric variable.

head(loandata3$revol_util)
## [1] "83.70%" "9.40%"  "98.50%" "21%"    "53.90%" "28.30%"
loandata4 <- loandata3 %>%
  mutate(revol_util = as.numeric(sub("%","", revol_util)) )
head(loandata4$revol_util)
## [1] 83.7  9.4 98.5 21.0 53.9 28.3

Feature Engineering - the presence of employer and other details

The three variables - Employer Title, Loan Reason Description, and months since last delinquency - can be used as important predictor variables but has to be coded differently. For example, employer title can be separated into those who have mentioned employer title and those who have not. In the same vein, loan reason and delinquency can also leveled using NAs.

loandata5 <- loandata4 %>%
  mutate(empdetailgiven = as.numeric(!is.na(emp_title)),
         reasonforloan = as.numeric(!is.na(desc)), 
         previousdefault = as.numeric(!is.na(mths_since_last_delinq)) ) 

Recoding home ownership and employer length variables

Other variables such as home ownership, and employment length are also reformatted in suitable ways. Home ownership into three levels and employment length as a continuous numeric after removing the missing variables. Employment length less than a year is coded as zero.

loandata6 <- loandata5 %>%
  mutate(emplengthgiven = ifelse(emp_length == "n/a", 1, 0),
         emp_length = ifelse(emp_length == "< 1 year" | emp_length == "n/a", 0, emp_length),
         emp_length = as.numeric(gsub("\\D", "", emp_length)),
         home_ownership = ifelse(home_ownership == "NONE", "OTHER", home_ownership))

Final Aggregated dataset

As we have removed the missing values, transformed and recoded our variables, we are agrregating it into a new dataset, including the computed columns.

loandata7 <- loandata6 %>% 
         select(default, loan_amnt, reasonforloan, empdetailgiven, emplengthgiven, emp_length, verification_status, 
         home_ownership, annual_inc, purpose, time_history, inq_last_6mths, 
         open_acc, pub_rec, revol_util, dti, total_acc, delinq_2yrs, previousdefault)

Formatting factors and numerics

Now we are going to format the variables into factors and numeric variables so as to enable modeling. I am using the mutate_each function from the tidyverse package which does a neat job of factorizing many columns at once.

factorcolumns <- c("default", "reasonforloan", "empdetailgiven", "emplengthgiven",
                   "verification_status", "home_ownership", "purpose", "previousdefault")
loandata7 %<>% mutate_each_(funs(factor(.)), factorcolumns)
loandata7$time_history <- as.numeric(loandata7$time_history)

Data Exploration

Firstly, I want to check the distribution of the loan amount. It is one of the most important and primary feature in our dataset.

Loan Amount Histogram

Loanamountdist <- ggplot(data=loandata7, aes(x=loandata7$loan_amnt)) + 
  geom_histogram(aes(y=..density..),
                 col='black', 
                 fill='dodgerblue1', 
                 alpha=0.3) +
  geom_density(adjust=3)

print(Loanamountdist + theme(plot.title=element_text(face="bold")) + ggtitle('Distribution of the loan amounts'))
## Warning: Removed 5 rows containing non-finite values (stat_bin).
## Warning: Removed 5 rows containing non-finite values (stat_density).

The distribution is right-skewed as we would normally expect for this kind of variable. A lot of the loans have amounts less than 10000, which is accurate considering the customer base of the company.

Segmented Bar Chart - Home Ownership

Nextly the home ownership variable is another crucial predictor of the default. Intuition would say that home owners would be good borrowers and people on rent or mortgage would be bad investment. But does the data say the same? Let’s see.

homedist <- ggplot(data=loandata7, aes(x=loandata7$home_ownership, fill=default)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)), position='stack', alpha=0.5) + scale_y_continuous(labels=scales::percent)

print(homedist + theme(plot.title=element_text(face="bold")) + ggtitle('Home Ownership vs Loan Default'))

The data as well backs up our theory. In fact, as we would expect, people on mortgage has done worse than people living on rented homes.

Age of the Credit Lines

Now let’s move on to some of the other interesting variables. The age of the credit accounts. Let us compare it with the target default.

loandata7$time_history <- abs(loandata7$time_history)

CreditAge <- ggplot(data=loandata7, aes(x=loandata7$time_history, fill=default)) + 
  geom_histogram(alpha=0.5) 

print(CreditAge + theme(plot.title=element_text(face="bold")) + ggtitle('Credit Account Age vs Default'))
## Warning: Removed 34 rows containing non-finite values (stat_bin).

There does not seem to be much difference between defaulted loans. The distribution is similar, centred around 5000, with a right skew and higher frequences to the left of the median line. Below 5000, loans seem to have a much higher possibility of default, and conversely, after 5000, the loans seem to have lesser probability of default. So seems like it is safe to lend money to older accounts.

Statistical Tests

Chi-square test for all combinations of factor variables

Chi-square test is done to find relations between factor variables. It will give us insights into variables of high predictive capacity and also of redundant factor variables. We are using the combn function to find all possible combinations of the factor columns, and this is then fed into the ‘adply’ with an anonymous function written for chi test.

loandata9 <- loandata7[factorcolumns]
combos <- combn(ncol(loandata9),2)
adply(
          combos, 
          2, 
          function(x) {
                          test <- chisq.test(collect(loandata9[,x[1]])[[1]],collect(loandata9[,x[2]])[[1]])
                          out <- data.frame(
                                                "Column 1" = colnames(loandata9)[x[1]],
                                                "Column 2" = colnames(loandata9)[x[2]],
                                                "Chi.Square" = round(test$statistic, 3),
                                                "df" = test$parameter,
                                                "p.value" = round(test$p.value,3)
                                            )
                          return(out)
                      }
      )
## Warning in chisq.test(collect(loandata9[, x[1]])[[1]],
## collect(loandata9[, : Chi-squared approximation may be incorrect

## Warning in chisq.test(collect(loandata9[, x[1]])[[1]],
## collect(loandata9[, : Chi-squared approximation may be incorrect

## Warning in chisq.test(collect(loandata9[, x[1]])[[1]],
## collect(loandata9[, : Chi-squared approximation may be incorrect
##    X1            Column.1            Column.2 Chi.Square df p.value
## 1   1             default       reasonforloan      1.708  1   0.191
## 2   2             default      empdetailgiven     65.657  1   0.000
## 3   3             default      emplengthgiven     52.788  1   0.000
## 4   4             default verification_status    113.533  2   0.000
## 5   5             default      home_ownership     26.350  3   0.000
## 6   6             default             purpose    340.543 13   0.000
## 7   7             default     previousdefault      5.032  1   0.025
## 8   8       reasonforloan      empdetailgiven      8.098  1   0.004
## 9   9       reasonforloan      emplengthgiven     47.069  1   0.000
## 10 10       reasonforloan verification_status    336.098  2   0.000
## 11 11       reasonforloan      home_ownership     78.365  3   0.000
## 12 12       reasonforloan             purpose    210.613 13   0.000
## 13 13       reasonforloan     previousdefault     13.818  1   0.000
## 14 14      empdetailgiven      emplengthgiven  15512.733  1   0.000
## 15 15      empdetailgiven verification_status     39.282  2   0.000
## 16 16      empdetailgiven      home_ownership    290.282  3   0.000
## 17 17      empdetailgiven             purpose    707.496 13   0.000
## 18 18      empdetailgiven     previousdefault      5.280  1   0.022
## 19 19      emplengthgiven verification_status     61.131  2   0.000
## 20 20      emplengthgiven      home_ownership    249.392  3   0.000
## 21 21      emplengthgiven             purpose    125.937 13   0.000
## 22 22      emplengthgiven     previousdefault     10.985  1   0.001
## 23 23 verification_status      home_ownership    483.423  6   0.000
## 24 24 verification_status             purpose    828.491 26   0.000
## 25 25 verification_status     previousdefault     84.943  2   0.000
## 26 26      home_ownership             purpose   2653.870 39   0.000
## 27 27      home_ownership     previousdefault      6.479  3   0.091
## 28 28             purpose     previousdefault    135.779 13   0.000

We use this information to remove the factor variables that does not have a strong relation with the target variable default.

Correlation and multicollinearity

Next, we find the correlation betweeen numeric variables. This will yield us insight into multicollinearity.

numcol <- sapply(loandata7,is.numeric)
pearsoncor <- cor(loandata7[numcol], use="complete.obs")
corrplot(pearsoncor, "number")

From the corrplot we can see that open_acc and total_acc are the only two variables to have any meaningful expression of multicollinearity. Let us have a look at the distribution of these two.

openaccplot <- ggplot(loandata7, aes(open_acc)) + geom_histogram(col='black',fill='dodgerblue1', alpha=0.3) 

print(openaccplot + theme(plot.title=element_text(face="bold")) + ggtitle('Histogram of number of open accounts'))
## Warning: Removed 34 rows containing non-finite values (stat_bin).

totaccplot <- ggplot(loandata7, aes(total_acc)) + geom_histogram(col='black',fill='dodgerblue1', alpha=0.3)

print(totaccplot + theme(plot.title=element_text(face="bold")) + ggtitle('Histogram of total number of accounts')) 
## Warning: Removed 34 rows containing non-finite values (stat_bin).

ANOVA test

This is done to determine whether total account or open account has stronger relation with the target default variable.

plot(open_acc ~ default, data=loandata7)

plot(total_acc ~ default, data=loandata7)

results <- aov(open_acc ~ default, data=loandata7)
summary(results)
##                Df Sum Sq Mean Sq F value Pr(>F)   
## default         1    167  167.50   8.287  0.004 **
## Residuals   42504 859134   20.21                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 34 observations deleted due to missingness
results2 <- aov(total_acc ~ default, data=loandata7)
summary(results2)
##                Df  Sum Sq Mean Sq F value   Pr(>F)    
## default         1    2913  2913.1   21.69 3.22e-06 ***
## Residuals   42504 5709473   134.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 34 observations deleted due to missingness

From the results, it can be seen that the variable total accounts has stronger relations to default variable. So we are choosing total account variable over open account, to remove multicollinearity.

index <- which(colnames(loandata7)=="open_acc")
loandata10 <- loandata7[,-c(index)]

Sampling

Our dataset is highly skewed. A lot of observations have not defaulted while very few have defaulted. We can see this from the following result.

table(loandata10$default)
## 
##     0     1 
## 36865  5670

We can see that there are 5670 rows with default (1) and 36,865 rows that are not default (0)

ROSE package and undersampling

We are performing undersampling using the ‘ROSE’ package to balance our dataset.

balanceddata <- ovun.sample(default ~ . , 
                            data=loandata10, method = "under", N = 11340, seed = 1)$data
table(balanceddata$default)
## 
##    0    1 
## 5686 5654

Now we can see that both default and non-default have same number of rows, around 5600.

Test and Train Split

set.seed(101) 
sample <- sample.split(balanceddata$default, SplitRatio = .75)
train <- subset(balanceddata, sample == TRUE)
test  <- subset(balanceddata, sample == FALSE)

We have split our loan dataset in the ratio of 0.75.

Modeling

Logistic Regression

logisticreg <- glm(default~., family=binomial(link='logit'), data = train)
actualvalues <- test$default
pred <- predict(logisticreg,test, type ='response')
pred <- ifelse(pred > 0.5,1,0)
compare <- data.frame(actual = actualvalues, predicted = pred)
misClasificError <- mean(pred != actualvalues)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.585684062059238"

Random Forest

rfmodel <- randomForest(default ~ loan_amnt + reasonforloan + empdetailgiven + emplengthgiven
                        + verification_status + home_ownership + annual_inc + purpose
                        + previousdefault + pub_rec, data=train, importance = TRUE, ntree = 200, na.action = na.omit)
predrf <- predict(rfmodel, test, type='class')
rfoutput <- confusionMatrix(test$default, predrf)
paste0(rfoutput$overall[1])
## [1] "0.563822284908322"

Comparison of our classifiers

ROC curve

val <- as.numeric(paste0(pred))
predObj <- prediction(val,actualvalues)
rocObj <- performance(predObj, measure="tpr", x.measure="fpr") 
aucObj <- performance(predObj, measure="auc")


predrfprob <- predict(rfmodel, test, type = "prob")
val2 <- as.numeric(paste0(predrfprob[,2]))
predObj2 <- prediction(val2, actualvalues)
rocObj2 <- performance(predObj2, measure="tpr", x.measure="fpr") 
aucObj2 <- performance(predObj2, measure="auc")

plot(rocObj, col = "blue", lwd = 1, main = "ROC Curves")
plot(rocObj2, add = TRUE, col = "red")
abline(a=0, b=1)

The ROC curve is used to compare classifiers. The above graph is a plot between true positive rate and false positive rate. Both are also known as sensitivity and 1-specificity. Area Under the Curve is a measurement used to compare the performance precisely. The more the area under the curve the higher the performance and predictive accuracy of the model. From the above graph we can see that the random forest model performs better than the logistic regression model.