For this data science project, I have downloaded the Lending Club data from the lending club’s website. The goal of this project is to predict whether loans will default or not, given various information about the loans.
library(tidyverse)
library(lubridate)
library(corrplot)
library(zoo)
library(plyr)
library(magrittr)
library(ROSE)
library(caTools)
library(ROCR)
library(caret)
library(randomForest)
setwd("~/Documents/Data Mining Slides/R and associated")
loandata <- read_csv("LoanLatest.csv")dim(loandata)## [1] 42540 137
head(loandata)## # A tibble: 6 × 137
## id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate
## <chr> <chr> <int> <int> <dbl> <chr> <chr>
## 1 <NA> <NA> 5000 5000 4975 36 months 10.65%
## 2 <NA> <NA> 2500 2500 2500 60 months 15.27%
## 3 <NA> <NA> 2400 2400 2400 36 months 15.96%
## 4 <NA> <NA> 10000 10000 10000 36 months 13.49%
## 5 <NA> <NA> 3000 3000 3000 60 months 12.69%
## 6 <NA> <NA> 5000 5000 5000 36 months 7.90%
## # ... with 130 more variables: installment <dbl>, grade <chr>,
## # sub_grade <chr>, emp_title <chr>, emp_length <chr>,
## # home_ownership <chr>, annual_inc <dbl>, verification_status <chr>,
## # issue_d <chr>, loan_status <chr>, pymnt_plan <chr>, url <chr>,
## # desc <chr>, purpose <chr>, title <chr>, zip_code <chr>,
## # addr_state <chr>, dti <dbl>, delinq_2yrs <int>,
## # earliest_cr_line <chr>, inq_last_6mths <int>,
## # mths_since_last_delinq <int>, mths_since_last_record <int>,
## # open_acc <int>, pub_rec <int>, revol_bal <int>, revol_util <chr>,
## # total_acc <int>, initial_list_status <chr>, out_prncp <int>,
## # out_prncp_inv <int>, total_pymnt <dbl>, total_pymnt_inv <dbl>,
## # total_rec_prncp <dbl>, total_rec_int <dbl>, total_rec_late_fee <dbl>,
## # recoveries <dbl>, collection_recovery_fee <dbl>, last_pymnt_d <chr>,
## # last_pymnt_amnt <dbl>, next_pymnt_d <chr>, last_credit_pull_d <chr>,
## # collections_12_mths_ex_med <int>, mths_since_last_major_derog <chr>,
## # policy_code <int>, application_type <chr>, annual_inc_joint <chr>,
## # dti_joint <chr>, verification_status_joint <chr>,
## # acc_now_delinq <int>, tot_coll_amt <chr>, tot_cur_bal <chr>,
## # open_acc_6m <chr>, open_il_6m <chr>, open_il_12m <chr>,
## # open_il_24m <chr>, mths_since_rcnt_il <chr>, total_bal_il <chr>,
## # il_util <chr>, open_rv_12m <chr>, open_rv_24m <chr>, max_bal_bc <chr>,
## # all_util <chr>, total_rev_hi_lim <chr>, inq_fi <chr>,
## # total_cu_tl <chr>, inq_last_12m <chr>, acc_open_past_24mths <chr>,
## # avg_cur_bal <chr>, bc_open_to_buy <chr>, bc_util <chr>,
## # chargeoff_within_12_mths <int>, delinq_amnt <int>,
## # mo_sin_old_il_acct <chr>, mo_sin_old_rev_tl_op <chr>,
## # mo_sin_rcnt_rev_tl_op <chr>, mo_sin_rcnt_tl <chr>, mort_acc <chr>,
## # mths_since_recent_bc <chr>, mths_since_recent_bc_dlq <chr>,
## # mths_since_recent_inq <chr>, mths_since_recent_revol_delinq <chr>,
## # num_accts_ever_120_pd <chr>, num_actv_bc_tl <chr>,
## # num_actv_rev_tl <chr>, num_bc_sats <chr>, num_bc_tl <chr>,
## # num_il_tl <chr>, num_op_rev_tl <chr>, num_rev_accts <chr>,
## # num_rev_tl_bal_gt_0 <chr>, num_sats <chr>, num_tl_120dpd_2m <chr>,
## # num_tl_30dpd <chr>, num_tl_90g_dpd_24m <chr>,
## # num_tl_op_past_12m <chr>, pct_tl_nvr_dlq <chr>,
## # percent_bc_gt_75 <chr>, pub_rec_bankruptcies <int>, tax_liens <int>,
## # ...
There are 42540 observations with 137 variables. A lot of those 137 variables are irrelevant for our current objective of default prediction. A thorough understanding of the domain and all the variables is necessary to remove irrelevant variables, so I used the lending club’s data dictionary as well as the orchard platform’s explanations of the workings of the lending world to identify and remove these variables. Links:
LC Dictionary - https://help.lendingclub.com/hc/en-us/articles/216127307-Data-Dictionaries
Orchard Platform - https://www.orchardplatform.com/blog/credit-variables-explained-inquiries-in-the-last-6-months/
loandata1 <- loandata %>%
select(loan_status, issue_d, loan_amnt, emp_title, emp_length, verification_status,
home_ownership, annual_inc, purpose, inq_last_6mths, desc,
open_acc, pub_rec, revol_util, dti, total_acc, delinq_2yrs,
earliest_cr_line, mths_since_last_delinq)Now let’s have a look at the dataset with all those irrelevant variables removed.
head(loandata1)## # A tibble: 6 × 19
## loan_status issue_d loan_amnt emp_title emp_length
## <chr> <chr> <int> <chr> <chr>
## 1 Fully Paid Dec-11 5000 <NA> 10+ years
## 2 Charged Off Dec-11 2500 Ryder < 1 year
## 3 Fully Paid Dec-11 2400 <NA> 10+ years
## 4 Fully Paid Dec-11 10000 AIR RESOURCES BOARD 10+ years
## 5 Fully Paid Dec-11 3000 University Medical Group 1 year
## 6 Fully Paid Dec-11 5000 Veolia Transportaton 3 years
## # ... with 14 more variables: verification_status <chr>,
## # home_ownership <chr>, annual_inc <dbl>, purpose <chr>,
## # inq_last_6mths <int>, desc <chr>, open_acc <int>, pub_rec <int>,
## # revol_util <chr>, dti <dbl>, total_acc <int>, delinq_2yrs <int>,
## # earliest_cr_line <chr>, mths_since_last_delinq <int>
The target variable ‘Default’ is recoded into numeric terms.
loandata2 <- loandata1 %>%
mutate(default = ifelse(loan_status=="Charged Off",1,0))The earliest credit line and issued dates are given in the form of month and year. In order to use these variables for prediction, we are converting them into date using the yearmon package from ‘zoo’ library.
loandata2$earliest_cr_line <- as.Date(as.yearmon(loandata$earliest_cr_line, "%b-%y"))
loandata2$issue_d <- as.Date(as.yearmon(loandata$issue_d, "%b-%y"))These dates cannot be used as it is, so we are calcuating a new variable - the difference between account opening and credit line opening. This will indirectly give the age of the credit lines, which is a crucial factor in determining credit. Even for credit cards, the age of the credit account is taken into account to compute the credit scores.
loandata3 <- loandata2 %>%
mutate(time_history = issue_d - earliest_cr_line) Revol_util indicates the utilization percentage of the credit. It is expressed in terms of percentage. To make it useful to our analysis, we are removing the needless character symbol and converting it into a numeric variable.
head(loandata3$revol_util)## [1] "83.70%" "9.40%" "98.50%" "21%" "53.90%" "28.30%"
loandata4 <- loandata3 %>%
mutate(revol_util = as.numeric(sub("%","", revol_util)) )
head(loandata4$revol_util)## [1] 83.7 9.4 98.5 21.0 53.9 28.3
The three variables - Employer Title, Loan Reason Description, and months since last delinquency - can be used as important predictor variables but has to be coded differently. For example, employer title can be separated into those who have mentioned employer title and those who have not. In the same vein, loan reason and delinquency can also leveled using NAs.
loandata5 <- loandata4 %>%
mutate(empdetailgiven = as.numeric(!is.na(emp_title)),
reasonforloan = as.numeric(!is.na(desc)),
previousdefault = as.numeric(!is.na(mths_since_last_delinq)) ) Other variables such as home ownership, and employment length are also reformatted in suitable ways. Home ownership into three levels and employment length as a continuous numeric after removing the missing variables. Employment length less than a year is coded as zero.
loandata6 <- loandata5 %>%
mutate(emplengthgiven = ifelse(emp_length == "n/a", 1, 0),
emp_length = ifelse(emp_length == "< 1 year" | emp_length == "n/a", 0, emp_length),
emp_length = as.numeric(gsub("\\D", "", emp_length)),
home_ownership = ifelse(home_ownership == "NONE", "OTHER", home_ownership))As we have removed the missing values, transformed and recoded our variables, we are agrregating it into a new dataset, including the computed columns.
loandata7 <- loandata6 %>%
select(default, loan_amnt, reasonforloan, empdetailgiven, emplengthgiven, emp_length, verification_status,
home_ownership, annual_inc, purpose, time_history, inq_last_6mths,
open_acc, pub_rec, revol_util, dti, total_acc, delinq_2yrs, previousdefault)Now we are going to format the variables into factors and numeric variables so as to enable modeling. I am using the mutate_each function from the tidyverse package which does a neat job of factorizing many columns at once.
factorcolumns <- c("default", "reasonforloan", "empdetailgiven", "emplengthgiven",
"verification_status", "home_ownership", "purpose", "previousdefault")
loandata7 %<>% mutate_each_(funs(factor(.)), factorcolumns)
loandata7$time_history <- as.numeric(loandata7$time_history)Firstly, I want to check the distribution of the loan amount. It is one of the most important and primary feature in our dataset.
Loanamountdist <- ggplot(data=loandata7, aes(x=loandata7$loan_amnt)) +
geom_histogram(aes(y=..density..),
col='black',
fill='dodgerblue1',
alpha=0.3) +
geom_density(adjust=3)
print(Loanamountdist + theme(plot.title=element_text(face="bold")) + ggtitle('Distribution of the loan amounts'))## Warning: Removed 5 rows containing non-finite values (stat_bin).
## Warning: Removed 5 rows containing non-finite values (stat_density).
The distribution is right-skewed as we would normally expect for this kind of variable. A lot of the loans have amounts less than 10000, which is accurate considering the customer base of the company.
Nextly the home ownership variable is another crucial predictor of the default. Intuition would say that home owners would be good borrowers and people on rent or mortgage would be bad investment. But does the data say the same? Let’s see.
homedist <- ggplot(data=loandata7, aes(x=loandata7$home_ownership, fill=default)) +
geom_bar(aes(y = (..count..)/sum(..count..)), position='stack', alpha=0.5) + scale_y_continuous(labels=scales::percent)
print(homedist + theme(plot.title=element_text(face="bold")) + ggtitle('Home Ownership vs Loan Default'))The data as well backs up our theory. In fact, as we would expect, people on mortgage has done worse than people living on rented homes.
Now let’s move on to some of the other interesting variables. The age of the credit accounts. Let us compare it with the target default.
loandata7$time_history <- abs(loandata7$time_history)
CreditAge <- ggplot(data=loandata7, aes(x=loandata7$time_history, fill=default)) +
geom_histogram(alpha=0.5)
print(CreditAge + theme(plot.title=element_text(face="bold")) + ggtitle('Credit Account Age vs Default'))## Warning: Removed 34 rows containing non-finite values (stat_bin).
There does not seem to be much difference between defaulted loans. The distribution is similar, centred around 5000, with a right skew and higher frequences to the left of the median line. Below 5000, loans seem to have a much higher possibility of default, and conversely, after 5000, the loans seem to have lesser probability of default. So seems like it is safe to lend money to older accounts.
Chi-square test is done to find relations between factor variables. It will give us insights into variables of high predictive capacity and also of redundant factor variables. We are using the combn function to find all possible combinations of the factor columns, and this is then fed into the ‘adply’ with an anonymous function written for chi test.
loandata9 <- loandata7[factorcolumns]
combos <- combn(ncol(loandata9),2)
adply(
combos,
2,
function(x) {
test <- chisq.test(collect(loandata9[,x[1]])[[1]],collect(loandata9[,x[2]])[[1]])
out <- data.frame(
"Column 1" = colnames(loandata9)[x[1]],
"Column 2" = colnames(loandata9)[x[2]],
"Chi.Square" = round(test$statistic, 3),
"df" = test$parameter,
"p.value" = round(test$p.value,3)
)
return(out)
}
)## Warning in chisq.test(collect(loandata9[, x[1]])[[1]],
## collect(loandata9[, : Chi-squared approximation may be incorrect
## Warning in chisq.test(collect(loandata9[, x[1]])[[1]],
## collect(loandata9[, : Chi-squared approximation may be incorrect
## Warning in chisq.test(collect(loandata9[, x[1]])[[1]],
## collect(loandata9[, : Chi-squared approximation may be incorrect
## X1 Column.1 Column.2 Chi.Square df p.value
## 1 1 default reasonforloan 1.708 1 0.191
## 2 2 default empdetailgiven 65.657 1 0.000
## 3 3 default emplengthgiven 52.788 1 0.000
## 4 4 default verification_status 113.533 2 0.000
## 5 5 default home_ownership 26.350 3 0.000
## 6 6 default purpose 340.543 13 0.000
## 7 7 default previousdefault 5.032 1 0.025
## 8 8 reasonforloan empdetailgiven 8.098 1 0.004
## 9 9 reasonforloan emplengthgiven 47.069 1 0.000
## 10 10 reasonforloan verification_status 336.098 2 0.000
## 11 11 reasonforloan home_ownership 78.365 3 0.000
## 12 12 reasonforloan purpose 210.613 13 0.000
## 13 13 reasonforloan previousdefault 13.818 1 0.000
## 14 14 empdetailgiven emplengthgiven 15512.733 1 0.000
## 15 15 empdetailgiven verification_status 39.282 2 0.000
## 16 16 empdetailgiven home_ownership 290.282 3 0.000
## 17 17 empdetailgiven purpose 707.496 13 0.000
## 18 18 empdetailgiven previousdefault 5.280 1 0.022
## 19 19 emplengthgiven verification_status 61.131 2 0.000
## 20 20 emplengthgiven home_ownership 249.392 3 0.000
## 21 21 emplengthgiven purpose 125.937 13 0.000
## 22 22 emplengthgiven previousdefault 10.985 1 0.001
## 23 23 verification_status home_ownership 483.423 6 0.000
## 24 24 verification_status purpose 828.491 26 0.000
## 25 25 verification_status previousdefault 84.943 2 0.000
## 26 26 home_ownership purpose 2653.870 39 0.000
## 27 27 home_ownership previousdefault 6.479 3 0.091
## 28 28 purpose previousdefault 135.779 13 0.000
We use this information to remove the factor variables that does not have a strong relation with the target variable default.
Next, we find the correlation betweeen numeric variables. This will yield us insight into multicollinearity.
numcol <- sapply(loandata7,is.numeric)
pearsoncor <- cor(loandata7[numcol], use="complete.obs")
corrplot(pearsoncor, "number")From the corrplot we can see that open_acc and total_acc are the only two variables to have any meaningful expression of multicollinearity. Let us have a look at the distribution of these two.
openaccplot <- ggplot(loandata7, aes(open_acc)) + geom_histogram(col='black',fill='dodgerblue1', alpha=0.3)
print(openaccplot + theme(plot.title=element_text(face="bold")) + ggtitle('Histogram of number of open accounts'))## Warning: Removed 34 rows containing non-finite values (stat_bin).
totaccplot <- ggplot(loandata7, aes(total_acc)) + geom_histogram(col='black',fill='dodgerblue1', alpha=0.3)
print(totaccplot + theme(plot.title=element_text(face="bold")) + ggtitle('Histogram of total number of accounts')) ## Warning: Removed 34 rows containing non-finite values (stat_bin).
This is done to determine whether total account or open account has stronger relation with the target default variable.
plot(open_acc ~ default, data=loandata7)plot(total_acc ~ default, data=loandata7)results <- aov(open_acc ~ default, data=loandata7)
summary(results)## Df Sum Sq Mean Sq F value Pr(>F)
## default 1 167 167.50 8.287 0.004 **
## Residuals 42504 859134 20.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 34 observations deleted due to missingness
results2 <- aov(total_acc ~ default, data=loandata7)
summary(results2)## Df Sum Sq Mean Sq F value Pr(>F)
## default 1 2913 2913.1 21.69 3.22e-06 ***
## Residuals 42504 5709473 134.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 34 observations deleted due to missingness
From the results, it can be seen that the variable total accounts has stronger relations to default variable. So we are choosing total account variable over open account, to remove multicollinearity.
index <- which(colnames(loandata7)=="open_acc")
loandata10 <- loandata7[,-c(index)]Our dataset is highly skewed. A lot of observations have not defaulted while very few have defaulted. We can see this from the following result.
table(loandata10$default)##
## 0 1
## 36865 5670
We can see that there are 5670 rows with default (1) and 36,865 rows that are not default (0)
We are performing undersampling using the ‘ROSE’ package to balance our dataset.
balanceddata <- ovun.sample(default ~ . ,
data=loandata10, method = "under", N = 11340, seed = 1)$data
table(balanceddata$default)##
## 0 1
## 5686 5654
Now we can see that both default and non-default have same number of rows, around 5600.
set.seed(101)
sample <- sample.split(balanceddata$default, SplitRatio = .75)
train <- subset(balanceddata, sample == TRUE)
test <- subset(balanceddata, sample == FALSE)We have split our loan dataset in the ratio of 0.75.
logisticreg <- glm(default~., family=binomial(link='logit'), data = train)
actualvalues <- test$default
pred <- predict(logisticreg,test, type ='response')
pred <- ifelse(pred > 0.5,1,0)
compare <- data.frame(actual = actualvalues, predicted = pred)
misClasificError <- mean(pred != actualvalues)
print(paste('Accuracy',1-misClasificError))## [1] "Accuracy 0.585684062059238"
rfmodel <- randomForest(default ~ loan_amnt + reasonforloan + empdetailgiven + emplengthgiven
+ verification_status + home_ownership + annual_inc + purpose
+ previousdefault + pub_rec, data=train, importance = TRUE, ntree = 200, na.action = na.omit)
predrf <- predict(rfmodel, test, type='class')
rfoutput <- confusionMatrix(test$default, predrf)
paste0(rfoutput$overall[1])## [1] "0.563822284908322"
val <- as.numeric(paste0(pred))
predObj <- prediction(val,actualvalues)
rocObj <- performance(predObj, measure="tpr", x.measure="fpr")
aucObj <- performance(predObj, measure="auc")
predrfprob <- predict(rfmodel, test, type = "prob")
val2 <- as.numeric(paste0(predrfprob[,2]))
predObj2 <- prediction(val2, actualvalues)
rocObj2 <- performance(predObj2, measure="tpr", x.measure="fpr")
aucObj2 <- performance(predObj2, measure="auc")
plot(rocObj, col = "blue", lwd = 1, main = "ROC Curves")
plot(rocObj2, add = TRUE, col = "red")
abline(a=0, b=1)The ROC curve is used to compare classifiers. The above graph is a plot between true positive rate and false positive rate. Both are also known as sensitivity and 1-specificity. Area Under the Curve is a measurement used to compare the performance precisely. The more the area under the curve the higher the performance and predictive accuracy of the model. From the above graph we can see that the random forest model performs better than the logistic regression model.