We are testing to see if the default rates for homeowners is significantly different than the default rates for all loans. Additionally, algorithms using decision trees were created based on various features (eg. demographic, geographic, income, fico etc) to predict default rates. The purpose of this study is to create a credit profile for borrowers to mitigate default risk.
There are 235,629 loans for 2013/2014.
Data was downloaded from LendingClub’s portfolio of consumer loans and includes complete loan data for all loans issued for 2013/2014, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. https://www.lendingclub.com/info/download-data.action. Data was also downloaded from Zip-Codes.com which offers demographic information by zip code for the nation. Data was downloaded in csv formats which required tidying and data transformation within R.
This is an observational study based on real market data provided by LendingClub.
Default/categorical; Default Rate/numerical.
Homeownership/categorical; Homeownership Rate/numerical.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.2
#Load data
lendclub <- read.csv("https://raw.githubusercontent.com/vskrelja/Final/master/lendingclub_zipcodes.csv",, header=TRUE)
dim(lendclub)
## [1] 235629 24
str(lendclub)
## 'data.frame': 235629 obs. of 24 variables:
## $ id : int 57167 300390 361542 367050 377140 378843 384151 404548 406664 454052 ...
## $ term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ intrate : num 0.1699 0.0649 0.0699 0.0712 0.0649 ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 4 1 1 1 1 3 4 1 3 1 ...
## $ homeownership : Factor w/ 4 levels "ANY","MORTGAGE",..: 4 4 2 2 2 2 2 2 3 2 ...
## $ annualinc : num 70000 165000 250000 100000 102000 ...
## $ issued : Factor w/ 12 levels "1/14/2015","10/14/2015",..: 11 4 4 2 4 4 3 3 10 4 ...
## $ loanstatus : Factor w/ 7 levels "ChargedOff","Current",..: 1 2 2 2 2 2 2 1 2 2 ...
## $ title : Factor w/ 1909 levels "#2","1creditcardpaymentamonth",..: 1301 660 128 979 660 660 1265 507 1839 660 ...
## $ zipcode : int 10000 7000 14000 98100 33000 1000 12800 7400 19100 6400 ...
## $ addrstate : Factor w/ 49 levels "AK","AL","AR",..: 33 30 33 46 10 20 33 30 37 7 ...
## $ dti : num 10.5 4.45 3.25 19.13 12.4 ...
## $ ficorangelow : int 660 715 820 715 765 665 720 675 680 730 ...
## $ totalpymnt : num 2718 2212 5176 6918 5379 ...
## $ totalpymntinv : num 2688 2212 5176 6918 5379 ...
## $ lastficorangelow : int 535 755 770 685 820 630 640 550 690 725 ...
## $ zipcode.1 : int 10000 7000 14000 98100 33000 1000 12800 7400 19100 6400 ...
## $ averagehousevalue : num 577344 381556 112468 230747 157541 ...
## $ incomeperhousehold : num 64981 78559 50772 37182 35802 ...
## $ state : Factor w/ 50 levels "","AK","AL","AR",..: 34 31 34 47 11 21 34 31 38 8 ...
## $ numberofbusinesses : num 1889 530 141 537 574 ...
## $ numberofemployees : num 37033 7523 1960 8750 5663 ...
## $ businessannualpayroll: num 3589065 404292 71002 557913 203134 ...
## $ populationestimate : num 28908 19042 6651 15096 24757 ...
x<-data.frame(lendclub)
#Overall Default Rate
levels(x$loanstatus)
## [1] "ChargedOff" "Current" "Default"
## [4] "FullyPaid" "InGracePeriod" "Late(16-30days)"
## [7] "Late(31-120days)"
round(100*nrow(x[x$loanstatus=="ChargedOff"|x$loanstatus=="Default"|x$loanstatus=="Late(31-120days)",])/nrow(x),2)
## [1] 3.95
#Homeowner Default Rate
levels(x$homeownership)
## [1] "ANY" "MORTGAGE" "OWN" "RENT"
homeowners<-x[x$homeownership=="OWN"|x$homeownership=="MORTGAGE",]
round(100*nrow(homeowners[homeowners$loanstatus=="ChargedOff"|homeowners$loanstatus=="Default"|homeowners$loanstatus=="Late(31-120days)",])/nrow(homeowners),2)
## [1] 3.49
#Distribution of Default rate in the overall population
population_defaults <- rep(NA, 10000)
for(i in 1:10000){
samp <- sample(x$loanstatus, 1000)
population_defaults[i] <- 100*length(samp[samp=="ChargedOff"|samp=="Default"|samp=="Late(31-120days)"])/1000
}
#Distribution of Default rate in homeowners
homeowners_defaults <- rep(NA, 10000)
for(i in 1:10000){
samp <- sample(homeowners$loanstatus, 1000)
homeowners_defaults[i] <- 100*length(samp[samp=="ChargedOff"|samp=="Default"|samp=="Late(31-120days)"])/1000
}
#Mean and SD of defaults rates in the over all population
mean(population_defaults); sd(population_defaults);
## [1] 3.93893
## [1] 0.617833
#Mean and SD of defaults rates in homeowners
mean(homeowners_defaults); sd(homeowners_defaults);
## [1] 3.48932
## [1] 0.5775459
t.test(homeowners_defaults,population_defaults)
##
## Welch Two Sample t-test
##
## data: homeowners_defaults and population_defaults
## t = -53.162, df = 19908, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4661872 -0.4330328
## sample estimates:
## mean of x mean of y
## 3.48932 3.93893
#Plot the ditributon of homeowner defaults vs population defaults
a<-data.frame('defaults'=homeowners_defaults)
a$sample<-'homeowners'
b<-data.frame('defaults'=population_defaults)
b$sample<-'population'
combined<-rbind(a,b)
ggplot(combined, aes(defaults, fill = sample)) + geom_density(alpha = 0.1) + geom_vline(xintercept = mean(homeowners_defaults),color='red') + geom_vline(xintercept = mean(population_defaults),color='green')
As is evident from the two Sample t-test, the NULL hypothesis (no difference between population defaults and homeowners defaults) can be rejected at 95% confidence level, since the confidence interval does not include 0. The analysis shows that on average the Homeowner default rates on consumer loans can be 0.44% to 0.48% lower than the overall rate at a 95% condifence level. The lower defaults in Homeowners can also be seen in the default rate density plots of Homeowner vs the overall population.
library(caret)
## Warning: package 'caret' was built under R version 3.2.2
## Loading required package: lattice
library(rpart)
## Warning: package 'rpart' was built under R version 3.2.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.2
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(rattle)
## Warning: package 'rattle' was built under R version 3.2.2
## Rattle: A free graphical interface for data mining with R.
## Version 4.0.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(e1071)
## Warning: package 'e1071' was built under R version 3.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:randomForest':
##
## combine
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(party)
## Warning: package 'party' was built under R version 3.2.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.2.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.2.3
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.2.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.2.3
#Remove loans with NA's
x<-x[complete.cases(x),]
dim(x)
## [1] 235220 24
#Create new categorical variable called indefault for prediction
x$indefault<-ifelse(x$loanstatus=="ChargedOff"|x$loanstatus=="Default"|x$loanstatus=="Late(31-120days)","default","nondefault")
x$indefault<-as.factor(x$indefault)
#Subset with relevant fields (independent demographic, geographic, income, fico variables)
x_pred<-x[,c("term","homeownership","annualinc","addrstate","dti","ficorangelow","averagehousevalue","incomeperhousehold","numberofbusinesses","numberofemployees","businessannualpayroll","populationestimate","indefault")]
#Since the overall data is very imbalanced (<5% Defaults, 9,275/235,220 observations = 3.9%), we will try "under-sampling" the much larger non-default category.
num_defaults<-length(x_pred$indefault[x_pred$indefault=="default"])
x_nondefaults<-x_pred[x_pred$indefault=="nondefault",]
x_pred<-rbind(x_pred[x_pred$indefault=="default",],sample_n(x_nondefaults,num_defaults))
#Partition Data (creates training set and test set)
inTrain <- createDataPartition(x_pred$indefault, p = 3/4)[[1]]
training <- x_pred[ inTrain,]
testing <- x_pred[-inTrain,]
#R Regression Tree
fit_rpart <- train(indefault~.,method='rpart',data=training)
#Fancy Decision Tree Plot
par(mar=c(2,2,1,1))
fancyRpartPlot(fit_rpart$finalModel)
#Confusion Matrix on Testing set: Regression Tree
pred_rpart <- predict(fit_rpart, testing)
confusionMatrix(pred_rpart, testing$indefault, positive = 'default')
## Confusion Matrix and Statistics
##
## Reference
## Prediction default nondefault
## default 1021 793
## nondefault 1297 1525
##
## Accuracy : 0.5492
## 95% CI : (0.5347, 0.5636)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 1.133e-11
##
## Kappa : 0.0984
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4405
## Specificity : 0.6579
## Pos Pred Value : 0.5628
## Neg Pred Value : 0.5404
## Prevalence : 0.5000
## Detection Rate : 0.2202
## Detection Prevalence : 0.3913
## Balanced Accuracy : 0.5492
##
## 'Positive' Class : default
##
Result from Regression Tree: We get an accuracy of over 55% which is an improvement over the default accuracy of 50% (No Information Rate) in the balanced dataset we used for prediction. But given how undesirable False Negatives are in this context, ie predicting bad loans (990) as good, we need to increase the Sensitivity of our prediction. Next we try Random Forest.
#R Random Forest
fit_rf <- randomForest(training$indefault~.,training,importance=TRUE)
#Confusion Matrix on Testing set: Random Forest
pred_rf <- predict(fit_rf, testing)
confusionMatrix(pred_rf, testing$indefault, positive = 'default')
## Confusion Matrix and Statistics
##
## Reference
## Prediction default nondefault
## default 2231 2196
## nondefault 87 122
##
## Accuracy : 0.5075
## 95% CI : (0.493, 0.522)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.1554
##
## Kappa : 0.0151
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.96247
## Specificity : 0.05263
## Pos Pred Value : 0.50395
## Neg Pred Value : 0.58373
## Prevalence : 0.50000
## Detection Rate : 0.48123
## Detection Prevalence : 0.95492
## Balanced Accuracy : 0.50755
##
## 'Positive' Class : default
##
head(getTree(fit_rf, 1))
## left daughter right daughter split var split point status prediction
## 1 2 3 5 19.26 1 0
## 2 4 5 6 687.50 1 0
## 3 6 7 3 74251.00 1 0
## 4 8 9 2 2.00 1 0
## 5 10 11 12 26204.57 1 0
## 6 12 13 6 712.50 1 0
varImpPlot(fit_rf)
Result from Random Forest and Conclusion: Random Forest is helpful in increasing the Sensitivity (96.5%). Given our primary objective is to predict and avoid Defaults, RF works very well, but given the very low Specificity (4%) the model is of limited use. Overall RF gives us a very marginal improvement in accuracy over the No Information Rate of 50%. Further work needs to be carried out using Bagging and/or Boosting.