607 Data Acquisition and Management Final

Research question

We are testing to see if the default rates for homeowners is significantly different than the default rates for all loans. Additionally, algorithms using decision trees were created based on various features (eg. demographic, geographic, income, fico etc) to predict default rates. The purpose of this study is to create a credit profile for borrowers to mitigate default risk.

Cases

There are 235,629 loans for 2013/2014.

Data collection

Data was downloaded from LendingClub’s portfolio of consumer loans and includes complete loan data for all loans issued for 2013/2014, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. https://www.lendingclub.com/info/download-data.action. Data was also downloaded from Zip-Codes.com which offers demographic information by zip code for the nation. Data was downloaded in csv formats which required tidying and data transformation within R.

Type of study

This is an observational study based on real market data provided by LendingClub.

Response

Default/categorical; Default Rate/numerical.

Explanatory

Homeownership/categorical; Homeownership Rate/numerical.

Summary statistics and Analysis: Are Default Rates for homeowners significantly different than default rates for all loans?

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.2

#Load data
lendclub <- read.csv("https://raw.githubusercontent.com/vskrelja/Final/master/lendingclub_zipcodes.csv",, header=TRUE)
dim(lendclub)

## [1] 235629     24

str(lendclub)

## 'data.frame':    235629 obs. of  24 variables:
##  $ id                   : int  57167 300390 361542 367050 377140 378843 384151 404548 406664 454052 ...
##  $ term                 : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ intrate              : num  0.1699 0.0649 0.0699 0.0712 0.0649 ...
##  $ grade                : Factor w/ 7 levels "A","B","C","D",..: 4 1 1 1 1 3 4 1 3 1 ...
##  $ homeownership        : Factor w/ 4 levels "ANY","MORTGAGE",..: 4 4 2 2 2 2 2 2 3 2 ...
##  $ annualinc            : num  70000 165000 250000 100000 102000 ...
##  $ issued               : Factor w/ 12 levels "1/14/2015","10/14/2015",..: 11 4 4 2 4 4 3 3 10 4 ...
##  $ loanstatus           : Factor w/ 7 levels "ChargedOff","Current",..: 1 2 2 2 2 2 2 1 2 2 ...
##  $ title                : Factor w/ 1909 levels "#2","1creditcardpaymentamonth",..: 1301 660 128 979 660 660 1265 507 1839 660 ...
##  $ zipcode              : int  10000 7000 14000 98100 33000 1000 12800 7400 19100 6400 ...
##  $ addrstate            : Factor w/ 49 levels "AK","AL","AR",..: 33 30 33 46 10 20 33 30 37 7 ...
##  $ dti                  : num  10.5 4.45 3.25 19.13 12.4 ...
##  $ ficorangelow         : int  660 715 820 715 765 665 720 675 680 730 ...
##  $ totalpymnt           : num  2718 2212 5176 6918 5379 ...
##  $ totalpymntinv        : num  2688 2212 5176 6918 5379 ...
##  $ lastficorangelow     : int  535 755 770 685 820 630 640 550 690 725 ...
##  $ zipcode.1            : int  10000 7000 14000 98100 33000 1000 12800 7400 19100 6400 ...
##  $ averagehousevalue    : num  577344 381556 112468 230747 157541 ...
##  $ incomeperhousehold   : num  64981 78559 50772 37182 35802 ...
##  $ state                : Factor w/ 50 levels "","AK","AL","AR",..: 34 31 34 47 11 21 34 31 38 8 ...
##  $ numberofbusinesses   : num  1889 530 141 537 574 ...
##  $ numberofemployees    : num  37033 7523 1960 8750 5663 ...
##  $ businessannualpayroll: num  3589065 404292 71002 557913 203134 ...
##  $ populationestimate   : num  28908 19042 6651 15096 24757 ...

x<-data.frame(lendclub)

#Overall Default Rate
levels(x$loanstatus)

## [1] "ChargedOff"       "Current"          "Default"         
## [4] "FullyPaid"        "InGracePeriod"    "Late(16-30days)" 
## [7] "Late(31-120days)"

round(100*nrow(x[x$loanstatus=="ChargedOff"|x$loanstatus=="Default"|x$loanstatus=="Late(31-120days)",])/nrow(x),2)

## [1] 3.95

#Homeowner Default Rate
levels(x$homeownership)

## [1] "ANY"      "MORTGAGE" "OWN"      "RENT"

homeowners<-x[x$homeownership=="OWN"|x$homeownership=="MORTGAGE",]
round(100*nrow(homeowners[homeowners$loanstatus=="ChargedOff"|homeowners$loanstatus=="Default"|homeowners$loanstatus=="Late(31-120days)",])/nrow(homeowners),2)

## [1] 3.49

#Distribution of Default rate in the overall population
population_defaults <- rep(NA, 10000)
for(i in 1:10000){
samp <- sample(x$loanstatus, 1000)
population_defaults[i] <- 100*length(samp[samp=="ChargedOff"|samp=="Default"|samp=="Late(31-120days)"])/1000
}

#Distribution of Default rate in homeowners
homeowners_defaults <- rep(NA, 10000)
for(i in 1:10000){
samp <- sample(homeowners$loanstatus, 1000)
homeowners_defaults[i] <- 100*length(samp[samp=="ChargedOff"|samp=="Default"|samp=="Late(31-120days)"])/1000
}

#Mean and SD of defaults rates in the over all population
mean(population_defaults); sd(population_defaults);

## [1] 3.93893

## [1] 0.617833

#Mean and SD of defaults rates in homeowners
mean(homeowners_defaults); sd(homeowners_defaults);

## [1] 3.48932

## [1] 0.5775459

t.test(homeowners_defaults,population_defaults)

## 
##  Welch Two Sample t-test
## 
## data:  homeowners_defaults and population_defaults
## t = -53.162, df = 19908, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4661872 -0.4330328
## sample estimates:
## mean of x mean of y 
##   3.48932   3.93893

#Plot the ditributon of homeowner defaults vs population defaults
a<-data.frame('defaults'=homeowners_defaults)
a$sample<-'homeowners'
b<-data.frame('defaults'=population_defaults)
b$sample<-'population'
combined<-rbind(a,b)

ggplot(combined, aes(defaults, fill = sample)) + geom_density(alpha = 0.1) + geom_vline(xintercept = mean(homeowners_defaults),color='red') + geom_vline(xintercept = mean(population_defaults),color='green')

Conclusion: Homeowners have lower default rates

As is evident from the two Sample t-test, the NULL hypothesis (no difference between population defaults and homeowners defaults) can be rejected at 95% confidence level, since the confidence interval does not include 0. The analysis shows that on average the Homeowner default rates on consumer loans can be 0.44% to 0.48% lower than the overall rate at a 95% condifence level. The lower defaults in Homeowners can also be seen in the default rate density plots of Homeowner vs the overall population.

Predictive Algorithm for default rates using decision Trees; preparing the data. Given a loan, we are trying to predict if a borrowing is going to default using borrower’s geographic, demographic, income and fico data.

library(caret)

## Warning: package 'caret' was built under R version 3.2.2

## Loading required package: lattice

library(rpart)

## Warning: package 'rpart' was built under R version 3.2.2

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.2.2

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.2.2

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

library(rattle)

## Warning: package 'rattle' was built under R version 3.2.2

## Rattle: A free graphical interface for data mining with R.
## Version 4.0.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(e1071)

## Warning: package 'e1071' was built under R version 3.2.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.2

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:randomForest':
## 
##     combine
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(party)

## Warning: package 'party' was built under R version 3.2.3

## Loading required package: grid
## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 3.2.3

## Loading required package: modeltools

## Warning: package 'modeltools' was built under R version 3.2.3

## Loading required package: stats4
## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 3.2.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.2.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 3.2.3

#Remove loans with NA's
x<-x[complete.cases(x),]
dim(x)

## [1] 235220     24

#Create new categorical variable called indefault for prediction
x$indefault<-ifelse(x$loanstatus=="ChargedOff"|x$loanstatus=="Default"|x$loanstatus=="Late(31-120days)","default","nondefault")
x$indefault<-as.factor(x$indefault)

#Subset with relevant fields (independent demographic, geographic, income, fico variables)
x_pred<-x[,c("term","homeownership","annualinc","addrstate","dti","ficorangelow","averagehousevalue","incomeperhousehold","numberofbusinesses","numberofemployees","businessannualpayroll","populationestimate","indefault")]

#Since the overall data is very imbalanced (<5% Defaults, 9,275/235,220 observations = 3.9%), we will try "under-sampling" the much larger non-default category.
num_defaults<-length(x_pred$indefault[x_pred$indefault=="default"])
x_nondefaults<-x_pred[x_pred$indefault=="nondefault",]
x_pred<-rbind(x_pred[x_pred$indefault=="default",],sample_n(x_nondefaults,num_defaults))

#Partition Data (creates training set and test set)
inTrain <- createDataPartition(x_pred$indefault, p = 3/4)[[1]]
training <- x_pred[ inTrain,]
testing <- x_pred[-inTrain,]

Regression Tree

#R Regression Tree
fit_rpart <- train(indefault~.,method='rpart',data=training)
  
#Fancy Decision Tree Plot
par(mar=c(2,2,1,1))
fancyRpartPlot(fit_rpart$finalModel)

#Confusion Matrix on Testing set: Regression Tree
pred_rpart <- predict(fit_rpart, testing)
confusionMatrix(pred_rpart, testing$indefault, positive = 'default')

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   default nondefault
##   default       1021        793
##   nondefault    1297       1525
##                                           
##                Accuracy : 0.5492          
##                  95% CI : (0.5347, 0.5636)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.133e-11       
##                                           
##                   Kappa : 0.0984          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4405          
##             Specificity : 0.6579          
##          Pos Pred Value : 0.5628          
##          Neg Pred Value : 0.5404          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2202          
##    Detection Prevalence : 0.3913          
##       Balanced Accuracy : 0.5492          
##                                           
##        'Positive' Class : default         
##

Result from Regression Tree: We get an accuracy of over 55% which is an improvement over the default accuracy of 50% (No Information Rate) in the balanced dataset we used for prediction. But given how undesirable False Negatives are in this context, ie predicting bad loans (990) as good, we need to increase the Sensitivity of our prediction. Next we try Random Forest.

Random Forest

#R Random Forest
fit_rf <- randomForest(training$indefault~.,training,importance=TRUE)

#Confusion Matrix on Testing set: Random Forest
pred_rf <- predict(fit_rf, testing)
confusionMatrix(pred_rf, testing$indefault, positive = 'default')

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   default nondefault
##   default       2231       2196
##   nondefault      87        122
##                                         
##                Accuracy : 0.5075        
##                  95% CI : (0.493, 0.522)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : 0.1554        
##                                         
##                   Kappa : 0.0151        
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.96247       
##             Specificity : 0.05263       
##          Pos Pred Value : 0.50395       
##          Neg Pred Value : 0.58373       
##              Prevalence : 0.50000       
##          Detection Rate : 0.48123       
##    Detection Prevalence : 0.95492       
##       Balanced Accuracy : 0.50755       
##                                         
##        'Positive' Class : default       
##

head(getTree(fit_rf, 1))

##   left daughter right daughter split var split point status prediction
## 1             2              3         5       19.26      1          0
## 2             4              5         6      687.50      1          0
## 3             6              7         3    74251.00      1          0
## 4             8              9         2        2.00      1          0
## 5            10             11        12    26204.57      1          0
## 6            12             13         6      712.50      1          0

varImpPlot(fit_rf)

Result from Random Forest and Conclusion: Random Forest is helpful in increasing the Sensitivity (96.5%). Given our primary objective is to predict and avoid Defaults, RF works very well, but given the very low Specificity (4%) the model is of limited use. Overall RF gives us a very marginal improvement in accuracy over the No Information Rate of 50%. Further work needs to be carried out using Bagging and/or Boosting.