Project Motivation

Background

Peer-to-peer networks are revolutionizing many businesses, including the personal and business lending industry. Lending Club (LC), along with other competitors, efficiently matches people wanting to borrow and lend online without the need for backs. Through the loan underwriting process, LC collects a wealth of information about borrowers that is used to determine the credit worthiness and the interest rate they should pay. As a result, lenders only have to focus on the risk they want to take (i.e credit worthiness of the borrower), the amount and duration of the loan. Despite this, a small percentages of loans among all types of borrower defaults, meaning they fail to pay their obligation. This is negative outcome that all lenders want to minimize.

Description Field

One of the fields available for analysis is the “Description” of the loan. In this field, borrower have the opportunity to describe the need for their loan in more detail and answer potential questions from borrowers.

Project Objectives

LC’s numerous categorical and continuous variables have been used to produce models to determine which borrowers are least likely to default. Not a lot of analysis has been done in analyzing the description field to determine if it is a good indicator of the likelihood of default.

In this project, I apply three types of analysis to determine if this field can be used to predict future defaults. The analysis will focus on:

  1. Character Counts
  2. Word Frequency
  3. Classification Methods

I use visualization techniques to describe the data or the results of the analysis.

Loading Required Libraries

library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(tm)
library(RTextTools)
library(SnowballC)
library(maps)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.3
library(RCurl)

Loading the Lending Club Data from Github

LC data has been posted to Github in a csv format.

url <- getURL("https://raw.githubusercontent.com/diegomdiaz/IS607/master/Final%20Project/lcloans.csv")
loans <- read.csv(text = url, stringsAsFactors = FALSE)

Subsetting the Data Set

For this analysis only the “Fully Paid” and “Charged Off” values will be used.

#Subsetting the loans data into two variables: Loan Status and Loan Description
loans1 <- select(loans, loan_status, desc, Lenth)

#Filtering the loan data into "Fully Paid" and "Charged Off" loans. 
ln <- filter(loans1, loans1$loan_status == "Fully Paid" | loans1$loan_status == "Charged Off", loans1$desc != "")

head(ln)
##   loan_status
## 1  Fully Paid
## 2 Charged Off
## 3  Fully Paid
## 4  Fully Paid
## 5 Charged Off
## 6 Charged Off
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             desc
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Borrower added on 12/22/11 > I need to upgrade my business technologies.<br>
## 2   Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike. I only need this money because the deal im looking at is to good to pass up.<br><br>  Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces<br>
## 3                                                                                                                                                                                                                                                                                                                                                                                                                             Borrower added on 12/21/11 > to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.<br>
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Borrower added on 12/16/11 > Downpayment for a car.<br>
## 5                                                                                                                                                                                                                                                                               Borrower added on 12/21/11 > I own a small home-based judgment collection business. I have 5 years experience collecting debts. I am now going from a home office to a small office. I also plan to buy a small debt portfolio (eg. $10K for $1M of debt) <br>My score is not A+ because I own my home and have no mortgage.<br>
## 6                                                                                                                        Borrower added on 12/16/11 > I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.<br><br>  Borrower added on 12/20/11 > $1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house.<br>
##   Lenth
## 1    78
## 2   590
## 3   180
## 4    57
## 5   322
## 6   473

Map of Lending Club’s Charged Offs

map <- select(loans, loan_status, addr_state, All.Caps)
map <- filter(map, loan_status == "Charged Off")

states <- map_data("state")

state.names <- unlist(sapply(map$addr_state, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]))

map$addr_state <- tolower(state.names)
colnames(map)[3] <- "region"

state.counts <- data.frame(table(map$addr_state))
colnames(state.counts) <- c("region", "Num.Loans")

result <- merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]

p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75)

print(p)

Character Counts

Number of Characters in Description

## From Factor to Numeric Transformation

ln[ ,"loan_status"] <- as.numeric(as.factor(ln$loan_status)) #challenge
lnc <- ln
lnc_paid <- filter(lnc, lnc$loan_status == 2)
lnc_off <- filter(lnc, lnc$loan_status == 1)

Distribution of Characters

Boxplot

boxplot(Lenth ~ loan_status, data = lnc)

Histogram of Character Frequencies

hist(lnc_paid$Lenth, breaks = 10, col="#CCCCFF", freq=FALSE)
hist(lnc_off$Lenth, breaks = 10, col="#CCCCFF", freq=FALSE)

Character Summary Statistics

by(lnc$Lenth, lnc$loan_status, length)
## lnc$loan_status: 1
## [1] 3776
## -------------------------------------------------------- 
## lnc$loan_status: 2
## [1] 21437
by(lnc$Lenth, lnc$loan_status, summary)
## lnc$loan_status: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     4.0   140.0   272.0   435.5   519.0  3989.0      26 
## -------------------------------------------------------- 
## lnc$loan_status: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0   143.0   290.0   429.7   550.0  3988.0     183

Creating a Corpus, DocumentTermMaxtrix and Freq

Scrubbing the Description Field

desc1 <- str_replace_all(ln$desc, "Borrower added on ", "")
desc1[6]
## [1] "  12/16/11 > I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.<br><br>  12/20/11 > $1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house.<br>"
desc1 <- str_replace_all(desc1, "  \\d{2}/\\d{2}/\\d{2} > ", "")
desc1[6]
## [1] "I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.<br><br>$1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house.<br>"
desc1 <- str_replace_all(desc1, "\\d{2}/\\d{2}/\\d{2} > ", "")
desc1[6]
## [1] "I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.<br><br>$1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house.<br>"
desc1 <- str_replace_all(desc1, "<br>", "")
desc1[6]
## [1] "I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.$1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house."
desc1 <- str_replace_all(desc1, "<br/>", "")
desc1[6]
## [1] "I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.$1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house."
desc1 <- str_replace_all(desc1, "[:punct:]+", " ")
desc1[6]
## [1] "I m trying to build up my credit history  I live with my brother and have no car payment or credit cards  I am in community college and work full time  Im going to use the money to make some repairs around the house and get some maintenance done on my car $1000 down only $4375 to go  Thanks to everyone that invested so far  looking forward to surprising my brother with the fixes around the house "
desc1 <- str_replace_all(desc1, "[$[:digit:]+]", "")
desc1[6]
## [1] "I m trying to build up my credit history  I live with my brother and have no car payment or credit cards  I am in community college and work full time  Im going to use the money to make some repairs around the house and get some maintenance done on my car  down only  to go  Thanks to everyone that invested so far  looking forward to surprising my brother with the fixes around the house "
desc1 <- str_replace_all(desc1, "  ", " ")
desc1[6]
## [1] "I m trying to build up my credit history I live with my brother and have no car payment or credit cards I am in community college and work full time Im going to use the money to make some repairs around the house and get some maintenance done on my car down only to go Thanks to everyone that invested so far looking forward to surprising my brother with the fixes around the house "
#Binding clean description to df
ln[ ,"desc"] <- NULL
ln[ ,"desc"] <- desc1

Training and Classification Set

ln1 <- DataframeSource(ln)
corpus2 <- Corpus(ln1)

#Creating the Corpus & 
corpus2 <- tm_map(corpus2, content_transformer(tolower))
as.character(corpus2[[6]])
## [1] "1"                                                                                                                                                                                                                                                                                                                                                                                             
## [2] "473"                                                                                                                                                                                                                                                                                                                                                                                           
## [3] "i m trying to build up my credit history i live with my brother and have no car payment or credit cards i am in community college and work full time im going to use the money to make some repairs around the house and get some maintenance done on my car down only to go thanks to everyone that invested so far looking forward to surprising my brother with the fixes around the house "
corpus2 <- tm_map(corpus2, removeWords, stopwords("english"))
as.character(corpus2[[6]])
## [1] "1"                                                                                                                                                                                                                                                                                                   
## [2] "473"                                                                                                                                                                                                                                                                                                 
## [3] " m trying  build   credit history  live   brother    car payment  credit cards    community college  work full time im going  use  money  make  repairs around  house  get  maintenance done   car    go thanks  everyone  invested  far looking forward  surprising  brother   fixes around  house "
corpus2 <- tm_map(corpus2, stemDocument)
as.character(corpus2[[6]])
## [1] "1"                                                                                                                                                                                                                                                                       
## [2] "473"                                                                                                                                                                                                                                                                     
## [3] " m tri  build   credit histori  live   brother    car payment  credit card    communiti colleg  work full time im go  use  money  make  repair around  hous  get  mainten done   car    go thank  everyon  invest  far look forward  surpris  brother   fix around  hous"
corpus2 <- tm_map(corpus2, stripWhitespace)
as.character(corpus2[[6]])
## [1] "1"                                                                                                                                                                                                                                     
## [2] "473"                                                                                                                                                                                                                                   
## [3] " m tri build credit histori live brother car payment credit card communiti colleg work full time im go use money make repair around hous get mainten done car go thank everyon invest far look forward surpris brother fix around hous"
corpus2 <- tm_map(corpus2, PlainTextDocument) # describe a challenge

as.character(corpus2[[6]])
## [1] "1"                                                                                                                                                                                                                                     
## [2] "473"                                                                                                                                                                                                                                   
## [3] " m tri build credit histori live brother car payment credit card communiti colleg work full time im go use money make repair around hous get mainten done car go thank everyon invest far look forward surpris brother fix around hous"
#Creating a Document Term Matrix
dtm <- DocumentTermMatrix(corpus2)

#Removing sparce terms
dtm <- removeSparseTerms(dtm, 1-(20/length(corpus2)))

#Freq
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)

Charged Off

#Corpus 
ln_off <- filter(ln, loan_status == 1)
ln1_off <- DataframeSource(ln_off)
corpus_off <- Corpus(ln1_off)
corpus_off <- tm_map(corpus_off, removeWords, stopwords("english"))
corpus_off <- tm_map(corpus_off, stemDocument)
corpus_off <- tm_map(corpus_off, stripWhitespace)
corpus_off <- tm_map(corpus_off, PlainTextDocument)

#Document Term Matrix
dtm_off <- DocumentTermMatrix(corpus_off)

#Removing sparce terms
dtm_off <- removeSparseTerms(dtm_off, 1-(20/length(corpus_off)))

#Frequency of off
freq_off <- sort(colSums(as.matrix(dtm_off)), decreasing = TRUE)

Fully Paid

ln_paid <- filter(ln, loan_status == 2)
ln_paid <- sample_n(ln_paid, 3776, replace = TRUE)

ln1_paid <- DataframeSource(ln_paid)

corpus_paid <- Corpus(ln1_paid)
corpus_paid <- tm_map(corpus_paid, content_transformer(tolower))
corpus_paid <- tm_map(corpus_paid, removeWords, stopwords("english"))
corpus_paid <- tm_map(corpus_paid, stemDocument)
corpus_paid <- tm_map(corpus_paid, stripWhitespace)
corpus_paid <- tm_map(corpus_paid, PlainTextDocument)

#DocumentTermMatrix
dtm_paid <- DocumentTermMatrix(corpus_paid)

#Removing sparce terms
dtm_paid <- removeSparseTerms(dtm_paid, 1-(20/length(corpus_paid)))

#Frequency of Paid
freq_paid <- sort(colSums(as.matrix(dtm_paid)), decreasing = TRUE)

Frequency and Association Analysis

Word Frequencies

#Entire Corpus
findFreqTerms(dtm, lowfreq = 2000)
##  [1] "abl"      "account"  "also"     "alway"    "amount"   "back"    
##  [7] "balanc"   "bill"     "borrow"   "budget"   "busi"     "can"     
## [13] "car"      "card"     "club"     "compani"  "consolid" "credit"  
## [19] "current"  "debt"     "employ"   "expens"   "financi"  "free"    
## [25] "full"     "fund"     "get"      "good"     "great"    "help"    
## [31] "high"     "home"     "hous"     "incom"    "interest" "invest"  
## [37] "job"      "just"     "last"     "late"     "lend"     "like"    
## [43] "loan"     "look"     "lower"    "make"     "money"    "month"   
## [49] "much"     "need"     "never"    "new"      "now"      "one"     
## [55] "paid"     "pay"      "payment"  "person"   "plan"     "purchas" 
## [61] "rate"     "save"     "stabl"    "start"    "take"     "thank"   
## [67] "time"     "two"      "use"      "want"     "well"     "will"    
## [73] "work"     "year"
#Fully Paid Group
findFreqTerms(dtm_paid, lowfreq = 1000)
##  [1] "card"     "consolid" "credit"   "debt"     "interest" "job"     
##  [7] "loan"     "month"    "pay"      "payment"  "rate"     "thank"   
## [13] "time"     "use"      "will"     "work"     "year"
#Charged Off Group
findFreqTerms(dtm_off, lowfreq = 1000)
##  [1] "bill"     "card"     "consolid" "credit"   "debt"     "get"     
##  [7] "help"     "interest" "job"      "loan"     "month"    "one"     
## [13] "pay"      "payment"  "thank"    "time"     "use"      "will"    
## [19] "work"     "year"

Associations

#Fully Paid Group
findAssocs(dtm_paid, "card", corlimit = 0.3)
## $card
##   credit      pay interest  payment     rate    month   balanc     debt 
##     0.81     0.53     0.41     0.41     0.39     0.38     0.37     0.36 
##     loan 
##     0.30
#Charged Off Group
findAssocs(dtm_off, "bill", corlimit = 0.2)
## $bill
##   pay  time medic alway month   job 
##  0.33  0.28  0.23  0.22  0.21  0.20
#Fully Paid Group
findAssocs(dtm_paid, "good", corlimit = 0.2)
## $good
## borrow credit    job  month    pay  stand candid   loan   year 
##   0.35   0.27   0.26   0.24   0.24   0.23   0.22   0.22   0.20
#Charged Off Group
findAssocs(dtm_off, "good", corlimit = 0.2)
## $good
##   borrow      job     time   credit     make    stabl maintain    stand 
##     0.36     0.29     0.26     0.24     0.24     0.23     0.22     0.22 
##     year   candid     loan     plan 
##     0.22     0.21     0.21     0.21

Frequency Diagrams

Charged Off

wf_off <-data.frame(word=names(freq_off), freq=freq_off) 
subset(wf_off, freq > 1000) %>% 
  ggplot(aes(word, freq)) + 
  geom_bar(stat ="identity") + 
  theme(axis.text.x = element_text(angle=10, hjust = 1))

Paid Off

wf_paid <-data.frame(word=names(freq_paid), freq=freq_paid)
subset(wf_paid, freq > 1000) %>% 
  ggplot(aes(word, freq)) + 
  geom_bar(stat ="identity") + 
  theme(axis.text.x = element_text(angle=10, hjust = 1))

Wordclouds

Classification Set Wordcloud

wordcloud(names(freq), freq, min.freq = 1000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

Fully Paid Wordcloud

wordcloud(names(freq_paid), freq, min.freq = 1000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

Charged Off Wordcloud

wordcloud(names(freq_off), freq, min.freq = 1000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

Classification of Loan Status

Creating a Container

container <- create_container(dtm, labels = ln$loan_status, trainSize =1:2000, testSize = 2001:3000, virgin = FALSE)

Training Set

SVM <- train_model(container, "SVM")
MAXENT <- train_model(container, "MAXENT")
GLMNET <- train_model(container, "GLMNET")

Classification

SVMC <- classify_model(container, SVM)
MAXENTC <- classify_model(container, MAXENT)
GLMNETC <- classify_model(container, GLMNET)

Performance & Summaries

analytics <- create_analytics(container, cbind(SVMC, MAXENTC, GLMNETC))

topic_summary <- analytics@label_summary
alg_summary <- analytics@algorithm_summary
ens_summary <- analytics@ensemble_summary
doc_summary <- analytics@document_summary

print(topic_summary)
##   NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 1                186                  40                   269
## 2                814                 960                   731
##   PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 1            21.50538             144.62366                      4.301075
## 2           117.93612              89.80344                     96.068796
##   PCT_CORRECTLY_CODED_PROBABILITY
## 1                        28.49462
## 2                        73.46437
print(alg_summary)
##   SVM_PRECISION SVM_RECALL SVM_FSCORE GLMNET_PRECISION GLMNET_RECALL
## 1          1.00       0.01       0.02             0.19          0.09
## 2          0.81       1.00       0.90             0.81          0.91
##   GLMNET_FSCORE MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 1          0.12                 0.20              0.29              0.24
## 2          0.86                 0.82              0.73              0.77

Conclusion

Character Counts

On the character count comparison, the median number of characters for Fully Paid loans was higher than the Charged Off loans, 290 vs. 272 respectively. Both groups had a significant number of outliers, but the Fully Paid group appeared to have more. Although I am generalizing, I can imagine that the more people write in the description, the more likely they are to pay.

Word Frequency

The word frequencies between the Fully Paid and Charged Off looked very similar overall. One exception was that the word Bill came up at a top word in the Charged Off group. This word did not come up in the Fully Paid group - give a set criteria. Correlation analyse of this word shows that this word is strongly correlated to Medic which is probably medical bills.

Classification Methods

I had high hopes that the classification methods would reasonably predict which loans would be likely to default. Overall I was very disappointed. From the three methods I used, I achieved the highest recall of only 25% from MAXENT.

Recall
SVM = 0%
GLMNET = 10%
MAXENT = 25%