Background
Peer-to-peer networks are revolutionizing many businesses, including the personal and business lending industry. Lending Club (LC), along with other competitors, efficiently matches people wanting to borrow and lend online without the need for backs. Through the loan underwriting process, LC collects a wealth of information about borrowers that is used to determine the credit worthiness and the interest rate they should pay. As a result, lenders only have to focus on the risk they want to take (i.e credit worthiness of the borrower), the amount and duration of the loan. Despite this, a small percentages of loans among all types of borrower defaults, meaning they fail to pay their obligation. This is negative outcome that all lenders want to minimize.
Description Field
One of the fields available for analysis is the “Description” of the loan. In this field, borrower have the opportunity to describe the need for their loan in more detail and answer potential questions from borrowers.
Project Objectives
LC’s numerous categorical and continuous variables have been used to produce models to determine which borrowers are least likely to default. Not a lot of analysis has been done in analyzing the description field to determine if it is a good indicator of the likelihood of default.
In this project, I apply three types of analysis to determine if this field can be used to predict future defaults. The analysis will focus on:
I use visualization techniques to describe the data or the results of the analysis.
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(tm)
library(RTextTools)
library(SnowballC)
library(maps)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.3
library(RCurl)
LC data has been posted to Github in a csv format.
url <- getURL("https://raw.githubusercontent.com/diegomdiaz/IS607/master/Final%20Project/lcloans.csv")
loans <- read.csv(text = url, stringsAsFactors = FALSE)
For this analysis only the “Fully Paid” and “Charged Off” values will be used.
#Subsetting the loans data into two variables: Loan Status and Loan Description
loans1 <- select(loans, loan_status, desc, Lenth)
#Filtering the loan data into "Fully Paid" and "Charged Off" loans.
ln <- filter(loans1, loans1$loan_status == "Fully Paid" | loans1$loan_status == "Charged Off", loans1$desc != "")
head(ln)
## loan_status
## 1 Fully Paid
## 2 Charged Off
## 3 Fully Paid
## 4 Fully Paid
## 5 Charged Off
## 6 Charged Off
## desc
## 1 Borrower added on 12/22/11 > I need to upgrade my business technologies.<br>
## 2 Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike. I only need this money because the deal im looking at is to good to pass up.<br><br> Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces<br>
## 3 Borrower added on 12/21/11 > to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.<br>
## 4 Borrower added on 12/16/11 > Downpayment for a car.<br>
## 5 Borrower added on 12/21/11 > I own a small home-based judgment collection business. I have 5 years experience collecting debts. I am now going from a home office to a small office. I also plan to buy a small debt portfolio (eg. $10K for $1M of debt) <br>My score is not A+ because I own my home and have no mortgage.<br>
## 6 Borrower added on 12/16/11 > I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.<br><br> Borrower added on 12/20/11 > $1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house.<br>
## Lenth
## 1 78
## 2 590
## 3 180
## 4 57
## 5 322
## 6 473
map <- select(loans, loan_status, addr_state, All.Caps)
map <- filter(map, loan_status == "Charged Off")
states <- map_data("state")
state.names <- unlist(sapply(map$addr_state, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]))
map$addr_state <- tolower(state.names)
colnames(map)[3] <- "region"
state.counts <- data.frame(table(map$addr_state))
colnames(state.counts) <- c("region", "Num.Loans")
result <- merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]
p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75)
print(p)
Number of Characters in Description
## From Factor to Numeric Transformation
ln[ ,"loan_status"] <- as.numeric(as.factor(ln$loan_status)) #challenge
lnc <- ln
lnc_paid <- filter(lnc, lnc$loan_status == 2)
lnc_off <- filter(lnc, lnc$loan_status == 1)
Boxplot
boxplot(Lenth ~ loan_status, data = lnc)
Histogram of Character Frequencies
hist(lnc_paid$Lenth, breaks = 10, col="#CCCCFF", freq=FALSE)
hist(lnc_off$Lenth, breaks = 10, col="#CCCCFF", freq=FALSE)
Character Summary Statistics
by(lnc$Lenth, lnc$loan_status, length)
## lnc$loan_status: 1
## [1] 3776
## --------------------------------------------------------
## lnc$loan_status: 2
## [1] 21437
by(lnc$Lenth, lnc$loan_status, summary)
## lnc$loan_status: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.0 140.0 272.0 435.5 519.0 3989.0 26
## --------------------------------------------------------
## lnc$loan_status: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 143.0 290.0 429.7 550.0 3988.0 183
Scrubbing the Description Field
desc1 <- str_replace_all(ln$desc, "Borrower added on ", "")
desc1[6]
## [1] " 12/16/11 > I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.<br><br> 12/20/11 > $1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house.<br>"
desc1 <- str_replace_all(desc1, " \\d{2}/\\d{2}/\\d{2} > ", "")
desc1[6]
## [1] "I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.<br><br>$1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house.<br>"
desc1 <- str_replace_all(desc1, "\\d{2}/\\d{2}/\\d{2} > ", "")
desc1[6]
## [1] "I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.<br><br>$1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house.<br>"
desc1 <- str_replace_all(desc1, "<br>", "")
desc1[6]
## [1] "I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.$1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house."
desc1 <- str_replace_all(desc1, "<br/>", "")
desc1[6]
## [1] "I'm trying to build up my credit history. I live with my brother and have no car payment or credit cards. I am in community college and work full time. Im going to use the money to make some repairs around the house and get some maintenance done on my car.$1000 down only $4375 to go. Thanks to everyone that invested so far, looking forward to surprising my brother with the fixes around the house."
desc1 <- str_replace_all(desc1, "[:punct:]+", " ")
desc1[6]
## [1] "I m trying to build up my credit history I live with my brother and have no car payment or credit cards I am in community college and work full time Im going to use the money to make some repairs around the house and get some maintenance done on my car $1000 down only $4375 to go Thanks to everyone that invested so far looking forward to surprising my brother with the fixes around the house "
desc1 <- str_replace_all(desc1, "[$[:digit:]+]", "")
desc1[6]
## [1] "I m trying to build up my credit history I live with my brother and have no car payment or credit cards I am in community college and work full time Im going to use the money to make some repairs around the house and get some maintenance done on my car down only to go Thanks to everyone that invested so far looking forward to surprising my brother with the fixes around the house "
desc1 <- str_replace_all(desc1, " ", " ")
desc1[6]
## [1] "I m trying to build up my credit history I live with my brother and have no car payment or credit cards I am in community college and work full time Im going to use the money to make some repairs around the house and get some maintenance done on my car down only to go Thanks to everyone that invested so far looking forward to surprising my brother with the fixes around the house "
#Binding clean description to df
ln[ ,"desc"] <- NULL
ln[ ,"desc"] <- desc1
Training and Classification Set
ln1 <- DataframeSource(ln)
corpus2 <- Corpus(ln1)
#Creating the Corpus &
corpus2 <- tm_map(corpus2, content_transformer(tolower))
as.character(corpus2[[6]])
## [1] "1"
## [2] "473"
## [3] "i m trying to build up my credit history i live with my brother and have no car payment or credit cards i am in community college and work full time im going to use the money to make some repairs around the house and get some maintenance done on my car down only to go thanks to everyone that invested so far looking forward to surprising my brother with the fixes around the house "
corpus2 <- tm_map(corpus2, removeWords, stopwords("english"))
as.character(corpus2[[6]])
## [1] "1"
## [2] "473"
## [3] " m trying build credit history live brother car payment credit cards community college work full time im going use money make repairs around house get maintenance done car go thanks everyone invested far looking forward surprising brother fixes around house "
corpus2 <- tm_map(corpus2, stemDocument)
as.character(corpus2[[6]])
## [1] "1"
## [2] "473"
## [3] " m tri build credit histori live brother car payment credit card communiti colleg work full time im go use money make repair around hous get mainten done car go thank everyon invest far look forward surpris brother fix around hous"
corpus2 <- tm_map(corpus2, stripWhitespace)
as.character(corpus2[[6]])
## [1] "1"
## [2] "473"
## [3] " m tri build credit histori live brother car payment credit card communiti colleg work full time im go use money make repair around hous get mainten done car go thank everyon invest far look forward surpris brother fix around hous"
corpus2 <- tm_map(corpus2, PlainTextDocument) # describe a challenge
as.character(corpus2[[6]])
## [1] "1"
## [2] "473"
## [3] " m tri build credit histori live brother car payment credit card communiti colleg work full time im go use money make repair around hous get mainten done car go thank everyon invest far look forward surpris brother fix around hous"
#Creating a Document Term Matrix
dtm <- DocumentTermMatrix(corpus2)
#Removing sparce terms
dtm <- removeSparseTerms(dtm, 1-(20/length(corpus2)))
#Freq
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
Charged Off
#Corpus
ln_off <- filter(ln, loan_status == 1)
ln1_off <- DataframeSource(ln_off)
corpus_off <- Corpus(ln1_off)
corpus_off <- tm_map(corpus_off, removeWords, stopwords("english"))
corpus_off <- tm_map(corpus_off, stemDocument)
corpus_off <- tm_map(corpus_off, stripWhitespace)
corpus_off <- tm_map(corpus_off, PlainTextDocument)
#Document Term Matrix
dtm_off <- DocumentTermMatrix(corpus_off)
#Removing sparce terms
dtm_off <- removeSparseTerms(dtm_off, 1-(20/length(corpus_off)))
#Frequency of off
freq_off <- sort(colSums(as.matrix(dtm_off)), decreasing = TRUE)
Fully Paid
ln_paid <- filter(ln, loan_status == 2)
ln_paid <- sample_n(ln_paid, 3776, replace = TRUE)
ln1_paid <- DataframeSource(ln_paid)
corpus_paid <- Corpus(ln1_paid)
corpus_paid <- tm_map(corpus_paid, content_transformer(tolower))
corpus_paid <- tm_map(corpus_paid, removeWords, stopwords("english"))
corpus_paid <- tm_map(corpus_paid, stemDocument)
corpus_paid <- tm_map(corpus_paid, stripWhitespace)
corpus_paid <- tm_map(corpus_paid, PlainTextDocument)
#DocumentTermMatrix
dtm_paid <- DocumentTermMatrix(corpus_paid)
#Removing sparce terms
dtm_paid <- removeSparseTerms(dtm_paid, 1-(20/length(corpus_paid)))
#Frequency of Paid
freq_paid <- sort(colSums(as.matrix(dtm_paid)), decreasing = TRUE)
Word Frequencies
#Entire Corpus
findFreqTerms(dtm, lowfreq = 2000)
## [1] "abl" "account" "also" "alway" "amount" "back"
## [7] "balanc" "bill" "borrow" "budget" "busi" "can"
## [13] "car" "card" "club" "compani" "consolid" "credit"
## [19] "current" "debt" "employ" "expens" "financi" "free"
## [25] "full" "fund" "get" "good" "great" "help"
## [31] "high" "home" "hous" "incom" "interest" "invest"
## [37] "job" "just" "last" "late" "lend" "like"
## [43] "loan" "look" "lower" "make" "money" "month"
## [49] "much" "need" "never" "new" "now" "one"
## [55] "paid" "pay" "payment" "person" "plan" "purchas"
## [61] "rate" "save" "stabl" "start" "take" "thank"
## [67] "time" "two" "use" "want" "well" "will"
## [73] "work" "year"
#Fully Paid Group
findFreqTerms(dtm_paid, lowfreq = 1000)
## [1] "card" "consolid" "credit" "debt" "interest" "job"
## [7] "loan" "month" "pay" "payment" "rate" "thank"
## [13] "time" "use" "will" "work" "year"
#Charged Off Group
findFreqTerms(dtm_off, lowfreq = 1000)
## [1] "bill" "card" "consolid" "credit" "debt" "get"
## [7] "help" "interest" "job" "loan" "month" "one"
## [13] "pay" "payment" "thank" "time" "use" "will"
## [19] "work" "year"
Associations
#Fully Paid Group
findAssocs(dtm_paid, "card", corlimit = 0.3)
## $card
## credit pay interest payment rate month balanc debt
## 0.81 0.53 0.41 0.41 0.39 0.38 0.37 0.36
## loan
## 0.30
#Charged Off Group
findAssocs(dtm_off, "bill", corlimit = 0.2)
## $bill
## pay time medic alway month job
## 0.33 0.28 0.23 0.22 0.21 0.20
#Fully Paid Group
findAssocs(dtm_paid, "good", corlimit = 0.2)
## $good
## borrow credit job month pay stand candid loan year
## 0.35 0.27 0.26 0.24 0.24 0.23 0.22 0.22 0.20
#Charged Off Group
findAssocs(dtm_off, "good", corlimit = 0.2)
## $good
## borrow job time credit make stabl maintain stand
## 0.36 0.29 0.26 0.24 0.24 0.23 0.22 0.22
## year candid loan plan
## 0.22 0.21 0.21 0.21
Frequency Diagrams
Charged Off
wf_off <-data.frame(word=names(freq_off), freq=freq_off)
subset(wf_off, freq > 1000) %>%
ggplot(aes(word, freq)) +
geom_bar(stat ="identity") +
theme(axis.text.x = element_text(angle=10, hjust = 1))
Paid Off
wf_paid <-data.frame(word=names(freq_paid), freq=freq_paid)
subset(wf_paid, freq > 1000) %>%
ggplot(aes(word, freq)) +
geom_bar(stat ="identity") +
theme(axis.text.x = element_text(angle=10, hjust = 1))
Classification Set Wordcloud
wordcloud(names(freq), freq, min.freq = 1000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
Fully Paid Wordcloud
wordcloud(names(freq_paid), freq, min.freq = 1000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
Charged Off Wordcloud
wordcloud(names(freq_off), freq, min.freq = 1000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
Creating a Container
container <- create_container(dtm, labels = ln$loan_status, trainSize =1:2000, testSize = 2001:3000, virgin = FALSE)
Training Set
SVM <- train_model(container, "SVM")
MAXENT <- train_model(container, "MAXENT")
GLMNET <- train_model(container, "GLMNET")
Classification
SVMC <- classify_model(container, SVM)
MAXENTC <- classify_model(container, MAXENT)
GLMNETC <- classify_model(container, GLMNET)
Performance & Summaries
analytics <- create_analytics(container, cbind(SVMC, MAXENTC, GLMNETC))
topic_summary <- analytics@label_summary
alg_summary <- analytics@algorithm_summary
ens_summary <- analytics@ensemble_summary
doc_summary <- analytics@document_summary
print(topic_summary)
## NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 1 186 40 269
## 2 814 960 731
## PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 1 21.50538 144.62366 4.301075
## 2 117.93612 89.80344 96.068796
## PCT_CORRECTLY_CODED_PROBABILITY
## 1 28.49462
## 2 73.46437
print(alg_summary)
## SVM_PRECISION SVM_RECALL SVM_FSCORE GLMNET_PRECISION GLMNET_RECALL
## 1 1.00 0.01 0.02 0.19 0.09
## 2 0.81 1.00 0.90 0.81 0.91
## GLMNET_FSCORE MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 1 0.12 0.20 0.29 0.24
## 2 0.86 0.82 0.73 0.77
Character Counts
On the character count comparison, the median number of characters for Fully Paid loans was higher than the Charged Off loans, 290 vs. 272 respectively. Both groups had a significant number of outliers, but the Fully Paid group appeared to have more. Although I am generalizing, I can imagine that the more people write in the description, the more likely they are to pay.
Word Frequency
The word frequencies between the Fully Paid and Charged Off looked very similar overall. One exception was that the word Bill came up at a top word in the Charged Off group. This word did not come up in the Fully Paid group - give a set criteria. Correlation analyse of this word shows that this word is strongly correlated to Medic which is probably medical bills.
Classification Methods
I had high hopes that the classification methods would reasonably predict which loans would be likely to default. Overall I was very disappointed. From the three methods I used, I achieved the highest recall of only 25% from MAXENT.
Recall
SVM = 0%
GLMNET = 10%
MAXENT = 25%