Lending Club connects people who need money (borrowers) with people who have money (investors). An investor you would want to invest in people who showed a profile of having a high probability of paying back. This is a model that will predict this. This data is from before they even went public. I used the lending data from 2007-2010 and classified and predict whether or not the borrower paid back their loan in full. The csv file is in this repository.
Here are what the columns represent:
- credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise
- purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”)
- int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. (risky <- high int.rate)
- installment: The monthly installments owed by the borrower if the loan is funded
- log.annual.inc: The natural log of the self-reported annual income of the borrower
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income)
- fico: The FICO credit score of the borrower
- days.with.cr.line: The number of days the borrower has had a credit line
- revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle)
- revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available)
- inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years
- pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments)
Goal = our purpose is to analyze the data to find new insights hidden in our data.
We will import and open the data to get a better understanding about the problem
data <- read.csv("loan_data.csv")
head(data)Firstly, we need to answer some basic questions:
1. How many fully paid data and not fully paid data?
2, Is there any correlation between the columns that makes someone cannot full pay the loan?
3. What kind of informations given which tend to fully paid and not fully paid?
If the difference value between not fully paid and fully paid is significant, there might be something wrong with either the lending company or the borrower.
Before going to the questions, we need to see the data from a big frame by using “summary()” and “str()”. We also need to check if there is any null value in our data. If there is any (>5%), we will have to do an imputation to fill the missing. But, if there is a column with missing values as its majority or the column will not give any significant effect, we can drop them.
summary(data)## credit.policy purpose int.rate installment
## Min. :0.000 Length:9578 Min. :0.0600 Min. : 15.67
## 1st Qu.:1.000 Class :character 1st Qu.:0.1039 1st Qu.:163.77
## Median :1.000 Mode :character Median :0.1221 Median :268.95
## Mean :0.805 Mean :0.1226 Mean :319.09
## 3rd Qu.:1.000 3rd Qu.:0.1407 3rd Qu.:432.76
## Max. :1.000 Max. :0.2164 Max. :940.14
## log.annual.inc dti fico days.with.cr.line
## Min. : 7.548 Min. : 0.000 Min. :612.0 Min. : 179
## 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0 1st Qu.: 2820
## Median :10.929 Median :12.665 Median :707.0 Median : 4140
## Mean :10.932 Mean :12.607 Mean :710.8 Mean : 4561
## 3rd Qu.:11.291 3rd Qu.:17.950 3rd Qu.:737.0 3rd Qu.: 5730
## Max. :14.528 Max. :29.960 Max. :827.0 Max. :17640
## revol.bal revol.util inq.last.6mths delinq.2yrs
## Min. : 0 Min. : 0.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 3187 1st Qu.: 22.6 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 8596 Median : 46.3 Median : 1.000 Median : 0.0000
## Mean : 16914 Mean : 46.8 Mean : 1.577 Mean : 0.1637
## 3rd Qu.: 18250 3rd Qu.: 70.9 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :1207359 Max. :119.0 Max. :33.000 Max. :13.0000
## pub.rec not.fully.paid
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.06212 Mean :0.1601
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :5.00000 Max. :1.0000
str(data)## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : chr "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
We need to change “purpose” to be category type, because there are only 7 values which appear repeatedly. We change the data type so we can get more insight about the correlation between “purpose” and the other columns.
data$purpose <- as.factor(data$purpose)Check if there any missing value in our data.
sum(is.na(data))## [1] 0
Based on the result, there is no missing value in our data. SO, we are ready to do data exploration.
In order to find insghts in our data, we can use visualization. Before visualizing our data, we need to import libraries needed for data visualization.
library(ggplot2)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We want to see a relation between “credit.policy” and “not.fully.paid”
numtab <- data[, -c(2)]
corcreditpaid <- aggregate(fico ~ credit.policy + not.fully.paid, data = numtab, FUN = "length")
names(corcreditpaid)[names(corcreditpaid) == 'fico'] <- "total"
corcreditpaidVisualize the corcreditpaid table in barplot
corcreditpaid$credit.policy <- as.factor(corcreditpaid$credit.policy)
corcreditpaid$not.fully.paid <- as.factor(corcreditpaid$not.fully.paid)
corcreditpaid <- corcreditpaid %>%
group_by(credit.policy) %>%
mutate(percent = total/sum(total)) %>%
ungroup() %>%
mutate(percent = round(percent, 2)*100,
percent = paste(percent, "%"))
ggplot(data = corcreditpaid, aes(fill = not.fully.paid, x = credit.policy, y = total)) +
geom_col(position = "dodge") +
geom_text(aes(label = percent), position = position_dodge(width = 1)) +
labs(x = "Credit Policy", y = "Number of Borrower")More than 75% borrowers who pass the credit policy fully paid the loan and almost 75% borrowers who don’t pass the credit policy succeeded to fully paid the loan. We can assume that the credit policy don’t play a significant role to predict whether the borrower full paid the loan or not. To ensure our statement, we can see the correlation between “credit.policy” and “not.fully.paid”.
data$credit.policy <- as.numeric(data$credit.policy)
data$not.fully.paid <- as.numeric(data$not.fully.paid)
cor(data$credit.policy, data$not.fully.paid)## [1] -0.1581192
The correlation value is -0.1581192. Now we are sure that our statement is correct.
Insight 1: There are 87% people who passed the credit policy could pay their loan in full and there are 72% who failed to pass the credit policy could pay their loan in full. Hence, the credit policy status does not give strong effect in predicting the not fully paid status
We want to see the data distribution based on Interest Rate and Not Fully Paid Status.
ggplot(numtab, aes(x = int.rate, y = not.fully.paid, color = int.rate)) +
geom_point() + geom_count() +
labs(title = "Scatter Plot Interest Rate and Not Fully Paid Status",
subtitle = "0 for Fully Paid and 1 for Not Fully Paid",
x = "Interest Rate",
y = "Not Fully Paid Status",
size = "Number of Data",
color = "Interest Rate"
)Insight 2: The majority of interest rate is below 10% and the borrowers can pay their loan in full.
Compare Interest Rate between borrowers who paid their loan in full and did not paid in full.
intrate0 <- numtab[numtab$fully.paid == 0, c(2, 13)]
intrate1 <- numtab[numtab$fully.paid == 1, c(2, 13)]numtab$not.fully.paid <- as.factor(numtab$not.fully.paid)
ggplot(data = numtab, aes(x = not.fully.paid, y = int.rate)) +
geom_boxplot() +
labs(title = "Boxplot of Not Fully Paid Based On Interest Rate",
x = "Not Fully Paid",
y = "interest Rate")Intrerest rate of not fully paid (1) =
- MIn = 0.0705
- 1st Q = 0.1154
- 2nd Q (median) = 0.1316
- 3rd Q = 0.1482
- Max = 0.2164
Intrerest rate of not fully paid (0) =
- Min = 0.0600
- 1st Q = 0.0996
- 2nd Q (median) = 0.1218
- 3rd Q = 0.1387
- Max = 0.2121
Insight 3: The median of people who did not paid their loan in full is higher than the ones who full paid.
To make a good prediction model, we have to make sure that we use enough amount of parameter so that there will be no noise. Noise that we have may affect our prediction, making it less accurate. To cope the problem, we need to see if there any unrelated data in our dataset.
numtab$credit.policy <- as.numeric(numtab$credit.policy)
numtab$fico <- as.numeric(numtab$fico)
numtab$revol.bal <- as.numeric(numtab$revol.bal)
numtab$inq.last.6mths <- as.numeric(numtab$inq.last.6mths)
numtab$delinq.2yrs <- as.numeric(numtab$delinq.2yrs)
numtab$pub.rec <- as.numeric(numtab$pub.rec)
numtab$not.fully.paid <- as.numeric(numtab$not.fully.paid)
as.data.frame(cor(numtab))From the table above, there is no column that has high correlation value with “not.fully.paid” column. But there are some columns that have high correlation value such as “fico” with “int.rate” and “fico” with “revol.util”. Thus, there might be some hidden information between them that will help us to predict fully paid status. Further analysis needs to be done for building prediction model.