For this project, we will explore the data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(ggthemes)
library(e1071)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caTools)
Let’s open the dataset.
setwd("~/R/R-Course-HTML-Notes/R-for-Data-Science-and-Machine-Learning/Training Exercises/Machine Learning Projects/CSV files for ML Projects")
df <- read.csv("loan_data.csv")
# Here are what the columns represent:
#credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
#purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
#int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
#installment: The monthly installments owed by the borrower if the loan is funded.
#log.annual.inc: The natural log of the self-reported annual income of the borrower.
#dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
#fico: The FICO credit score of the borrower. - FICO scores range between 300 and 850. In general, scores above 650 indicate a very good credit history. In contrast, individuals with scores below 620 often find it difficult to obtain financing at favorable rates. -
#days.with.cr.line: The number of days the borrower has had a credit line.
#revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
#revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
#inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
#delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
#pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
#not.fully.paid: If the borrower paid the complete amount of loan.
Let’s check the summary and structure of the dataset.
summary(df)
## credit.policy purpose int.rate
## Min. :0.000 all_other :2331 Min. :0.0600
## 1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
## Median :1.000 debt_consolidation:3957 Median :0.1221
## Mean :0.805 educational : 343 Mean :0.1226
## 3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
## Max. :1.000 major_purchase : 437 Max. :0.2164
## small_business : 619
## installment log.annual.inc dti fico
## Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0
## 1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0
## Median :268.95 Median :10.929 Median :12.665 Median :707.0
## Mean :319.09 Mean :10.932 Mean :12.607 Mean :710.8
## 3rd Qu.:432.76 3rd Qu.:11.291 3rd Qu.:17.950 3rd Qu.:737.0
## Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0
##
## days.with.cr.line revol.bal revol.util inq.last.6mths
## Min. : 179 Min. : 0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.6 1st Qu.: 0.000
## Median : 4140 Median : 8596 Median : 46.3 Median : 1.000
## Mean : 4561 Mean : 16914 Mean : 46.8 Mean : 1.577
## 3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 70.9 3rd Qu.: 2.000
## Max. :17640 Max. :1207359 Max. :119.0 Max. :33.000
##
## delinq.2yrs pub.rec not.fully.paid
## Min. : 0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 0.0000 Median :0.00000 Median :0.0000
## Mean : 0.1637 Mean :0.06212 Mean :0.1601
## 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :13.0000 Max. :5.00000 Max. :1.0000
##
str(df)
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
The last four columns and the first one can be converted to categorical data.
df$credit.policy <- factor(df$credit.policy)
df$inq.last.6mths <- factor(df$inq.last.6mths)
df$delinq.2yrs <- factor(df$delinq.2yrs)
df$pub.rec <- factor(df$pub.rec)
df$not.fully.paid <- factor(df$not.fully.paid)
Let’s check if there are NA’s.
any(is.na(df))
## [1] FALSE
Nice. So let’s go ahead to make some exploratory analysis.
We will use the ggplot2 package to visualize the data.
df %>% ggplot(aes(fico)) + geom_histogram(aes(fill=not.fully.paid), col='black', bins=40, alpha=0.4) + scale_fill_manual(values=c('green','red')) + theme_bw()
As we explain before, borrowers with a high fico score tends to pay back their loans. From 660 to 750 there are a more people who haven’t fully paid yet.
df %>% ggplot(aes(purpose)) + geom_bar(aes(fill=purpose), alpha=0.8) + theme_bw() + theme(axis.text.x = element_blank())
Almost 4000 borrowers asked for money to debit consolidations, getting a great difference from the second one.
df %>% ggplot(aes(purpose)) + geom_bar(aes(fill=not.fully.paid), col='black', position="dodge", alpha=0.5) + theme_bw() + theme(axis.text.x = element_text(angle=45, h=1)) + scale_fill_manual(values=c("green","red"))
Let’s check if there is a trend between FICO score and the int.rate.
df %>% ggplot(aes(int.rate, fico)) + geom_point(aes(col=not.fully.paid), alpha=0.6) + theme_bw()
That makes a lot of sense, because people with a low fico score will not likely pay back their loans so they will have high interest rates.
df %>% ggplot(aes(credit.policy, fico)) + geom_boxplot(aes(fill=credit.policy), alpha=0.7) + theme_bw()
Borrowers who know the credit policy are likely to have higher fico scores.
df %>% ggplot(aes(delinq.2yrs, not.fully.paid)) + geom_jitter(alpha=0.6) + theme_bw()
Let’s go ahead and build the Support Vector Machine (SVM) model.
First of all, let’s split the data into train and test dataset. We will use the sample.split function from caTools package.
sample <- sample.split(df$not.fully.paid, SplitRatio = 0.7)
test <- subset(df, sample==F)
train <- subset(df, sample==T)
Let’s build the model.
model <- svm(not.fully.paid~., data=train)
summary(model)
##
## Call:
## svm(formula = not.fully.paid ~ ., data = train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 2865
##
## ( 1792 1073 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Let’s predict the values from the test dataset.
predicted.values <- predict(model, test[1:13])
table(predicted.values, test$not.fully.paid)
##
## predicted.values 0 1
## 0 2413 460
## 1 0 0
We got some bad results. This is due to gamma and cost parameters are incorrect. We have to tune the model, looking for the correct parameters.
#svm.tuned <- tune(svm, train.x=not.fully.paid~., data=train, kernel='radial', ranges=list(cost=c(10,100) gamma=c(0.1,1)))
We would use it to get the most appropriate values for gamma and cost to improve our model, but it takes a long time to get the results. We are going to use gamma equal to 1 and cost equal to 10.
model <- svm(not.fully.paid~., data=train, gamma=0.1, cost=200)
predicted.values <- predict(model, test[1:13])
table(predicted.values, test$not.fully.paid)
##
## predicted.values 0 1
## 0 2143 372
## 1 270 88
As always, if the model is good or not will depend on the objectives of the study. Maybe, you are interested to get higher type 1 error or type 2.