Support Vector Machines Project For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
Lending club had a very interesting year in 2016, so let’s check out some of their data and keep the context in mind. This data is from before they even went public.
We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here or just use the csv already provided. It’s recommended you use the csv provided as it has been cleaned of NA values.
Here are what the columns represent:
credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”). int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. installment: The monthly installments owed by the borrower if the loan is funded. log.annual.inc: The natural log of the self-reported annual income of the borrower. dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income). fico: The FICO credit score of the borrower. days.with.cr.line: The number of days the borrower has had a credit line. revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months. delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years. pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments). Data
library(readr)
loans <-read.csv('loan_data.csv')
head(loans)
## credit.policy purpose int.rate installment log.annual.inc dti
## 1 1 debt_consolidation 0.1189 829.10 11.35041 19.48
## 2 1 credit_card 0.1071 228.22 11.08214 14.29
## 3 1 debt_consolidation 0.1357 366.86 10.37349 11.63
## 4 1 debt_consolidation 0.1008 162.34 11.35041 8.10
## 5 1 credit_card 0.1426 102.92 11.29973 14.97
## 6 1 credit_card 0.0788 125.13 11.90497 16.98
## fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs
## 1 737 5639.958 28854 52.1 0 0
## 2 707 2760.000 33623 76.7 0 0
## 3 682 4710.000 3511 25.6 1 0
## 4 712 2699.958 33667 73.2 1 0
## 5 667 4066.000 4740 39.5 0 1
## 6 727 6120.042 50807 51.0 0 0
## pub.rec not.fully.paid
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
Checking the summary of the loans
summary(loans)
## credit.policy purpose int.rate installment
## Min. :0.000 Length:9578 Min. :0.0600 Min. : 15.67
## 1st Qu.:1.000 Class :character 1st Qu.:0.1039 1st Qu.:163.77
## Median :1.000 Mode :character Median :0.1221 Median :268.95
## Mean :0.805 Mean :0.1226 Mean :319.09
## 3rd Qu.:1.000 3rd Qu.:0.1407 3rd Qu.:432.76
## Max. :1.000 Max. :0.2164 Max. :940.14
## log.annual.inc dti fico days.with.cr.line
## Min. : 7.548 Min. : 0.000 Min. :612.0 Min. : 179
## 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0 1st Qu.: 2820
## Median :10.929 Median :12.665 Median :707.0 Median : 4140
## Mean :10.932 Mean :12.607 Mean :710.8 Mean : 4561
## 3rd Qu.:11.291 3rd Qu.:17.950 3rd Qu.:737.0 3rd Qu.: 5730
## Max. :14.528 Max. :29.960 Max. :827.0 Max. :17640
## revol.bal revol.util inq.last.6mths delinq.2yrs
## Min. : 0 Min. : 0.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 3187 1st Qu.: 22.6 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 8596 Median : 46.3 Median : 1.000 Median : 0.0000
## Mean : 16914 Mean : 46.8 Mean : 1.577 Mean : 0.1637
## 3rd Qu.: 18250 3rd Qu.: 70.9 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :1207359 Max. :119.0 Max. :33.000 Max. :13.0000
## pub.rec not.fully.paid
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.06212 Mean :0.1601
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :5.00000 Max. :1.0000
structure of the loans
str(loans)
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : chr "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
Converting some features into factor
loans$inq.last.6mths <- factor(loans$inq.last.6mths)
loans$delinq.2yrs <- factor(loans$delinq.2yrs)
loans$pub.rec <- factor(loans$pub.rec)
loans$not.fully.paid <- factor(loans$not.fully.paid)
loans$credit.policy <- factor(loans$credit.policy)
str(loans)
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ purpose : chr "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : Factor w/ 28 levels "0","1","2","3",..: 1 1 2 2 1 1 1 1 2 2 ...
## $ delinq.2yrs : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 2 1 1 1 1 1 ...
## $ pub.rec : Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 2 1 1 1 ...
## $ not.fully.paid : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
EDA
library(ggplot2)
pl <- ggplot(loans,aes(fico))+geom_histogram(aes(fill=not.fully.paid),color='black',bins = 40,alpha=0.5)
pl+scale_fill_manual(values = c("#FFDB6D","#C4961A"))+theme_bw()
Barplot representation
pl<- ggplot(loans,aes(factor(purpose)))+geom_bar(aes(fill=not.fully.paid),position='dodge')
pl+theme(axis.text.x =element_text(angle = 90,size = 10,vjust = 0.5))+theme_bw()
ggplot(loans,aes(int.rate,fico))+geom_point(aes(color=not.fully.paid,alpha=0.3))+theme_bw()
Building a model
library(caTools)
set.seed(101)
sample <- sample.split(loans$not.fully.paid,SplitRatio = 0.70)
train <- subset(loans,sample==T)
test <- subset(loans,sample==F)
Function to call SVM
library(e1071)
model <- svm(formula=not.fully.paid~.,data = train)
summary(model)
##
## Call:
## svm(formula = not.fully.paid ~ ., data = train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 2849
##
## ( 1776 1073 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Predicition
predicted.svm <- predict(model,test[1:13])
table(predicted.svm,test$not.fully.paid)
##
## predicted.svm 0 1
## 0 2413 460
## 1 0 0
Tuning the Model
tuned.model <- tune(svm,train.x = not.fully.paid~., data = train,kernel='radial',ranges = list(cost=c(1,10)),gamma=c(0.1,1))
final model
final.model <-svm(not.fully.paid~.,data = train,cost=10,gamma=0.1)
predicted.values <-predict(model,test[1:13])
table(predicted.values,test$not.fully.paid)
##
## predicted.values 0 1
## 0 2413 460
## 1 0 0