Support Vector Machines Project For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

Lending club had a very interesting year in 2016, so let’s check out some of their data and keep the context in mind. This data is from before they even went public.

We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here or just use the csv already provided. It’s recommended you use the csv provided as it has been cleaned of NA values.

Here are what the columns represent:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”). int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. installment: The monthly installments owed by the borrower if the loan is funded. log.annual.inc: The natural log of the self-reported annual income of the borrower. dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income). fico: The FICO credit score of the borrower. days.with.cr.line: The number of days the borrower has had a credit line. revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months. delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years. pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments). Data

library(readr)
loans <-read.csv('loan_data.csv')
head(loans)
##   credit.policy            purpose int.rate installment log.annual.inc   dti
## 1             1 debt_consolidation   0.1189      829.10       11.35041 19.48
## 2             1        credit_card   0.1071      228.22       11.08214 14.29
## 3             1 debt_consolidation   0.1357      366.86       10.37349 11.63
## 4             1 debt_consolidation   0.1008      162.34       11.35041  8.10
## 5             1        credit_card   0.1426      102.92       11.29973 14.97
## 6             1        credit_card   0.0788      125.13       11.90497 16.98
##   fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs
## 1  737          5639.958     28854       52.1              0           0
## 2  707          2760.000     33623       76.7              0           0
## 3  682          4710.000      3511       25.6              1           0
## 4  712          2699.958     33667       73.2              1           0
## 5  667          4066.000      4740       39.5              0           1
## 6  727          6120.042     50807       51.0              0           0
##   pub.rec not.fully.paid
## 1       0              0
## 2       0              0
## 3       0              0
## 4       0              0
## 5       0              0
## 6       0              0

Checking the summary of the loans

summary(loans)
##  credit.policy     purpose             int.rate       installment    
##  Min.   :0.000   Length:9578        Min.   :0.0600   Min.   : 15.67  
##  1st Qu.:1.000   Class :character   1st Qu.:0.1039   1st Qu.:163.77  
##  Median :1.000   Mode  :character   Median :0.1221   Median :268.95  
##  Mean   :0.805                      Mean   :0.1226   Mean   :319.09  
##  3rd Qu.:1.000                      3rd Qu.:0.1407   3rd Qu.:432.76  
##  Max.   :1.000                      Max.   :0.2164   Max.   :940.14  
##  log.annual.inc        dti              fico       days.with.cr.line
##  Min.   : 7.548   Min.   : 0.000   Min.   :612.0   Min.   :  179    
##  1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0   1st Qu.: 2820    
##  Median :10.929   Median :12.665   Median :707.0   Median : 4140    
##  Mean   :10.932   Mean   :12.607   Mean   :710.8   Mean   : 4561    
##  3rd Qu.:11.291   3rd Qu.:17.950   3rd Qu.:737.0   3rd Qu.: 5730    
##  Max.   :14.528   Max.   :29.960   Max.   :827.0   Max.   :17640    
##    revol.bal         revol.util    inq.last.6mths    delinq.2yrs     
##  Min.   :      0   Min.   :  0.0   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.:   3187   1st Qu.: 22.6   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median :   8596   Median : 46.3   Median : 1.000   Median : 0.0000  
##  Mean   :  16914   Mean   : 46.8   Mean   : 1.577   Mean   : 0.1637  
##  3rd Qu.:  18250   3rd Qu.: 70.9   3rd Qu.: 2.000   3rd Qu.: 0.0000  
##  Max.   :1207359   Max.   :119.0   Max.   :33.000   Max.   :13.0000  
##     pub.rec        not.fully.paid  
##  Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000  
##  Mean   :0.06212   Mean   :0.1601  
##  3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :5.00000   Max.   :1.0000

structure of the loans

str(loans)
## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ purpose          : chr  "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...

Converting some features into factor

loans$inq.last.6mths <- factor(loans$inq.last.6mths)
loans$delinq.2yrs <- factor(loans$delinq.2yrs)
loans$pub.rec <- factor(loans$pub.rec)
loans$not.fully.paid <- factor(loans$not.fully.paid)
loans$credit.policy <- factor(loans$credit.policy)
str(loans)
## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ purpose          : chr  "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : Factor w/ 28 levels "0","1","2","3",..: 1 1 2 2 1 1 1 1 2 2 ...
##  $ delinq.2yrs      : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ pub.rec          : Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 2 1 1 1 ...
##  $ not.fully.paid   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...

EDA

library(ggplot2)
pl <- ggplot(loans,aes(fico))+geom_histogram(aes(fill=not.fully.paid),color='black',bins = 40,alpha=0.5)
pl+scale_fill_manual(values = c("#FFDB6D","#C4961A"))+theme_bw()

Barplot representation

pl<- ggplot(loans,aes(factor(purpose)))+geom_bar(aes(fill=not.fully.paid),position='dodge')
pl+theme(axis.text.x =element_text(angle = 90,size = 10,vjust = 0.5))+theme_bw()

ggplot(loans,aes(int.rate,fico))+geom_point(aes(color=not.fully.paid,alpha=0.3))+theme_bw()

Building a model

library(caTools)
set.seed(101)
sample <- sample.split(loans$not.fully.paid,SplitRatio = 0.70)
train <- subset(loans,sample==T)
test <- subset(loans,sample==F)

Function to call SVM

library(e1071)
model <- svm(formula=not.fully.paid~.,data = train)
summary(model)
## 
## Call:
## svm(formula = not.fully.paid ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  2849
## 
##  ( 1776 1073 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Predicition

predicted.svm <- predict(model,test[1:13])
table(predicted.svm,test$not.fully.paid)
##              
## predicted.svm    0    1
##             0 2413  460
##             1    0    0

Tuning the Model

tuned.model <- tune(svm,train.x = not.fully.paid~., data = train,kernel='radial',ranges = list(cost=c(1,10)),gamma=c(0.1,1))

final model

final.model <-svm(not.fully.paid~.,data = train,cost=10,gamma=0.1)
predicted.values <-predict(model,test[1:13])
table(predicted.values,test$not.fully.paid)
##                 
## predicted.values    0    1
##                0 2413  460
##                1    0    0