1. Introduction

For this project, we will explore the data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
library(ggthemes)
library(e1071)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caTools)

Let’s open the dataset.

setwd("~/R/R-Course-HTML-Notes/R-for-Data-Science-and-Machine-Learning/Training Exercises/Machine Learning Projects/CSV files for ML Projects")
df <- read.csv("loan_data.csv")
# Here are what the columns represent:
#credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
#purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
#int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
#installment: The monthly installments owed by the borrower if the loan is funded.
#log.annual.inc: The natural log of the self-reported annual income of the borrower.
#dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
#fico: The FICO credit score of the borrower. - FICO scores range between 300 and 850. In general, scores above 650 indicate a very good credit history. In contrast, individuals with scores below 620 often find it difficult to obtain financing at favorable rates. -
#days.with.cr.line: The number of days the borrower has had a credit line.
#revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
#revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
#inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
#delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
#pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
#not.fully.paid: If the borrower paid the complete amount of loan.

Let’s check the summary and structure of the dataset.

summary(df)
##  credit.policy                 purpose        int.rate     
##  Min.   :0.000   all_other         :2331   Min.   :0.0600  
##  1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039  
##  Median :1.000   debt_consolidation:3957   Median :0.1221  
##  Mean   :0.805   educational       : 343   Mean   :0.1226  
##  3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407  
##  Max.   :1.000   major_purchase    : 437   Max.   :0.2164  
##                  small_business    : 619                   
##   installment     log.annual.inc        dti              fico      
##  Min.   : 15.67   Min.   : 7.548   Min.   : 0.000   Min.   :612.0  
##  1st Qu.:163.77   1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0  
##  Median :268.95   Median :10.929   Median :12.665   Median :707.0  
##  Mean   :319.09   Mean   :10.932   Mean   :12.607   Mean   :710.8  
##  3rd Qu.:432.76   3rd Qu.:11.291   3rd Qu.:17.950   3rd Qu.:737.0  
##  Max.   :940.14   Max.   :14.528   Max.   :29.960   Max.   :827.0  
##                                                                    
##  days.with.cr.line   revol.bal         revol.util    inq.last.6mths  
##  Min.   :  179     Min.   :      0   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 2820     1st Qu.:   3187   1st Qu.: 22.6   1st Qu.: 0.000  
##  Median : 4140     Median :   8596   Median : 46.3   Median : 1.000  
##  Mean   : 4561     Mean   :  16914   Mean   : 46.8   Mean   : 1.577  
##  3rd Qu.: 5730     3rd Qu.:  18250   3rd Qu.: 70.9   3rd Qu.: 2.000  
##  Max.   :17640     Max.   :1207359   Max.   :119.0   Max.   :33.000  
##                                                                      
##   delinq.2yrs         pub.rec        not.fully.paid  
##  Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median : 0.0000   Median :0.00000   Median :0.0000  
##  Mean   : 0.1637   Mean   :0.06212   Mean   :0.1601  
##  3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :13.0000   Max.   :5.00000   Max.   :1.0000  
## 
str(df)
## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...

The last four columns and the first one can be converted to categorical data.

df$credit.policy <- factor(df$credit.policy)
df$inq.last.6mths <- factor(df$inq.last.6mths)
df$delinq.2yrs <- factor(df$delinq.2yrs)
df$pub.rec <- factor(df$pub.rec)
df$not.fully.paid <- factor(df$not.fully.paid)

Let’s check if there are NA’s.

any(is.na(df))
## [1] FALSE

Nice. So let’s go ahead to make some exploratory analysis.

2. Exploratory Data Analisys (EDA).

We will use the ggplot2 package to visualize the data.

df %>% ggplot(aes(fico)) + geom_histogram(aes(fill=not.fully.paid), col='black', bins=40, alpha=0.4) + scale_fill_manual(values=c('green','red')) + theme_bw()

As we explain before, borrowers with a high fico score tends to pay back their loans. From 660 to 750 there are a more people who haven’t fully paid yet.

df %>% ggplot(aes(purpose)) + geom_bar(aes(fill=purpose), alpha=0.8) + theme_bw() + theme(axis.text.x = element_blank())

Almost 4000 borrowers asked for money to debit consolidations, getting a great difference from the second one.

df %>% ggplot(aes(purpose)) + geom_bar(aes(fill=not.fully.paid), col='black', position="dodge", alpha=0.5) + theme_bw() + theme(axis.text.x = element_text(angle=45, h=1)) + scale_fill_manual(values=c("green","red"))

Let’s check if there is a trend between FICO score and the int.rate.

df %>% ggplot(aes(int.rate, fico)) + geom_point(aes(col=not.fully.paid), alpha=0.6) + theme_bw()

That makes a lot of sense, because people with a low fico score will not likely pay back their loans so they will have high interest rates.

df %>% ggplot(aes(credit.policy, fico)) + geom_boxplot(aes(fill=credit.policy), alpha=0.7) + theme_bw()

Borrowers who know the credit policy are likely to have higher fico scores.

df %>% ggplot(aes(delinq.2yrs, not.fully.paid)) + geom_jitter(alpha=0.6) + theme_bw()

3. Building the model

Let’s go ahead and build the Support Vector Machine (SVM) model.

First of all, let’s split the data into train and test dataset. We will use the sample.split function from caTools package.

sample <- sample.split(df$not.fully.paid, SplitRatio = 0.7)
test <- subset(df, sample==F)
train <- subset(df, sample==T)

Let’s build the model.

model <- svm(not.fully.paid~., data=train)
summary(model)
## 
## Call:
## svm(formula = not.fully.paid ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  2865
## 
##  ( 1792 1073 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Let’s predict the values from the test dataset.

predicted.values <- predict(model, test[1:13])
table(predicted.values, test$not.fully.paid)
##                 
## predicted.values    0    1
##                0 2413  460
##                1    0    0

We got some bad results. This is due to gamma and cost parameters are incorrect. We have to tune the model, looking for the correct parameters.

#svm.tuned <- tune(svm, train.x=not.fully.paid~., data=train, kernel='radial', ranges=list(cost=c(10,100) gamma=c(0.1,1)))

We would use it to get the most appropriate values for gamma and cost to improve our model, but it takes a long time to get the results. We are going to use gamma equal to 1 and cost equal to 10.

model <- svm(not.fully.paid~., data=train, gamma=0.1, cost=200)
predicted.values <- predict(model, test[1:13])
table(predicted.values, test$not.fully.paid)
##                 
## predicted.values    0    1
##                0 2143  372
##                1  270   88

As always, if the model is good or not will depend on the objectives of the study. Maybe, you are interested to get higher type 1 error or type 2.