Data Context

One industry in which machine learning is having a significant impact is financial lending. In this data analysis project, I will use data from LendingClub, a well-known peer-to-peer lending platform based in San Francisco, California, to build statistical models that use debtor attributes to predict the grades given to loan applications by the Lending Club.

Data Source

  1. The training data is the LendingClub historical data on loans and chargeoffs from 2013 through 2014. (175 MB)
  2. The test data is LendingClub applications that were received in 2015. (309 MB)
  3. the data dictionary which explains the meanings of the different fields in the files. Link: https://www.lendingclub.com/info/download-data.action

Code Proper

Part 1. Data Loading and Clean Up:

Loading in the data needed for this analysis; clearing the first row and the empty rows in the bottom

library(readr)
getwd()
## [1] "C:/Users/fzy20/Downloads"
set.seed(1)
#data loading
loan_data = read_csv(skip = 1, "C:/Users/fzy20/Downloads/LoanStats3c.csv")
loan_data = loan_data[!is.na(loan_data$grade),]
#row structuring
loan_data = loan_data[0:-2,]

Part 2. Descriptive statistics:

Instead of predicting the absolute loan amount I decided to make this a classification problem to predict the loan grades in a binary variable. Using the training data, I created a binary variable called highgrade that takes the value 1 when a loan has received an “A” or “B” grade and 0 otherwise

Furthermore I want to answer a couple of questions:

  1. What is the proportion of loans in the training data that receive either an “A” or “B” grade?

  2. determining whether the differences in the proportion of loans that are highgrade is significantly different depending upon:

  • Whether the debtor is above or below the median income level
  • Whether the loan request is above or below the median loan amount
  • Whether the debtor rents their home or not

Creating the highgrade variable:

#variable generation
loan_data$highgrade <- as.integer(loan_data$grade %in% c("A","B"))

Checking the proportion of high grade loans; there about 42% of invididual who received a high grade loan

# Load
library(wesanderson)
par(lwd = 4)
freq = as.data.frame(table(loan_data$highgrade))$Freq
lbls = c("lower grade", "high grade")
pct <- round(freq/sum(freq)*100)
pct_lbls <- paste(pct,"%",sep="") # ad % to labels 
pie(freq, labels = pct_lbls, main = "Pie Chart highgrade", col = wes_palette(n=length(freq), name="GrandBudapest1"), border = "white")
legend("topright", lbls, cex = 0.8,
   fill = wes_palette(n=length(freq), name="GrandBudapest1"))

The exact proportion is shown below:

table(loan_data$highgrade)[2][[1]]/nrow(loan_data)
## [1] 0.4160898

Next, we will use the t.test to examine three things: 1. level fo loans depending on the income level of the debtor - There is a significant difference between the proportion of individual receiving high grade loans for individuals above median income level and below income level; in fact the difference sum up to about 9 percent on average in favor of those with a higher income

t.test(loan_data[loan_data$annual_inc > median(loan_data$annual_inc),]$highgrade,
       loan_data[loan_data$annual_inc <= median(loan_data$annual_inc),]$highgrade)
## 
##  Welch Two Sample t-test
## 
## data:  loan_data[loan_data$annual_inc > median(loan_data$annual_inc),  and loan_data[loan_data$annual_inc <= median(loan_data$annual_inc),     ]$highgrade and     ]$highgrade
## t = 45.046, df = 231178, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.08739910 0.09535065
## sample estimates:
## mean of x mean of y 
## 0.4642396 0.3728647
  1. level fo loans depending on the loan amount of the debtor
  • There is a significant difference between the proportion of individual receiving high grade loans for individuals loaning above the median loan amount and below; in fact the difference sum up to about 7 percent on average in favor of those asking for a smaller loan
t.test(loan_data[loan_data$loan_amnt > median(loan_data$loan_amnt),]$highgrade,
       loan_data[loan_data$loan_amnt <= median(loan_data$loan_amnt),]$highgrade)
## 
##  Welch Two Sample t-test
## 
## data:  loan_data[loan_data$loan_amnt > median(loan_data$loan_amnt),  and loan_data[loan_data$loan_amnt <= median(loan_data$loan_amnt),     ]$highgrade and     ]$highgrade
## t = -34.087, df = 235539, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07302278 -0.06508180
## sample estimates:
## mean of x mean of y 
## 0.3814803 0.4505326
  1. level fo loans depending on the home status of the debtor
  • There is a significant difference between the proportion of individual receiving high grade loans for individuals who rent their homes or not; in fact the difference sum up to about 3 percent on average in favor of those who do not rent their homes
t.test(loan_data[loan_data$home_ownership %in% c("RENT"),]$highgrade,
       loan_data[!(loan_data$home_ownership %in% c("RENT")),]$highgrade)
## 
##  Welch Two Sample t-test
## 
## data:  loan_data[loan_data$home_ownership %in% c("RENT"), ]$highgrade and loan_data[!(loan_data$home_ownership %in% c("RENT")), ]$highgrade
## t = -14.684, df = 199441, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03450148 -0.02637568
## sample estimates:
## mean of x mean of y 
## 0.3976242 0.4280627

These all point to the fact that loans are more in favor of those of a better social class

Part 3. Build a logistic regression:

Proceeding to build a logistic regression model to predict the kind of loans a person will get; most predictors seem quite significant except for the home ownership predictor.

#eliminating unwanted data
loan_data <-loan_data[!(loan_data$home_ownership == ""),]
#model building
log_model <- glm("highgrade ~ annual_inc + home_ownership + loan_amnt + verification_status + purpose", data = loan_data,family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#printing data summary
summary(log_model)
## 
## Call:
## glm(formula = "highgrade ~ annual_inc + home_ownership + loan_amnt + verification_status + purpose", 
##     family = binomial, data = loan_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.9499  -0.7030   1.1244   2.6029  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                         8.188e+00  2.666e+01   0.307   0.7588
## annual_inc                          8.547e-06  1.216e-07  70.262  < 2e-16
## home_ownershipMORTGAGE             -8.055e+00  2.666e+01  -0.302   0.7626
## home_ownershipOWN                  -8.071e+00  2.666e+01  -0.303   0.7621
## home_ownershipRENT                 -8.180e+00  2.666e+01  -0.307   0.7590
## loan_amnt                          -3.895e-05  6.762e-07 -57.601  < 2e-16
## verification_statusSource Verified -6.533e-01  1.090e-02 -59.925  < 2e-16
## verification_statusVerified        -9.497e-01  1.245e-02 -76.260  < 2e-16
## purposecredit_card                  8.271e-01  4.978e-02  16.616  < 2e-16
## purposedebt_consolidation          -8.010e-02  4.925e-02  -1.626   0.1039
## purposehome_improvement            -3.269e-01  5.256e-02  -6.218 5.02e-10
## purposehouse                       -2.032e+00  1.385e-01 -14.673  < 2e-16
## purposemajor_purchase              -1.265e-01  5.963e-02  -2.121   0.0339
## purposemedical                     -1.177e+00  7.063e-02 -16.659  < 2e-16
## purposemoving                      -2.159e+00  1.037e-01 -20.814  < 2e-16
## purposeother                       -1.173e+00  5.481e-02 -21.394  < 2e-16
## purposerenewable_energy            -2.306e+00  3.299e-01  -6.990 2.74e-12
## purposesmall_business              -1.844e+00  8.677e-02 -21.252  < 2e-16
## purposevacation                    -1.294e+00  8.797e-02 -14.712  < 2e-16
## purposewedding                     -4.688e-01  7.629e-01  -0.614   0.5389
##                                       
## (Intercept)                           
## annual_inc                         ***
## home_ownershipMORTGAGE                
## home_ownershipOWN                     
## home_ownershipRENT                    
## loan_amnt                          ***
## verification_statusSource Verified ***
## verification_statusVerified        ***
## purposecredit_card                 ***
## purposedebt_consolidation             
## purposehome_improvement            ***
## purposehouse                       ***
## purposemajor_purchase              *  
## purposemedical                     ***
## purposemoving                      ***
## purposeother                       ***
## purposerenewable_energy            ***
## purposesmall_business              ***
## purposevacation                    ***
## purposewedding                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 319981  on 235626  degrees of freedom
## Residual deviance: 290584  on 235607  degrees of freedom
## AIC: 290624
## 
## Number of Fisher Scoring iterations: 6
#prediciting result
loan_data$pred_result <- predict(log_model,loan_data,type= "response")

Choosing the best threshold for the top rating; as seen from the histogram most of these model have an accruracy of about 0.58 percent. hmmm..

func_1 <- function(num, loan_data){
  pred <- as.integer(loan_data$pred_result >= num)
  return(sum(pred==loan_data$highgrade)/nrow(loan_data))
}
num <- 1:100
num <- c(num/100)
result = sapply(num,function(i) func_1(i,loan_data))
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
hist(result, breaks = 20,col = wes_palette(n=1, name="GrandBudapest1"),labels = TRUE)

Now i will pick the top threshold(0.49) and make the final prediction

threshold = num[result == max(result)]
loan_data$pred_final <- as.integer(loan_data$pred_result >= threshold)

Let us check the predictive rating of our model; 0.66, not bad! For a frame of reference, the model that assigns 0 to all rows has an accuracy of: 0.58 For another frame of reference, the model that assigns random 0 and 1 to all rows has an accuracy of: 0.50

result = data.frame("form"= c("logistic regression","all zeroes","random numbers"), "predictive_value"= c(sum(loan_data$pred_final==loan_data$highgrade)/nrow(loan_data),                                                    sum(rep(0,235627)==loan_data$highgrade)/nrow(loan_data),sum(sample(c(1,0),235627,replace =TRUE)==loan_data$highgrade)/nrow(loan_data)))
ggplot(result,aes(x = form, y = predictive_value)) + geom_col(fill = wes_palette(n=3, name="GrandBudapest1")) + geom_text(aes(label = predictive_value),hjust = 1.02,size = 5, color = "white") + theme_minimal() + coord_flip()

Part 4. Build a classification tree:

To make another comparison, I will procedd to building a decision tree classification model to predict the kind of loans a person will get

library(rpart)
library(rattle)                 # Fancy tree plot
library(rpart.plot)             # Enhanced tree plots
#model construction
fit = rpart("highgrade ~ annual_inc + home_ownership + loan_amnt + verification_status + purpose", data = loan_data,method = "class")

Find below the graphical representation of the model; the model returned seems quite simple; in fact only three predictors are used: purpose; verfiication status and annual income

fancyRpartPlot(fit,cex = 0.7,palettes = "OrRd")

Now we shall predict our dataset using our classifcation model and check for its accuracy; the accuracy returned 0.65, slightly lower than our logistic model

loan_data$pred_final_rf <- predict(fit,loan_data,type="class")

result = data.frame("form"= c("decision tree","all zeroes","random numbers"), "predictive_value"= c(sum(loan_data$pred_final_rf==loan_data$highgrade)/nrow(loan_data),                                                    sum(rep(0,235627)==loan_data$highgrade)/nrow(loan_data),sum(sample(c(1,0),235627,replace =TRUE)==loan_data$highgrade)/nrow(loan_data)))
ggplot(result,aes(x = form, y = predictive_value)) + geom_col(fill = wes_palette(n=3, name="GrandBudapest1")) + geom_text(aes(label = predictive_value),hjust = 1.02,size = 5, color = "white") + theme_minimal() + coord_flip()

Part 5. Performance on the test data:

Load the test data and clear the bottom two rows

loan_test <- read.csv("C:/Users/fzy20/Downloads/LoanStats3d.csv", skip = 1)
loan_test = loan_test[0:-2,]

Predict the value of the loan test data using the model we constructed earlier

#creating variable
loan_test$highgrade <- as.integer(loan_test$grade %in% c("A","B"))
#remove educational level
loan_test <- loan_test[!(loan_test$purpose %in% c("educational")),]
#remove unecessary variable
loan_test <-loan_test[!(loan_test$purpose == ""),]
#variable correction
loan_test$purpose <- as.factor(as.character(loan_test$purpose))
#predictions
loan_test$pred_result <- predict(log_model,loan_test,type= "response")
loan_test$pred_final <- as.integer(loan_test$pred_result >= threshold)
loan_test$pred_final_rf <- predict(fit,loan_test,type="class")

evaluating results of the models with benchmarks; it is quite interesting that both the decision tree and the logistic regression actually have similar performance, with logistic regression slightly higher

result = data.frame("form"= c("logistic regression","decision tree","all zeroes","random numbers"), "predictive_value"= c(sum(loan_test$pred_final==loan_test$highgrade)/nrow(loan_test),sum(loan_test$pred_final_rf==loan_test$highgrade)/nrow(loan_test), sum(rep(0,421092)==loan_test$highgrade)/nrow(loan_test),sum(sample(c(1,0),421092,replace = TRUE)==loan_test$highgrade)/nrow(loan_test)))
ggplot(result,aes(x = form, y = predictive_value)) + geom_col(fill = wes_palette(n=4, name="GrandBudapest1")) + geom_text(aes(label = predictive_value),hjust = 1.02,size = 5, color = "white") + theme_minimal() + coord_flip()

Part 6. Measure of performance:

Instead of stopping at the pure accuracy I also decided to test the recall and precision of our model. Recall(True Positive/True Positive + False Negative) and Power/Precision(True Positive/True Positive + False Positive) are both important metrics to really determine how good a model is. Specifically recall in this case refers to how how many individual who actually received high grade loans were classified as so, while precision refers to how many of those that we classified as high grade loan receivers actually received high grade loans.

Evaluating power of logistic regression model

sum(loan_test[loan_test$pred_final == 1,]$highgrade == 1)/sum(loan_test$pred_final == 1)
## [1] 0.6552654

Evaluating recall of logistic regression model

sum(loan_test[loan_test$pred_final == 1,]$highgrade == 1)/sum(loan_test$highgrade == 1)
## [1] 0.4798917

Evaluating power of the classification tree

sum(loan_test[loan_test$pred_final_rf == 1,]$highgrade == 1)/sum(loan_test$pred_final_rf == 1)
## [1] 0.6113031

Evaluating recall of the classification tree

sum(loan_test[loan_test$pred_final_rf == 1,]$highgrade == 1)/sum(loan_test$highgrade == 1)
## [1] 0.4993008