Data 606 Final Project

David Quarshie

December 9, 2017

Part 1 - Introduction

Lending Club (LC) is a peer to peer lending platform that allows borrowers to put in loan requests and lets investors fund those loans. Borrowers can ask to up to $40K and according to LC they can be funded in days. Based on borrowers’ information (income, credit history, debt, etc.) their loan is given a score from A (best) to G (worst). Investors can use this score to decide if they would like to help fund that loan or not.

In this study we are going to look at if the average amount funded differs for each grade. Just like bank loans, LC loans have higher interest payments when the loan grade worsens. With that being known should a borrower ask for more or less money with a good score? If there is no difference in amount funded by grade should a borrower with a grade of G even worry about the grade? We will try to answer that here.

Part 2 - Data

Data Collection

LC posts all it’s funded and reject loan information on it’s site here: https://www.lendingclub.com/info/download-data.action For this study we will look at the funded information which contains details about the borrower. Details including annual income, state, job, loan amount, etc.

Cases

We will take a sample of 1,000 loans from 2014

Variables

We will focus on the fields grade and funded_amt_inv, which is the amount that investors have given to the loan.
Grade: The explanatory variable that is categorical
Funded Amount: The response variable that is numerical

Type of Study and Scope of Inference

This will be an observational study so we will not be able to establish casual links between the variables. LC lends to all of the U.S. so our population of interest could be anyone looking to loan or lend money in the United States.

library(ggplot2)
library(dplyr)
library(DT)
loangrades <- read.csv ("loangrades.csv",header=TRUE, sep=",", stringsAsFactors=FALSE)

loangrades <- as.data.frame(loangrades) 

datatable(head(loangrades,10))

Part 3 - Exploratory Data Analysis

Loan Grade Summary

summary(loangrades)
##    loan_amnt      funded_amnt    funded_amnt_inv    grade          
##  Min.   : 1000   Min.   : 1000   Min.   :    0   Length:1000       
##  1st Qu.: 6000   1st Qu.: 5900   1st Qu.: 5000   Class :character  
##  Median : 9900   Median : 9600   Median : 8991   Mode  :character  
##  Mean   :11188   Mean   :10900   Mean   :10327                     
##  3rd Qu.:15000   3rd Qu.:15000   3rd Qu.:14000                     
##  Max.   :35000   Max.   :35000   Max.   :35000

Looking at the summary of the loangrades table we see that the mean of the funded amount by investors (funded_amnt_inv) is $10.3K. That’s less than the mean of the total loan funded amount (funded_amnt) which is $10.9K and less than the average requested amount (loan_amnt) which is $11.2K.

loangrades_grp  <- loangrades %>% 
  group_by(grade) 

loangrades_grp <- summarise(
  loangrades_grp,
  loans = n(),
  mean_funded = mean(funded_amnt_inv),
  sd_funded = sd(funded_amnt_inv),
  max_funded = max(funded_amnt_inv),
  mean_funding_diff = mean(loan_amnt-funded_amnt_inv)
)

datatable(loangrades_grp)

When we group the loans by grade we see that the group with the most loans is A while G has the least. When comparing the amount funded to the amount requested G grade loans have the lowest difference.

ggplot(loangrades,aes(x=funded_amnt_inv)) + geom_histogram(aes(fill=grade),bins=5) + ggtitle("Funded Amount Histogram")

Our histogram shows us that most loans are funded around $10K.

Part 4 - Inference

Hypothesis

\[{ H }_{ 0 }:\quad { \mu }_{ A }={ \mu }_{ B }={ \mu }_{ C }={ \mu }_{ D }={ \mu }_{ E }={ \mu }_{ F }={ \mu }_{ G }\quad \] H0: There is no difference in the average funded amounts between loan grades

HA: There is a difference in the average funded amounts between loan grades

Check conditions for ANOVA

ggplot(loangrades,aes(x=grade,y=funded_amnt_inv)) + geom_boxplot(aes(fill=grade)) + xlab("") +  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

With all these observations being independent and the distributions being close to normal the conditions are met for the ANOVA test.

ANOVA Test

anova <- aov(loangrades$funded_amnt_inv ~ loangrades$grade)
summary(anova)
##                   Df    Sum Sq   Mean Sq F value   Pr(>F)    
## loangrades$grade   6 3.938e+09 656266473   14.48 6.88e-16 ***
## Residuals        993 4.500e+10  45319714                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a low p value we can reject H0 and conclude that there are differences in the average funded amount by loan grade.

Tukey Honest Significant Difference

thsd <- TukeyHSD(anova, ordered=TRUE)
thsd
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = loangrades$funded_amnt_inv ~ loangrades$grade)
## 
## $`loangrades$grade`
##            diff         lwr       upr     p adj
## B-A  1885.05254   240.02845  3530.077 0.0130194
## C-A  1913.34057    14.88406  3811.797 0.0467810
## D-A  3527.17222  1433.26487  5621.080 0.0000157
## E-A  5875.03106  3129.12666  8620.935 0.0000000
## F-A  7186.80087  2682.25198 11691.350 0.0000569
## G-A 13899.04127  6285.61364 21512.469 0.0000018
## C-B    28.28802 -1816.06539  1872.641 1.0000000
## D-B  1642.11968  -402.86209  3687.101 0.2114425
## E-B  3989.97852  1281.19772  6698.759 0.0002992
## F-B  5301.74833   819.73281  9783.764 0.0089409
## G-B 12013.98872  4413.87145 19614.106 0.0000699
## D-C  1613.83165  -640.05016  3867.713 0.3442363
## E-C  3961.69049  1091.92983  6831.451 0.0009607
## F-C  5273.46030   692.35826  9854.562 0.0123568
## G-C 11985.70070  4326.73109 19644.670 0.0000871
## E-D  2347.85884  -654.77952  5350.497 0.2400224
## F-D  3659.62865 -1005.86241  8325.120 0.2365462
## G-D 10371.86905  2662.12675 18081.611 0.0014680
## F-E  1311.76981 -3680.45914  6303.999 0.9871724
## G-E  8024.01020   112.26771 15935.753 0.0443046
## G-F  6712.24040 -1968.00378 15392.485 0.2524488

Using Tukey’s Honest Significant Difference method we see the differences in average funded amount by loan grade. The p values can be used to see the range of differences between two groups.

Part 5 - Conclusion

It’s safe to say that the grade of a Lending Club loan will have an effect on the amount investors will fund. Knowing that lower grade loans will yield a higher return on their investment investors may be more inclined to give more funds to them. However, investors must keep things like charge off rates in mind when making these decisions.

Lending Club’s data is very interesting and useful. With previous loans’ performance available to investors many more tests like the one done here can help investors know what to look for before helping borrowers. Test looking at certain locations and income levels may help someone find the perfect area to invest in. Using statistics and probability in addition to data analysis we can not only help others looking for funding but make good investments ourselves.