knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
The data that I use for this analysis (loan.csv) is a stripped down version of LendingClub’s complete loan data for loans issued between 2007 and 2015. LendingClub, a lending firm started in 2007, manages crowdfunding between disparate borrowers and investors. In this analysis, I propose that useful risk and return estimates can be ascertained from LendingClub’s public historical data.
## [1] 887379
## [1] "member_id" "loan_amnt"
## [3] "funded_amnt" "funded_amnt_inv"
## [5] "term" "int_rate"
## [7] "installment" "grade"
## [9] "sub_grade" "emp_length"
## [11] "home_ownership" "annual_inc"
## [13] "verification_status" "issue_d"
## [15] "loan_status" "purpose"
## [17] "addr_state" "dti"
## [19] "delinq_2yrs" "earliest_cr_line"
## [21] "inq_last_6mths" "mths_since_last_delinq"
## [23] "open_acc" "pub_rec"
## [25] "revol_bal" "revol_util"
## [27] "total_acc" "out_prncp"
## [29] "out_prncp_inv" "total_pymnt"
## [31] "total_pymnt_inv" "total_rec_prncp"
## [33] "total_rec_int" "total_rec_late_fee"
## [35] "recoveries" "collection_recovery_fee"
## [37] "last_pymnt_d" "last_pymnt_amnt"
## [39] "last_credit_pull_d"
LendingClub uses a formula called Net Annualized Return which measures the rate of return on the principle invested over the lifetime of the investment. Unfortunately, this formula relies on payment history that we don’t have access to in this data set, so I’m not able to fully implement this. I’m also aware of an approach to calculating ROI that accounts for the expected default rate of a given loan. Since I’m only concerned with the observed liklihood that a given borrower will default, I will simply look at how much was loaned versus how much was paid back. The total payment variable that I use includes the principle, effective interest payments, and fees associate with a particular loan.
I’m only interested in historical loans for this analysis, so loans that haven’t matured yet can be ignored. In this step, I’m taking only the subset of loans with 36 month terms that should have matured between 2007 and 2012. 5 year loans and loans that were issued in the later years are probably not mature yet and should include a disproportionate number of defaulted loans which will skew our data. Removing these loans will give me a clearer, more accurate picture of how loans really performed.
## [1] 74866
LendingClub was founded in 2007 and began growing rapidly thereafter as shown in the histograms for the number of loans issued each year. 74,866 36 month loans were issued between 2007 and 2012. The volume of these loans has increased over time. Loan issuance seems to be slower in the first half of the year, but pick up in the second half. It’s also interesting to see the dip in the growth of loan issuance in 2008, during the height of the Great Recession in the US. Perhaps this is due to a reflexive tightening of credit standards in the face of the housing crash. I’d like to further explore how temporal factors played a role in lending volume and loan outcomes.
## clmn
## A B C D E F G
## 0.27 0.35 0.22 0.12 0.03 0.01 0.00
LendingClub assigns grades to loans based on an internal formula that looks at indicators of the applicant’s credit risk. Higher-grade loans (i.e. A,B,C) indicate better credit and lower risk while lower grade loans (i.e. E, F, G) indicate the opposite. Looking at the distribution of loans by subgrade shows that the vast majority of loans are A, B, or C grade (87%), while grades E, F and G are far less prevelant(4%). Interest rates increase and become more variable as the loan grade gets lower. This interest rate adjustment with respects to loan grade is meant to account for the greater risk and volatility associated with each loan. I’d like to explore this idea of risk in later sections.
## clmn
## Charged Off Fully Paid
## 0.13 0.87
To determine the extent that certain factors have on ROI (return on investment), I will focus on the loans that are either Fully Paid or Charged Off, as those loans are already matured and can be analyzed retrospectively. 13% of these loans are Charged Off and 87% were Fully Paid.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1896 40000 57000 67251 80000 7141778 4
The vast majority of borrowers have been employed for at least 10 years and the average income falls bewteen $40k and $80k with a mean of $67,251. This reveals a mostly middle-class cohort of borrowers. I’m assuming that those with more work history and higher income have a better liklihood of repaying their loans. That assumption will need to be tested later.
## clmn
## car credit_card debt_consolidation
## 0.02 0.18 0.51
## educational home_improvement house
## 0.01 0.06 0.01
## major_purchase medical moving
## 0.04 0.02 0.01
## other renewable_energy small_business
## 0.08 0.00 0.03
## vacation wedding
## 0.01 0.02
A plurality of borrowers are renters and the most popular reason for seeking the loan seems to be for debt consolidation. Borrowers who either own their homes or aren’t burdened with a mortgage and want to consolidate debt don’t strike me as high risk. On the other hand, if the borrower had a burdensome mortgage and needed a loan to make a major purchase or to pay off debt, that seems like it would be higher risk. I would have to look at debt profiles to get a better idea of how risky a borrower would be.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0938 0.1484 0.1501 0.2038 0.3499
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.3340 0.5590 0.5339 0.7510 1.1900 127
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.02632 0.35714 0.46875 0.49354 0.61111 1.75000 29
The average debt-to-income ratio is a reasonable 15%. There doesn’t seem to be much skew, considering that the median is so close to the mean. More than 20% debt-to-income starts to be considered problematic according to some experts. The mean credit utilization, however, is 53% which means the average borrower is using most of their revolving credit at a time when they are seeking the loan. This is higher than the 30% recommended by many experts. There is again a large spread for credit utilization, but not much skew. The percentage of open accounts to total accounts seems to be around 50% which doesn’t seem problematic.
According to a 2016 report from NerdWallet, U.S. households with higher incomes tend to have higher debt numbers, but that debt tends to make up a smaller percentage of its annual income. So, the high credit utilization numbers alone don’t necessarily indicate financial instability (or irresponsibility). I think this gives me a good enough idea of the kind of borrowers we have. I may want to drill down later into other personal finance metrics that might be of interest.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0 0 236.03 514.20519 650.39 18858.99
## 2 0 0 0.00 16.17264 0.00 27750.00
## V1
## 1 13.36414
## 2 11.86130
## clmn
## Charged Off
## 1
## clmn
## Charged Off Fully Paid
## 0.02 0.98
## A B C D E F G
## 1 0.15 0.33 0.27 0.18 0.05 0.02 0.01
## 2 0.29 0.36 0.21 0.11 0.03 0.01 0.00
## V1
## 1 0.26095558
## 2 0.07212291
There are two sections in the distribution of ROI; one at least 0.0 ROI and one less than 0.0 ROI. The starkest differences between these sections were the following: - All of the loans with negative ROI defaulted while almost all of the loans with positive ROI were fully paid (98%)
Related to the above point, loans that ended up with lower ROI’s involved a greater recovery effort ($514 vs. $16) and slightly higher interest rates (13.4% vs. 11.9%)
Grades are slightly higher for loans that were fully paid (makes sense); twice as many A grade loans
A lot more variability in ROI for loans that defaulted compared to loans that were fully paid
A lot of loans had near 0% returns - contributing to the large spike at 0.0 ROI
The sections on the left represents borrowers who defaulted, while the section on the right represents borrowers who fully paid their loans. Both these cohorts settle around their respective means. I think the larger variability in ROI on the left may be associated with higher risk. The interesting thing to keep in mind is the role of interset rates and “recoveries” (e.g. late fees, collections fees, etc.) and how they relate to profitability and risk: borrowers who are considered riskier will likely have higher interest rates and pay more fees and will, therefore, yield a higher-than-average ROI if they do make their payments. If they default, however, then at least the higher interset rates and fees will have padded the rate of loss to some degree. On the other hand, borrowers who are perceived as less risky may pay less in interest and fees, but they are expected to actually pay more reliably.
I think this plot is crucial as it reveals the well known risk vs. reward theme in lending. This theme will probably reappear in other relationships in the dataset.
How does time of year relate to loan volume and outcome?
We can further explore how “seasons” effect loan profitability by doing a time series analysis on ROI.
How did people with poor credit metrics compare to people with good credit metrics?
From looking at some borrower statistics, we can see that our cohort of borrowers seem relatively well-established being at least in their 30’s, owning a home, and having an annual income of about ($40-$80). When looking at their credit statistics, the distributions for debt-to-income ratio and revolving credit usage (relative to income) seemed pretty stable, averaging at 15% and 53% respectively. Were they paying off their loans? How much more in interest were they paying than people with good credit metrics?
What features tell you the most about profitability?
I’d like to look into which borrower metrics generally have the biggest influence on ROI.
To what degree do interest and fees contribute to profitability?
We’ve seen the stark net_profit in recovery rates and variability in the ROI distribution, how does that relationship look for different cohorts in the dataset?
I want to zoom in and take a look at the following graphs that I thought were interesting:
dti vs. ROI
ROI vs. recoveries
annual income vs. ROI
interest rates vs. ROI
loan amount vs. ROI
credit indicators vs. ROI
Takeaways from these plots:
You can see the generic ROI distribution expressing itself in the two sections at the top and bottom in the roi vs. dti plot. The dense band at the top represents borrowers who largely paid their loans and loans at the bottom represents borrowers who defaulted
Most loans that required recoveries ended up with negative ROI
There’s more variability in ROI for borrowers with lower incomes while borrowers with higher incomes were more likely to have positive ROI
As with the ROI vs. dti plot, I’m assuming that the top band of the interest rate vs. ROI plot represent borrowers who paid their loans (moreover) while the lower portion represents those that didn’t. As I expected, there was a clear positive correlation between interest rates and ROI for borrowers who paid their loans, but for those who didn’t the ROI was negative and didn’t seem to correlate much with interest rates
Amount of recoveries seemed to have had the strongest correlation with profitability (-0.296) while dti had the weakest (0.005)
Interestingly, there’s a positive correlation (.02) between number of past defaults and roi. Other indicators of poor credit don’t necesarily indicate unprofitable borrowers
Takeaways from the plots:
Loans that were charged off obviously had lower ROI and more variability as we saw in the generic ROI distribution
Lower grades tend to have higher mean ROIs, but more variability - the risk vs. reward relationship again
## # A tibble: 12 x 6
## issue_mnth.ordered roi_mean roi_median roi_std amount n
## <fctr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Jan 0.07326420 0.14 0.2383801 10708.844 4243
## 2 Feb 0.08357041 0.14 0.2322338 11211.739 4204
## 3 Mar 0.08085510 0.14 0.2360110 10815.965 4748
## 4 Apr 0.08376145 0.14 0.2324391 10416.659 5131
## 5 May 0.07874272 0.14 0.2376360 10455.192 4979
## 6 Jun 0.08384364 0.14 0.2377086 9839.805 5500
## 7 Jul 0.08599412 0.15 0.2399038 10123.531 6468
## 8 Aug 0.09007333 0.14 0.2326443 10420.840 7091
## 9 Sep 0.09294382 0.15 0.2304946 10462.032 7796
## 10 Oct 0.09166013 0.15 0.2353048 10704.910 8156
## 11 Nov 0.09431923 0.15 0.2318452 10902.411 8439
## 12 Dec 0.09919739 0.15 0.2194553 12152.700 8111
## ls.select_months$issue_mnth: Apr
## clmn
## car credit_card debt_consolidation
## 0.03 0.17 0.46
## educational home_improvement house
## 0.01 0.07 0.01
## major_purchase medical moving
## 0.05 0.02 0.01
## other renewable_energy small_business
## 0.10 0.00 0.03
## vacation wedding
## 0.01 0.02
## --------------------------------------------------------
## ls.select_months$issue_mnth: Dec
## clmn
## car credit_card debt_consolidation
## 0.02 0.21 0.54
## educational home_improvement house
## 0.00 0.05 0.01
## major_purchase medical moving
## 0.03 0.01 0.01
## other renewable_energy small_business
## 0.07 0.00 0.03
## vacation wedding
## 0.01 0.01
## --------------------------------------------------------
## ls.select_months$issue_mnth: Jan
## clmn
## car credit_card debt_consolidation
## 0.03 0.18 0.49
## educational home_improvement house
## 0.01 0.04 0.01
## major_purchase medical moving
## 0.04 0.02 0.01
## other renewable_energy small_business
## 0.08 0.00 0.05
## vacation wedding
## 0.01 0.02
## --------------------------------------------------------
## ls.select_months$issue_mnth: Oct
## clmn
## car credit_card debt_consolidation
## 0.02 0.19 0.54
## educational home_improvement house
## 0.00 0.06 0.01
## major_purchase medical moving
## 0.03 0.02 0.01
## other renewable_energy small_business
## 0.07 0.00 0.03
## vacation wedding
## 0.01 0.01
##
## Pearson's product-moment correlation
##
## data: ls.roi_by_month$roi_mean and ls.roi_by_month$n
## t = 6.978, df = 10, p-value = 3.816e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7059602 0.9750495
## sample estimates:
## cor
## 0.9108353
There is a strong positive correlation between loan volume and loan profitability by month. My immediate guess is that borrowers have different reasons for seeking loans in the early months than they do in the later months. These different reasons may account for the different profitabilities seen in the plot.
Looking at the purpose of loans in the earlier months (January - June), we can see that there are slightly less debt consolidation loans than in later months (July - December) - this coincides with ROI. Perhaps there is something about borrowers who are seeking a loan for debt consildation that makes them apply later in the year and causes them to pay more reliably. Other factors could be at play or it could also be part of an internal business strategy at LendingClub - it’s hard to tell. Otherwise, I don’t really see another explanation since all other metrics of interest (median, std, loan amount) show almost no variation from month to month.
I imagine there must be some relationship between loan purpose and profitibility, though.
## ls$purpose: car
## [1] 0.08086957
## --------------------------------------------------------
## ls$purpose: credit_card
## [1] 0.1128816
## --------------------------------------------------------
## ls$purpose: debt_consolidation
## [1] 0.09160097
## --------------------------------------------------------
## ls$purpose: educational
## [1] 0.03980247
## --------------------------------------------------------
## ls$purpose: home_improvement
## [1] 0.08305586
## --------------------------------------------------------
## ls$purpose: house
## [1] 0.050033
## --------------------------------------------------------
## ls$purpose: major_purchase
## [1] 0.08813616
## --------------------------------------------------------
## ls$purpose: medical
## [1] 0.05596038
## --------------------------------------------------------
## ls$purpose: moving
## [1] 0.06023333
## --------------------------------------------------------
## ls$purpose: other
## [1] 0.06385369
## --------------------------------------------------------
## ls$purpose: renewable_energy
## [1] 0.02662069
## --------------------------------------------------------
## ls$purpose: small_business
## [1] 0.02097334
## --------------------------------------------------------
## ls$purpose: vacation
## [1] 0.06728358
## --------------------------------------------------------
## ls$purpose: wedding
## [1] 0.09181502
For each reason for seeking a loan, profitibility trend is more or less the same as it is generally: borrowers who default and borrowers who pay their loans off form two maxima in the plot with defaulted loans (on the left) having more variable ROI than fully paid loans (on the right). There are slight differences in the mean ROI for each purpose. Small business loans appear to be the least profitable (0.02 rate of return), while loans for credit card debt tend to be most profitable (0.11 rate of return). Are these different profitabilities due to differences in interest rates, default rates, financial stability of borrowers, or some other factors?
In understanding profitability, it’s not necessarily the case that the borrowers with the best credit profiles or the lowest loan amounts will end up being the most profitable borrowers. Profitability involves a mixture of factors. I was a little surprised that borrowers who seemed to have worst credit indicators ended up being more profitable as a group, but since higher risk can translate into higher reward, I suppose it would make sense in that regard. The most interesting trend that I noticed in the plots was the difference in the ROI distribution between borrowers who paid and those who didn’t. The sensitivity of ROI to factors like debt, income, and interest rate is inherent in this graph and deserves more focus.
To what degree do interest and recovery payments account for overall profitability? Look at how interest played a role for different features.
How does ROI relate to net profit? High ROI means more money was earned relative to the amount that was invested. This doens’t tell us about how much actual profit was gained for each loan. Relating this to net profit will tell us to what degree was a loan profitable and by amount. We should also see how this relates to the overall portfolio.
##
## Pearson's product-moment correlation
##
## data: ls$int_vs_total and ls$roi
## t = -77.735, df = 74838, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2799501 -0.2666917
## sample estimates:
## cor
## -0.2733339
##
## Pearson's product-moment correlation
##
## data: ls$rec_vs_total and ls$roi
## t = -173.19, df = 74838, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5399946 -0.5297653
## sample estimates:
## cor
## -0.5348995
##
## Pearson's product-moment correlation
##
## data: ls$cost_vs_income and ls$roi
## t = -6.2652, df = 74860, p-value = 3.743e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03005118 -0.01573191
## sample estimates:
## cor
## -0.02289272
The share of interest and recovery payments had a moderate negative correlation with ROI Generally, I think more high recovery fees indicate a greater likelihood of default. The role of interest rates seems to have a modest impact on profitability, but we’ll have to look at the actual amount that was gained/lost to get a more accurate picture of profitbility.
## [1] 75922254
We can see from the plots that there are much more loans between 0 and .4 ROI with a net gain than there are between 0 and -1 with a net loss, leading to the $76+ million net profit. Loans with higher ROIs appear to be more sensitive to net profit than those with lower ROIs. We’ve seen a similar relationship with interest rates and ROI where loans with higher ROIs were more sensitive to changes in interest rates.
The above plots remind us that the magnitude of the ROI alone will not tell us everthing we need to know about profitability. We need to consider how much profit each loan yields and how many of such loans are in the portfolio.
Generally, higher interest rates yield higher ROIs, and therefore, higher net profits. The vast majority of loans carry a 10%-15% interest rate, so most of the revenue in the portfolio comes from that group of loans. As mentioned before, higher interest rates usually indicate riskier borrowers, so the ROI will vary more widely for higher interest rates, since they will include more borrowers who defaulted on one extreme along with borrowers who paid their high interest payments and were especially profitable on the other extreme.
The second set of line graphs shows how the ROI range gets wider as interest rates get higher. A useful thing to notice in these plots is the area between the lines on the far right where additional profit is gained between the different interest rates. This area can be adjusted to balance risk and reward.
Recall that borrowers who were seeking to pay down credit card or reconsolidate their debt were more profitable than those who were seeking loans related to small businesses or renewable energy. Also, loans issued near the end of the year seemed to be more profitable than those at the beginning. We should should see how much interest rates plays into these differences.
In the first set of graphs, we can see, again, how loans with higher interest rates have higher net profit and ROI. For the loan purpose plots, loans for credit cards and debt reconsolidation seem to include more high interest loans that would boost the profitability for those loans. This tend can also be seen when comparing earlier months to later months in the issue month plots.
This section honed in on the role of interest rates in profitability and looked further into the relationship between ROI and profitability for different cohorts of borrowers. The plots in this section taught me some new things but mainly just reinforced what I had already seen in previous sections:
Recoveries have a moderate impact on ROI
Higher ROIs yield more profit
Distribution of loans and relationship between ROI and profit must be considered
Higher interest rates yield higher ROIs – and, thus, more net profit
Overall portfolio has more middle tier interest rates (10%-15%)
Loans with higher ROIs were more sensitive to changes in interest rates
I suspect that overall profit can be boosted by adding more high interest loans and padding them with more middle interest loans; perhaps this is why cohorts such as December loans and credit card loans, with more robust distributions, can afford to add riskier high interest loans and, thus, achieve higher ROIs. I didn’t look to much into the debt profile of these loans, but I think the higher risk loans would include more borrowers with poorer credit metrics (high dti, more public records, deliquencies, etc.) based on what we’ve seen in previous plots. These risks can be balanced by assigning them to certain grades and subgrades and managing the distribution of these grades in the portfolio.
This plot shows the general ROI distribution in the dataset. We can see that the unprofitable loans on the left are less frequent and cover a wider range of ROIs while the profitable loans on the right a far more frequent and cover a smaller range of ROIs. This plot is significant because it shows the fundamental balance of ROI in the dataset.
This plot shows how loans tend to have greater returns later in the year. This plot was significant because of how stark it is. I think it also reveals how a higher volume of loans can translate into better returns, probably because you have more flexibility in the risk vs. reward balance when you have more loans.
This multivariate plot shows the relationship between ROI and net profit for different interest rates. This plot is significant because it supports the notion that Fully Paid loans are more sensitive to changes in interest rates and that higher interest loans are more profitable, relatively and nominally.
I think this particular dataset was way more challenging than I had expected. The initial problem was deciding how to clean and transform te data based on what I wanted to know. Of course, if I wanted to know new things I would often go back and transform the data again - it was a cyclical process rather than linear. Another part of the challenge was wrapping my head around the large number of variables at my disposal. It’s tempting to want to look at a permutation of all the relationships that could exist, but I had to narrow my scope and accept that I might not catch everything in one swoop.
I did enjoy the process of actually learning about the lending process and some of the intricacies of lending under the hood. This type of knowledge might be materially useful in the future. I also enjoyed using the R language and RStudio. Coming from developing with C in a text editor, I’m really being spoiled with the remarkable convenience of the platform (very Python-esque).
For future work, I could look more into how geography plays into the risk vs. reward paradigm. Eventually, understanding how to boost profitability from the investor side would be very useful.