Peer-to-peer lending platform industry is thriving in recent years. Thousnads of investors are making profit through these platforms; thousnads of borrowers are getting money more easily. Although these platforms provide credit score and basic information of borrowers to ensure these lending tradings are in a safety trading environment, there are still thousnads of people under risk of losing money.
Even Prosper, a leading financial peer-to-peer lending platform company, still suffered credit risk issues before. But after their reconstruction and new credit system launched, the credit risk has been improved. It’s leading me want to find out stories behind the scenes. Here I will explore this Prosper dataset and try to find out some patterns behind borrowers properties, different Rating type, and how they link to default loan or completed loan.
The Prosper dataset(last updated on 03/11/2014) contains four kinds of variable categories during 2005 to 2014:
Loan Status : The status of the loan list such as Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue.
Borrower Data : Basic properties about borrowers such as credit score, occupation, employment status, etc.
Loan Data : Basic properties about the loan such as length of the loan(term), Borrower APR, etc.
Credit Risk Matrices : Matrices measured the risk of loans, such as Credit grade, Prosper Score, bank card utilization, etc.
What properties and risk factors are connected to defaulted loan? Completed loan status?
## [1] 113937 81
The dataset contains 81 variables and 113,979 observation for each loan list data.
First, let’s take a look in the Loan Status variables:
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
We see that there are 10 loan status. Observe that there are six kinds of PastDue classified by different number of due days, since I only care about whether a loan is under high risk or not, to be more specific, I combined loans under 121 days of PastDue into one variable: PastDue.
Besides, since the definition of Chargedoff is— a loan reaches 121 days past due. So we can see there are still 16 observations in over 120 days past due catagory, so I combined the Past Due (>120 days) into Chargedoff variable.
Furthermore, the number of observations in FinalPaymentInProgress and Cancelled are very small, and based on their definitions, the loan of FinalPaymentInProgress is almost completed, the Cancelled loan will not be listed in the future, so I classified both of them into Completed.
The combination result becomes as follows:
##
## Completed Current PastDue Chargedoff Defaulted
## 38284 56576 2051 12008 5018
And the bar chart of loan statust:
Furthermore, I create a variable CompletedOrRisk to label a loan is high risk(including pastdue, chargeoff, default) or completed to simplify the status variable, because that’s the two categories that I only care about. And I created a sub set sub_loan which only contains the CompletedOrRisk in Completed and HighRisk by removing the Current loans.
## [1] 57361 82
There 57,361 observation left in the subset.
Group it by loans before 2009 and after 2009:
Here we can see that:
The proportion of current loan is about 50%, meaning that there are about half of loan data that we are not sure whether they are defaulted or completed.
There are about 34% completed loan.
About 17% of charge off loan and defaulted loan in the dataset, which means in the duration of the dataset it roughly have about 17% high risk loan that investor may loss money.
Among the loans which have status in Completed or HighRisk, the proporation of high risk loans have decreased after 2009 from about 37% to 30%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
Above table and graph show the borrow rate is about 20% on average in the whole data set, the range is about 5%-35%, with slightly right-skewed except that it has a spike around 32%.
Zooming in the borrower rate with each loan status:
## sub_loan$CompletedOrRisk: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1174 0.1745 0.1865 0.2511 0.4975
## --------------------------------------------------------
## sub_loan$CompletedOrRisk: HighRisk
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1750 0.2390 0.2320 0.2900 0.4975
On average, the borrower rate with high risk loan are around 23%, with spikes around 32%; and the rate with completed loan are around 19% with a slightly right-skewed, and spikes around 32% as well. In general, the completed loans have lower average borrower interest rate than high risks’.
I wonder what kind of loan cause such a 32% of borrower rate exists in both completed and high risk loans. I guess maybe such a high rate occurred in specific year because interest rate is highly sensitive with different financial condition in different year.
##
## 0.3177 0.35 0.3199 0.29 0.2699
## 3672 1905 1651 1508 1319
The table above shows that the most frequent interest rate is 31.77% with 3672 counts, the second is 35% with 1905 counts, and if we group the interest rate by year:
Here we can see that the interest rate around 32% appeared mostly in 2011 and 2012 with roughly 2000, 3600 counts respectively. And apparently, we can see the same spiking pattern around 30 % and 35% in 2007 and 2008 as well.
I flipped through the Prosper 2012 Annual Report, there listed a table shows borrower interest rate is based on the Prosper Rating. For rating the highest risk, HR, the corresponding interest rate was ranging from 30.79% to 31.77%.
Great. Here I assume that these spiked interest rates in 2007, 2008, 2011 and 2012 were loans which be rated as HR or high risk credit. It makes sense because in these years there were global crisis occurred and it probably lead to increase the number of risker due to these rating are based on credit records of them in these year.
I also want to check out if the borrower rate is classified by different credit rating: good rating with lower borrower rate; bad rating with higher borrower rate.
First, I observe the subset with loan in both completed and high risk, excluding current loan in each year:
We can see that there are just few numbers completed or high risk loan in 2014.
Next, I group the interest rate with different Prosper Rating in each year:
We can see the rating color distribution are align with different borrower rate and has a significant color spectrum across each borrower rates– lower interest rates are with good credit ratings, and higher interest rates are with worse ratings. We can see the spiked-rate are primary in pink color (which represents HR rating).
The result makes sense because just as mentioned before, the borrower interest rate partially determined by [Prosper Rating] (https://prosper.zendesk.com/hc/en-us/articles/210013213-How-does-Prosper-set-interest-rates-).
How about different loan status in different scope of borrower rates?
We can see that some high risk loans distributed evenly at lower interest rate area before 2009, and after 2009 it has been slightly improved with more moving to the right sides. But this is not very significant.
Another interesting thing is, the most spiked interest rate, which is around 32%, is with evenly distributed of completed loans and high risk loans, meaning that even if one loan is rated as highest risk, there still has the chance that the loan can be completed.
Highlights in this section:
The borrower interest rate has clearly patterns across different credit rating and somewhat determined by Credit Grade and Prosper Rating.
Seems like the Prosper Rating has better ability to measure loan risk than previous matrix, but not very significant.
In such a huge dataset contains so many variables of credit risk matrices, I will start from the most directly matrix for investors which they can see directly in the platform.
Before 2009, the major credit risk matrix displayed for investors was Credit Grades, which was based on the borrower’s credit score from an independent credit reporting agency. But the loan performance on Prosper were not very well at that time. After temporarily shut down asked by SEC and restructuring, Prosper launched new credit risk matrix displayed since July 2009 — Prosper Rating, which regarded as stricter credit guidelines for borrowers.
In the previous section: Borrower Rate, I found that it Seems like the Prosper Rating has better ability to measure loan risk than previous matrix of credit score from agency.
Here I will explore both of Credit Grade and Prosper Rating to see if they have differences.
According to Prosper’s Form S-1 filed in October 2007, the Credit Grade was based on Experian Scorex PLUS, which was provided by Experian— one of the leading credit reporting agencies in the United States. It ranges from a high of 900 to a low of 300, which a higher number is indicative of greater likelihood that the borrower will pay his or her credit obligations. Prosper’s seven letter credit grades correspond to seven Experian Scorex PLUS scores tiers, with more favorable credit grades assigned to the higher scores. (NC means ‘No Score’)
| Credit Grade | AA | A | B | C | D | E | HR |
|---|---|---|---|---|---|---|---|
| Scorex PLUS Credit Score | 760 and up | 720-759 | 680-719 | 640-679 | 600-639 | 560-599 | 520-559 |
| Source: Prosper Form S-1 |
The other matrix available after 2009, Prosper Rating, launched by Prosper’s own system in 2009, were measured in AA, A, B, C, D, E, HR from low risk to high as well. Following are the tables of the both rating. (The first is CreditGrade, the second is ProsperRating)
##
## A AA B C D E HR NC
## 84984 3315 3509 4389 5649 5153 3289 3508 141
##
## A AA B C D E HR
## 29084 14551 5372 15581 18345 14274 9795 6935
From the graphs above, the borrower credit graded before 2009 were primarily at C and D which were medium-to-high credit risk level but with fat tails on two sides. And after 2009, we can see the distribution presents a similar bell-shaped center at c but with much thinner tails on two sides. Seems like Prosper conduct more strict credit auditing causing Prosper Rating on the “AA” category as well as the “HR” decrease.
On average, borrowers credit level on Prosper was distributed somewhat like bell shape with most of borrower in medium risk level.
How about the ability of loan risk measuring before and after 2009?
Completed loans before 2009 primarily distributed at C or better with around 41% of all loans, while after 2009 it distributed more evenly with
rating C or better decreased to 38%, mostly decreased in AA and increased in D.
High risk loans totally decreased compared to loans before 2009, as we have shown in previous section, while loans rated in D and HR still increased. Besides, loans rated in C has decreased most from 7.04% before 2009 to 4.7% after 2009.
According above, I think the ability of Prosper Rating has improved on detecting the high risk loans. Or we can say it seems like Prosper conduct more strict credit auditing on Prosper Rating to reduce good loans to a more medium risk levels.
It leads me wonder what’s the factor improved in the Prosper Rating compared to the previous credit grade? If I can find some patterns, maybe it’s the key factor that associated with the different risk level of loans.
According to this page, Prosper Rating is determined by Estimated Loss Rates, and this Estimated Loss Rates is determined by two scores: 1) a custom Prosper Score and 2) Credit score from a consumer reporting agency (like Experian). So I will investigate more in Prosper Score and Credit score to see how they make Prosper Rating more accurate than Credit Grade.
According to Prosper website, Prosper Score was built using historical Prosper data to assess the risk of Prosper borrower listings. It ranges from 1 to 11, with 11 being the lowest risk, to 1 being the highest risk.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
Graph above shows Prosper Score has a bell-shaped distribution spiking on Score with 4,6,7, and fewer counts with scores in both low and high risk.
Group Prosper Score with each loan status, we can see the Prosper score distributed a left skewed shape in completed loan, which means completed loans primarily locate in good rating. However, Prosper score distributed a bell-shaped in high risk loan. Compared to Prosper Rating, seems like Prosper Score presents a less ability to detect the high risk loans.
Since Prosper Score is just one of component in Prosper Rating. To see the difference, I compared the Prosper Score graph to the previous Prosper Rating graph:
We can see that Prosper Score distributed a left-skewed shape, while Prosper Rating distributed more evenly. It seems like there are some more strict measuring standard that be added in the less-strict of Prosper Score and make Prosper Rating to be more strict .
Group with each status of loans:
Observe that the both distributions of completed loan, it seems like Prosper using more stricter of adjusting way causing some loans rated in 6-10(medium to good loans) of Prosper Score to move to “D” to “HR” of Prosper Rating(more bad rate).
And in high risk loan we can see after adjustment, the Prosper Rating presents more good ability in detecting the high risk loans with bell-shaped to left-skewed shape.
The other component of Prosper Rating is credit score from a reporting agency. In this data set, I think the variables related to this kind of scores were CreditScoreRangeLower and CreditScoreRangeUpper. I create a new variable CreditScoreAverage by averaging both of the two variables.
Let’s see the distribution of CreditScoreAverage. Basically, before 2009 Prosper does not allow individuals with an Experian Scorex PLUS credit score below 520 to post listings on the Platform. And after 2009, Prosper made the Credit Score have the minimum threshold up to 640, but in some cases they allowed scores minimum value to 600 if borrower previously completed a Prosper loan. So I divided the graph into two period and limit the min value of the score on x-axis to 510 and 630 to exclude the outliers of special cases.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 591
We can see the CreditScoreAverage distributed as right-skewed after 2009 with most of counts in 610 to 670, and a roughly bell shape before 2009 with most of counts in 670 to 710. We can see that the overall Average Credit Score after 2009 was apparently rated higher compared to loans before 2009.
Subseting the Average Credit Score with loan status in Completed and HighRisk:
From the graphs above:
We can see that the whole distribution CreditScoreAverage moves to right after 2009 because the required Credit Score has the minimum threshold from 520 to 640. That’s why it makes the Prosper Rating more strict than Prosper Score.
Compare to the distribution of Completed and HighRisk loans, we can see that they have nearly similar distributions with right-skewed shape in both two periods.
It turns out: If Prosper only uses Credit Score for auditing, under the condition of more strict assessing after 2009(higher threshold), the credit score of overall borrowers’ loans at that time will primarily located in high-risk tiers, even the loans would have the high probability to complete. However, since Prosper combined Prosper Score as well, it makes Prosper Rating present much better measuring ability and appear a much better discriminating between completed and high risk loans.
Let’s make a brief summary. After 2009, Prosper apply the Prosper Score to make credit measuring have more discrimination, under the condition of more strict assessing standard on bureau score threshold after 2009.
So what’s the major elements of Prosper Score? I flipped through Prosper annual report from 2010 and 2013, I found some information about Prosper Score:
Prosper Score is built to estimate the likelihood that a loan will go 61+ days past due. Unlike credit score obtained from a credit reporting agency is based on a much broader population, Prosper Score is based on a more precise picture from a smaller lending platform subset.
Interest. I infer that if Prosper just measure the borrower credit score by traditional bureau agency, in fact it is just a similar measuring way like lending from a bank or other officially lending institution. Prosper Score consider the borrower behaviors that is unique among platform population. Maybe such a custom credit score is more suitable for the lending platform market, because it is specifically measured by the Prosper borrower and applicant population. Because we know the lending platform offering borrower a additional platform when he/she can’t borrow from a bank, which measure credit score in more strict way. But such the lending platform spreads the risk across many investors. The measuring way may be different.
Hence, I search what’s the major elements of Prosper Score. I found some different sources that Prosper Score was composed by different set of elements over time. Like the website or this one. I am not going explore all the related elements in order to avoid making the report too long. Instead, I will choose some variable I think important which also coved by these variable lists from these sources.
First start with InquiriesLast6Months, It measures the number of credits inquiries of borrowers in the past six months at the time the credit profile was pulled.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.000 1.435 2.000 105.000 697
The distribution present right-skewed, most of the inquiries in past 6 months occur mostly occur ant 0 time with gradually fewer counts when past 6 months inquiries increase.
Subset InquiriesLast6Months into different loan status, it shows both have right-skew distribution, but the shape in high risk loan is flatter.
Next, let’s check out BankcardUtilization. It measures borrower’s total amount of credit limit that’s being used. The lower the ratio is, the more financial liquidity of one’s has.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
The maximum value of the ratio is reasonably at 1, it seems like there are some unreasonable outliers which ate over one, so I limit the x-axis to one on the graph.
From the graph above, most of the borrower have the ratio around 0.6 to 0.9, and with median and mean in 0.6, 0.56 respectively. It seems like the borrowers on Prosper are medium or heavy credit card user.
We can see the distribution of BankcardUtilization in completed loan is flatter, while in high risk loan it primarily concentrated in the high ratio area with increasing shape across the BankcardUtilization.
Now, the last one is DelinquenciesLast7Years, which measures the number of delinquencies in the past 7 years at the time the credit profile was pulled.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
We can see that at least 50% borrowers have no delinquency in the past 7 years. Borrower with delinquency were primarily at 1 to 3 times.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
Transformed the long tail data to better understand the distribution with log10, it presents a bell-shaped distribution centered around 0.6 to 0.7. Seems like borrowers in the data set are medium to heavy credit card users.
The graph above shows that the distribution in each loan status have no significant difference with bell-shapes except that distribution in high risk loan is more flatter.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.00 1.00 1.42 2.00 8.00 91852
##
## 0 1 2 3 4 5 6 7 8
## 1 15538 4540 1447 417 104 29 8 1
There are 91,852 loan lists with no previous Prosper Loans record which accounted about 80.6% of all loans. In other words, there are about 20% of loan lists that have the previous Prosper Loans records. And in the loans with Prosper Loans records, most of the number of records are one time.
If we see “no previous Prosper Loan” and “has previous Prosper Loan” as two groups, we can see in “has previous Prosper Loan” group, the completed loan has slightly higher percentage than “no previous Prosper Loan” group.
Cool, I think that’s why Prosper takes the previous Prosper Loan record into consideration.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1627 2930 4127 23451 91852
## [1] 16142 84
There are 91,852 of NA values, corresponding the number of NA in Total Prosper Loans. In the loans with Prosper Loan records, above table shows there are still about 16,142 loans with outstanding Prosper Principal which are not completed.
Putting all things together, we can see that borrowers has previous Prosper Loan records but without outstanding Prosper Principal are having the lowest proportion of high risk loans around 30.32%.
The interesting things is, borrowers has previous Prosper Loan record as well as outstanding Prosper Principle are having lower proportion of high risk loans than borrowers with no previous Prosper Loan Record.
This matrix measures borrower’s credit score changes at the time the credit profile was pulled. This will be the change relative to the borrower’s last Prosper loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -209.00 -35.00 -3.00 -3.22 25.00 286.00 95009
We can see that the distributions appear a bell shape distribution, with mean around -3 and range around 500.
The graph above shows that the distribution in each loan status have no significant difference with bell-shapes.
Here I also choose the variable which I think important corresponding to the element of Prosper Score.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.167 5.583 8.006 11.417 62.917 7625
The duration of borrower’s employment is primarily at 0-10 years, the duration mean is about 8 years and the median is about 5.6 years.
The distribution in each loan status has the similar shape.
The borrowers’ income range were between $25,000 to $75,000 . Besides, the number of the higher income range (with 75,000+) is even more than the lower income range borrower (bellow 25,000). It shows that people use Prosper platform are primary middle class or some rich.
We can see the distribution in bad loan also has the similar patterns with completed loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
Transformed the long tail data to better understand the distribution of stated monthly income.
Above graph shows that the distribution of stated monthly income appears a normal shape centered around 7,500.
## sub_loan$CompletedOrRisk: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2917 4417 5330 6583 618548
## --------------------------------------------------------
## sub_loan$CompletedOrRisk: HighRisk
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2500 3750 4550 5667 208333
These distributions appear similar, while their mean stated incomes have difference: the mean of borrower income with completed loan is 5,330, and the mean of borrower income with high risk loan is 4,550.
There are 113,979 loan lists in the data set with 81 features from 2005 Nov to 2014 Mar. These 81 variables can be classified into four categories: loan status, borrower data, loan data, credit risk matrices.
For my questions of interest, the main features are as follows:
Loan Status: A loan is completed or high risk.
Credit Risk Matrix: Credit Grade, Prosper Rating, Prosper Score, Credit Score Range Lower and Credit Score Range Upper.
The main credit rating applied on Prosper before July 2009 was Credit Grade, which was based on the borrower’s credit score from an independent credit reporting agency.
After July 2009 Prosper used Prosper Rating, which had two major components: Prosper Score and Credit score from a consumer reporting agency.
I am interest in What risk factors are connected to high risk loan or completed loan? And how Prosper improved their credit risk matrix? Features in Credit Risk Matrix listed above may the potential features associated with Loan Status.
Some related components about Prosper Rating and Prosper Score such as: Inquiries Last 6 Months, Bank Card Utilization, Delinquencies Last 7 Years, Debt To IncomeRatio, ProsperPrincipalOutstanding, Total Prosper Loans, ScorexChangeAtTimeOfListing.
The others are properties about borrower such as Employment Status, Length of Employment, and Stated Monthly Income.
CompletedOrRisk: I created CompletedOrRisk to simplify the original LoanStatus variable which contains up to 7 levels. The CompletedOrRisk contains Completed and HighRisk.
Credit Score Average: The variable is derived from Credit Score Range Lower and Credit Score Range Upper, which is equal to the average of Credit Score Range Lower and Credit Score Range Upper. The variable is measured as a component of Prosper Rating.
I noticed both of completed loan and high risk loan have unusual peak of Borrower Interest Rate around 32%. So I classified the Borrower Interest Rate across each year and I found the peak around 32% primary occurred in 2012 and they were primary rated as “HR” in Prosper Rating.
I log-transformed the right skewed Debt To IncomeRatio and Stated Monthly Income distributions. The transformed distribution for both are appearing bell-shape and center around 0.6 to 0.7 and 7,500 respectively.
In the previous section, we found Prosper Rating seems conduct a better credit risk measuring way for Prosepr lending platform. Prosper Rating is composed of Prosper Score and Credit Bureau Score, and the Prosper Score plays a key role to make the lending platform rating have more discrimination compared to Credit Bureau Score itself.
In this section, I will investigate the relationship between Prosper Score with other features to see what’s the factor are emphasized in Prosper Score.
First, let’s investigate the relationship between Credit Grade, Prosper Rating and each status of loan. I want to check again if Prosper Rating and Credit Score can detect both of the bad and good loans.
We can see the percentage of high risk loan appears an inverse relationship with both of Prosper Rating and Credit Grade. The lower percentage of high risk loan, the better the rating is. Besides, we can see that the whole high risk loan actually decrease after Prosper Rating was launched.
Let’s dive into components of Prosper Rating. First, look at CreditScoreAverage and ProsperRating.
Although ProsperRating is category variable, I calculate the mean of CreditScoreAverage across each Prosper Rating and then drew a line to see more clearer relationship. We can see the distribution appears a concave-up incremental curve. And If we just look at the jittered scatter plot, the relationship between CreditScoreAverage and ProsperRating does not very clear, the variance of CreditScoreAverage in each Prosper Rating is kind of large.
Investigate ProsperRating and ProsperScore above, the distribution appears a slightly positive shape, and the variance of Prosper Score in each Prosper Rating is more concentrated.
Comparing ProsperScore to Credit Score Average, we can see the distribution appears a concave-up incremental curve as well, while the variance of CreditScoreAverage in each Prosper Score is kind of large.
Putting all three together, we can see Credit Score Average variated widely across each Prosper Ranting and Prosper Score, although Credit Score Average has incremental threshold as Prosper Score increase overall. But I think Prosper put more weights of Prosper Score than Credit Score Average on their own Prosper Rating model.
Before any deeply exploration of relationship between two variables, I want to start with a correlation table between the variables from previous sections with corrplot to get a quick overview.
Basically, the borrower rate tends to have negative relationship with Credit Score Average, Prosper Score Prosper Rating. The more one borrower rate, then the higher the credit risk is overall. And we can see Prosper Rating have strong negative relationship with borrower rate, it makes sense because borrower rate is determined by Prosper Rating. In the graph bellow it shows the mean and median of Borrower Rate appears a inverse relationship with Prosper Rating.
Besides, from the matrix table we can see Prosper Rating have strong relationship with Prosper Score than Credit Score Average. It shows again that maybe Prosper put more weight on Prosper Score than Credit Score Average to model of Prosper Rating.
Furthermore, Prosper Score shows week relationship with Credit Score Average, Inquiries Last 6 Months, BankcardUtilization and log_DebtToIncomeRatio.
On the next stage, I will explore the relationship between Prosper Score with other feature factors.
The trend between Prosper Score and number of inquiries last 6 months presents long tail, so I transformed the InquiriesLast6Months with log10, and we can see in the boxplot the mean of InquiriesLast6Months presents a more linear-inverse relationship with Prosper Score. In the scatter plot we can see that the relationship is not clear, most of the loans located in the Score of 4-8 with InquiriesLast6Months at 1-4 times, but the trend is still present with the decreasing upper bond of InquiriesLast6Months as Prosper Score increase.
After transforming the BankcardUtilization with square root in y-axis, we can see the trend is clear. The distribution of BankcardUtilization move toward to down as Prosper Score increases, with around 0.5 to 0.8 in Score 1-6 and spread out down to 0.25 as Prosper Score increases.
The trend between Prosper Score and delinquencies last 7 years is not very clear across Score 1 to 7. But we can see over 50% of the bad tier Prosper Score loans have the largest mean and variance of delinquencies. And most of the good Score tier loans are mainly at 0 to 2 delinquencies.
Here I transform DebtToIncomeRatio with log10. We can see there presents a trend that DebtToIncomeRatio decreases as Prosper Score increase.
We can see that Prosper Score doesn’t have clear relationship with TotalProsperLoans. The distribution appears a parabola shape.
If we focus on Total Prosper Loans over than zero:
We can see that there are still no clear relationship, but the number of TotalProsperLoans significant increase in the good Prosper Score tier(10 and 11).
We can see that Prosper Score has no clear relationship with ProsperPrincipalOutstanding. We can see that over 50% of borrowers who rated as Score 1 and 2 (bad tier) have previous Prosper principal outstanding.
If we exclude zero value of ProsperPrincipalOutstanding:
The result shows that there is no significant difference of ProsperPrincipalOutstanding across Prosper Score in 1 to 10, but the interesting thing is the Prosper Score of 11 have the highest ProsperPrincipalOutstanding mean and median, and over 50% of borrower with
Prosper Score 11 have ProsperPrincipalOutstanding roughly between 5000 to 10000. However, in the scatter plot, we can see the sample of loans which has ProsperPrincipalOutstanding with Score 11 are very small.
The trend between Prosper Score and ScorexChangeAtTimeOfListing is clear, with increasing trend as Score increases.
We can see that Prosper Score doesn’t have clear relationship with EmploymentStatusDuration in years.
We can see most of lower level of Prosper Score in 1 to 6 are primarily located on $25,000 to 75,000 of income range. And most of higher level of Prosper Score in 8 to 10 are primarily located on $75,000+ area, but the variance is large.
We can see the trend between Prosper Score and StatedMonthlyIncome is clear, with increasing trend as Score increases.
There is a clear inverse relationship between Prosper Rating and proportion of high risk loan. Better levels of Prosper Rating tend to occur more often at lower percentage of high risk loans.
And Prosper Rating tend to have more stronger relationship with Prosper Score than Credit Average Score in the respect of linear relationship. The variance of CreditScoreAverage in each Prosper Rating is larger than Prosper Score across Prosper Rating.
Prosper Score tend to have inverse trend with InquiriesLast6Months, BankcardUtilization, DebtToIncomeRatio. The better levels of Prosper Score, the less levels of each factor.
Prosper Score tend to have positive trend with ScorexChangeAtTimeOfListing, IncomeRange and StatedMonthlyIncome. The more better levels of Prosper Score, the more levels of each factor.
I was impressed that Scorex Change At Time Of Listing has very clear trend with Prosper Score since ScorexChangeAtTimeOfListing only takes account the change of Credit Score relative to the borrower’s last Prosper loan, not a longer period of Credit Score.
The Prosper Rating is positively and strongly correlated with Borrower Rate since Prosper Rating determines the Borrower Rate on the Prosper model. In matrix table and scatter plot, they show Prosper Rating have strong relationship with Prosper Score than Credit Score Average. So I look closer at the relationship between Prosepr Score and other factors, I found BankCardUtilization(-) and DebtToIncomeRatio(-) have strongest inverse trend with Prosper Score. ScorexChangeAtTimeOfListing and StatedMonthlyIncome have strongest positive trend with Prosper Score.
The other factor has more weak inverse relationship with Prosper Score includes InquiriesLast6Months.
In previous section we found some patterns between Prosper Score and risk matrices as well as borrower factors. In this section I will add the factor of loan status to do more wider exploration.
First start with relationship between Prosper Score, Prosper Rating and Loan Status. On the upper graph, we can see there is a clear color spectrum in the Prosper Score & Prosper Rating graph. It seems like the distributions of high risk/completed loans in Prosper Rating and Prosper Score are consistent. High risk loans tend to locate in the lower-left area which are the areas of low-tier Rating and Score. Completed loans tend to locate in the upper side area which stands for the good rating loans.
Look closer at the second graph, we can see that the distributions of completed loans are wide across ProsperRating holding Prosper Score constant. While high risk loans distribute more concentrate along with a diagonal. Seems like the measurement is more consistent in the high risk loans.
Since Prosper Rating correlated strongly with Borrower Rate that were seen earlier, I want to explore the trend between Borrower Rate vs. Prosper Score vs. LoanStatus.
We can see that there is a clear trend between Prosper Score, Borrower Rate and Loan Status. Borrower rate decreases as Prosper Score increases, and the scope of borrower rate moves toward down as Prosper Score increases. Holding Prosper Score constant, from the color spectrum, we can see high risk loans tend to have higher borrower rate and distributed in the 1-6 of Prosper Score. The plot below shows that completed loans tend to locate at bottom right side with Score of 6-10, but there are still have some loans have highest rate across Score 1-6. And we can see most of the high risk loans are the loans with highest rate and with Score 1-6. We can see that the distribution is more variated in Completed loans than high risk loans across each Prosper Score.
Again, we see that Completed loans have the high variated borrower rate holding Prosper Score constant, while most of high risk loans are with Prosper Score 1-6 and with borrower rate around 0.3, which implies that they are more consistent between Prosper Score and Prosper Rating than completed loan.
Again, we can see that high risk loans tend to concentrate on the lower Prosper Rating, lower Prosper Score and lower borrower rate area, while completed loans are located more discrete.
Let’s take a closer look at relationship of some factors between completed loans and high risk loans.
On higher tier of Prosper Score, high risk loans tend to have higher Inquiries in last 6 months holding Prosper Score constant. But the trend is not very clear on the low-tier Prosper Score.
We can see that the trend is very clear. High risk loans tend to have lower Prosper Score of 1 to 5 with higher BankcardUtilization of 0.5 to 1. Completed loans tend to have higher Prosper Score of 8 to 10 with more border range of BankcardUtilization of 0.25 to 1.
Still, we can see the distribution of completed loans and high risk loans in the respect of Prosper Score VS. DelinquenciesLast7Years are similar. Most of the both loans are with Score 4-8 and with lower level of DelinquenciesLast7Years.
We can see there is a clear trend between Prosper Score and Debt to Income Ratio in the respect to status of loan. High risk loans tend to have lower Prosper Score at 1 to 6 and with 0.5 to 1 of Debt to Income Ratio. Completed loans tend to have higher level of Prosper Score at 6-10 as well as have more variated Debt to Income Ratio ranged 0.1 to 0.6.
We can see high risk loans tend to have lower Prosper Score around 1 to 4; completed loans tend to have score around 5 to 10. But the times of TotalProsperLoans has no difference between completed loans and high risk loans.
We can see high risk loans tend to have lower Prosper Score around 1 to 5; completed loans tend to have score around 5 to 10. And the times of TotalProsperLoans has some trend between completed loans and high risk loans. High risk loans tend to have higher ProsperPrincipalOutstanding and completed loans tend to have lower ProsperPrincipalOutstanding holding Prosper Score constant.
We can see that the trend is very clear. High risk loans tend to have lower Prosper Score of 1 to 6 and lower ScorexChangeAtTimeOfListing of -100 to 50. Completed loans tend to have higher Prosper Score of 6 to 10 and higher ScorexChangeAtTimeOfListing of -50 to 100.
We can see that the trend is clear. High risk loans tend to have lower Prosper Score of 1 to 6 with lower IncomeRange of $0 to $50,000. Completed loans tend to have higher Prosper Score of 6 to 10 and higher IncomeRange of $50,000+.
We can see that the trend is clear. High risk loans tend to have lower Prosper Score of 1 to 6 with lower StatedMonthlyIncome of $0 to $5,000. Completed loans tend to have higher Prosper Score of 6 to 10 and higher IncomeRange of $5,000 to $15,000.
Since DebtToIncomeRatio, IncomeRange and StatedMonthlyIncome have the similar property and distribution, I will pick DebtToIncomeRatio for further investigation.
These plots which have trends suggest that we can build a Logistic model and use those variables in the model to predict the probability of one loan would be completed or high risk.
I create a target variable called Status which has two values. “1” for HighRisk in LoanStatus, the other is “0” for Completed.
I also choose 4 features which I found they have trends with Status of loans from previous section, and make those 4 as the input variables of the model. I also replace the NAs with median. These 4 features are BankcardUtilization, DebtToIncomeRatio, ProsperPrincipalOutstanding, ScorexChangeAtTimeOfListing.
First let’s check out the NA values and fill them with median.
## Status BankcardUtilization
## 0 7604
## DebtToIncomeRatio ProsperPrincipalOutstanding
## 4456 0
## ScorexChangeAtTimeOfListing
## 0
##
## Call:
## glm(formula = Status ~ ., family = binomial(link = "logit"),
## data = ds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7923 -0.9097 -0.8484 1.4265 2.0566
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.487e-01 1.806e-02 -52.521 <2e-16 ***
## BankcardUtilization 3.896e-01 2.762e-02 14.106 <2e-16 ***
## DebtToIncomeRatio 1.261e-01 1.234e-02 10.221 <2e-16 ***
## ProsperPrincipalOutstanding 1.870e-06 5.088e-06 0.368 0.713
## ScorexChangeAtTimeOfListing -5.348e-03 4.241e-04 -12.609 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 72962 on 57360 degrees of freedom
## Residual deviance: 72434 on 57356 degrees of freedom
## AIC: 72444
##
## Number of Fisher Scoring iterations: 4
We can see three of four variables are statistically significant. The negative coefficient for ScorexChangeAtTimeOfListing predictor suggests that all other variables being equal, the higher ScorexChangeAtTimeOfListing of loans are less likely to be high risk. The positive coefficients for other predictors suggest that the higher BankcardUtilization and DebtToIncomeRatio of loans are more likely to be high risk.
Further transform them for interpretation of odds:
## (Intercept) BankcardUtilization
## 0.3872305 1.4764469
## DebtToIncomeRatio ProsperPrincipalOutstanding
## 1.1344478 1.0000019
## ScorexChangeAtTimeOfListing
## 0.9946667
## (Intercept) BankcardUtilization
## -6.127695e+01 4.764469e+01
## DebtToIncomeRatio ProsperPrincipalOutstanding
## 1.344478e+01 1.870164e-04
## ScorexChangeAtTimeOfListing
## -5.333273e-01
Holding other factors constant, for a point increase in BankcardUtilization, we expect to see a 4.76% increase in the odds of being a high risk loan.
Holding other factors constant, for a point increase in DebtToIncomeRatio, we expect to see a 1.34% increase in the odds of being a high risk loan.
Holding other factors constant, for a point increase in ScorexChangeAtTimeOfListing, we expect to see a 5.33% decrease in the odds of being a high risk loan.
I want to see will CreditScoreAverage have the predict ability? So I take CreditScoreAverage be the only one feature of the model:
## Status CreditScoreAverage
## 0 591
##
## Call:
## glm(formula = Status ~ ., family = binomial(link = "logit"),
## data = ds2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7408 -0.9179 -0.7329 1.2846 2.0511
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.7952916 0.0854498 44.41 <2e-16 ***
## CreditScoreAverage -0.0066346 0.0001262 -52.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 72962 on 57360 degrees of freedom
## Residual deviance: 69819 on 57359 degrees of freedom
## AIC: 69823
##
## Number of Fisher Scoring iterations: 4
## (Intercept) CreditScoreAverage
## 44.4912092 0.9933874
## (Intercept) CreditScoreAverage
## 4349.1209220 -0.6612611
CreditScoreAverage, we expect to see a 0.66% decrease in the odds of being a high risk loan.Let’s check out the ProsperScore. Take ProsperScore be the only one feature of the model:
## Status ProsperScore
## 0 29084
##
## Call:
## glm(formula = Status ~ ., family = binomial(link = "logit"),
## data = ds3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1375 -0.8977 -0.6991 1.3446 2.3429
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.09473 0.07823 -1.211 0.225965
## ProsperScore2 -0.12526 0.09327 -1.343 0.179264
## ProsperScore3 -0.29761 0.09155 -3.251 0.001151 **
## ProsperScore4 -0.29021 0.08632 -3.362 0.000774 ***
## ProsperScore5 -0.40735 0.08548 -4.766 1.88e-06 ***
## ProsperScore6 -0.60611 0.08468 -7.158 8.20e-13 ***
## ProsperScore7 -0.75552 0.08677 -8.708 < 2e-16 ***
## ProsperScore8 -1.18973 0.08574 -13.877 < 2e-16 ***
## ProsperScore9 -1.75075 0.09324 -18.777 < 2e-16 ***
## ProsperScore10 -2.58353 0.12619 -20.473 < 2e-16 ***
## ProsperScore11 -2.33302 0.60734 -3.841 0.000122 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34419 on 28276 degrees of freedom
## Residual deviance: 32571 on 28266 degrees of freedom
## AIC: 32593
##
## Number of Fisher Scoring iterations: 5
## (Intercept) ProsperScore2 ProsperScore3 ProsperScore4 ProsperScore5
## 0.90962099 0.88226398 0.74259344 0.74810904 0.66541442
## ProsperScore6 ProsperScore7 ProsperScore8 ProsperScore9 ProsperScore10
## 0.54546885 0.46976401 0.30430256 0.17364392 0.07550733
## ProsperScore11
## 0.09700226
## (Intercept) ProsperScore2 ProsperScore3 ProsperScore4 ProsperScore5
## -9.037901 -11.773602 -25.740656 -25.189096 -33.458558
## ProsperScore6 ProsperScore7 ProsperScore8 ProsperScore9 ProsperScore10
## -45.453115 -53.023599 -69.569744 -82.635608 -92.449267
## ProsperScore11
## -90.299774
We can see for a one level of Prosper Score increase, we expect to see the probability of being a high risk loan decrease holding other factors constant.
Let’s plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier.
First start with the model 1 which with 4 significant variables:
## [1] 0.555076
The AUC ratio for this Logistic Regression Model is 0.6345.
And for the model 2 only with CreditScoreAverage to predict the loan status:
## [1] 0.6355906
The AUC ratio for this Logistic Regression Model is 0.6355.
The last is for the model 3 only with ProsperScore to predict the loan status:
## [1] 0.6488394
The AUC ratio for this Logistic Regression Model is 0.6488.
In this section, I investigate the relationship between status of loans and each related factors across each Prosepr Score. Most of the time, there is a clear trend between completed loans and high risk loan across Prosper Score. High risk loans primarily located at lower tier of Prosper Score; and Completed loans primarily located at higher tier of Prosepr Score holding other factor constant.
Holding Prosper Score constant, BankCardUtilization, DebtToIncomeRatio,ProsperPrincipalOutstanding, ScorexChangeAtTimeOfListing, IncomeRange and StatedMonthlyIncome has clear trend with Status of loans. From the plot sections, we can see high risk loans tend to have higher BankCardUtilization, DebtToIncomeRatio and ProsperPrincipalOutstanding, and have lower ScorexChangeAtTimeOfListing, IncomeRange and StatedMonthlyIncome.
These results suggest that I can build a Logistic regression model and use those variables in the model to predict the probability of one loan would be completed or high risk. The results of the model are summarized below.
The most surprising pattern I found is Completed loans always have high BankCardUtilization across each Prosper Score. Generally, I think higher tier of Prosper Score like 8 to 10 would have lower range of BankCardUtilization, but in these plots it shows loans which are higher tier of Prosper Score are also dominating the higher level of BankCardUtilization.
Yes, I created 3 Logistic Regression model with the dataset. We can see all three models have some predictive power for loan quality, but the power is weak. And we can see these three models have the similar predictive powers with AUC ratio of 0.55 to 0.64. The lowest AUC is of the model one which contains the four features I picked; and the other two model with AUC around 0.64 are with the features Prosper Score and CreditScoreAverage. Seems like directly using these two risk metrices is enough than using other four metrices at a time to predict a loan be high risk or not.
I think there are still some rooms to improve these abilities.
The percentage of High Risk loan appears an inverse relationship with both of Prosper Rating and Credit Grade. The lower percentage of High Risk loan, the better the Prosper Rating is. And we can see that the whole high risk loan actually decrease after Prosper Rating was launched.
The trend between ProsperRating and ProsperScore appear a slightly positive shape, and the variance of Prosper Score in each Prosper Rating is more concentrated. Compared to CreditScoreAverage, the variance of CreditScoreAverage in each Prosper Rating is kind of large. Seems like Prosper put more weights of Prosper Score than Credit Score Average on their own Prosper Rating model.
I pick four features that look like having trends between these features and Prosper Score and Loan Status from the graphs above. Holding Prosper Score constant, high risk loans tend to have lower ScorexChangeAtTimeOfListing and higher Bank Card Utilization, DebtToInconeRatio and ProsperPrincipalOutstanding. It shows that I can construct a Logistic Regression to predit the likelihood of a loan would be high risk or completed by using Status(1 for high risk and 0 for completed loan) as target variable and these 4 feature as the predictor variables.
This Prosper loan data set roughly contains four kinds of loan data information — Loan Status, Borrower Data, Loan Information Data and Credit Risk Matrices during 2005 to 2014. I started by understanding some stories about Prosper and exploring the individual variables related to my initial question of interest — What factors are connected to defaulted/completed loan? And then I found Prosper have improve their credit measurement system by Prosper Rating from only using Credit Score before. And I continued to make observations on plots, then I make assumption that maybe Prosper put more weight of Prosper Prosper Score than Credit Score in their Prosper Rating. I kept exploring features related to Prosper Score and relationship between Prosper Score and these features and Loan Status. Eventually, I pick features that I observed that having some trends with Loan Status and Prosper Score. I created a Logistic model to predict the likelihood of a loan would be high risk or completed using these features.
I found Bank Card Utilization and Debt To Income Ratio have strongest inverse trend with Prosper Score. ScoreX Change At Time Of Listing and Stated Monthly Income have strongest positive trend with Prosper Score. And after investigating the relationship between status of loans and each related factor across each Prosepr Score, I found high risk loans tend to have higher BankCardUtilization, DebtToIncomeRatio, ProsperPrincipalOutstanding, and have lower ScorexChangeAtTimeOfListing, IncomeRange and StatedMonthlyIncome, holding Prosper Score constant.
Finally, I picked all these features above except IncomeRange and StatedMonthlyIncome to build a Logistic Regression model to predict likelihood of one loan would be completed or high risk. And I also built two other models: one only contains CreditScoreAverage as predictor variable; the other use Prosper Score as predictor variable. These three models present some predictive power for loan quality, but the power is weak. For model with four features, BankCardUtilization, DebtToIncomeRatio and ScorexChangeAtTimeOfListing are the significant variables. The results suggest that they have association of higher Bank Card Utilization, higher Debt To Income Ratio, and lower Scorex Change of the borrower with the probability of being high risk. The other two models only consider the predictor of credit matrices–CreditScoreAverage and Prosper Score, and the results suggest that both predictors are significant and have higher AUC ratio than previous model. Seems like directly using these two risk metrices is enough than using other four metrices at a time to predict a loan be high risk or not.
I think there are still some of limitations that cause such a not good model:
Data duration: The duration of the data set is from 2006 to 2014, and it covers the duration of financial crisis, which may cause the quality of loans and borrower condition be unstable. And this may cause the predict model be more inaccurate if I want to evaluate the loan status of today’s market.
Data subset and cleaning: Since the data contains so many categories of data type like borrower quality or time series data, it is appropriate to conduct a data subset to have more precise data content. For example, we know there is a financial crisis in 2008 which affects the whole borrower condition and quality of loans, maybe it is more appropriate to subset this data set by post-crisis years. Besides, I also didn’t consider the outlier data in each feature to avoid irrelevant information, I just filled NA values with median in these three models.
Still, I was impressed that I can use such abundant information of loan platform data set and use R and RStudio to perform basic data exploration. Through the exploration process I do find some interesting pattern like the trend between loan status and ScorexChangeAtTimeOfListing. Unlike using Credit Score directly, it kind amazed me that using the change of Borrower’s credit score relative to the borrower’s last Prosper loan can get a not bad predict ability. It makes me think maybe using the data information about the platform loans has more reference value than directly using official credit data, if we want to evaluate a peer to peer loan market not a traditional borrowing channel.