Introduction
Peer-to-peer lending commonly abbreviated as P2PL is the practice of lending money to unrelated individuals, or “peers”, without going through a traditional financial intermediary such as a bank. The modern peer-to-peer lending industry in US started in February 2006 with the launch of Prosper [1].
Prosper has a transaction-based business model, in which the company collects revenue by taking a fee on its customers’ transactions. Borrowers who receive a loan pay an origination fee of 0.5-4.5% depending on the borrower’s Prosper Rating, and investors pay a 1% annual servicing fee. Prosper offers unsecured personal loans for anywhere from $2,000 to $35,000. Prosper’s loans are issued at fixed rates for terms of three or five years (36 or 60 months in this dataset).
In addition to credit scores, ratings, and histories, investors can consider borrowers’ personal loan descriptions, endorsements from friends, and community affiliations. Prosper handles the servicing of the loan and collects and distributes borrower payments and interest back to the loan investors [2].
I am new P2PL but it seems have the similar concept of Asia’s popular offline lending tradition between family and friends which has been around for centuries. How are people in the US using P2PL platforms? How reliable or risky to invest in P2PL? How to make better return? These really interest me. So let’s take a look of the loan data from Prosper.
The whole analysis is divided into 5 parts:
- Univariate Plotting & Analysis
- Bivariate Plotting & Analysis
- Multivariate Plotting & Analysis
- Final Plots and Summary
- Reflection
I’m using the Prospr loan dataset provided for the community. You can find variable definitions at [Prosper Loan Data - Variable Definitions], and how to ready Prosper listing at [How to Read a Loan Listing].
Univariate Plots Section
In the univariate section, I’ll focus more on data distribution, and the trend revealed by analyzing and comparing distribution. Most of the plots will be histograms or frequency polygons. Please note that you will see many y-axes labeled with count in this section. If not otherwise specified, it means the number of listings under the range or category on x-axis. And if not specified, y-axes labeled with diversity means the proportion of the range/category on x-axis within its own group.
Listing Creation
Let’s use numbers to understand more how people used Prosper.
Listings Over Time
First I want to know the listing history and trend over time.
As you can see from the table, this dataset contains listings created from November 2005 to March 2014. It started slightly earlier than the official inception, but from the plot, the actual growth did start in early 2006. However, listings dropped suddenly in late 2008, and remained quiet for more than half of a year. What happened? After a quick search I found that, due to SEC’s regulation activities, Prosper halted its operation for 9 months [3] to obtain the securities registration. After that, Prosper resumed its growth but encountered a glitch at the end of 2012. It should relate to the software issue that for several days no payments were credited to investors [4]. Never underestimate the importance of system robustness!
You might ask, here I’m showing the listing creation history not loan originations, why the glitch related to investor features still showed up in the plot? To answer this, let’s see some key variables in the next section.
Summary of Listing Creation Date
| 2005-11-09 20:44:28 |
2008-09-19 10:02:14 |
2012-06-16 12:37:19 |
2011-07-09 08:30:35 |
2013-09-09 19:40:48 |
2014-03-10 12:20:53 |

First 5 Key Variables of Listings / Loans
This dataset has 113937 oberservations over 81 variables. Below I plotted some tables for the summaries of 5 key variables: \(ListingCreationDate\), \(LoanOriginationDate\), \(LoanOriginalAmount\), \(Term\) and \(LoanStatus\):
- No \(LoanOriginationDate\) has the value NA. So in this dataset only listings that had been funded and originated are listed.
- Loan amount ranged from 1000 to 35000.
- There were 3 term options (in months): 12, 36, 60.
- Many loans were already closed with one of the statuses: Cancelled, Completed, Defaulted and Chargedoff. Overall about 16~17% were delinquent, about 15% were lost.
Summary of ListingCreationDate, LoanOriginationDate and LoanOriginalAmount
| Min. :2005-11-09 20:44:28 |
Min. :2005-11-15 |
Min. : 1000 |
| 1st Qu.:2008-09-19 10:02:14 |
1st Qu.:2008-10-02 |
1st Qu.: 4000 |
| Median :2012-06-16 12:37:19 |
Median :2012-06-26 |
Median : 6500 |
| Mean :2011-07-09 08:30:35 |
Mean :2011-07-21 |
Mean : 8337 |
| 3rd Qu.:2013-09-09 19:40:48 |
3rd Qu.:2013-09-18 |
3rd Qu.:12000 |
| Max. :2014-03-10 12:20:53 |
Max. :2014-03-12 |
Max. :35000 |
Stats of Term
| 12 |
1614 |
1.42% |
| 36 |
87778 |
77.04% |
| 60 |
24545 |
21.54% |
Stats of LoanStatus
| Cancelled |
5 |
0.00% |
Ok |
| Current |
56576 |
49.66% |
Ok |
| FinalPaymentInProgress |
205 |
0.18% |
Ok |
| Completed |
38074 |
33.42% |
Ok |
| Past Due (1-15 days) |
806 |
0.71% |
PastDue |
| Past Due (16-30 days) |
265 |
0.23% |
PastDue |
| Past Due (31-60 days) |
363 |
0.32% |
PastDue |
| Past Due (61-90 days) |
313 |
0.27% |
PastDue |
| Past Due (91-120 days) |
304 |
0.27% |
PastDue |
| Past Due (>120 days) |
16 |
0.01% |
PastDue |
| Defaulted |
5018 |
4.40% |
Loss |
| Chargedoff |
11992 |
10.53% |
Loss |
Stats of LoanStatus by DelinquentStatus
| Loss |
17010 |
14.93% |
| Ok |
94860 |
83.26% |
| PastDue |
2067 |
1.81% |
I want to point out why now you don’t see 12-month term available on prosper.com. Because this option only lived from 2008 to 2013. 60-month term was most recently added in 2010. Overall you can see 36-month term has been the most popular as it has the highest count per year.
| 12 |
2008-04-28 |
2013-04-02 |
1614 |
1800 |
4.93 |
327.38 |
| 36 |
2005-11-10 |
2014-03-10 |
87778 |
3042 |
8.33 |
10537.58 |
| 60 |
2010-07-02 |
2014-03-10 |
24545 |
1347 |
3.69 |
6651.76 |
Listing Categories
Since Prosper offers low rates for personal loans, I wonder for what purposes did people borrow? Apparently, Prosper’s low rates was a no-brainer for people who bore debts of higher rates. Debt consolidation is more than 6 times higher than the next recognizable category, home improvement.

Loan Original Amounts
How much did borrowers request? From the plots below we can see there is a roughly right-skewed distribution, with a few exceptions. Those are: 4000, 5000, 10000, 15000, 20000, 25000 and 35000 (with a binwidth of 500). So did borrowers prefer rounded numbers? Let me drill down to see the loans under $11,000. Borrower did prefer 1000, 1500, 2000 and so on (with a binwidth of 250). Interesting!

Amounts by Categories
Knowing the distribution of categories and loan amounts, I’ also curious about how much was borrowed for each category. As shown in the total and average plots, debt consolidation is still leading. So much debt out there. Prosper knew there is this consolidation market.


Borrower Demographics
And who were borrowing? Let’s see some demographics in the following sections.
Borrower States
Borrowers came from 51 states (including DC). Maybe because Prosper is located in San Francisco, California has the leading number of borrowers. About double of Texas, New York or Florida.
From the map, these is no geographical proximity among the top 5 states. Although these 5 all have big cities, there should be other reasons for California to distinguish itself from others. Maybe FinTech development?


Borrower Income Statistics
How much did borrowers made? The verifiable income can be a repay indicator. The distribution of verifiable income seems usual. 0-income was mostly not verifiable.

Investors
After understanding our borrowers, let’s look into investors.
Investors per Loan
How did investors fund a loan with others? From the plots, there could be hundreds of investors chipped in for one loan but many loans were funded by only one investor.

Amount per Investor per Loan
And how much on average did investors invest in one loan? Here I’m using this formula:
\(AvgInvestAmount = LoanOriginalAmount * PercentFunded / Investors\)
Investors also preferred rounded numbers. But this could be caused by that borrowers liked rounded numbers and many loans were funded by only one investor. This will require more investigation.

Then I drill down to the tallest bin, $0 - $500. Prosper sets minimum investment unit to $25. And we can see investors did take this advantage to diversify the portfolios. But this pattern could be partly caused by auto-invest as picking loans to only invest $50, $75… can be time consuming. This will require more investigation too.

For my curiosity, has diversity improved on Prosper over the time? From the plot below it looks like investors started with low diversity, became careful and highly diversified (at around 2009 relaunch) and then stayed moderate. But again this could be partly caused by Prosper’s rules, features and etc.

Borrower Credibilities / Risks
To better know the risks of P2PL, let’s dig into more technical details such as credit scores, risk ratings and risk scores. As we know Prosper has undergone a relaunch in 2009. Besides SEC registration, a new risk underwriting model also went live at that time. Hence for most analyses onward, we’ll need to divide the dataset into 2 underwriting groups: Before July 2009 & Since July 2009. And from the numbers below we can tell about 1/4 of the listings were created with old underwriting model.
Stats of Loans, Grouped by Underwriting
| Before July 2009 |
29084 |
25.53% |
| Since July 2009 |
84853 |
74.47% |
FICO Credit Scores
First, FICO credit scores were pulled when the listing were created. What was FICO score distribution of the 2 different underwriting models? From the table and plot below we can see scores under 600 were no longer allowed on Prosper with the new underwriting model. Also the distribution moved to the better side.
Counts of Each Credit Score Range
| 880-899 |
10 |
17 |
| 860-879 |
90 |
122 |
| 840-859 |
169 |
398 |
| 820-839 |
367 |
1042 |
| 800-819 |
537 |
2107 |
| 780-799 |
919 |
3705 |
| 760-779 |
1353 |
5253 |
| 740-759 |
1396 |
7871 |
| 720-739 |
1889 |
11034 |
| 700-719 |
1861 |
13610 |
| 680-699 |
2473 |
14019 |
| 660-679 |
2233 |
14133 |
| 640-659 |
3350 |
8849 |
| 620-639 |
2519 |
1653 |
| 600-619 |
2562 |
1040 |
| 580-599 |
1125 |
NA |
| 560-579 |
1357 |
NA |
| 540-559 |
1474 |
NA |
| 520-539 |
1593 |
NA |
| 500-519 |
554 |
NA |
| 480-499 |
346 |
NA |
| 460-479 |
141 |
NA |
| 440-459 |
36 |
NA |
| 420-439 |
5 |
NA |
| 360-379 |
1 |
NA |
| 0-19 |
133 |
NA |
| NA |
591 |
NA |

Prosper Ratings & Scores
Every listing was assigned 1-2 scores to indicate risk level. Before 2009 relaunch, Prosper used \(CreditGrade\). How did it work? From the plot below it looks like quite a number of borrowers had AA and A grades, which contradicted to the credit score distribution, especially when there were more bad scores. And yes, people did complain about it [5].

Disregard the old model, let’s take a look of the new risk assessments after 2009 relaunch, Prosper Rating & Prosper Score [6]. Has them become more reasonable? Now the mechanism seems more conservative as these 2 do align better with the credit score distribution.

Prosper Rating is the final risk determination based on Credit Score, Prosper Score and some other factors [7]. So let’s see how Credit Score, Prosper Score and Rating are related to each other.
From the green plot, you can see many borrowers had credit scores ranged from mid-600s to mid-700s; fair-good scores. They were given Prosper Score 2 to 8 more often. And from purple plot,they were given Prosper Rating E to A. The best Prosper Score 11 was not granted to credit scores lower than 640, nor those higher than 860? Best Prosper Rating AA was granted to those who had credit scores higher than 680. From the blue plot of Prosper Rating and Prosper Score, you can see a better looking distribution where the majority have moved lower-risk-ish.
PS. These plots are bivariate but I want to have them closer to related topic.

Previous Prosper Loans
Previous records with Prosper are also officially used as risk indicators. Here the data is not separated by underwriting model. Counts of \(TotalProsperLoans\) show that many listings were created by new borrowers (the NA’s). This variable may not help too much either.
Counts of Total Prosper Loans
| Count |
1 |
15538 |
4540 |
1447 |
417 |
104 |
29 |
8 |
1 |
91852 |
113937 |
Investment Return
After credibility, let’s dig into anther very important perspective for the investors, the return estimation. Since it won’t make sense to continue with old underwriting model, from this point on I’ll mostly use the subset of listings that were created after 2009 relaunch.
Lender Yield & Estimated Return
Lender yield is equal to the interest rate on the loan less the servicing fee. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. How did they vary? Below you can see \(EstimatedReturn\) is more normally distributed, and \(LenderYield\) has a tall exception at around 0.3 (binwidth = 0.01). That seems related to high risk,but we know risk levels were more normally distributed after 2009 relaunch. Why this tall bin? I’ll do more analysis later.
Another thing is there were negative estimated return rates? People really funded those loans? Interesting. Something to be explored later too.

Estimated Return, Effective Yield and Loss
Let’s see how estimations are related. Again the formula:
\(LenderYield = BorrowerRate - 1\)
\(EstimatedReturn = EstimatedEffectiveYield - EstimatedLoss\)
So it looks like est yield and lender yield are more in sync. But the higher effective yield rates were offset by loss rate and probably wouldn’t happen. Looking at these 4 plots I think the base of estimation should be est loss. Let me keep this in mind and explore more later.

Actual Return (Rough)
After estimation, it would be interesting to take a quick look of the reality. I’m using this formula to calculate the actual return (investors are aware of not getting the optimistic yield):
\(ActualReturn = (LP_CustomerPayments + LP_ServiceFees + LP_CollectionFees) / (LoanOriginalAmount * PercentFunded) - 1\)
This is an interesting pattern. You can see there are 2 right-skewed distributions overlapped, with a clear cut at 0. Grouped it by loan status, and it’s proved to have been caused by this concern. But do notice that sometimes when the loans were defaulted or chargedoff, investors might still get positive return. And sometimes the loans were completed with loss.


Days to Fund Loans
Before ending univariate plotting, I want to check how quickly a loan was funded.
Since we just explored borrower background, behavior, credibility, risk and estimation. I’m now interested in knowing how did the variables influence investors’ decision making? Days taken to fund a loan can be an indicator, because prospective loans souled be funded faster. I’m using this formula to calculate the days needed to fund the loan:
\(FundingDays = LoanOriginationDate - ListingCreationDate\)
Th’s histogram shows that many loans were funded in 2 weeks. It’s a good sign that people did invest quickly.

Univariate Analysis
Structure of the dataset
This dataset has 113937 oberservations over 81 variables. These variable can be grouped into 7 caterogies:
- Listing info
- Borrower demographics
- Risk level
- Return estimation
- Loan status
- Prosper history (optional)
- Social profile
Each oberservation stands for an funded and originated listing. The first listing was created on 2005-11-10, and the last listing was created on 2014-03-10. There is this clear crater for Prosper’s 9-month halt in 2008-2009. And after 2009 relaunch new variables for risk level and return estimation were added, these are: \(EstimatedEffectiveYield\), \(EstimatedLoss\), \(EstimatedReturn\), \(ProsperRating (numeric)\), \(ProsperRating (Alpha)\) and \(ProsperScore\). So far I’ve converted and used factor, numeric, datetime variable types.
Main feature(s) of interest
I position myself as an investor and I want to know more about the borrowers and the patterns to avoid loss. So features of listing info, risk level, return estimation and loan status are more of my interest. These are (sample values from listings after 2009 relaunch):
Listing Info
- \(Term\): 12, 36, 60
- \(LoanOriginalAmount\): 1000 - 35000
- \(MonthlyLoanPayment\): 0 - 2251.51
Risk Level (low - high)
- \(ProsperRating\): AA, A, B, C, D, E, HR
- \(ProsperScore\): 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
- \(CreditScoreRange\): 880-899, 860-879, 840-859, 820-839, 800-819, 780-799, 760-779, 740-759, 720-739, 700-719, 680-699, 660-679, 640-659, 620-639, 600-619, 580-599, 560-579, 540-559, 520-539, 500-519, 480-499, 460-479, 440-459, 420-439, 360-379, 0-19 (combined, range lower - range higher)
Return Estimation
- \(LenderYield\): 0.03 - 0.34
- \(EstimatedEffectiveYield\): -0.1827 - 0.3199
- \(EstimatedLoss\): 0.0049 - 0.366
- \(EstimatedReturn\): -0.1827 - 0.2837
Loan Status
- \(LoanStatus\): Cancelled, Current, FinalPaymentInProgress, Completed, Past Due (1-15 days), Past Due (16-30 days), Past Due (31-60 days), Past Due (61-90 days), Past Due (91-120 days), Past Due (>120 days), Defaulted, Chargedoff
- \(LP_CustomerPayments\): -2.3499 - 3.73691610^{4}
- \(LP_NetPrincipalLoss\): -504.41 - 2.510^{4}
Other features in the dataset that will help support the investigation into the feature(s) of interest
Features related to repay ability such as \(StatedMonthlyIncome\). And some features relatead to credit score such as \(CurrentDelinquencies\), \(RevolvingCreditBalance\), \(BankcardUtilization\) and etc.
New variables createad from existing variables in the dataset
- To better understand how much investors liked to put into one loan I created \(AvgInvestAmount = LoanOriginalAmount * PercentFunded / Investors\).
- To compare actual retun with estimation I created \(ActualReturn = (LP_CustomerPayments + LP_ServiceFees + LP_CollectionFees) / (LoanOriginalAmount * PercentFunded) - 1\).
- To see how fast the loans were funded I created \(FundingDays = LoanOriginationDate - ListingCreationDate\).
- Also the implicit one for reporting convenience, \(CreditScoreRange\), which is a factor compound of \(CreditScoreRangeLower\) and \(CreditScoreRangeUpper\).
Bivariate Plots Section
After seeing data distribution, let’s move to see more of the relationship bwtween variables. I will use grouped grequency polygons, box plots or scatter plots. Please note that you will see y-axes labeled with density in this section. If not otherwise specified, it means the proportion of the range/category on x-axis within its own group.
Days to Fund Loans, Grouped
At the end of univariate section I plotted the histogram of FundingDays:
\(FundingDays = LoanOriginationDate - ListingCreationDate\)
And now I’m plotting to see how funding days varied with different variables. I’m using loans created after 2009 relaunch. But I’m making an exception for “Funding Days by Years” to use all the loans because it would be informative to see the trend over both underwriting models.
- Plot by years has the trend from moderate to slow to fast. Especially in 2013 and 2014, many loans were funded on the first day (span in between these 2 grey dash lines).
- By terms, no distinct variation.
- By Prosper Ratings, mid-range were funded slightly faster.
- By Prosper Scores, interestingly, both high and low risk were funded faster.
- By credit scores, also interestingly, high risk were funded faster.
- By lender yield, mid-range were funded faster.
- By return rates, also mid-range were funded faster.
- By monthly payment, the higher the payment, the slower to be funded.
It seems that estimated return rate, monthly payment, credit score, Prosper Score and Prosper Rating (ordered by variance) are the more important factors for investors.
PS. This section is rendered differently with shiny. In shiny you can use a dropdown to switch among plots. But shinyapps.io is not very stable. If you would like, you can enable shiny indicator in the R code and run this locally to see.








Monthly Loan Payment
So we know that funding days varied by monthly loan payment. Let’s look into it more.
Loan Amount, Payment and Income by Terms
Monthly payment is a math result of loan amount, rate and term. How are these related? From the box plots we can see higher loan amount on average will be repaid during a longer period (as expected). Majority monthly payment didn’t vary a lot, but 12-month had more outliers.

Stats of Box Plots
|
Min. : 1000 |
Min. : 0.0 |
|
1st Qu.: 4000 |
1st Qu.: 157.3 |
|
Median : 7500 |
Median : 251.9 |
|
Mean : 9083 |
Mean : 291.9 |
|
3rd Qu.:13500 |
3rd Qu.: 388.4 |
|
Max. :35000 |
Max. :2251.5 |
But how much out of the income did borrowers use to replay the loan? I used this formula to calculate the monthly loan payment to verified income ratio:
\(PaymentIncomeRatio = MonthlyLoanPayment / StatedMonthlyIncome\) (verifiable only)
Here you can’t see the mean of 12-month term; it’s beyond the upper plotting limit. So some 12-month borowers were aggressive to repay. But for the majority, 60-month borrowers took out slightly higher percentage from income to repay. It’s still not very clear here, but could the ratio differences lead to default?

Stats of Box Plots
|
Min. : 1000 |
Min. : 0.0 |
Min. : 0.000 |
|
1st Qu.: 4000 |
1st Qu.: 157.3 |
1st Qu.: 0.031 |
|
Median : 7500 |
Median : 251.9 |
Median : 0.052 |
|
Mean : 9083 |
Mean : 291.9 |
Mean : 0.099 |
|
3rd Qu.:13500 |
3rd Qu.: 388.4 |
3rd Qu.: 0.078 |
|
Max. :35000 |
Max. :2251.5 |
Max. :679.760 |
|
NA |
NA |
NA’s :7343 |
Loan Amount, Payment and Income by Closed Statuses
To answer if higher monthly payment to income ratio lead to default, I need to first extract closed loans from the dataset. There are 4 statuses considered as closed: Cancelled, Completed, Defaulted and Chargedoff. I’ll just use the later 3, and I’ll group Defaulted and Chargedoff into Defaulted as both had loss associated. So it resulted in 2 closed statuses: Completed and Defaulted.
Among closed loans created after 2009 relaunch, about 1/4 were defaulted. Please note that this number is higher than overall percentage shown in the very first section. It’s because I’m only considering closed loans here. Many current loans are excluded.
Status of Closed Loans, Since July 2009
| Completed |
19664 |
75.62% |
| Defaulted |
6341 |
24.38% |
Now I have the 2 closed statuses setup, let me do the box plots again by the closed statuses. As you can see, defaulted loans had slightly higher monthly payment to income ratio, but overall quite overlapped. Any other variables that could cause default?

Stats of Payment Income Ratio by Closed Status
| Completed |
0 |
0.0642511 |
0.0392282 |
0.0749999 |
192.31200 |
| Defaulted |
0 |
0.0763314 |
0.0490803 |
0.1620291 |
74.50586 |
Estimated Yield, Loss and Return
Obviously estimated return influenced investors’ decision making a lot, let’s see if we can find more insights from estimated loss, yield and return that were computed with Prosper’s proprietary algorithms.
Estimated Loss - Pairs
As mentioned earlier, I feel that \(EstimatedLoss\) should be the base estimation. And then rate, yield, return and etc were derived consequently. How was est loss computed? Let me do some pair plots to see how it is related to other variables.
Notice that I try to use numeric values here and most of the variables directly come from the dataset. Only \(PaymentIncomeRatio\) is calculated with this formula: \(PaymentIncomeRatio = MonthlyLoanPayment / StatedMonthlyIncome\) (verifiable only).
As expected, Prosper Rating is highly correlated with est loss, as they are hens and eggs. Borrower rate is very correlated with est loss; I hope this proves my thinking that rate was derived from est loss. 2 other more crrelated variables (other than risk levels) are \(LoanOriginalAmount\) and \(MonthlyLoanPayment\).




Were estimated loss based on loan amount / monthly payment?
Again the correlation coefficients:
\(LoanOriginalAmount\) vs. \(EstimatedLoss\): -0.4298904 \(MonthlyLoanPayment\) vs. \(EstimatedLoss\): -0.3287027
Besides that they are both negatively correlated to est loss, it’s also interesting to see these 2 observations:
- \(LoanOriginalAmount\) shows many vertical lines for the same amounts
- \(MonthlyLoanPayment\) follows coefficient linear formula when the payment goes higher
I’ll need to include more variables to better understand these patterns. I’ll do it in multivariate section soon.
PS. Notice that the brown line is linear smooth line.

Estimated Yield, Loss and Return Together
Remember Prosper’s formula: \(EstimatedReturn = EstimatedEffectiveYield - EstimatedLoss\). Let me plot to see the relationship.
When yield and loss were plotted together, 2 types of patterns were shown:
- Tiers from upper left to lower right
- Lnear-exponential lines located in first quadrant.
Notice this diagonal line stands for the edge that there is 0 estimated return. So those dots scatter over the upper left area actually mean negative estimated returns.
Despite the 2 patterns, these 2 variables have pretty high correlation score: \(EstimatedEffectiveYield\) vs. \(EstimatedLoss\) = 0.7981346 (I think est yield was derived from est loss with some model).
I’m more interesting in what’s going on to cause these 2 patterns. Let me explore more with multivariate plots, after the bivariate summary.
PS. Notice that the brown line is linear smooth line.
PS. Notice that the grey line is ab line that stands for equal value from x and y.

Bivariate Analysis
Relationships observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset.
All the estimated rates and borrower rates are highly correlated. In my opinion I think they should all be derived from est loss. Est loss is based on Prosper Rating and they are unsurprisingly highly correlated.
Prosper Rating is calculated base on Prosper Score and FICO Credit Score. These 3 are very correlated, so this means other factors didn’t weight as much in final risk assessment. But Prosper Rating is more correlated with Prosper Score than Credit Score. In the risk tile plots (heat maps) section when plotting Prosper Rating with Prosper Score, the overall risk looks lower than plotting with FICO Credit Score.
Monthly payment to verifiable income ratio seems correlated to default. But it’s not very distinguishable.
Interesting relationships between the other features.
At the very first of this section I plotted to see which features influenced investors’ decision making more. I want to learn from seasoned investors’ experiences too. Estimated return rate played a very important role. Although I think it’s derived from Prosper Rating but it caused more variance than Prosper Rating. But surprisingly borrowers with lower credit scores were funded faster.
Strongest relationship found.
It’s Estimated Loss vs. Prosper Rating vs. Borrower APR. But I think it’s because APR is a derivative of est loss and Prosper Rating. And est loss and Prosper Rating go hand-in-hand. Other than all those related rates, loan amount has stronger relationship with est loss, but negatively. Probably we should lend to those who borrow more.
Multivariate Plots Section
We have just identified some interesting patterns in bivariate section. Let’s add more variables to dissect. Here I will mostly scatter plots and heat maps, and some others to assist the analysis. Please note that you will see y-axes labeled with count in this section. If not otherwise specified, it means the number of listings of the range/category on x-axis.
The 2 Patterns
Continue with the last bivariate topic, let’s try to find out why there are 2 patterns in the Estimated Effective Yield vs. Estimated Loss scatter plot. What does this indicate?
Estimated Effective Yield vs. Estimated Loss, by Categorical Features
Prosper’s formula: \(EstimatedReturn = EstimatedEffectiveYield - EstimatedLoss\).
Here I’m trying to associate \(EstimatedEffectiveYield\) and \(EstimatedLoss\) with: Closed Status (Completed vs. Defaulted), Credit Score Ranges, Prosper Scores, Prosper Ratings, Terms, Years to see if anything is causing the patterns. For the plots below I used closed loans created after 2009 relaunch only. As I want to compare Closed Status and to maintain the same base for the rest.
Let me briefly review how I extracted closed loans from the dataset. There are 4 statuses considered as closed: Cancelled, Completed, Defaulted and Chargedoff. I’ll just use the later 3, and I’ll group Defaulted and Chargedoff into Defaulted as both had loss associated. So it resulted in 2 closed statuses: Completed and Defaulted.
Among closed loans created after 2009 relaunch, about 1/4 were defaulted. Please note that this number is higher than overall average shown in the very first section. It’s because I’m only considering closed loans here. Many current loans are excluded.
Status of Closed Loans, Since July 2009
| Completed |
19664 |
75.62% |
| Defaulted |
6341 |
24.38% |
From the plotd below:
- With Closed Status, it seems some defaulted loans happened along the right most curve.
- Some Credit Score Ranges are slightly along the curves too.
- Prosper Scores and
- even more Prosper Ratings cut the loss rates horizontally.
- When Terms were applied, it’s interesting to see the tierd pattern is mostly consist of 36-month loans.
- It’s clearly something was done specifically to 36-month term to made the estimated return rates look better (non-negative). And then I definitely needed to check with Years to see when was this done.
- Interestingly, there is a clear cut by Since 2011.
PS. This section is rendered differently with shiny. In shiny you can use a dropdown to switch among plots. But shinyapps.io is not very stable. If you would like, you can enable shiny indicator in the R code and run this locally to see.
PS. Notice that the brown line is linear smooth line.
PS. Notice that the grey line is ab line that stands for equal value from x and y.
Estimated Effective Yield vs. Estimated Loss
, Grouped by Selected Variable
, Closed Loans, Since July 2009







A quick sumaries. Amont the loans created after 2009 relaunch (closed and open), about 3/4 were created in 2011 or beyond. In 2011 the new algorithms were deployed to eliminate negative estimated returns for 36-month term.
Closed and Open Loans, Grouped by Estimation, Since July 2009
| Jul 2009 - 2010 |
19150 |
22.57% |
| 2011 - 2014 Mar |
65703 |
77.43% |
Estimated Loss, Revisited with 2011 Estimation Cohort
Now we’ve discovered the 2011 estimation update. Let’s focus on this cohort as this is what intestors are dealing with currently. How is this cohort doing? Let’s see some numbers with the loans (closed and open) of this 2011 cohort. And of course most 36-60-month loans are still open, as the dataset is cut off on 2014-03-12. Notice that 12-month term was no longer available after 2013-04-02, so most were already closed.
Loan Status Percentage, Closed and Open, Grouped by Term, 2011 - 2014 Mar
| Current |
5.95% |
81.14% |
83.78% |
| FinalPaymentInProgress |
0.96% |
0.29% |
0.16% |
| Completed |
88.10% |
11.55% |
8.77% |
| Past Due (1-15 days) |
0.29% |
1.05% |
1.01% |
| Past Due (16-30 days) |
0.19% |
0.34% |
0.36% |
| Past Due (31-60 days) |
0.10% |
0.48% |
0.50% |
| Past Due (61-90 days) |
NA |
0.41% |
0.43% |
| Past Due (91-120 days) |
0.10% |
0.37% |
0.43% |
| Past Due (>120 days) |
NA |
0.01% |
0.03% |
| Defaulted |
0.19% |
0.58% |
0.61% |
| Chargedoff |
4.13% |
3.78% |
3.92% |
| Total |
100.00% |
100.00% |
100.00% |
And let’s revisit loan amount / payment vs. estimated loss with 2011 estimation cohort. I only use the loans (all terms, closed only) to see if there is any specific pattern related to defaulted cases and to maintain a common base for all the plots.
- With Closed Status, it seems slightly more higher-risk loans defaulted.
- Prosper Ratings again cuts the loss rates horizontally. Notice that with this plot we may guess high risk borrowers tended to borrow less. This can explain why \(LoanOriginalAmount\) and \(EstimatedLoss\) have negative correlation score.
- When Terms were applied, it’s interesting that for similar monthly payment of same term, the est loss rate follow a linear pattern.
- With Years, since most of the loans are still open so there is not much to tell.
PS. This section is rendered differently with shiny. In shiny you can use a dropdown to switch among plots. But shinyapps.io is not very stable. If you would like, you can enable shiny indicator in the R code and run this locally to see.
PS. Notice that the brown line is linear smooth line.
Loan Amount / Monthly Payment vs. Estimated Loss,
Closed Loans, All Terms, Since 2011




Don’t Want to Lose
I think now we better undetand how est loss rates were correlated, and we discovered the different cohorts of Prosper Marketplace. Let’s see if there is anything we can do to furthur avoid loss.
The 2011 36-60 Cohort
To make the analysis more relative to current situation, I’ll only use closed loans created with 2011 estimation update. Closed loans have final status revealed for adjustment. And these loans are of 36-60-month terms because 12-month term is no longer available.
Let’s see how is the final status like for this cohort. There are 9675 loans in this cohort. From the tables below we can tell as of 2014-03-12, 60-month loans were more prone to default. And overall about 30% loans were defaulted. This is so far the highest default percentage. But it’s because open loans are excluded, and we are only comparing closed 36-60-month loans (less completed loans than 12-month).
Closed Status Statistics, Grouped by Term, 2011 - 2014 Mar
| 36 |
72.62% |
27.38% |
100.00% |
| 60 |
65.96% |
34.04% |
100.00% |
Closed Status Statistics, as Grid, 2011 - 2014 Mar
| 36 |
49.27% |
18.57% |
67.84% |
| 60 |
21.21% |
10.95% |
32.16% |
| Total |
70.48% |
29.52% |
100.00% |
Default Indicators
To discover any additional variable related to unexpected default, we need to first find a proper default indicator.
Loan First Defaulted Cycle
In the dataset there is \(LoanFirstDefaultedCycleNumber\). Also I created this \(DefaultedPoint = LoanFirstDefaultedCycleNumber / Term\) to apply common benchmark to see which one work better.
These 2 plots show similar distribution. But since only defaulted loans have the cycle values, only about 30% of 9675 loans can be used for analysis.

Actual Loss Rate
Another straightforward indicator can be actual loss rate. I use the formula:
\(ActualLoss = LP_NetPrincipalLoss / (LoanOriginalAmount * PercentFunded)\)
\(LossDiff = ActualLoss - EstimatedLoss\)
From the plots, both contain completed and defaulted cases and have distinguishable distribution. I think I’ll choose \(ActualLoss\) as it has a more parallel indication to \(EstimatedLoss\).

Actual Loss Rate - Pairs
So again I want to see how the actual loss correlate to other variables with pair plots. But unfortunately, there is hardly any correlated variable. But this didn’t disappoint me too much, because if Prosper already figured this out they could have updated their estimation models accordingly before me.



Actual Default Statistics
But still let’s do some more statistics so that furture investors can refer to. Again this will be based on 2011 cohort, closed loans, 36-60-month terms.
Credit Score, Prosper Score, Prosper Rating and Defaulted Rate
First let’s see how default loans distribute among risk levels. Here beause I’ll need to calculate average Prosper Ratings so I’m using the numeric values provided in the dataset. And the mapping is:
0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA.
In the first plot, the number in each cell stands for average Proper Rating in numeric format. The color code stands for percentage of loans that were actually defaulted within that bucket. So when you see a new listing, you can map the risk levels and check the actual default rate.
In the second plot, the number in each cell stands for the permille of all the loans (completed and defaulted) that fell into that bucket. So when you see a new listing, you can map to see how often this risk combination occurred. And how many other listings can be available for investment if you can accept the actual defaulted rate.


Loan Amount, Estimated Loss, Prosper Rating and Defaulted Rate
Earlier we know \(LoanOriginalAmount\) also played a key role in estimation. So let’s try again with \(LoanOriginalAmount\) and \(EstimatedLoss\).
In the first plot, the number in each cell stands for average Proper Rating in numeric format. The color code stands for percentage of loans that were actually defaulted within that bucket. So when you see a new listing, you can map amount and estimation, and check the actual default rate.
In the second plot, the number in each cell stands for the permille of all the loans (completed and defaulted) that fell into that bucket. So when you see a new listing, you can map to see how often this amount/estimation combination occurred. And how many other listings can be available for investment if you can accept the actual defaulted rate.
Again notice that at the bottom right corner, a lot of high risk borrowers defaulted on small amounts?


Material Events
So far we have discovered several material events that affected borrowers and investors. Let’s do a quick review with the overtime history to see how much they had affected the platform. It looks like the 2011 estimation update might have helped on the growth.
So my plotting is ending here. After multivariate analysis, I’ll pick 3 representative plots for the summary.

Multivariate Analysis
Relationships observed in this part of the investigation. Features that strengthened each other in terms of looking at feature(s) of interest.
One thing strengthened is the correlation between \(LoanOrigialAmount\) and \(EstimatedLoss\). From the pair plot in bivariate section we got this counterintuitive negative correlation: more loan, less risk. But with multivariate analysis I found that high risk crowd might tend to borrow less or they were just restricted. This kind of makes sense as credit card system does similar thing to new holders.
Interesting or surprising interactions between features.
Most interesting discovery would be the 2011 estimation update. By plotting \(EstimatedEffecticeYield\) and \(EstimatedLoss\) on scatter plots with other variables, the update to eliminate negative \(EstimatedReturn\) was revealed. That also makes sense. I doubt general public would like to invest in negative loans.
Final Plots and Summary
Plot One

Description One
How quickly a loan was funded can be an indicator of the general perception of the loan’s repay projection. Prospective loans souled be funded faster. I’m using this formula to calculate the days needed to fund the loan:
\(FundingDays = LoanOriginationDate - ListingCreationDate\)
This frequency polygon plot shows the over time trend. During Proper’s inception time many loans were funded in a week. Later it’s slowed down especially in 2009 when Prosper was undergone its halt and relaunch. After the relaunch the loans were funded faster each year. In the most recent 2 years (2013 and 2014), many loans were funded on the first day (bin between the 2 dash lines).
In the dataset it was not clearly stated how much of the investment was done manually or by Prosper’s “Quick Invest” (Automated Investing with Quick Invest) which was introduced in the 4th quarter of 2011. Nevertheless, this plot can still give investors the idea that nowadays they need to act very quick.
Plot Two
PS. Notice that the brown line is linear smooth line.
PS. Notice that the grey line is ab line that stands for equal value from x and y.

Description Two
This multivariate scatter plot is one of the plotting series to find the cause the 2 patterns of relationship of estimated effective yield and estimated loss which both were provided by Prosper:
- Tiers from upper left to lower right
- Lineat and exponential lines in first quadrant
I picked this one because it represents the most important turning point of the whole series and had helped me narrow down further in the analyses afterwards.
At first I thought the interlacing patterns could provide additional info on what loans had defaulted. But when I tried applying \(ListingYear\) as the third variable I found the substantial updates to make estimated return (est yield - est loss) positive numbers. Again the grey ab line stands for the points that x and y have the same value, and est return is zero. As you can see after 2011 the points scatter below the grey ab line. Intuitively I thought that removing negative est return numbers could have changes investors’ perception of the loans. More importantly investors should not compare really old etimations with current actual results. So in the next heat maps I’m using the loans created with the 2011 estimation updates to show investors the more up-to-date actual loss comparison to estimated loss. Which I thik will be more helpful.
I’m plotting more of the whole series below as a reference. One additional thing to mention is that the updates were majorly made for 36-month term.

Plot Three

Description Three
The heat map uses the loans created with the latest estimation updates (2011 cohort), which was discovered with previous plot series. Here are the key numbers:
Summary of Actual Defaulted Rate
| Max Defaulted Rate |
44.43% |
186.56 |
[2000,5300] |
(0.156,0.172] |
| Median Defaulted Rate |
16.67% |
5.58 |
(15200,18500] |
(0.0239,0.0404] |
| Median Defaulted Rate |
16.07% |
5.79 |
(18500,21800] |
(0.0239,0.0404] |
| Min Defaulted Rate |
0.00% |
1.24 |
(8600,11900] |
(0.139,0.156] |
| Min Defaulted Rate |
0.00% |
0.41 |
(28400,31700] |
(0.0239,0.0404] |
| Min Defaulted Rate |
0.00% |
0.31 |
(31700,35000] |
(0.0569,0.0734] |
| Min Defaulted Rate |
0.00% |
0.31 |
(31700,35000] |
(0.0239,0.0404] |
| Min Defaulted Rate |
0.00% |
0.21 |
(31700,35000] |
(0.0404,0.0569] |
| Min Defaulted Rate |
0.00% |
0.21 |
(25100,28400] |
(0.0239,0.0404] |
| Min Defaulted Rate |
0.00% |
0.10 |
(28400,31700] |
(0.0404,0.0569] |
| Min Defaulted Rate |
0.00% |
0.10 |
(25100,28400] |
(0.0404,0.0569] |
| Min Defaulted Rate |
0.00% |
0.10 |
(31700,35000] |
[0.0074,0.0239] |
Summary of Loan Count Permil
| Max Count Permil |
44.43% |
186.56 |
[2000,5300] |
(0.156,0.172] |
| Median Count Permil |
12.36% |
9.20 |
(8600,11900] |
[0.0074,0.0239] |
| Median Count Permil |
12.90% |
9.61 |
(11900,15200] |
[0.0074,0.0239] |
| Min Count Permil |
0.00% |
0.10 |
(28400,31700] |
(0.0404,0.0569] |
| Min Count Permil |
0.00% |
0.10 |
(25100,28400] |
(0.0404,0.0569] |
| Min Count Permil |
0.00% |
0.10 |
(31700,35000] |
[0.0074,0.0239] |
I’m using est loss at x-axis because \(EstimatedReturn = EstimatedEffectiveYield - EstimatedLoss\), est loss is the key to drag est return down and est loss and Prosper Rating are hand-in-hand. Without extra info other than what the dataset provided, no other indicators to defaulted loans were found (if Prosper knew they could have built a better feature or estimation model). But these heat map can still give future investors an overview of the most up-to-date reality.
The heat map is supported by the 2 tables that the most defaulted group is also the group most loans belong too. Investors probably should avoid loans in the group of est loss (0.156,0.172] and amount [2000,5300]. While this group borrowed little, they had been most likely to default. This heat map also roughly affirms the counterintuitive result that riskier loans were of smaller amount. Even for the same est loss bucket there is roughly a trend that when more amount was borrowed, it’s less likely to be defaulted.
Reflection
I started with exploring the distribution of individual variables. It’s a very good approach to understand how borrowers and investors are using this platform. Such as the discovery of debt consolidation as the leading category makes me to disregard some sayings about not to invest in this category. Debt consolidation is what Prosper mainly for. With the new underwriting model, risk leaves are more normally distributed. However, lender yield that is correlated to risk didn’t seem as even. This led to my exploration later in the multivariate analysis.
Also I was intrigued by the discovery of that California has the leading number of borrowers and started thinking about FinTech development. For business strategists, findings like these can help to better start a new venture with the right extreme users.
As I moved to bivariate analysis, I found it helpful in discovering trends and patterns. First I tried to see how existing investors decided what to invest by analyzing days to fund the loans. And interestingly I found that even estimated return should be derived from risk level, it still played a more important role. Based on these findings, I plotted more scatter plots to see the relationship among these estimation variables. And I discovered the interesting interlacing patterns from Yield-Loss plot. The very distinguishable patterns led to an important part of multivariate analysis.
At first I thought I would discover default insights from the interlacing Yield-Loss patterns. But after including variables that investors care about as the third concern I instead discovered Proper’s product updates. Not saying it did no help but via the process I got reminded to apply key variables to Amount-Loss plots and analyze on why higher loan amounts were associated with lower loss rates. Without additional data, I have yet to build the model to avoid loss. But with the current default status reference tables I created for future investors, I also strengthened the discovery of the relationship between loan amount and loss rate.
One thing I want to mention is that I needed to keep narrowing down the dataset to eliminate outdated or irrelevant information. Such as old underwriting model, 12-month term which is no longer available and etc. When I have more time and there is more data collected I may want to do the analysis to compare these cohorts, with addotional data like national demographics, economic events and etc. So that I can provide some suggestions to improve this P2PL model.
Overall I’d say it’s majorly an experience to analyze how people used the platform and how the platform evolved. I could see Prosper as a new FinTech company had evolved to improve the platform performance.
Regarding R and RStudio, I found it’s a really straightforward tool for quantitative analysis. I’d want to later allocate some time to re-organize the chunks for reading and executing efficiency. Also shiny package didn’t produce exactly the same static rendering as Knit HTML. And shinyapps.io could throw more errors than other environments. Maybe later I’ll try other interactive packages and deployments to improve the quality of the final report.
Thanks for checking. Please leave some comments! - Meilan Ou