This report explores a loan dataset that comes from Prosper which is a marketplace lending platform. More specifically, Prosper Marketplace is America’s first peer-to-peer lending marketplace, with over $7 billion in funded loans. Borrowers request personal loans on Prosper and investors (individual or institutional) can fund them. Investors can consider borrowers’ credit scores, ratings, and histories and the category of the loan. Prosper handles the servicing of the loan and collects and distributes borrower payments and interest back to the loan investors.
The data set contains almost 114,000 loans and 81 variables, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, the latest payment information among others. A full list of the variables can be found here.
Let’s firstly check the distribution of loan origination through the years.
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 23 6213 11557 11263 2206 5530 11442 19556 35413 10734
## [1] "2005-11-09" "2014-03-10"
## [1] 113937
The dataset contains 113.937 loans from November 2005 till March 2014. Most loans were originated in 2013. If we zoom in a little bit in our diagram…
…we can notice that there is a gap in the data in late 2008. There is a clear explanation about this issue. In the wikipedia PROSPER article we are informed that “the SEC imposed a cease and desist order on Prosper” from the end of November 2008 till July 2009.
But what is the status of each loan in the dataset?
##
## Cancelled Chargedoff Completed
## 0.0 10.5 33.4
## Current Defaulted FinalPaymentInProgress
## 49.7 4.4 0.2
## Past Due (1-15 days) Past Due (>120 days) Past Due (16-30 days)
## 0.7 0.0 0.2
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 0.3 0.3 0.3
Almost half the loans are still in progress and 1/3 of loans are completed.
Let’s examine the credit rating of loans when they went live. To measure the credit rating we can use the Credit Grade variable. This variable is populated only for listings before 2009 (before the SEC cease). The Credit Grade is expressed in an alphabetical grade scale where AA is the best possible credit rating for a loan taker and HR is the worst possible rating (NC means No Credit).
##
## A AA B C D E HR NC <NA>
## 3315 3509 4389 5649 5153 3289 3508 141 84984
Most loan takers before 2009 had a credit rating of C and D. Note that we removed NA’s from the graph because essenially NA’s in the Credit Grade variable correspond to loan takers after July 2009.
As for the loan takers after July 2009 we are going to use the ProsperRating (Alpha) variable. The ProsperRating is a proprieatary system developped by PROSPER after the SEC cease that “allows potential investors to easily consider a loan application’s level of risk because the rating represents an estimated average annualized loss rate range to the investor.” So it seems reasonable to use it for measuring the credit risk of each loan taker as it fulfils the same role as the Credit Grade variable.
##
## A AA B C D E HR <NA>
## 14551 5372 15581 18345 14274 9795 6935 29084
The Prosper Rating follows a similar grading scale as the Credit Rating.
Most loan takers after 2009 had a credit rating of B and C and few loan takers had a credit rating of AA.
What is the most common term of a PROSPER loan?
##
## 12 36 60
## 1.4 77.0 21.5
77% of all loans has a 36 months term in the PROSPER marketplace.
Now lets check the distribution of loan amounts:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
##
## [0,5e+03] (5e+03,1e+04] (1e+04,1.5e+04] (1.5e+04,2e+04]
## 45.5 27.3 17.6 5.0
## (2e+04,2.5e+04] (2.5e+04,3e+04] (3e+04,3.5e+04]
## 4.0 0.2 0.4
##
## 4000 15000 10000 5000 2000 3000 25000 20000 1000 2500
## 14333 12407 11106 6990 6067 5749 3630 3291 3206 2992
## [1] 949894347
The minimum loan amount is $1000 and the maximum amount $35000. The mean loan amount is $8337 and the median is $6500. 45.5% of loans are loans between $1000 and $4000 and only 0.4% of loans are for amounts above $30000. The most common loan amount (the mode of the distribution) is $4000. As a result the distribution of loan amounts is skewed to the right. It is also interesting to note that the distribution exhibits many other peaks at $15.000, $10.000, $20.000 and $25.000 which is normal because usually people tend to borrow in multiple of thousand. Total value of loan amounts funded by the PROSPER market place in the examined dataset is approximately $950 million.
But what is the reason specified by loan takers for asking the loan?
## NA Debt Consolidation Home Improvement
## 16965 58308 7433
## Business Personal Loan Student Use
## 7189 2395 756
## Auto Other Baby&Adoption
## 2572 10494 199
## Boat Cosmetic Procedure Engagement Ring
## 85 91 217
## Green Loans Household Expenses Large Purchases
## 59 1996 876
## Medical/Dental Motorcycle RV
## 1522 304 52
## Taxes Vacation Wedding Loans
## 885 768 771
Most loan takers use the PROSPER platform for debt consolidation. Many people don’t specify the category of their loan (NA option). The third most popular listing selection by loan takers is “other”. “Home improvement” and “Business” come in fourth and fifth place.
What is the occupation of loan takers?
## [1] 68
##
## Other Professional Computer Programmer
## 25.1 12.0 3.9
## Executive Teacher Administrative Assistant
## 3.8 3.3 3.2
## Analyst Sales - Commission Accountant/CPA
## 3.2 3.0 2.8
## Clerical
## 2.8
## [1] 86.2
There are 68 different occupations selected in the dataset by the loan takers. Most popular categories by a large margin are “Other” and “Professional” which are not terribly useful for our analysis since they are too generic. Computer Programmer is the third most common selection as can be seen from the graph, Executive came in the fourth place and Teacher in the fifth. 86.2% of answers about Occupation fall in the top 1/3 of answers. The barplot of top 1/3 of anwers is depicted in the graph above.
The income range of loan takers could probably be a useful variable to explore:
## IncomeRange count_n perc
## 1 $25,000-49,999 32192 28.3
## 2 $50,000-74,999 31050 27.3
## 3 $100,000+ 17337 15.2
## 4 $75,000-99,999 16916 14.8
## 5 Not displayed 7741 6.8
## 6 $1-24,999 7274 6.4
## 7 Not employed 806 0.7
## 8 $0 621 0.5
Most people in the dataset have an income that ranges between $25.000 and $75.000. According to this wikipedia article only the top 9.15% of U.S population had a personal income larger than $100.000 in 2016. So, it is somehow strange that the third most common income range category for loan takers is high income people (+$100.000).
To conclude this section we are going to plot three more variables that may be useful in the analysis that will follow: Recommendations, InvestmentFromFriendsCount and Investors.
## [1] 96.3
## [1] 3.7
##
## 0 1 2 3 4 5 6 7 8 9 14 16 18 19 21
## 0.0 82.6 13.3 2.5 0.6 0.3 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0
## 24 39
## 0.0 0.0
Only 3.7% of loan takers had a recommendation when they posted their listing. From those people the large majority (3.516 or 82.6%) had only one recommendation.
## [1] 98.1
## [1] 1.9
##
## 0 1 2 3 4 5 6 7 8 9 13 15 20 33
## 0.0 86.1 10.1 1.9 0.7 0.4 0.2 0.1 0.1 0.2 0.0 0.0 0.0 0.0
Less than 2% of loan takers had friends investing on their loan. The most common friend count was one person (86.1%).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
##
## 1 2 3 4 5 8 10 6 9 11
## 24.41 1.22 0.87 0.73 0.66 0.66 0.64 0.63 0.63 0.63
##
## [0,100] (100,200] (200,300] (300,400]
## 81470 19372 7790 3119
## (400,500] (500,600] (600,700] (700,800]
## 1374 545 181 53
## (800,900] (900,1e+03] (1e+03,1.1e+03] (1.1e+03,1.2e+03]
## 26 3 3 1
##
## [0,100] (100,200] (200,300] (300,400]
## 71.504 17.002 6.837 2.737
## (400,500] (500,600] (600,700] (700,800]
## 1.206 0.478 0.159 0.047
## (800,900] (900,1e+03] (1e+03,1.1e+03] (1.1e+03,1.2e+03]
## 0.023 0.003 0.003 0.001
Almost 1/4 of loans were funded by a single investor. The distribution of investor number per loan is heavily skewed to the right (note that 1 investor per loan category is omitted, otherwise other categories wouldn’t be visible). 71.5% of loans were funded by less than 100 investors and only 4 loans had more than 1000 investors.
As it was mentioned in the beginning of this project the data set contains 113.937 loans and 81 variables, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, the latest payment information among others.
It is interesting to explore which variables can help potential investors avoid loan takers that most likely will default on their loan. Also, I am curious to check if the behavior and characterics of loan takers changed after the financial crisis of 2008 in the PROSPER platform.
It is not possible to explore all the variables in this dataset due to time constraints, so I chose 12 supporting variables that were presented in the univariate plots section in order to conduct my anlysis against the features of interest, namely:
I created four new variables in the bivariate plots section. The first one classifies loans as defaulted and non-defaulted. The second one groups listings by year of initiation. The third one combines the CreditGrade and the ProsperRating…Alpha. variables into a single variable. And the last one splits the Investors variable in ten categories.
In the multivariate plots section I created a variable that divides the dataset in two time periods. The first period contains listings up until 2008. The second one covers listings from 2009 and onwards.
The LoanOriginalAmount distibution is heavily skewed to the right, which is not what I expected. I originally thought that loan amounts would be centered around $5.000 - $10.0000 not below $5.000 as they actually do.
I changed the data type of the following variables:
from numeric to factor to better visualize their distribution.
We will create a new variable that classifies loans as defaulted and non defaulted. We are going to use the LoanStatus variable to achive that. As a reminder the LoanStatus variable with its loan type counts is depicted below.
## [1] 55084
We will name the new variable as Defaulted and it will take only two values: either “Yes”" or “No”. We will include to the “Yes” category the Defaulted and Chargedoff loans from the LoanStatus variable and to the “No” category the Completed loans of the same variable. We will omit the other categories from the LoanStatus variable because we don’t know their outcome yet (if the loan will default or not).
As a consequence we will drop the data for all loans that don’t belong to the aforementioned categories. The total number of loans that we are going to work with is 55.084. Let’s have a first look at the percentage of defaulted loans in the newly created dataset:
3 out of 10 loans defaulted in the selected dataset. This is a really high percentage of defaulted loans. Lets plot the selected examined variables in the univariate plots section against the “Yes” and “No” categories.
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 23 6209 11557 11262 2206 5489 7700 8032 2544 62
It seems that loans that were initiated in 2006 and 2007 had a higher chance of default than loans in other years. As a note keep in mind that for 2005 and 2014 there are few loan listings in the dataset (23 and 62 respectively).
Now lets check the credit risk of loan takers against the Defaulted variable. We will combine the CreditGrade and the ProsperRating…Alpha. variables in one new variable called Credit_Risk. As a reminder CreditGrade was applicable for loans for the pre-2009 period and ProsperRating…Alpha. for loans after July 2009. So combining them let us analyse the credit risk for the whole period covered by our dataset.
Ok so it is pretty clear that the lower the Credit Risk rating the higher the probability of default of a loan taker.
Is loan duration related to defaults?
## Cell Contents
## |-------------------------|
## | Count |
## | Column Percent |
## |-------------------------|
##
## ========================================================
## PLD_def_or_no$Term
## PLD_def_or_no$Defaulted 12 36 60 Total
## --------------------------------------------------------
## No 1450 34200 2424 38074
## 94.6% 68.6% 65.6%
## --------------------------------------------------------
## Yes 82 15656 1272 17010
## 5.4% 31.4% 34.4%
## --------------------------------------------------------
## Total 1532 49856 3696 55084
## 2.8% 90.5% 6.7%
## ========================================================
We can see that 12 months listings have a significant lower default rate than 36 and 60 months listings. Keep in mind that 12 months loans represented only 2.8% of loans in the dataset.
Let’s see if there is a relation between the loan amount and the default rate.
## # A tibble: 2 x 5
## Defaulted median mean standard_dev IQR
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 No 4500 6189.093 5087.737 5450
## 2 Yes 4500 6424.842 5494.314 5100
It’s not likely that loan amount plays any significant role in the likelihood of default according to the above graph and summary statistics. Both violin plots looks pretty similar and their distribution is mainly concentrated below $5.000. The only difference that we can easily spot is that there are no loans above $25.000 that had defaulted.
Let’s visualize which listing category had the biggest percentage of default:
Green loans, loans that have undifined loan category and loans for Household Expenses had the higher rate of default.
Following the same procedure as above for occupation categories produces the following three graphs:
Realtors and Nurse’s Aides had almost a 50% probabibilty of defaulting on their loans! Technical and Community College students came in the third and fourth place with probabibility of default around the 45% mark. No Judge had ever defaulted (but there are only two in the dataset).
What about income range of loan takers?
No surprises here. The higher the income the lower the probability of default (except for the $0 category, which we will explore it later on).
Checking if the number of recommendations the borrower had and the number of friends that made an investment in the loan at the time the listing was created are related somehow to the default ratio:
This is not what someone might expect. Logically larger number of recommendations or investment from friends should produce a lower rate of default. This is not the case in the above two graphs, but nevertheless we should not forget that there are few observations for counts above 2 in both graphs. So it’s seems prudent to not make any inference about those two variables.
So let’s check if the number of investors reveal any useful insights for the probability of default:
In the above chart we splitted the investor count data into approximately equal frequency bins to better visualize it. It seems that as the number of investors surpasses 25 the rate of default goes down. It increases again a little bit after 244 investors which seems quite odd.
Now it’s time to have a quick look at possible relationships between the supporting varibles of our analysis. Let’s see if Investors count is related to Income Range:
We can spot a trend here. The higher the Income Range of the loan taker the higher the number of people willing to invest in a loan, which of course makes sense. But clearly the $0 income range category seems like an outlier..
Is loan amount related to IncomeRange?
## # A tibble: 8 x 2
## IncomeRange mean
## <fctr> <dbl>
## 1 $100,000+ 9711.120
## 2 $75,000-99,999 7719.072
## 3 $0 7421.954
## 4 $50,000-74,999 6564.272
## 5 $25,000-49,999 5285.641
## 6 Not displayed 5171.352
## 7 Not employed 4579.519
## 8 $1-24,999 4164.764
Αs the Income Range gets higher so does the loan amount. But we can see that people with $0 income managed to borrow higher amounts on average than people in the income range $1 to $74,999 which is totally unexpected. As a first guess we could think that the $0 category is mainly populated by students that use PROSPER to get student loans. Let’s see if that’s the case:
Clearly that’s not the case. Let’s check the year of initiation of those loans:
Almost all these loans where initiated in the first years of PROSPER operation. I believe that the Credit_Risk variable can help us a little bit more here. Let’s visualize the credit risk for the $0 and the $50,000-74,999:
The top 3 credit risk ratings of loan takers with $0 income were AA, A and B! But loan takers with income of $50,000-74,999 had clearly worse ratings…Checking the number of loan takers in each loan category:
##
## $0 $100,000+ $1-24,999 $25,000-49,999 $50,000-74,999
## 606 6064 4571 16343 12789
## $75,000-99,999 Not displayed Not employed
## 6442 7737 532
The number of loan takers in the $0 income category is really low compared to the other income range categories. If we check again the listing category of these loans….
## ListingCategory..numeric. n perc
## 1 NA 183 30.2
## 2 Business 154 25.4
## 3 Personal Loan 105 17.3
## 4 Debt Consolidation 91 15.0
## 5 Student Use 28 4.6
## 6 Other 27 4.5
## 7 Home Improvement 15 2.5
## 8 Auto 3 0.5
…..we can see that the top two categories are loans that are listed as NA and Business loans. So I think these $0 income loans were initiated on behalf of small companies that needed some quick financing and this was probably allowed in the initial years of PROSPER operation. This probably explains the high ratings, the high mean amount per loan too and the high number of investors willing to invest on these loans. But this is just a hypothesis and we should take it with a grain of salt.
Finally to conclude this section let’s have a quick glance in the graph below that depicts the relation between the loan amount and the number of investors:
## [1] 0.68
It seems that there is a fairly positive relationship between the two variables. Their correlation coefficient is 0.68.
I explored the relationships of the Defaulted variable, which is one of the features of interest, with 10 other variables. I found some non-surprising results:
All the relationships above look visually strong, but if I had to chose one I would say that the realtion between the Credit Risk variable and the probability of default stands out.
As we said before it is interesting to see if loan takers before the financial crisis of 2008 had different behaviour than those after 2008. The dataset contains 29.051 loans for the period 2005-2008 and 26.033 for the period 2009-2014. So the data split is almost even (53% - 47%).
More that 1/3 loans in 2005-2008 defaulted in the PROSPER platform. PROSPER managed to decrease the default ratio below 1/4 in 2009-2014.
It is clear from the above graph that loans created in the aftermath of the Lehman Brothers bankruptcy had a substantially lower default rate across all credit risk categories than loans in the period 2009-2014.
The above graph reveals that income range was negatively correlated with the default rate for loans after 2009. For loans before the financial crisis the correlation is present but it is much smaller.
The same pattern as with the income range is spotted to a lesser extent with the number of investors. More than 68 investors per loan signifies a lower rate of default for the second period. This relation holds true for all categories in the first period up until 244 investors.
In the above table we can see the loan takers occupations which had the biggest improvement in terms of default rate percentage in the period 2009-2014, when compared to 2005-2008. In order to exclude categories with few observations we had included only professions for which there are at least 50 default cases in both periods. Loan takers from the construction industry had the biggest improvement, which makes sense since the construction industry was one of the most severely hit industries by the financial crisis. The other category that is worth mentioning is the Executive occupatiom which sound really vague considering the fact that almost half the loan takers with this occupation title defaulted on their loan during 2005-2008.
In the above graph we can notice that PROSPER added several loan categories after 2009, but excluded the personal loan category from its listings.
The above graph reveals that defaulted loans in period 2005-2008 had a larger range of listing amounts in most income categories. This pattern isn’t visible in the period 2009-2014.
Splitting the data in two time periods revealed some interesting relationships between the examined variables. The rate of default was substantially lower in 2009-2014 compared to 2005-2008. Default rates were lower across all credit risk categories compared to the period 2005-2008 and income range of loan takers seemed to infuence a lot more the deafult rate after 2009. Also we discovered that PROSPER added new listing categories after 2009.
When plotting loan amount vs income range for both periods it was somewhat unexpected that defaulted loans in the first period had a larger range of listing amounts in most income categories. We will try to explain this observation in the next section.
I believe the above graph is self-explanatory. It shows that PROSPER made its credit risk system stricter from 2009 and onwards. Despite the abysmal performance of loan takers in 2005-2008, the graph shows that PROSPER managed to correctly seperate loan takers into different credit risk categories in 2005-2008 because the default rate exhibits a ladder pattern as in 2009-2014. But the main problem in the period 2005-2008 is that PROSPER failed to exclude certain loan takers from taking a loan in the first place. It is not reasonable to allow people taking a loan if their probability of default is close or above 50% (i.e. categories E and HR in 2005-2008). This is of course an ex post info but given the fact that the first period covers 3 years I believe there was enough time for PROSPER to adjust its credit risk policy. To PROSPER defence we can say that when a company starts operations it is inevidable that some things may go south. We shouldn’t also forget that there is always a need for new customers for newly established companies. Furthermore the financial crisis that hit the world economy in 2007-2008 certainly played it’s role to the far than stellar performance of loan takers in 2005-2008.
Plz note that in the above graph we dropped the NC (No Credit) and NA (Not Available) credit risk categories for better representing the data (they contain only 272 observations, 0.5% of all observations).
Several things can be noticed in the above graph. First of all it seems that after 2009 there was no option for a loan taker to not display his/her income. Probably this happened because during the period 2005-2008 this income range category had the highest rate of default. Secondly the default ratio is highly correlated with income range in 2009-20014. This is not the case in 2005-2008 where the drop in default rate as income range increases not so visually strong. We can make two hypothesis for this observation: either the effect of the financial crisis hit all income categories to the same extent (which I think is highly unlikely) or PROSPER didn’t have a mechanism to verify the stated income of loan takers and loan takers took advantage of it. I think this last hypothesis is further supported by the fact that the “Not employed” category had the smallest default rate in 2005-2008 (which doesn’t make sense), but in 2009-2014 it had the largest one.
In the last plot we can easily spot a trend: as income range grows so does the loan amount. But the $0 category seems to brake this trend because it’s distribution of loan amounts lies above the income category $1-25k category. As it was analysed during the bivariate plots section the $0 income loans are few, they are mainly concentrated in 2007 and 2008 and they are probably loans on behalf of small companies, so we can ignore them in this graph.
Another issue that is visible is that the distribution of defaulted loans during 2005-2008 is more spread out than non defaulted loans across several income categories. This isn’t something to be expected. In general every bank that lends money takes heavily into consideration the income of each loan taker before giving a loan. This isn’t inline with what is seen in the graph of 2005-2008. We should note that this tendancy was evaporated in the period 2009-2014.
A final observation about the above graph is that as income range increase loan range amount increseases also.
Exploring the PROSPER loan dataset was really interesting. I discovered many insights about the peer-to-peer lending which I wasn’t aware of. In fact this type of lending in Europe is not at all common. The main difficulties that I encountered during this project were related to the variable selection that I did. I wanted to explore the categorical variables of this dataset mainly because I haven’t been exposed to this type of data analysis in the past and learning to do so would be useful.
The most common pattern throught out this dataset exploration is that PROSPER had a difficult time during the first years of it’s operation and a part of it can be attributed to the financial crisis that unfolded after 2008. Nevertheless, it seems that their credit rating system wasn’t at all optimal because more than 1/3 of their loans defaulted during 2005-2008. So it is not strange that the SEC imposed a cease and desist order on PROSPER from the end of November 2008 till July 2009. Moving after the SEC order PROSPER managed to fix its credit rating system and the default rates were normalised to fairly acceptable rates.
To this end adding a model that forecasts which loans will default on the PROSPER platform is certainly feasible. We could use the variables that were explored in this project and seemed to be related with the default ratio. This is something that I will definetely try to do in the future.