Prosper is America’s first peer-to-peer lending marketplace, with more than 2 million members and over $2,000,000,000 in funded loans. Here we use the data available to the public (last updated on March 11th, 2014) from Prosper, which contains all the listings and loans ever created with 81 variables on each loan/listing, to do some data analysis. As a potential amateur investor, I will explore the borrower market (including the demographic segmentation and beyond) and try to let data tell some ‘behind scene’ stories about the borrowers, also the performance of Prosper in terms of the volume of listings by year and by area.
Analyzing loan bussiness in Prosper to understand the loan market by year,by usage of loan and by states
Exploring the borrower market(including the demographic segmentation and credit score) and indentify influential factors to prosper credit score.
Evaluating the lender market(investment frequency) and find out risk factors to lender yield
This data set contains 113,937 loans with 81 variables on each loan.Here are the names of all the 81 variables.Since I set stringsAsFactor=FALSE when load the data. All the varibles are either numeric or string for now. But to make more sense of the analysis, I may change the class of some variables later. In those 81 variables,except the information of borrower, lender and listing, there are also have 5 indetification ID for each listing. For easy understanding, one unique ID will be kept in analysis.
## [1] 113937 81
## Observations: 113,937
## Variables: 81
## $ ListingKey (chr) "1021339766868145413AB3B",...
## $ ListingNumber (int) 193129, 1209647, 81716, 65...
## $ ListingCreationDate (chr) "2007-08-26 19:09:29.26300...
## $ CreditGrade (chr) "C", "", "HR", "", "", "",...
## $ Term (int) 36, 36, 36, 36, 36, 60, 36...
## $ LoanStatus (chr) "Completed", "Current", "C...
## $ ClosedDate (chr) "2009-08-14 00:00:00", "",...
## $ BorrowerAPR (dbl) 0.16516, 0.12016, 0.28269,...
## $ BorrowerRate (dbl) 0.1580, 0.0920, 0.2750, 0....
## $ LenderYield (dbl) 0.1380, 0.0820, 0.2400, 0....
## $ EstimatedEffectiveYield (dbl) NA, 0.07960, NA, 0.08490, ...
## $ EstimatedLoss (dbl) NA, 0.0249, NA, 0.0249, 0....
## $ EstimatedReturn (dbl) NA, 0.05470, NA, 0.06000, ...
## $ ProsperRating..numeric. (int) NA, 6, NA, 6, 3, 5, 2, 4, ...
## $ ProsperRating..Alpha. (chr) "", "A", "", "A", "D", "B"...
## $ ProsperScore (dbl) NA, 7, NA, 9, 4, 10, 2, 4,...
## $ ListingCategory..numeric. (int) 0, 2, 0, 16, 2, 1, 1, 2, 7...
## $ BorrowerState (chr) "CO", "CO", "GA", "GA", "M...
## $ Occupation (chr) "Other", "Professional", "...
## $ EmploymentStatus (chr) "Self-employed", "Employed...
## $ EmploymentStatusDuration (int) 2, 44, NA, 113, 44, 82, 17...
## $ IsBorrowerHomeowner (chr) "True", "False", "False", ...
## $ CurrentlyInGroup (chr) "True", "False", "True", "...
## $ GroupKey (chr) "", "", "783C3371218786870...
## $ DateCreditPulled (chr) "2007-08-26 18:41:46.78000...
## $ CreditScoreRangeLower (int) 640, 680, 480, 800, 680, 7...
## $ CreditScoreRangeUpper (int) 659, 699, 499, 819, 699, 7...
## $ FirstRecordedCreditLine (chr) "2001-10-11 00:00:00", "19...
## $ CurrentCreditLines (int) 5, 14, NA, 5, 19, 21, 10, ...
## $ OpenCreditLines (int) 4, 14, NA, 5, 19, 17, 7, 6...
## $ TotalCreditLinespast7years (int) 12, 29, 3, 29, 49, 49, 20,...
## $ OpenRevolvingAccounts (int) 1, 13, 0, 7, 6, 13, 6, 5, ...
## $ OpenRevolvingMonthlyPayment (dbl) 24, 389, 0, 115, 220, 1410...
## $ InquiriesLast6Months (int) 3, 3, 0, 0, 1, 0, 0, 3, 1,...
## $ TotalInquiries (dbl) 3, 5, 1, 1, 9, 2, 0, 16, 6...
## $ CurrentDelinquencies (int) 2, 0, 1, 4, 0, 0, 0, 0, 0,...
## $ AmountDelinquent (dbl) 472, 0, NA, 10056, 0, 0, 0...
## $ DelinquenciesLast7Years (int) 4, 0, 0, 14, 0, 0, 0, 0, 0...
## $ PublicRecordsLast10Years (int) 0, 1, 0, 0, 0, 0, 0, 1, 0,...
## $ PublicRecordsLast12Months (int) 0, 0, NA, 0, 0, 0, 0, 0, 0...
## $ RevolvingCreditBalance (dbl) 0, 3989, NA, 1444, 6193, 6...
## $ BankcardUtilization (dbl) 0.00, 0.21, NA, 0.04, 0.81...
## $ AvailableBankcardCredit (dbl) 1500, 10266, NA, 30754, 69...
## $ TotalTrades (dbl) 11, 29, NA, 26, 39, 47, 16...
## $ TradesNeverDelinquent..percentage. (dbl) 0.81, 1.00, NA, 0.76, 0.95...
## $ TradesOpenedLast6Months (dbl) 0, 2, NA, 0, 2, 0, 0, 0, 1...
## $ DebtToIncomeRatio (dbl) 0.17, 0.18, 0.06, 0.15, 0....
## $ IncomeRange (chr) "$25,000-49,999", "$50,000...
## $ IncomeVerifiable (chr) "True", "True", "True", "T...
## $ StatedMonthlyIncome (dbl) 3083.3333, 6125.0000, 2083...
## $ LoanKey (chr) "E33A3400205839220442E84",...
## $ TotalProsperLoans (int) NA, NA, NA, NA, 1, NA, NA,...
## $ TotalProsperPaymentsBilled (int) NA, NA, NA, NA, 11, NA, NA...
## $ OnTimeProsperPayments (int) NA, NA, NA, NA, 11, NA, NA...
## $ ProsperPaymentsLessThanOneMonthLate (int) NA, NA, NA, NA, 0, NA, NA,...
## $ ProsperPaymentsOneMonthPlusLate (int) NA, NA, NA, NA, 0, NA, NA,...
## $ ProsperPrincipalBorrowed (dbl) NA, NA, NA, NA, 11000, NA,...
## $ ProsperPrincipalOutstanding (dbl) NA, NA, NA, NA, 9947.90, N...
## $ ScorexChangeAtTimeOfListing (int) NA, NA, NA, NA, NA, NA, NA...
## $ LoanCurrentDaysDelinquent (int) 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LoanFirstDefaultedCycleNumber (int) NA, NA, NA, NA, NA, NA, NA...
## $ LoanMonthsSinceOrigination (int) 78, 0, 86, 16, 6, 3, 11, 1...
## $ LoanNumber (int) 19141, 134815, 6466, 77296...
## $ LoanOriginalAmount (int) 9425, 10000, 3001, 10000, ...
## $ LoanOriginationDate (chr) "2007-09-12 00:00:00", "20...
## $ LoanOriginationQuarter (chr) "Q3 2007", "Q1 2014", "Q1 ...
## $ MemberKey (chr) "1F3E3376408759268057EDA",...
## $ MonthlyLoanPayment (dbl) 330.43, 318.93, 123.32, 32...
## $ LP_CustomerPayments (dbl) 11396.1400, 0.0000, 4186.6...
## $ LP_CustomerPrincipalPayments (dbl) 9425.00, 0.00, 3001.00, 40...
## $ LP_InterestandFees (dbl) 1971.1400, 0.0000, 1185.63...
## $ LP_ServiceFees (dbl) -133.18, 0.00, -24.20, -10...
## $ LP_CollectionFees (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LP_GrossPrincipalLoss (dbl) 0.00, 0.00, 0.00, 0.00, 0....
## $ LP_NetPrincipalLoss (dbl) 0.00, 0.00, 0.00, 0.00, 0....
## $ LP_NonPrincipalRecoverypayments (dbl) 0.00, 0.00, 0.00, 0.00, 0....
## $ PercentFunded (dbl) 1.0000, 1.0000, 1.0000, 1....
## $ Recommendations (int) 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ InvestmentFromFriendsCount (int) 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ InvestmentFromFriendsAmount (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Investors (int) 258, 1, 41, 158, 20, 1, 1,...
The first plot from 2006 to 2007, the number of listings created almost doubled. But the number dropped a little in the year of 2008 and decreased dramatically in the following year - 2009. My guess is that the global financial crisis was the principal culprit.
Let us see the mutiple plots to get more detail information, Propers became growth beginning Nov.2005 until May 2008. Because of financial crisis, from 2008 May to the end of this year, the listing count from almost 2000 amount decreased to 0. The bad economic situation lasted 8 months.After 3 years later, in 2012 the number of created listing going back to the silimar amount before the global financial crisis.And the following 2012 and 2013, the business bloomed really well.Untile data cut date, the amount of listing has exceeded 5500.
From the plot above, we can clearly see almost 50% percentage people borrowing loan from Prospers for Debt Consolidation, the secondary and the third are Not Available and Other.That is a little surprise.25% people don’t like to tell how to use the loan.Home improvement and business also take 0.06% respectively.
The two plot above show the population distribution of borrowers.They clearly tell us the toppest 4 on the list of state with big borrower market are CA,TX,NY,FL with population 14.717k,6842,6720,6729 respectively. Pie chart shows the percentage for those four states in whole market are 13.6%, 6.33%, 6.23% 6.22% respectively. Does loan market rank associate with openning status to investor of each state ?
The Prosper website shows the states are open to the investors. Lightblue color means open state to investor, and the black color means close state t investor. Comparing this map and the previous plot" Distribution of borrowers“, no correlation clearly was found. For example, TX currently doesn’t open to investor, but TX is the second big market in loan market. There would be other reason impact on the loan market. In general,if the people spend more money, they maybe live in the economic developed regions. In order to verify the hypothesis, we can plot out the average GDP for each state in United states from 2011-2014.
The information of GDP for states in U.S.A. from 2011 to 2014 comes from Bureau of Economic. From this plot, the toppest 4 GDP states are CA,TX,NY,FL. All those four states are the top four rank of loan market in Prosper.This plot basically confirm my hypothesis, the people lives in ecomonic developped region try to lend loan from Prosper.
After analyzing each factor(time, usage of category and state) impacts on the loan business in Prosper. Does any interaction of any two factors together influence in business deveploment? First let us check the business development from 2005 to 2014 in each state.
Seeing from the small multiple plots above, the trend of the listings created each year of most states are similar to the overall trend as I explained before. Interestingly, it seems that the borrower markets were shut down completely after 2008 in Iowa, Maine and North Dakota in despite of the growing trend of listings created from 2006 to 2008. Also, the borrower market didn’t start until 2007 in District of Columbia, 2008 in Nevada and Rhode Island and 2009 for South Dakota.
Secondary, did people borrow loan from Prosper for the similar reasons from 2006 to 2014? Did any change though different year? In order to answer this question, let us check the plot below. The category of ‘other’ and ‘Not Avariable’ were excluded from data, because those can’t give us any information for loan usage
The plot above shows the debit consolidation,business, home improvement were the most top 3 reasons for borrowing loan from Prosper during 2007 to 2014. Personal loan was stopped offering at 2008 and student loan was stopped used at 2010. From 2007 to 2011, people borrowed loan for solving big issues in their lifes such as personal loan, student loan, buying car or boat, baby adoption. But after 2011, people borrowing loan for more personal and wider reasons such as cosmetic procedure, engagement ring, green loans, household expenses, large purchases,and etc.
Because Prosper rating which are the credit score from Prosper are only applicable from July 2009 to 2014, we filter out those borrower for following analysis. First, checking basic information such as occupation,employment status,ishomewoner,employment status duration income range for borrowers.
All the data are selected after July 2009, which the date available for ProsperScore. The left top plot above shows almost 80% people are employed. Suprisingly, there are some poeple without employment getting loan from Prosper.The reasons for borrowers would have other financial income or properties. The top right plot tells us over 50% borrowers are home owner.The majority borrowers’ income are in range 25000-100000 $ yearly. We can conclude that if people with income below 25000 yearly are hard to borrow the money from Prosper.
I subset the borrowers with employment status listed as “Employed”, “Full-time”, “Part-time” and “Self-employed” and get the histogram of the length of employment for them. Using year as the unit of time, there’s a very clear descending trend of the number of people who borrow loans with the increase of the length of employment. My guess is the longer one works, the more one earns and therefore more savings. With more saving in the bank, people are less likely to borrow loans.
Excluding the missing value of Prosper rating.Let us see the distribution of prosper score and Loan Original Amount. From the histogram of Prosper Score, we can see most borrower with prosper score from 2.5 to 9 score. The first peak for loan original amount is 4000\(, the second peak is 15000\), the third peak is 10000$.What features will impact on the Prosper Score and loan amount? Let us see the following analysis.
Prosper score/Rate of the borrower is a unique and really important feature that surely of interest of both borrowers and investors. I’d like to see what’s the relationship between Prosper Rate and some other features of the listings. To get a overview quickly, I will start with ggpairs and dig in deeper for any findings of my interest later on.
1000 observation were randomly selected from loan data for this analysis,because of my computor limitation. Maybe the result was not precise enough, but it give us the roughly trend between variables. 11 variables with important general meaning are selected from dataset. We can clearly see the ProsperScore are positively related to investors and CreditScoreRangeLower,but negatively related to BorrowerRate, EstimatedLoss,DebtToIncomeRatio .From the boxplot, we can also know the ProsperScore are relative with Income Range,EmploymentStatus and ProsperRating..Alpha.. Surprisingly, the Employment Status Duration doesn’t have correlation with prosper score.
Those finding are very meaningful. If the borrower can get more investor’s funded loans, which means the investor idetification the borrower’s information, so the borrower can get higher ProsperScore/rating. If the borrower with higher salary, they can get higher ProsperScore/rating. If the borrower had higher Credit Score from other creidt angency, s/he will get higher score in Prosper. If borrower have higher DebtToIncomeRatio that means their financial situation is bad, so they are prone to get the lower Prosper score. If the borrower with higher evaluation of EstimatedLoss, s/he will get the lower ProsperScore. The BorrowerRate is highly related with EstimatedLoss. So it can be removed for following analysis.
In order to deeply analyze the correlation found above on whole dataset,the mutivariate analysis will be used to explore risk factors impact on Prosper score and then finally impacting on the amount of loan for borrowers.
The income range is very important risk factor for prosper rating/score. After changing the order of “Not employed” and “$0”, a very interesting step-wise graphic shows up. We can claim that Borrowers of higher income range tend to have better Prosper Rate, however, borrowers who were not employed when they send out the loan request have higher Prosper Rate than those who had zero income.
The employment status is another risk factor for prosper rating/score.After changing the order of employment status from not employed to full time job,another step-wise graphic shows up. We can claim that Borrowers with stable employment status tend to have better Prosper Rate. The other and self_employment seems get the similar prosper rating. But, the retired borrower easily get the AA and A prosper rating than employed total.
From analysis above, the CreditScoreRangeLower positively related to Prosper score. In order to verify the prosper score is highly related to prosper rating this plot is used here.We can clearly see that the prosper rating basicly correspond to the prosper score no matter the value of credit score lower band. The bettter rating category is according to the better prosper score. However,the CreditScoreRangeLower impact on the different levels of prosper rating on different way. So we can’t roughly get the higher credit score lower band will get the better prosper rating.The prosper rating is not impacted by CreditScoreRangeLower.
This plot combined prosperscore, prosper rating and Estimatedloss together. The different color represent the different levels of prosper rating. The ball size represent the prosper score. The X axis is the reverse of Estimatedloss. Though this plot we can clearly know the criterion for classification of prosper rating.The Threshold for classifying borrower into rating AA levels is Estimatedloss euqualing to 0.02. And most of borrowers in AA levels will get prosper score above 4. the the two variable estimatedloss and prosper rating combined impacted on the prosper score. If the borrower with lower estimatedloss and higher prosper rating, s/he will get the better prosper score.
We can conclude the findings, the incomerange, emloyment status, estimatedloss and lower band of credit score are the most risk factor for the prosper score/Rating calculation.
LenderYield is the most important factor for attracting the lenders to founded Prosper. We can easily know the prosper score/Rating is one of the most risk factor for LenderYield, except it what other factors impacting on LenderYield. The variables which are strong related to the Prosper score/Rating must related to LenderYield. For consistency, the data only contain from July 2007 to 2014. First, we check the LenderYield.
There are 22 cases that the lenders didn’t get any profit and even lost some of their own money. But 22 out of 113937, the odd is really low. Most yields range from 0.05 to 0.35. Surprisingly, the highest peak in the graphic is around 0.31, which shows a really rosy picture if one considers about doing the similar investment.
In order to further find the risk factors for LenderYield, the ggpaires is used here again.1000 observations are randomly selected from the data loan, because of computing speed.
The From the analysis above, we can easily to see the LoanOriginalAmount and CreditScoreRangeLower are negatively related to LenderYield, but EstimatedEffectiveYield and BorrowerRate are positively related to LenderYield. Those finding are making sense. Because if the borrower has lower credit score band which means s/he has bad credit score, the borrower has lower repayment capacity.If lender borrow more money to borrower, they will take more risk. conversely, if the Borrower Rate is higher, they will pay more for loan, so the lender will get more money. If the EstimatedEffectiveYield is higher, the lender yield will be higher. Surprisely, the percent funded and recommedation do not correlate with Lender Yield. From general principles, the more people recommedate the borrower, the lender more firmly get the payment on time. The same idea is for the variable percent funded. Let us deeply check those variables in the whole data.
We can easliy see from the left plot, the most of borrowers don’t have any recommendations, and from the right plot, the most of listing get 100% percent founded. Therefore, we can know the reason why the recommedations and percentfounded don’t relate to LenderYield.
Tough we know the Prosper rating highly relate to LenderYield, this plot below will tell us how they related to each other. The higher risk really mean higher return.
All analysis in this part, we can conclude lender yield should be highly posively with EstimatedEffectiveYield and BorrowerRate, meanwhile negavtively with prosper rating,loan original amount and lower band of credit score range. All investors should remember all investments are risky.
Using ggplot, rMaps and plotly packages makes R such a powerful visualization tool. While it’s easy to get a graphic of any kind, it’s not easy to get a good and informative graphic with reasonable type, scale, legend etc.
Since I have never been a borrower or a investor of a loan, it took me a lot of time just to understand the meaning of all the variables in the list of 81. But the good thing about the Prosper dataset is I was able to find detailed information for both the borrower side and the lender side from their website. With better understanding of how Prosper works, I even started to think about being an investor in the future when I have extra money to invest. I guess that is also the biggest takeaway for me for doing the project, I feel like that I’m now armed with the knowledge and confidence to explore data, even for those in some fields that I don’t quite familiar with.
For further enriching the analysis, I would like a build a regression model using the available variables to predict the Prosper Rating for the future borrowers. As the models of credit score are the core of the Credit Score Companies and how big the business is for those companies, I want to explore myself using the great tool I learned in this course.
rCharts with DataMaps:http://rcharts.io/viewer/?6735051#.VLrFkmREVFK
Tips for rMaps and rCharts:http://daisukeichikawa.blogspot.com/2014/03/tips-for-rmaps-and-rcharts.html
Prosper:https://www.prosper.com/welcome/how-it-works/
What Is in a Credit Score:https://www.creditkarma.com/article/credit-score-factors
Multiple graphs on one page: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
United states GDP 2011-2014:http://www.bea.gov/newsreleases/regional/gdp_state/gsp_newsrelease.htm