Univariate Plots Section

Since I’m going to run a lot of histograms, I am creating a function to make it a little easier on myself.

First I’m wondering about Loan Originations by Quarter…

The first plot did not display in chronological order along the x-axis. To fix that, I will change the Loan Origination Quarter to display in date series order.

And re-plot the histogram The result shows a big dip in listings from Q4 2008 into 2009-10. This time period coincides with the collapse of Lehman Brothers and the ensuing fallout in the global financial system. It took almost four years for listing levels to reach the Q2 2008 levels. Even though Prosper is an alternative to traditional loan models, it appears that its business was not immune to the global economic crisis. I’m now wondering if the economic crisis caused Prosper to change how it does business. Perhaps Prosper’s credit policies were much looser before the credit crisis? I’ll file that question away for later. After all, it appears that Prosper only established its Prosper Rating and Prosper Score in July 2009.

Now let’s look at Loan Status counts:

Overall, it appears that a large majority of loans are either Completed or Current, though there are also a large number of Charged-off and Defaulted (non-performing). A little later I will want to look at loan performance based on origination vintage.

Look at Prosper’s risk ratings of the loans:

The first plot is out of order. We need to re-order the levels to show a progression of more risky to less risky, left to right, and re-plot the histogram.

This is more or less normally distributed (ignoring the blank entries from before 2009). What does the Prosper Loan Score look like?

This field is also more or less normally distributed. We can conclude that, overall, risk ratings are relatively normal across the sample.

Geographic distribution of the loans by state:

This histogram shows the highest listing counts in CA, TX, NY, FL and IL, which tracks with the rankings of top US states by population. In other words, geographically, the Prosper listings are distributed similarly to the population of the US. The states with the highest percentage of loans per 1,000 residents are GA, MD, OR, and IL.

I have some code that transforms Listing Categories into readable names.

Distribution of Listing Categories: By far, most loans are for “Debt Consolidation.”

How much do borrowers borrow? The loan amounts skew relatively small, with the median of $6,500 less than the mean of $8,337. Based on the past two histograms, a typical loan is for consolidating less than $10,000 of debt (probably credit card debt).

What do their monthly payments look like? The median monthly payment is $217, less than the mean of $272, so the distribution skews toward smaller monthly payments. No payment is more than $2,252, but that appears to be a signficant outlier. There are very few payments above $1,000.

What interest rates are borrowers paying? Interest rates are pretty high, averaging 19.28%. There are also material spikes at higher interest rates over 30%. Since most of these loans are for debt consolidation, these high rates must still be better than the interest rates the borrowers would have to pay to credit card companies. Or, perhaps the borrowers see value in freeing up the borrowing limit on their credit cards, even if it means paying high rates to Prosper lenders.

Listings by income range:

The first plot was out of order. We need to re-order the levels to see the bins in numerical order left to right.

And re-plot the histogram:

The income ranges show that the vast majority of borrowers make less than $75,000 per year, shedding some more light on why their credit cards are stretched and they see the need to consolidate debt.

What borrowers do for a living: The majority put “Other” for their occupation. That is not too helpful. The largest non-Other category is “Professional,” which is also another unhelpful, generic catch-all. But perhaps we can compare the loan performance of some of the other borrower professions down the road. #INTERESTING

Are borrowers homeowners?

## False  True 
## 56459 57478

About half own homes–and have mortgage debt–and half do not.

Let’s look at risk, in the form of Debt-to-Income ratio: There are definitely some outliers in this sample. I wonder what those records look like. Could they be some sort of data error?

## [1] 333

Since the Debt-to-Income Ratio is calculated based on credit report data, it appears to be impossible to recreate this number by calculating it from other fields in this data set. I do, however, want to see what it looks like when we remove outliers from the sample.

The vast majority of Debt-to-Income Ratios are less than 0.5. Excluding the outliers on the high side on the right, the data is close to having the shape of a normal distribution, though it is skewed slightly left.

Now let’s take a look at Credit Scores: The credit scores are normally distributed, with means of 704.6 for the Upper and 685.6 for the Lower. I would suspect there is a strong correlation between credit score and Prosper’s risk rating for each listing; this is something to look a little later.

Now I want to inspect some of the “social” aspects of Prosper’s listings. For example, let’s take a look at Recommendations. Very few borrowers have any recommendations at the time the listing is created.

What about investments from friends? Very few listings get investments from the borrower’s friends. The vast majority get exactly $0, as evidenced by the long bar at the left of the histogram (on a log 10 scale). At least one loan got the max of $25K entirely from friends.

Here’s a question that I don’t think we have enough data to answer: does a friend invest more or less than a typical investor?

What percentage of listings get funded? Almost all listings are 100% funded (y axis is log 10).

Do borrowers get more than one Prosper loan? A material percentage of borrowers are on their second, third, or fourth loan with Prosper at the time of listing. It would be interesting to see how these loans perform vs. first-time borrowers’ loans. To do that, we would have to risk-adjust that loan performance.

How many on-time payments have repeat borrowers made when applying for their next loans? The mean is under two years, and the median is 15 months. So most Prosper borrowers who get a second (or third, etc.) Prosper loan do not wait very long to do so.

What about LATE payments for their prosper loans?

What are the terms for each loan (the length of the loan in months)? Loan terms are either 12, 36, or 60 months, with the vast majority being 36-months.

Univariate Analysis

What is the structure of your dataset?

The Prosper loan dataset contains 113,937 observations of 81 variables. The observations refer to loan listings on Prosper.com from late 2005 until 2014, and various characteristics of those loans. The data seems “tidy,” according to Hadley Wickham’s definition: the variable names are not variables themselves, so there is not much work required in the way of “tidying” the data.

What is/are the main feature(s) of interest in your dataset?

The histogram of Loan Origination Quarter shows a big dip in listings from Q4 2008 into 2009-10. This time period coincides with the (A) collapse of Lehman Brothers and the ensuing fallout in the global financial system, and (B) Prosper’s decision to register with the SEC. It appears that some combination of A and B caused Prosper to change how it does business. It will be interesting to take a look at how Prosper’s business changed over time.

Another feature that looks interesting is Loan Status, in that some loans have performed while others have defaulted or been charged off (what I will call “non-performing”). It will be interesting to look at loan status/performance for different Occupations, Loan Origination Quarters, Prior Borrowers, and other variables. Also, with Credit Score being a proxy for risk, it will be interesting to see how loans with different Credit Scores have performed.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

My main investigation will be into how Prosper’s credit policies and borrower characteristics changed over time as its business model and the larger lending environment evolved. So having the Loan Origination Quarter is critical. And being able to see how much Prosper’s lenders earned or lost on each loan with the Payments, Fees, and Loss fields could prove helpful in illuminating what is going on with these loans.

Did you create any new variables from existing variables in the dataset?

I did not see a need to at this juncture of the analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I spent most of my clean-up efforts on re-ordering default factor levels to make more intuitive sense when displayed in a histogram. For example, Loan Origination Quarter should really display in chronological order. And Prosper rating levels should show in order from most risky to least risky.

The main “unusual” distribution that stood out was the number of loans originated in each quarter, and how that number dipped to zero in early 2009.

Bivariate Plots Section

Let’s start by taking a look at credit scores over time. I am creating a new dataframe to capture minimum credit score data grouped by loan origination quarter. It is clear that Prosper changed its policies from one that had no minimum credit score to one that had very strict lower limits. That lower limit rose over time from 520 in Q2 2007 to 600 in Q3 2009 (after the credit crisis) to 640 most recently.

Let’s look at how loans performed over time. First, if a listing’s LoanStatus is “Chargedoff” or “Defaulted”, then I will consider it non-performing for the purpose of determining the percentage of loans in a given time period that are non-performing.

I am creating a new dataframe to capture loan performance percentages by quarter, and plotting that data over time:

Now I want to see, by quarter, the breakout of performing and non performing loans.

Now I have noticed that I’m not accounting for something: the more recent loans that haven’t fully matured are going to give me an artificially low count of non-performing loans. Those loans haven’t had enough of a chance to go bad yet. So now I’m going to create a new field in the original “listings” dataframe that indicates whether or not a loans should have closed by now, whether or not it has performed. For more recent loans, only the 12-month terms will have closed. For older loans, the 36-month terms will have closed. I am referring these as loan “cohorts,” and grouping them together by comparing their months since origination with the length of the term.

And now I will create a similar loan performance histogram, but using this new “CohortLoansClosed” field to filter the data. The histogram above is a better representation of performing vs. non-performing loan counts, because it only contains loans whose entire cohort has closed, either through maturation or non-performance.

The results show that, starting in Q1 2006, under Prosper’s original business model, the percentage of non-performing loans went from 27.0% and rose to a peak of 42.9% in Q4 2006, before dropping somewhat until the “quiet period” of Q1 2009, when Prosper had no listings. After the quiet period, the percentage of non-performing loans was much lower, though slowly increasing. The non-performing loan percentage peaked again at 21.8% in Q2 2011, before dropping again.

However, the results described above overstate the non-performing loan percentage for more recent loans. Since newer loans have not fully matured, it is possible more of them could start to default or be charged off before their maturation date, so it’s hard to trust the more recent numbers.

Now I’m going to create a new field called MonthsBeforeDefault to look more in-depth at how long it took loans to stop performing. For “non-performing” loans only, I calculate the amount that customers (borrowers) have paid on each loan, divided by their monthly loan payment, to arrive at the number of months that they paid back their loan.

## [1] 11

I suppose, technically, this is a univariate plot, but I needed to have done the “non-performing” analysis to get to this point. As it turns out, the data is skewed towards fewer months. Thus, if someone is going to default on his or her loan, they will decide to do so relatively quickly. The median is just 10 months.

Now I’m looking for “first-payment defaults”: people who never intended to pay their loan back Essentially, these are fraudulent loans.

## [1] 85

A total of 85 borrowers failed to make two payments on their loan. Wow!

Here’s a question that has arisen in my mind: how much money do investors make and have they made over time? I start by summarizing loan amounts, customer payments, losses, and fees from the data by quarter for loans whose cohorts have closed. In this way, I can determine the overall return on the money lent by Prosper’s borrowers by loan origination vintage. A very interesting insight has emerged from the data. Prior to 2009, with the exception of the first quarter of loan originations in Q4 2005, Prosper’s investors/lenders lost money overall. Returns ranged as low as almost -5%. Only when Prosper altered its business model in Q3 2009 did investors start to earn positive returns.

By the way, I removed Q2 2009 from the above analysis becuase the losses so pronounced. It appears something was going on with these loans. Nobody made payments on these loans. Seems like we should disregard them from most analysis, because there was something unusual going on that quarter related to the SEC’s quiet period.

A Narrative is Emerging

In initial, pre-2009 period, loans were too risky based on credit scores.
Therefore, lenders/investors/consumers doing the lending didn’t make any money. In fact, they lost money.
So Prosper switched its business model to a more regulated one (which means, what, exactly?), tightened its credit guidelines and risk profiles, and lenders/investors started making money
With that, Prosper has increased in listing volume and overall money lent

How predictive is Prosper Score?

Now let’s look at Prosper Score vs. Loan Performance. Do listings that score worse up front actually perform worse in reality? How predictive are certain “knowns” at the start of the process?

## 
##  Welch Two Sample t-test
## 
## data:  listings[listings$LoanPerformance == "NonPerforming", "ProsperScore"] and listings[listings$LoanPerformance == "Performing", "ProsperScore"]
## t = -19.859, df = 7644.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6210403 -0.5094495
## sample estimates:
## mean of x mean of y 
##  5.427062  5.992307

Non-performing loans have a significantly lower average Prosper Score than performing loans.

When you separate loans to homeowners from those to non-homeowners, Performing loans to homeowners have significantly higher Prosper scores than non-performing loans. However, loans to non-homeowners have roughly identical Prosper scores, whether the loan performs or not.

Let’s do the same analysis on all Prosper loans… Non-performing loans consistently have lower Prosper Scores than performing loans, at each level of Total Prosper Loans.

What about looking at credit scores?

## 
##  Welch Two Sample t-test
## 
## data:  listings[listings$LoanPerformance == "NonPerforming", "CreditScoreRangeUpper"] and listings[listings$LoanPerformance == "Performing", "CreditScoreRangeUpper"]
## t = -73.616, df = 19389, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -53.99225 -51.19164
## sample estimates:
## mean of x mean of y 
##  659.7876  712.3795

The means are substantially different for credit scores between Performing and Non-Performing loans. Therfore, the higher the credit score, the more likely the loan is to perform.

Let’s look at how various factors affect Prosper’s alphanumeric Rating While there appears to be some relationship between a low DTI ratio and a good Prosper Rating (AA, A, B), for riskier loans the mean DTI is roughly the same across the board. Clearly, other factors influence the Prosper rating.

Again, at lower-risk Prosper Ratings (AA, A, B), the mean high credit score does appear to be significantly higher than those of higher-risk Prosper Ratings. However, with higher-risk Prosper Ratings (HR, E, D), the credit score does not appear as correlated. HR (“high-risk”) Prosper Ratings have higher average high credit scores than do the next (less risky) level up, E. So, again, other factors besides just DTI and credit score factor into the Prosper Rating calculation.

Let’s create a new field in listings that tells us if it is before or after the Q1 2009 quiet period. Then we can compare what happened before that time with what happened afterwards more easily.

## 
##  Welch Two Sample t-test
## 
## data:  listings[listings$LoanPerformance == "NonPerforming", "DebtToIncomeRatio"] and listings[listings$LoanPerformance == "Performing", "DebtToIncomeRatio"]
## t = 11.104, df = 16792, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.06998372 0.09998741
## sample estimates:
## mean of x mean of y 
## 0.3484210 0.2634354

The T-test indicates that the sample means are significantly different.

How do various inputs affect Prosper Score? There are some interesting plots here about the relationship between credit score, DTI ratio, and Prosper Score, but I’m now thinking this is a dead end. I am not going to be able to reverse-engineer the Prosper Score or Prosper Rating calculations, so I am not sure what this is telling me beyond common sense.

One thing we can see is that, overall, a risky Prosper Rating does correlate to a higher rate of non-performing loans, and less risky ratings show fewer non-performing loans in a stair-step fashion.

Let’s take a look at interest rates over time… So rates have gotten significantly higher over time, though they have been falling recently. I’m not sure what to do with this, so this looks like another dead end.

Let’s take a look at average max credit scores over time… The plot above is very interesting. Average upper credit scores increase significantly, especially starting with Q3 2009, when Prosper changed its business practices. And they have stayed high ever since. That corresponds with the earlier chart that shows investor returns being below zero in the pre-2009 era, and above zero since. Juxtaposing these two charts will help tell the Prosper story, making for an effective final plot.

Does debt-to-income ratio follow a similar pattern pre- and post-2009? DTI ratios appear much more variable (though only sometimes higher) in the pre-2009 era than afterwards. This looks like another area in which Prosper tightened up its lending standards.

I am interested in looking into the “verifiable” incomes that borrowers stated as part of their Prosper loan application, and I would like to see that broken out by Occupation. How do those compare with loan performance numbers?

## Source: local data frame [68 x 6]
## 
##                    Occupation NonPerforming Performing PctNonPerf Total
## 1                      Doctor            40        454 0.08097166   494
## 2                    Attorney            49        997 0.04684512  1046
## 3                       Judge             0         22 0.00000000    22
## 4                   Executive           495       3816 0.11482255  4311
## 5                     Dentist             2         66 0.02941176    68
## 6                  Pharmacist            20        237 0.07782101   257
## 7                    Investor            28        186 0.13084112   214
## 8  Pilot - Private/Commercial             8        191 0.04020101   199
## 9         Engineer - Chemical            22        203 0.09777778   225
## 10                  Principal            39        273 0.12500000   312
## ..                        ...           ...        ...        ...   ...
## Variables not shown: MeanIncome (dbl)

The resulting chart shows an Occupation’s average income and the percentage at which their loans fail to perform. With this information, we can start to see which Occupations are the best, and which are the worst, at paying back their loans–given how much money they make.

When we plot Percent Non-Performing loans vs. Mean Income by profession and include a best-fit plot, we can see that it borrowers in lower-income Occupations generally have a higher percentage of non-performing loans. This trend makes sense; Occupations with higher incomes have more ability to pay back their loans. The most interesting data points on the above plot are those that sit far above or below the trend line (many of which are labeled). Those occupations represent those that are either “more reliable” or “less reliable,” when compared to other occupations making similar incomes.

How do loans to prior Prosper borrowers perform? I’m not sure I see anything interesting here. I think it’s a dead end.

Again, I don’t really see any interesting trends emerging from the data. I am going to consider this another dead end.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Whereas investors lost money in the pre-2009 era, they made money starting in 2009. This is largely due to better quality borrowers as indicated by the increased average credit scores on the loans.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, the relationship between income and the percentage of non-performing loans based on occupation is a relatively strong one and one that can help us evaluate loans in the future. If an occupation has a potentially volatile income stream, it makes that borrower’s loan more risky.

What was the strongest relationship you found?

The relationship between credit score and loan non-performance is strong. The mean high credit score of loans that perform is significantly higher, as shown with a student t-test, than that of loans that do not perform. You also see this relationship when you look at average credit scores per quarter, and the performance of the loans in that quarter, and plot that over time.

Multivariate Plots Section

I want to use the credit score buckets to see how lender returns break out over time.

I want to see how the count of loans breaks out by the credit score buckets. This plot shows the high number of loans with the max credit score below 650 (in salmon/pink) that were mixed into the loan pool up to and including Q3 2008. Since then, the percentage of lower-credit-score loans is much lower, which has helped push the average upper credit score over 700. About half of the loans in the latest quarter are between 650 and 700, and the other half are above 700.

The plots below continue with the non-performing loan percentages from the bivariate plots section. This time the quarterly loan performance results are broken out by more variables, including Term and the Prosper Rating (alpha). It stands to reason that, in most cases, the higher-risk loans tend to have higher percentages of non-performing loans. The shorter 12-month terms appear to perform better than their 36-month counterparts, at least in the limited data sample that we have. It makes sense: with only 12 months of payments, their are fewer things that can go wrong and disrupt a borrower’s income stream.

Looking at quarterly loan non-performance by Prosper Rating is illustrative. The data is a lot chunkier by quarter, and lower-risk loans don’t always outperform their higher-risk counterparts.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I was able to illustrate that low-credit-score loans, those below 650, made up a substantial percentage of the loan pool in the pre-2009 era, and that group was almost completely wiped out in the era since. That helps explain the improved loan performance in the post-2009 era.

Were there any interesting or surprising interactions between features?

I think it’s interesting that loan performance is a little more volatile on a quarterly basis than it is overall. Overall, HR loans show the most defaults and charge-offs. However, from quarter to quarter, E-, D-, or even B-rated loans sometimes perform worse that the so-called “high risk” HR loans.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I did not see the need to create any models with this data set.

Final Plots and Summary

Plot One

Description One

This plot illustrates the history of Prosper’s business model transition and how it vastly improved the company’s business. We can see that in the pre-2009 period, the average prosper borrower’s highest credit score was under 700 and in some cases under 650, as indicated by the blue line. Loans originated during that same pre-2009 period cost investors money overall, as shown by the red bars illustrating negative returns. So it is clear why Prosper saw the need to change how it does business; if investors were going to continue to lose money, the company would not be in business very long.

In Q3 2009, after the company registered with the SEC and went through a “quiet period,” the average high credit score rose above 700, and has remained there ever since. Annualized investor returns have flipped into positive territory. So with the 2009 business model transition, Prosper was able to improve the overall risk profile of its loans, resulting in a shift from negative to positive returns for the Prosper members investing in the loans. As a result, loan volume has increase substantially in recent years (as shown in other graphs above), meaning Prosper has discovered a formula for sustained profitability and growth.

Plot Two

Description Two

This is a plot of the percentage of non-performing loans–loans in a cohort that has “closed” (all loans have reached full maturity) and that have either been charged off or defaulted–vs. mean income, where the loans are grouped by the borrower’s stated occupation. It allows us to see the affect of income on ability to pay back a loan.

It’s logical to think that the borrowers who make more money will be less likely to stop paying off their loans. So the downward slope of the trend curve is to be expected.

The more interesting part of this analysis is looking at the outliers–those points furthest from the trend line. I have labeled the group below the trend line as “More Reliable Occupations.” These are the occupations that have paid back their loans at a higher rate than their income would suggest, when compared to other occupations at or near the same income. My use of the word “reliable” is not a value judgement. It could be that these people have more stable income streams than other occupations. And perhaps college seniors, graduate students, and teacher’s aides transition into the workforce during the term of their loan, making their verifiable income at the time the loan was originated effectively too low.

On the area above the curve, where I have put the label “Less Reliable Occupations,” sit those occupations who stop paying their loans at a higher rate than others at or near the same level of income. These folks perhaps have more volatile income streams that were negatively affected by the economic downturn related to the 2008 credit crisis. Realtors, construction workers, commissioned salespeople, investors, and executives all tend to have compensation that varies with the economy. And homemakers, by definition, live in a household with multiple adults and one income, which is often more susceptible to changes in economic fortune.

My take-away with this plot is that, when evaluating investing in a Prosper loan, it is important to consider the occupation of the borrower and how exposed it is to changes in the macroeconomic environment.

Plot Three

Description Three

As shown in the top plot, Prosper’s risk rating system is fairly predictive in identifying the group of loans that has a higher chance of default. Indeed, over 25% of “high-risk” HR loans have failed to perform, while less than 5% of “low-risk” AA loans have defaulted (or been charged off). The risk ratings in between decline in an almost linear fashion going down the scale of riskiness.

However, looking at the loans on a quarterly basis shows that the Prosper Rating is not necessarily as predictive at smaller sample sizes. There are many quarters in which E-, D-, or sometimes even B-rated loans end up with higher non-performance rates than HR loans. The data is “chunkier” and more idiosyncratic from quarter to quarter, as opposed to over the whole sample size.

The relatively low ratings for the loans originated in Q2 2011 and later likely have to do with the fact that most of those loans have not reached full maturity, and are therefore not included in this analysis. Since the Prosper data is from 2014, only those loans with 12-month terms have matured; the 36- and 60-month-term loans have not reached maturity. It appears that 12-month-term loans perform somewhat better than their longer-term counterparts.

Reflection

In this analysis, I ran into difficulties from the sheer size of the data set. Not so much from the 114K observations, but from the 81 variables. There were so many different ways to slice and dice the data that it was easy to get lost at times. It also prompted me to ask some questions of the data that probably can’t be answered. For example, I wanted to reverse engineer the Prosper Score and Prosper Ratings algorithms, which I later learned wasn’t really possible.

I found success in creating sub-dataframes of aggregated variables, typically over time, using the dplyr framework. These were useful in getting the data into more manageable chunks for analysis and visualization.

I would like to see a more mature data set of the “new” Prosper–the era since they changed their business model in Q2 2009. There would be more information about loan performance because more of the loan cohorts will have closed. That will enable better analysis of the risks and factors that go into creating a profitable online lending business. Ideally, that data set would be through several economic and interest rate cycles to bring exogenous, macroeconomic factors into the mix, making for a richer understaning of loan performance in various scenarios. Obviously, it will take some time for Prosper to build up such an asset.

Udacity Data Analysis Project 4: Prosper Loan Data

David Smith

December 18, 2015