Project 3 - Exploring Prosper Loan Data by Li Liang

I chose the Prosper loan data for project 3. I wanted to explore the P2P lender’s data to find insights related to the borrowers and the loans they broker. At quick glance, there are over 100k rows of loans and 81 variables. I initally choose way too many variables (30+) and through exploratory analysis, whittled it down to 20 variables and even created a few variables. Let’s begin.

##  ListingCategory..numeric. EmploymentStatusDuration CreditScoreRangeLower
##  1      :58308             Min.   :  0.00           Min.   :  0.0        
##  0      :16965             1st Qu.: 26.00           1st Qu.:660.0        
##  7      :10494             Median : 67.00           Median :680.0        
##  2      : 7433             Mean   : 96.07           Mean   :685.6        
##  3      : 7189             3rd Qu.:137.00           3rd Qu.:720.0        
##  6      : 2572             Max.   :755.00           Max.   :880.0        
##  (Other):10976             NA's   :7625             NA's   :591          
##  CreditScoreRangeUpper CurrentCreditLines DebtToIncomeRatio
##  Min.   : 19.0         Min.   : 0.00      Min.   : 0.000   
##  1st Qu.:679.0         1st Qu.: 7.00      1st Qu.: 0.140   
##  Median :699.0         Median :10.00      Median : 0.220   
##  Mean   :704.6         Mean   :10.32      Mean   : 0.276   
##  3rd Qu.:739.0         3rd Qu.:13.00      3rd Qu.: 0.320   
##  Max.   :899.0         Max.   :59.00      Max.   :10.010   
##  NA's   :591           NA's   :7604       NA's   :8554     
##  StatedMonthlyIncome BorrowerState                      Occupation   
##  Min.   :      0     CA     :14717   Other                   :28617  
##  1st Qu.:   3200     TX     : 6842   Professional            :13628  
##  Median :   4667     NY     : 6729   Computer Programmer     : 4478  
##  Mean   :   5608     FL     : 6720   Executive               : 4311  
##  3rd Qu.:   6825     IL     : 5921   Teacher                 : 3759  
##  Max.   :1750003            : 5515   Administrative Assistant: 3688  
##                      (Other):67493   (Other)                 :55456  
##       EmploymentStatus IsBorrowerHomeowner         IncomeRange   
##  Employed     :67322   False:56459         $25,000-49,999:32192  
##  Full-time    :26355   True :57478         $50,000-74,999:31050  
##  Self-employed: 6134                       $100,000+     :17337  
##  Not available: 5347                       $75,000-99,999:16916  
##  Other        : 3806                       Not displayed : 7741  
##               : 2255                       $1-24,999     : 7274  
##  (Other)      : 2718                       (Other)       : 1427  
##  IncomeVerifiable  BorrowerAPR       BorrowerRate    Term      
##  False:  8669     Min.   :0.00653   Min.   :0.0000   12: 1614  
##  True :105268     1st Qu.:0.15629   1st Qu.:0.1340   36:87778  
##                   Median :0.20976   Median :0.1840   60:24545  
##                   Mean   :0.21883   Mean   :0.1928             
##                   3rd Qu.:0.28381   3rd Qu.:0.2500             
##                   Max.   :0.51229   Max.   :0.4975             
##                   NA's   :25                                   
##                  LoanStatus    LoanOriginalAmount
##  Current              :56576   Min.   : 1000     
##  Completed            :38074   1st Qu.: 4000     
##  Chargedoff           :11992   Median : 6500     
##  Defaulted            : 5018   Mean   : 8337     
##  Past Due (1-15 days) :  806   3rd Qu.:12000     
##  Past Due (31-60 days):  363   Max.   :35000     
##  (Other)              : 1108                     
##           LoanOriginationDate MonthlyLoanPayment
##  2014-01-22 00:00:00:   491   Min.   :   0.0    
##  2013-11-13 00:00:00:   490   1st Qu.: 131.6    
##  2014-02-19 00:00:00:   439   Median : 217.7    
##  2013-10-16 00:00:00:   434   Mean   : 272.5    
##  2014-01-28 00:00:00:   339   3rd Qu.: 371.6    
##  2013-09-24 00:00:00:   316   Max.   :2251.5    
##  (Other)            :111428
## 'data.frame':    113937 obs. of  20 variables:
##  $ ListingCategory..numeric.: Factor w/ 21 levels "0","1","2","3",..: 1 3 1 17 3 2 2 3 8 8 ...
##  $ EmploymentStatusDuration : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ CurrentCreditLines       : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable         : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ Term                     : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate      : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ MonthlyLoanPayment       : num  330 319 123 321 564 ...

Univariate Plots Section

Loan amount exhibits positive skew, mean 8300 > median 6500.

Monthly loan payments exhibits positive skew, mean 272 > median 217.

Credit ranges are normally distributed with mean/med = 685/680. It seem they have common intervals and are bunched as such. Note: I only use CreditScoreRangeLower throughout this analysis. CreditScoreRangeHigher has very similar distribution but 20 points higher.

Interest rate are normally distributed mean/med = .19 However, there is a spike around .31. I’ll take a closer look in bivarate section.

DI ratio exhibits slight positive skew mean .27 > median .22. There are more people with DI < 0.5.

Monthly income exhibits positive skew mean 5600 > median 4666.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0    637.5   1062.0   1225.0   1622.0 171000.0     8554

I created the variable TotalMonthlyDebt to estimate the monthly debt at inception of loan by multiplying the DI ratio by the stated monthly income. The resulting distribution is positively skewed.

Employment duration exhibits positive skew mean 96 > median 67. It seems that more people that are employed shorter durations.

Current credit lines has a slight positive skew and the mean/med = 10.

I decomposed loan origination date using lubridate and plotted by month and year. I observed seasonality in the loans by month by during the holiday season. It is also interesting in the loans by year plot that the loans dropped off in 2009. I would guess that is a supply side shortage due to the Financial Crisis. What do you think? (Hint: I’m wrong.)

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

Most loans in the whole data set are completed and current. Later I will break it down by year.

## [1] "# of Loans by Income Range"
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806
## [1] "# of Loans by Term (months)"
##    12    36    60 
##  1614 87778 24545

Income range is mostly in the middle from $25k-75k with mostly 3 year loans.

Graph of loan term by year. Dominated by 3 year loans.

## Source: local data frame [21 x 3]
## 
##    ListingCategory..numeric. DollarTotal CountTotal
## 1                          1   577736197      58308
## 2                          0   106096621      16965
## 3                          3    64175191       7189
## 4                          7    62035775      10494
## 5                          2    60148466       7433
## 6                          6    12861665       2572
## 7                          4    10913226       2395
## 8                         13    10549832       1996
## 9                         15     9929936       1522
## 10                        14     7684187        876
## ..                       ...         ...        ...

Debt consolidation by far is the leading loan category in number of loans and $ amount. But it seems there are a lot of loans not classified. More on this in the multivarate section.

Loan status broken down by year you can see more texture: most defaults occurred in 2006-2008.

I created the variable DefaultedPercent (# of defaulted loans / # total loans) to show default rates over each year. Prosper default rates seems to have decreased dramatically over the years.

I selected the top 5 states. California dominates this category.

Top 5 Occupations but data seems incomplete.

## [1] "Is the borrower a homeowner? "
## False  True 
## 56459 57478

Borrowers are mostly employed and about half are homeowners.

Univariate Analysis

What is the structure of your dataset?

There are 113,937 loans in the dataset. I choose with 20 features out of 81, listed below. 10 being continuous variables and 10 being discrete variables or factors.

Continuous Variables:

EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper CurrentCreditLines DebtToIncomeRatio StatedMonthlyIncome BorrowerAPR BorrowerRate LoanOriginalAmount MonthlyLoanPayment

Discrete Variables:

BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner IncomeRange IncomeVerifiable Term ListingCategory LoanStatus LoanOriginationDate

Most loans have terms of 3 years and are under $10k (with the median being 6500). Most have monthly payments of usually < $500 (median of $217) and paying the median interest rate of 18%. Out of 17 categories of loans, the majority of the loans are classified under debt consolidation (51%) Borrowers tend to have credit scores in the IQR of 660-720 (using creditscorerangelower). The borrower median monthly income is 4667, 28% earns $25-50k and 27% earns $50-75K. 88% are employed in some way (employed, full-time or self-employed).

What is/are the main feature(s) of interest in your dataset?

There are a number of features that will give color to the profile of the loan and borrower. I think the main ones for borrowers are credit score, debt to income ratio, income and the ones for the loans are loan amount, interest rate, and term.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Loan category and status can help determine how the loans are being used and what loans are current, defaulted, delinquent etc.

Did you create any new variables from existing variables in the dataset?

I created TotalMonthlyDebt to estimate the total monthly debt at inception of loan by multiplying the DI ratio by the stated monthly income. The distribution is positively skewed. I broke out the loan origination date in LoanYear, LoanMonth using lubridate. This really helped with yearly analysis and seasonality. I also created DefaultedPercent to show the default rates over each year.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I manually set two numeric categorical variables as a factor so when I plotted them it would look discrete: Term and ListingCategory..numeric.

Bivariate Plots Section

Correlation matrix of all variables discrete and continuous. Next few scatterplots will explore the stronger relationships in greater depth.

Positive correlation R^2 of .47. You have more debt as you increase the # of credit lines or vice-versa.

Negative correlation R^2 of -.46. Makes sense the interest rate is lower for more creditworthy borrowers.

Monthly income and total monthly debt has a positive relationship of R^2 of .36.

Slight negative correlation R^2 of -.33. Seems counterintuitive, lower interest rate on larger loans.

Slight positive correlation R^2 of -.33. The larger the loan, the higher the credit score.

I plotted total monthly debt vs monthly income facet wrapped by IsBorrowerHomeowner. The homeowners have a higher dispersion whereas the non-homeowners are more concentrated under $5000 income.

A closer look at the spike in BorrowerRate

Taking a closer look at spike in the histogram of BorrowerRate, these high interest loans were made mainly for debt consolidation in 2011-2012.

And the credit score ranges are shown, the higher the borrower rate (cost) the worse the credit rating.

Mean loans amounts dips in 2009 and increased in 2013.

For comparison, aggregate loans amounts dips in 2009 and increases sharply in 2013, from $150M to $300M.

Investigating further, I graphed every loan by date, it shows a gap in service between November 2008 and July 2009. After doing some research on Wikipedia, there was a cease and desist from the SEC regarding the P2P lending model. The data also only goes until March 2014.

Shows the seasonality of increased average loan amounts and variance during the holiday seasons and a pull back during the spring and summer months.

  1. Debt consolidation and 8. Baby/Adoption are the category with the top amounts. ~$9k on average.

Employed and fulltime on average take out larger loans.

Larger incomes take out larger loans on average.

As the term is longer, the loan amounts are larger.

But not necessarily more expensive. The mean interest rate moves up slightly at the 5 year term and the variance is smaller.

5 year borrowers seem to be more credit-worthy on average and also the variance is smaller.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • There’s a strong positive relationship between open credit lines and total monthly debt.

  • There’s a negative relationship between interest rates and loan amount, the larger the loan, the lower the rate on average. That was mostly due to them having higher credit scores.

  • Monthly income and monthly debt have a positive relationship and levels off after a certain point.

  • Average loan amounts and variance are seasonal (higher during holidays)

  • People who are employed and fulltime on average take out larger loans than other groups.

  • There is a strong positive relationship between term and loan amount (longer the term, the larger the loan)

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I observed that in 2008-2009 there was a large dip in loan origination. After further investigation, it was related to a cease and desist order by the SEC regarding the nature of the P2P lending model. But after they opened again, there has been impressive growth topping out at $300 million in loans in 2013 and close to $150 million by March of 2014 (partial year)

What was the strongest relationship you found?

The strongest relationship I found was between credit score and interest rate, with R^2 .46. This makes sense since credit score is a rating of the credit-worthiness of the borrower and that should be directly related to the cost of borrowing (interest rate) Also, between open credit lines and total monthly deb with R^2 of .47.

Multivariate Plots Section

The borrowers with high credit scores are on the bottom right. They generally have lower interest rates and larger loan amounts. I subsetted only credit scores from 660 (1st quartile) and above for better visual presentation.

I normalized the loan amounts by multiplying by their credit score and dividing by 850 (the best achievable score). The results do not differ to much from just using the loan amounts.

In this plot, I divided MonthlyLoanPayment by StatedMonthlyIncome see what the debt coverage ratio would be by year and faceted by IncomeRange. It seems most of the borrowers with lower income ranges take on larger loans and all income ranges have an uptick in the most recent years.

Comparing DI ratio, most of the borrowers seem to have DI ratio close to 20-30% with a uptick in the most recent years. The $100k + income range have noticeably lower DI ratio at around 15-20%. The variance is reduced as incomes increase.

Looking at Income Ranges it looks like the uptick in loans in 2012-2013 grew primarily in $50-75k, $75-100k and $100k+.

I bucketed categories by year and you can clearly see that it looks like pre-2008 they did not capture the loan category and debt consolidation dominated the cateogory every year subsequently.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When looking at the loan amounts vs cost (interest rate), the credit scores demarcated borrowers by credit worthiness. I looked at loan amounts (normalized by credit score) vs income ranges and saw that the higher the income, the larger the loans amounts on average.

Were there any interesting or surprising interactions between features?

Investigating further, I looked at DI ratio and debt coverage ratio and observed the higher the income, the lower the percentage of debt. Additionally, it appears the platform after a few years of maturity is beginning to lend to slight more indebted borrowers in the income ranges of $50-75k, $75-100k and $100k+. Looking at the debt categories, it seemed like they didn’t collect this information before 2007. But the recent uptick of lending is mostly in debt consolidation.


Final Plots and Summary

Plot One

Description One

The borrowers with high credit scores are in red on the left. They generally have lower interest rates and larger loan amounts. I subsetted only credit scores from 660 (1st quartile) and above for better visual presentation. And this shows that as the lending platform matured, the overal risk exposure increased. In 2014, much more blue (credit score ~700) borrowers.

Plot Two

Description Two

Looking at DI (debt to income) ratio, it illustrates a picture of increased loan sizing, especially at lower borrowing rates (2013-2014) as the platform matured and risk managment improves, the default rates were decreasing (graphed previously).

Plot Three

Description Three

I subsetted only deliquent loans (total count: 2067) to look at their at exposure to more risky (delinquent) loans. It appears the majority of their delinquencies are clustered in 2012-2013, 1-15days past due and the borrowers have income ranges $25k-50k and $50-75k.


Reflection

The data set had nearly 114,000 loans from Nov 2005 - March 2014. Over the course of those years, Prosper has made almost $1 trillion dollars in loans ($949,894,347 to be exact). The difficulties I had with the data mainly stemmed from understanding the variables and then selecting the appropriate ones to analyze. I had to do an initial exploration of the data and even then I probably selected a bit too many. I initally wanted to create a model to see if I could model loans likely to be delinquent or default but I realize solving classification isn’t until the machine learning course. I would love to see if a model could be learned and created. From the data I selected, I saw them struggle initially with their first few years of loans. However, their business grew very quickly after the cease and desist by the SEC in 2009. Their main loan exposures are in debt consolidation and I’m sure the low interest rate environment has helped the supply of lenders looking for yield. This was a great learning experience about the P2P lending model and I’m glad Udacity gave me the opportunity access to this data.