Prosper Loans Analysis

Adam Tolnay

Prosper Loans is a peer-to-peer crowdsourced loan company, which connects borrowers with willing investors, and takes a small fee. Peer-to-peer lending often attracts borrowers who are unable to qualify for traditional bank loans, so default rates have been historically high, but standards have since become stricter and the minimum credit score is now 640. Prosper was started in 2005, and the data set covers the time period 2005 to 2014, the first 9 years of the business. It contains 113,937 observations of 81 variables. I chose to investigate 11 of them:

Loan origination date: The date the loan originated.
Loan original amount: The origination amount of the loan.
Term: The length of the loan expressed in months.
Loan status: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue.
Borrower monthly income: The monthly income the borrower stated at the time the listing was created.
Credit score: The value representing the range of the borrower’s credit score as provided by a consumer credit rating agency. Only lower and upper limits provided.
Debt to income ratio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is capped at 10.01.
Is borrower homeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
Delinquencies in the last 7 years: Number of delinquencies in the past 7 years at the time the credit profile was pulled.
Occupation: The Occupation selected by the Borrower at the time they created the listing.

##  [1] "LoanOriginationDate"     "LoanOriginalAmount"     
##  [3] "StatedMonthlyIncome"     "CreditScoreRangeLower"  
##  [5] "CreditScoreRangeUpper"   "Term"                   
##  [7] "LoanStatus"              "DebtToIncomeRatio"      
##  [9] "IsBorrowerHomeowner"     "DelinquenciesLast7Years"
## [11] "Occupation"
##       LoanOriginationDate LoanOriginalAmount StatedMonthlyIncome
##  1/22/2014 0:00 :   491   Min.   : 1000      Min.   :      0    
##  11/13/2013 0:00:   490   1st Qu.: 4000      1st Qu.:   3200    
##  2/19/2014 0:00 :   439   Median : 6500      Median :   4667    
##  10/16/2013 0:00:   434   Mean   : 8337      Mean   :   5608    
##  1/28/2014 0:00 :   339   3rd Qu.:12000      3rd Qu.:   6825    
##  9/24/2013 0:00 :   316   Max.   :35000      Max.   :1750003    
##  (Other)        :111428                                         
##  CreditScoreRangeLower CreditScoreRangeUpper      Term      
##  Min.   :  0.0         Min.   : 19.0         Min.   :12.00  
##  1st Qu.:660.0         1st Qu.:679.0         1st Qu.:36.00  
##  Median :680.0         Median :699.0         Median :36.00  
##  Mean   :685.6         Mean   :704.6         Mean   :40.83  
##  3rd Qu.:720.0         3rd Qu.:739.0         3rd Qu.:36.00  
##  Max.   :880.0         Max.   :899.0         Max.   :60.00  
##  NA's   :591           NA's   :591                          
##                  LoanStatus    DebtToIncomeRatio IsBorrowerHomeowner
##  Current              :56576   Min.   : 0.000    Mode :logical      
##  Completed            :38074   1st Qu.: 0.140    FALSE:56459        
##  Chargedoff           :11992   Median : 0.220    TRUE :57478        
##  Defaulted            : 5018   Mean   : 0.276                       
##  Past Due (1-15 days) :  806   3rd Qu.: 0.320                       
##  Past Due (31-60 days):  363   Max.   :10.010                       
##  (Other)              : 1108   NA's   :8554                         
##  DelinquenciesLast7Years                    Occupation   
##  Min.   : 0.000          Other                   :28617  
##  1st Qu.: 0.000          Professional            :13628  
##  Median : 0.000          Computer Programmer     : 4478  
##  Mean   : 4.155          Executive               : 4311  
##  3rd Qu.: 3.000          Teacher                 : 3759  
##  Max.   :99.000          Administrative Assistant: 3688  
##  NA's   :990             (Other)                 :55456

Univariate Plots:

# Extracting loan origination year from loan date
loan_data$LoanOriginationYear <- 
  as.numeric(format(as.Date(loan_data$LoanOriginationDate,'%m/%d/%Y'), '%Y'))

The number of loans dropped off shortly after the financial crisis, but then continued growing. The data did not include the full year for 2014, which is why the histogram drops again in 2014. A Google search showed that the business continued to grow year over year.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Mean loan amount is $8338, and the median is $6500. Prosper makes mostly small loans, and investors can decide whether to lend based on borrower information. Loans are capped at $35000, but even so, the distribution is skewed to the right. There are spikes at multiples of 5000, which are common loan amounts. The minimum loan is set at $1000.

# Creating a new variable YearlyIncome by multiplying monthly income by 12
loan_data$YearlyIncome <- loan_data$StatedMonthlyIncome * 12

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    38404    56000    67296    81900 21000035

The maximum reported yearly income was over 20 million, and there were several other outliers, but the majority of the data was between 0 and 300000 so the graph is zoomed in on this part. Mean income is $67284 and the median is $38400. The graph is skewed to the right as is expected of income distributions. There are small spikes at multiples of 10000, which are commonly reported incomes. Some borrowers are unemployed, and reported yearly incomes of 0.

An alternative is to log transform YearlyIncome, but this doesn’t reveal much more:

Credit scores:

# CreditScore is the average of the lower and upper limits for credit score given in the data set.
loan_data$CreditScore <- (loan_data$CreditScoreRangeUpper + loan_data$CreditScoreRangeLower) / 2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

The majority of borrowers have credit scores around 700. A minimum credit score of 600 was introduced in 2009, and this was later raised to 640 in 2014. Credit scores below 600 and 640, must have been from before 2009 and 2014 respectively. Credit scores were not available for 569 borrowers, and 133 borrowers had credit scores in the 0 to 19 range, which is not reasonable. These were replaced with NA. I will take a closer look at credit scores later.

# 9.5 is the average of 0 and 19
loan_data$CreditScore[loan_data$CreditScore == 9.5] <- NA

Moving on to term:

Only 1 year, 3 year, and 5 year loan terms are available. The most popular type is a 3 year loan, and 1 year loans are rare, which was surprising. In the bivariate section, we will see why this is the case.

There are about as many current loans as there have been loans in the past. This is impressive and indicates that the business is growing. However, there have been problems with a significant portion of loans. This is the graph I will be focusing on later in the analysis, by finding which variables are correlated with delinquency.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Mean debt to income ratio is .276, and the median is .22. The domain is limited to (0, 1). 800 people have a debt to income ratio higher than 1, and these are likely due to low incomes as opposed to high debt. The distribution is skewed to the right.

This graph shows that about half of borrowers are homeowners.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990

Most people have no delinquencies, but there are a few unbelievable outliers. It’s surprising that anyone has more than 10 delinquencies.

This graph shows the distribution of borrower occupations.

Univariate Summary

During univariate analysis, I created 3 new variables: LoanOriginationYear, YearlyIncome, and CreditScore. Prosper has been growing since its inception, with a dip in loans written after the financial crisis. The median income of borrowers is $56000, with a median loan amount of $6500, mostly for 3 year loans. Median credit scores were 689.5, however minimum credit scores of 600 and 640 were introduced in 2009 and 2014 respectively, so moving forward the median credit score might be higher.

Bivariate Plots

First, I created a scatterplot matrix to identify variables that would be interesting to investigate more closely.

None of the variables are highly correlated, and only a few of them have any significant correlation.

Pairs of variables that have higher correlations are:
Term and loan origination amount
Term and loan origination year
Loan origination amount and loan origination year
Credit score and loan origination year
Credit score and loan original amount
Credit score and delinquencies in the last 7 years

Here is a closer look at these pairs of variables.

It’s unclear which variable is causing this relationship, but nonetheless term and loan amount are positively correlated. Longer terms correspond to higher loan amounts. In fact, the 75th percentile of loan amounts for 1 year loans is well below the 25th percentile for 5 year loans. This may indicate the 1 year loans on Prosper are often used for smaller, more impulsive purchases.

The positive correlation coefficient of .34 between term and loan origination year might not indicate a meaningful relationship between the two variables. However, no loans of term 1 year and 5 years originated before 2010, so these options might have only been introduced in 2010. In fact, the data set contains almost all 3 year loans in 2010, and only 3 year loans prior to 2010:

Next, I looked at the correlation between loan year and loan original amount:

Median loan amounts have been increasing over the years. It also looks like a new minimum loan amount was introduced in 2011, and the maximum loan amount was raised from $25,000 to $35,000 in 2013. Median loan amounts also have a local maximum around the time of the financial crisis when it was cheap to borrow, and a local minimum in 2009 when interest rates were higher. The multivariate section will show that loans written in 2007 and 2008 defaulted at much higher rates than in other years.

It’s odd that credit scores are not continuous, but they are the averages of upper and lower limits. It’s unclear whether people with higher credit scores tend to borrow more, or people who want to borrow more must have a higher credit score to attract investors, but there is a meaningful positive correlation here. It could also be that higher incomes are correlated with higher credit scores. Let’s see.

The correlation coefficient is only .109. Income does not correlate with credit score as strongly as I thought. It’s interesting that credit scores increase with income up until about 100,000, but then level off. Maybe the relationship will be clearer with a scatterplot of delinquencies in the last 7 years and credit score.

Once again, this is not the graph I expected, and it’s not clear why it’s capped at 100 delinquencies. However, with the conditional means plotted, it’s clear that there’s a negative correlation. The correlation coefficient is -.263, which is still not as high as I would expect it to be. I find it hard to believe that the average credit score for someone with 100 delinquencies in the past 7 years is over 600, so that makes me skeptical of those points.

Next, I looked at how the creditworthiness of borrowers has evolved over time:

The data from 2005 is strange. No data points from 2005 have any credit score entries.

As with incomes, it looks like credit scores have increased over time, although they have leveled off and even decreased since 2009. The minimum credit score was raised to 600 in 2009, and raised again to 640 in 2014. It’s interesting that despite the stricter credit requirement in 2014, the median credit score dropped. I wonder why that is? Nonetheless, total loans made have been growing despite stricter requirements, so Prosper has been attracting many higher quality borrowers.

The data from 2005 looks inaccurate, with borrowers having a median income of 100,000. Going into Excel, there are only 22 data points from 2005, so I decide to drop them from the data set. Prosper loans was founded in 2005, so these might be test transactions, or they might be inaccurate.

# Dropping 22 data points from 2005
loan_data = loan_data[!loan_data$LoanOriginationYear == 2005,]

Besides 2005, it looks like the median yearly income of borrowers has been increasing over time, from about 45,000 per year in 2006 to 70,000 per year in 2014. Some of this can be accounted for by inflation, but using a CPI calculator, $45000 in 2006 would be about equal in purchasing power to $53000 in 2014. Therefore, it does look like inflation adjusted incomes of borrowers have been increasing.

I was also curious which professions had the lowest and highest average credit scores. Lowest:

## loan_data$Occupation
##         Student - Technical School        Student - Community College 
##                           640.7500                           641.6429 
##         Student - College Freshman           Student - College Senior 
##                           645.1098                           672.6915 
##                   Flight Attendant                  Military Enlisted 
##                           678.9309                           681.8585 
##                       Nurse's Aide                         Bus Driver 
##                           682.9012                           683.5506 
##                     Teacher's Aide Student - College Graduate Student 
##                           685.5870                           688.5204

Highest:

## loan_data$Occupation
##      Engineer - Electrical                  Principal 
##                   716.3800                   716.5513 
##                  Professor                    Dentist 
##                   717.4354                   718.6176 
## Pilot - Private/Commercial                   Attorney 
##                   718.6457                   718.7161 
##                   Investor                     Doctor 
##                   724.5467                   727.1113 
##                 Pharmacist                      Judge 
##                   729.6556                   734.9545

It looks like more established professionals have higher credit scores, at least within the Prosper data set.

Next, I created a new Boolean variable called DelinquentBorrowers, which is true if there was a problem with repayment of the loan. Delinquency is defined as someone who owes an overdue debt, so DelinquentBorrowers should be true for loans that are defaulted, charged off, or past due.

loan_data$DelinquentBorrower <- ifelse(loan_data$LoanStatus == "Defaulted" |
                                       loan_data$LoanStatus == "Chargedoff" |
                                       loan_data$LoanStatus == "Past Due (1-15 days)" |
                                       loan_data$LoanStatus == "Past Due (16-30 days)"|
                                       loan_data$LoanStatus == "Past Due (31-60 days)"|
                                       loan_data$LoanStatus == "Past Due (61-90 days)" |
                                       loan_data$LoanStatus == "Past Due (91-120 days)"|
                                       loan_data$LoanStatus == "Past Due (>120 days)",
                                       1, 0)

Delinquency rates have decreased significantly over time. This is partly because of higher minimum credit score requirements, and partly because recent loans are more likely to still be in good standing.

## loan_data$DelinquentBorrowerCat: Good Standing
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7000    8683   12500   35000 
## -------------------------------------------------------- 
## loan_data$DelinquentBorrowerCat: Delinquent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    4500    6624    8800   35000

The shapes of the graphs look similar, with delinquent borrowers having a slightly higher density for small loans less than $5000. In fact the conditional medians are $7000 for borrowers in good standing, and only $4500 for delinquent borrowers. This is surprising, as the expected relationship would be that higher loans are more difficult to pay back. Perhaps this would be true if there were no lending limit, but with a $40000 limit, borrowers looking for a small, quick loan might be more likely to run into trouble paying it back.

## loan_data$DelinquentBorrowerCat: Good Standing
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    40000    59040    69835    85000 21000035 
## -------------------------------------------------------- 
## loan_data$DelinquentBorrowerCat: Delinquent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   30000   45000   54601   68000 2500000

The graphs look similar, but the median yearly income for a borrower in good standing is $59040, while for a delinquent borrower it is $45000.

I will try frequency polygons to get a better comparison of densities.

As expected, credit scores for borrowers in good standing are higher.

This graph shows that loans from 2007 and 2008 had significantly higher rates of delinquency than in other years. Most people who borrowed in 2013 and 2014 are still in good standing, but they have had less time to fall behind on their payments, so the relationship in later years is unclear.

Debt to income ratio is capped at 10.01, but this graph only shows ratios from 0 to 1. Many of the higher debt to income ratios are the result of people having low (or no) incomes, as opposed to a lot of debt. The graphs look very similar, although debt to income ratios for delinquent borrowers skew slightly more to the right. The relationship between debt to income ratios and delinquent borrowers will become more apparent in the multivariate section.

This graph shows percents, rather than counts. It looks like 3 year loans have the highest delinquency rates. However, from previous graphs I know that defaults peaked in 2007 and 2008, and 1 year and 5 year loans were not introduced until 2010. Many of the delinquent 3 year loans are from 2007 and 2008. which could explain this result.

Homeowners have slightly lower delinquency rates.

Finally, the top 10 least and most trustworthy professions, by proportion of borrowers in good standing.

Least trustworthy:

##  Student - Technical School Student - Community College 
##                   0.5625000                   0.5714286 
## Student - College Sophomore  Student - College Freshman 
##                   0.5797101                   0.6341463 
##                     Realtor                   Homemaker 
##                   0.6408840                   0.7000000 
##    Student - College Junior                    Clerical 
##                   0.7142857                   0.7534766 
##             Waiter/Waitress                             
##                   0.7637615                   0.7680875

Most trustworthy:

##                 Nurse (RN)                     Clergy 
##                  0.8951386                  0.8979592 
##                     Doctor                  Scientist 
##                  0.9109312                  0.9274194 
##           Military Officer Pilot - Private/Commercial 
##                  0.9306358                  0.9346734 
##               Psychologist                    Dentist 
##                  0.9379310                  0.9411765 
##                   Attorney                      Judge 
##                  0.9445507                  0.9545455

Bivariate Summary

Many of the variables I am investigating are not highly correlated. Credit scores and yearly incomes have been increasing over time, partly because minimum credit scores have increased from 600 starting in 2009, to 640 in 2014. Credit scores and yearly incomes are positively correlated, which explains why incomes have also been increasing over time.

In this section, I created a new Boulean variable called DelinquentBorrower, which is true if the borrower is late on payments, or his debt has been charged off or defaulted. 22 data points from 2005 were dropped because they seemed inaccurate.

Delinquency rates are correlated with credit scores, loan amounts, yearly incomes, term of the loan, home ownership, and loan origination year. The relationship delinquency rates have with debt to equity ratios is unclear from the bivariate plots. The higher rates of delinquency in 2007 and 2008, when only 3 year loans were available, calls into question the relationship with loan term. I am not too interested in the relationship delinquency rates have with loan origination year, because it has no predictive power.

Multivariate Plots

Debt to income ratios of delinquent borrowers have been significantly higher than their counterparts in good standing in every year. Although faceting a graph of debt to income ratios by delinquency status did not reveal much, this graph reveals that the debt to income ratio is significant. Here is a closer look:

Debt to income ratios for delinquent borrowers are slightly higher, but this chart does not reveal much more. One oddity is that the debt to income ratios of delinquent borrowers in 2014 seem significantly higher. However, most loans from 2014 are still in good standing, so the sample size is small.

Similarly, yearly income has been significantly higher for borrowers in good standing in every year.

Next, I performed a logit regression to classify delinquent borrowers. These are the original variables I investigated, and whether they will be included in the model:

Loan origination date: No. Loan origination year was correlated with delinquency rates, but it has no predictive power. Including it would better classify loans in the data set, but it can’t be used to make predictions. Economic data would be more interesting to include.
Loan original amount: Yes. It was surprising that loans for lower amounts corresponded to higher delinquency rates, but it does seem to have predictive power for the prosper loans data set.
Term: No. Only 3 year loans were offered before 2010, so it would skew the result.
Loan status: This is the dependent variable.
Borrower monthly income: Yes. Both the bivariate and multivariate plots showed that yearly income was correlated with delinquency rates.
Credit score: No. Credit scores are correlated with delinquency rates, but they are also highly correlated with the other variables.
Debt to income ratio: Yes. The multivariate plots confirmed that debt to income ratio correlated with delinquency rates.
Is borrower homeowner: Yes. Bivariate plots suggested that homeowners have slightly lower delinquency rates. Delinquencies in the last 7 years: Yes. Bivariate plots showed previous delinquencies were correlated with future delinquencies.
Occupation: No. Occupation is correlated with income, which has more predictive power. I also don’t want dozens of dummy variables in this particular model.

So I will include LoanOriginalAmount, YearlyIncome, DebtToIncomeRatio, IsBorrowerHomeowner, and DelinquenciesInLast7Years.

## 
## Call:
## glm(formula = DelinquentBorrowerCat ~ ., family = binomial(link = "logit"), 
##     data = loan_data[c("LoanOriginalAmount", "YearlyIncome", 
##         "DebtToIncomeRatio", "IsBorrowerHomeowner", "DelinquenciesLast7Years", 
##         "DelinquentBorrowerCat")])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9291  -0.6480  -0.5599  -0.4182   5.8443  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.010e+00  2.007e-02 -50.342   <2e-16 ***
## LoanOriginalAmount      -4.599e-05  1.761e-06 -26.109   <2e-16 ***
## YearlyIncome            -6.199e-06  2.878e-07 -21.541   <2e-16 ***
## DebtToIncomeRatio        1.395e-01  1.187e-02  11.750   <2e-16 ***
## IsBorrowerHomeownerTRUE  4.171e-03  1.779e-02   0.234    0.815    
## DelinquenciesLast7Years  1.419e-02  7.026e-04  20.201   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93104  on 104504  degrees of freedom
## Residual deviance: 90037  on 104499  degrees of freedom
##   (9410 observations deleted due to missingness)
## AIC: 90049
## 
## Number of Fisher Scoring iterations: 5

IsBorrowerHomeowner is not significant so I will remove it from the model.

## 
## Call:
## glm(formula = loan_data$DelinquentBorrowerCat ~ ., family = binomial(link = "logit"), 
##     data = loan_data[c("LoanOriginalAmount", "YearlyIncome", 
##         "DebtToIncomeRatio", "DelinquenciesLast7Years")])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9297  -0.6481  -0.5598  -0.4183   5.8379  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.009e+00  1.974e-02  -51.15   <2e-16 ***
## LoanOriginalAmount      -4.596e-05  1.756e-06  -26.17   <2e-16 ***
## YearlyIncome            -6.183e-06  2.795e-07  -22.12   <2e-16 ***
## DebtToIncomeRatio        1.396e-01  1.186e-02   11.77   <2e-16 ***
## DelinquenciesLast7Years  1.419e-02  7.017e-04   20.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93104  on 104504  degrees of freedom
## Residual deviance: 90037  on 104500  degrees of freedom
##   (9410 observations deleted due to missingness)
## AIC: 90047
## 
## Number of Fisher Scoring iterations: 5

The predicted probability of delinquency is decreasing in LoanOriginalAmount and YearlyIncome, and increasing in DebtToIncomeRatio and DelinquenciesLast7Years.

The predicted probability of default for someone who takes out a $10000 loan, makes $50000 a year, has a debt to income ratio of .2, and no delinquencies in the last 7 years is:

new.df <- data.frame(LoanOriginalAmount = 10000, 
                     YearlyIncome = 50000, 
                     DebtToIncomeRatio = .2, 
                     DelinquenciesLast7Years = 0)
predict(model, new.df, type = "response")
##         1 
## 0.1480198

Multivariate Summary

I did not find many interesting relationships using multivariate plots. However, multivariate plots confirmed that debt to income ratio and yearly income are correlated with delinquency status.

A logit model was able to capture relationships found during the analysis.

Final Plots

Yearly income, debt to income ratio, delinquencies last 7 years, and loan original amount explain some of the variation in delinquency status.

Reflection

The data set contained 81 variables, which was overwhelming at first. It took me a while to settle on which ones I would investigate, and I chose the ones that might have the most predictive power for defaults. However, this led to confirming rather obvious relationships between variables during my analysis.

I learned a lot and got a lot of practice with R, and I am interested in its other functionalities. I initially wanted to do things like reverse engineer credit scores, and create a very rigorous statistical model, but these turned out to be too ambitious.

Instead of fitting a binomial model, I considered fitting an ordinal model with 3 levels: “good standing”, “past due” and “defaulted or charged off.” However, the categories are not well defined: some of the “completed” or “current” loans could have been past due, or even charged off in the past, so I stuck with a simpler model. I also did not know what to do with current loans. They are included with loans that are in good standing, but many of them are recent. I considered dropping the years 2013 and 2014 from the data set, but decided against it.

I would be interested to see how closely my model’s predictions correlate with credit scores, and I wonder what other variables or techniques are used for credit scores. In the future I would be interested in adding more variables to a model, and using more data.