Prosper Loans is a peer-to-peer crowdsourced loan company, which connects borrowers with willing investors, and takes a small fee. Peer-to-peer lending often attracts borrowers who are unable to qualify for traditional bank loans, so default rates have been historically high, but standards have since become stricter and the minimum credit score is now 640. Prosper was started in 2005, and the data set covers the time period 2005 to 2014, the first 9 years of the business. It contains 113,937 observations of 81 variables. I chose to investigate 11 of them:
Loan origination date: The date the loan originated.
Loan original amount: The origination amount of the loan.
Term: The length of the loan expressed in months.
Loan status: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue.
Borrower monthly income: The monthly income the borrower stated at the time the listing was created.
Credit score: The value representing the range of the borrower’s credit score as provided by a consumer credit rating agency. Only lower and upper limits provided.
Debt to income ratio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is capped at 10.01.
Is borrower homeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
Delinquencies in the last 7 years: Number of delinquencies in the past 7 years at the time the credit profile was pulled.
Occupation: The Occupation selected by the Borrower at the time they created the listing.
## [1] "LoanOriginationDate" "LoanOriginalAmount"
## [3] "StatedMonthlyIncome" "CreditScoreRangeLower"
## [5] "CreditScoreRangeUpper" "Term"
## [7] "LoanStatus" "DebtToIncomeRatio"
## [9] "IsBorrowerHomeowner" "DelinquenciesLast7Years"
## [11] "Occupation"
## LoanOriginationDate LoanOriginalAmount StatedMonthlyIncome
## 1/22/2014 0:00 : 491 Min. : 1000 Min. : 0
## 11/13/2013 0:00: 490 1st Qu.: 4000 1st Qu.: 3200
## 2/19/2014 0:00 : 439 Median : 6500 Median : 4667
## 10/16/2013 0:00: 434 Mean : 8337 Mean : 5608
## 1/28/2014 0:00 : 339 3rd Qu.:12000 3rd Qu.: 6825
## 9/24/2013 0:00 : 316 Max. :35000 Max. :1750003
## (Other) :111428
## CreditScoreRangeLower CreditScoreRangeUpper Term
## Min. : 0.0 Min. : 19.0 Min. :12.00
## 1st Qu.:660.0 1st Qu.:679.0 1st Qu.:36.00
## Median :680.0 Median :699.0 Median :36.00
## Mean :685.6 Mean :704.6 Mean :40.83
## 3rd Qu.:720.0 3rd Qu.:739.0 3rd Qu.:36.00
## Max. :880.0 Max. :899.0 Max. :60.00
## NA's :591 NA's :591
## LoanStatus DebtToIncomeRatio IsBorrowerHomeowner
## Current :56576 Min. : 0.000 Mode :logical
## Completed :38074 1st Qu.: 0.140 FALSE:56459
## Chargedoff :11992 Median : 0.220 TRUE :57478
## Defaulted : 5018 Mean : 0.276
## Past Due (1-15 days) : 806 3rd Qu.: 0.320
## Past Due (31-60 days): 363 Max. :10.010
## (Other) : 1108 NA's :8554
## DelinquenciesLast7Years Occupation
## Min. : 0.000 Other :28617
## 1st Qu.: 0.000 Professional :13628
## Median : 0.000 Computer Programmer : 4478
## Mean : 4.155 Executive : 4311
## 3rd Qu.: 3.000 Teacher : 3759
## Max. :99.000 Administrative Assistant: 3688
## NA's :990 (Other) :55456
# Extracting loan origination year from loan date
loan_data$LoanOriginationYear <-
as.numeric(format(as.Date(loan_data$LoanOriginationDate,'%m/%d/%Y'), '%Y'))
The number of loans dropped off shortly after the financial crisis, but then continued growing. The data did not include the full year for 2014, which is why the histogram drops again in 2014. A Google search showed that the business continued to grow year over year.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Mean loan amount is $8338, and the median is $6500. Prosper makes mostly small loans, and investors can decide whether to lend based on borrower information. Loans are capped at $35000, but even so, the distribution is skewed to the right. There are spikes at multiples of 5000, which are common loan amounts. The minimum loan is set at $1000.
# Creating a new variable YearlyIncome by multiplying monthly income by 12
loan_data$YearlyIncome <- loan_data$StatedMonthlyIncome * 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 38404 56000 67296 81900 21000035
The maximum reported yearly income was over 20 million, and there were several other outliers, but the majority of the data was between 0 and 300000 so the graph is zoomed in on this part. Mean income is $67284 and the median is $38400. The graph is skewed to the right as is expected of income distributions. There are small spikes at multiples of 10000, which are commonly reported incomes. Some borrowers are unemployed, and reported yearly incomes of 0.
An alternative is to log transform YearlyIncome, but this doesn’t reveal much more:
Credit scores:
# CreditScore is the average of the lower and upper limits for credit score given in the data set.
loan_data$CreditScore <- (loan_data$CreditScoreRangeUpper + loan_data$CreditScoreRangeLower) / 2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 591
The majority of borrowers have credit scores around 700. A minimum credit score of 600 was introduced in 2009, and this was later raised to 640 in 2014. Credit scores below 600 and 640, must have been from before 2009 and 2014 respectively. Credit scores were not available for 569 borrowers, and 133 borrowers had credit scores in the 0 to 19 range, which is not reasonable. These were replaced with NA. I will take a closer look at credit scores later.
# 9.5 is the average of 0 and 19
loan_data$CreditScore[loan_data$CreditScore == 9.5] <- NA
Moving on to term:
Only 1 year, 3 year, and 5 year loan terms are available. The most popular type is a 3 year loan, and 1 year loans are rare, which was surprising. In the bivariate section, we will see why this is the case.
There are about as many current loans as there have been loans in the past. This is impressive and indicates that the business is growing. However, there have been problems with a significant portion of loans. This is the graph I will be focusing on later in the analysis, by finding which variables are correlated with delinquency.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
Mean debt to income ratio is .276, and the median is .22. The domain is limited to (0, 1). 800 people have a debt to income ratio higher than 1, and these are likely due to low incomes as opposed to high debt. The distribution is skewed to the right.
This graph shows that about half of borrowers are homeowners.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
Most people have no delinquencies, but there are a few unbelievable outliers. It’s surprising that anyone has more than 10 delinquencies.
This graph shows the distribution of borrower occupations.
During univariate analysis, I created 3 new variables: LoanOriginationYear, YearlyIncome, and CreditScore. Prosper has been growing since its inception, with a dip in loans written after the financial crisis. The median income of borrowers is $56000, with a median loan amount of $6500, mostly for 3 year loans. Median credit scores were 689.5, however minimum credit scores of 600 and 640 were introduced in 2009 and 2014 respectively, so moving forward the median credit score might be higher.
First, I created a scatterplot matrix to identify variables that would be interesting to investigate more closely.
None of the variables are highly correlated, and only a few of them have any significant correlation.
Pairs of variables that have higher correlations are:
Term and loan origination amount
Term and loan origination year
Loan origination amount and loan origination year
Credit score and loan origination year
Credit score and loan original amount
Credit score and delinquencies in the last 7 years
Here is a closer look at these pairs of variables.
It’s unclear which variable is causing this relationship, but nonetheless term and loan amount are positively correlated. Longer terms correspond to higher loan amounts. In fact, the 75th percentile of loan amounts for 1 year loans is well below the 25th percentile for 5 year loans. This may indicate the 1 year loans on Prosper are often used for smaller, more impulsive purchases.
The positive correlation coefficient of .34 between term and loan origination year might not indicate a meaningful relationship between the two variables. However, no loans of term 1 year and 5 years originated before 2010, so these options might have only been introduced in 2010. In fact, the data set contains almost all 3 year loans in 2010, and only 3 year loans prior to 2010:
Next, I looked at the correlation between loan year and loan original amount:
Median loan amounts have been increasing over the years. It also looks like a new minimum loan amount was introduced in 2011, and the maximum loan amount was raised from $25,000 to $35,000 in 2013. Median loan amounts also have a local maximum around the time of the financial crisis when it was cheap to borrow, and a local minimum in 2009 when interest rates were higher. The multivariate section will show that loans written in 2007 and 2008 defaulted at much higher rates than in other years.
It’s odd that credit scores are not continuous, but they are the averages of upper and lower limits. It’s unclear whether people with higher credit scores tend to borrow more, or people who want to borrow more must have a higher credit score to attract investors, but there is a meaningful positive correlation here. It could also be that higher incomes are correlated with higher credit scores. Let’s see.
The correlation coefficient is only .109. Income does not correlate with credit score as strongly as I thought. It’s interesting that credit scores increase with income up until about 100,000, but then level off. Maybe the relationship will be clearer with a scatterplot of delinquencies in the last 7 years and credit score.
Once again, this is not the graph I expected, and it’s not clear why it’s capped at 100 delinquencies. However, with the conditional means plotted, it’s clear that there’s a negative correlation. The correlation coefficient is -.263, which is still not as high as I would expect it to be. I find it hard to believe that the average credit score for someone with 100 delinquencies in the past 7 years is over 600, so that makes me skeptical of those points.
Next, I looked at how the creditworthiness of borrowers has evolved over time:
The data from 2005 is strange. No data points from 2005 have any credit score entries.
As with incomes, it looks like credit scores have increased over time, although they have leveled off and even decreased since 2009. The minimum credit score was raised to 600 in 2009, and raised again to 640 in 2014. It’s interesting that despite the stricter credit requirement in 2014, the median credit score dropped. I wonder why that is? Nonetheless, total loans made have been growing despite stricter requirements, so Prosper has been attracting many higher quality borrowers.
The data from 2005 looks inaccurate, with borrowers having a median income of 100,000. Going into Excel, there are only 22 data points from 2005, so I decide to drop them from the data set. Prosper loans was founded in 2005, so these might be test transactions, or they might be inaccurate.
# Dropping 22 data points from 2005
loan_data = loan_data[!loan_data$LoanOriginationYear == 2005,]
Besides 2005, it looks like the median yearly income of borrowers has been increasing over time, from about 45,000 per year in 2006 to 70,000 per year in 2014. Some of this can be accounted for by inflation, but using a CPI calculator, $45000 in 2006 would be about equal in purchasing power to $53000 in 2014. Therefore, it does look like inflation adjusted incomes of borrowers have been increasing.
I was also curious which professions had the lowest and highest average credit scores. Lowest:
## loan_data$Occupation
## Student - Technical School Student - Community College
## 640.7500 641.6429
## Student - College Freshman Student - College Senior
## 645.1098 672.6915
## Flight Attendant Military Enlisted
## 678.9309 681.8585
## Nurse's Aide Bus Driver
## 682.9012 683.5506
## Teacher's Aide Student - College Graduate Student
## 685.5870 688.5204
Highest:
## loan_data$Occupation
## Engineer - Electrical Principal
## 716.3800 716.5513
## Professor Dentist
## 717.4354 718.6176
## Pilot - Private/Commercial Attorney
## 718.6457 718.7161
## Investor Doctor
## 724.5467 727.1113
## Pharmacist Judge
## 729.6556 734.9545
It looks like more established professionals have higher credit scores, at least within the Prosper data set.
Next, I created a new Boolean variable called DelinquentBorrowers, which is true if there was a problem with repayment of the loan. Delinquency is defined as someone who owes an overdue debt, so DelinquentBorrowers should be true for loans that are defaulted, charged off, or past due.
loan_data$DelinquentBorrower <- ifelse(loan_data$LoanStatus == "Defaulted" |
loan_data$LoanStatus == "Chargedoff" |
loan_data$LoanStatus == "Past Due (1-15 days)" |
loan_data$LoanStatus == "Past Due (16-30 days)"|
loan_data$LoanStatus == "Past Due (31-60 days)"|
loan_data$LoanStatus == "Past Due (61-90 days)" |
loan_data$LoanStatus == "Past Due (91-120 days)"|
loan_data$LoanStatus == "Past Due (>120 days)",
1, 0)
Delinquency rates have decreased significantly over time. This is partly because of higher minimum credit score requirements, and partly because recent loans are more likely to still be in good standing.
## loan_data$DelinquentBorrowerCat: Good Standing
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7000 8683 12500 35000
## --------------------------------------------------------
## loan_data$DelinquentBorrowerCat: Delinquent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4500 6624 8800 35000
The shapes of the graphs look similar, with delinquent borrowers having a slightly higher density for small loans less than $5000. In fact the conditional medians are $7000 for borrowers in good standing, and only $4500 for delinquent borrowers. This is surprising, as the expected relationship would be that higher loans are more difficult to pay back. Perhaps this would be true if there were no lending limit, but with a $40000 limit, borrowers looking for a small, quick loan might be more likely to run into trouble paying it back.
## loan_data$DelinquentBorrowerCat: Good Standing
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 40000 59040 69835 85000 21000035
## --------------------------------------------------------
## loan_data$DelinquentBorrowerCat: Delinquent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 30000 45000 54601 68000 2500000
The graphs look similar, but the median yearly income for a borrower in good standing is $59040, while for a delinquent borrower it is $45000.
I will try frequency polygons to get a better comparison of densities.
As expected, credit scores for borrowers in good standing are higher.
This graph shows that loans from 2007 and 2008 had significantly higher rates of delinquency than in other years. Most people who borrowed in 2013 and 2014 are still in good standing, but they have had less time to fall behind on their payments, so the relationship in later years is unclear.
Debt to income ratio is capped at 10.01, but this graph only shows ratios from 0 to 1. Many of the higher debt to income ratios are the result of people having low (or no) incomes, as opposed to a lot of debt. The graphs look very similar, although debt to income ratios for delinquent borrowers skew slightly more to the right. The relationship between debt to income ratios and delinquent borrowers will become more apparent in the multivariate section.
This graph shows percents, rather than counts. It looks like 3 year loans have the highest delinquency rates. However, from previous graphs I know that defaults peaked in 2007 and 2008, and 1 year and 5 year loans were not introduced until 2010. Many of the delinquent 3 year loans are from 2007 and 2008. which could explain this result.
Homeowners have slightly lower delinquency rates.
Finally, the top 10 least and most trustworthy professions, by proportion of borrowers in good standing.
Least trustworthy:
## Student - Technical School Student - Community College
## 0.5625000 0.5714286
## Student - College Sophomore Student - College Freshman
## 0.5797101 0.6341463
## Realtor Homemaker
## 0.6408840 0.7000000
## Student - College Junior Clerical
## 0.7142857 0.7534766
## Waiter/Waitress
## 0.7637615 0.7680875
Most trustworthy:
## Nurse (RN) Clergy
## 0.8951386 0.8979592
## Doctor Scientist
## 0.9109312 0.9274194
## Military Officer Pilot - Private/Commercial
## 0.9306358 0.9346734
## Psychologist Dentist
## 0.9379310 0.9411765
## Attorney Judge
## 0.9445507 0.9545455
Many of the variables I am investigating are not highly correlated. Credit scores and yearly incomes have been increasing over time, partly because minimum credit scores have increased from 600 starting in 2009, to 640 in 2014. Credit scores and yearly incomes are positively correlated, which explains why incomes have also been increasing over time.
In this section, I created a new Boulean variable called DelinquentBorrower, which is true if the borrower is late on payments, or his debt has been charged off or defaulted. 22 data points from 2005 were dropped because they seemed inaccurate.
Delinquency rates are correlated with credit scores, loan amounts, yearly incomes, term of the loan, home ownership, and loan origination year. The relationship delinquency rates have with debt to equity ratios is unclear from the bivariate plots. The higher rates of delinquency in 2007 and 2008, when only 3 year loans were available, calls into question the relationship with loan term. I am not too interested in the relationship delinquency rates have with loan origination year, because it has no predictive power.
Debt to income ratios of delinquent borrowers have been significantly higher than their counterparts in good standing in every year. Although faceting a graph of debt to income ratios by delinquency status did not reveal much, this graph reveals that the debt to income ratio is significant. Here is a closer look:
Debt to income ratios for delinquent borrowers are slightly higher, but this chart does not reveal much more. One oddity is that the debt to income ratios of delinquent borrowers in 2014 seem significantly higher. However, most loans from 2014 are still in good standing, so the sample size is small.
Similarly, yearly income has been significantly higher for borrowers in good standing in every year.
Next, I performed a logit regression to classify delinquent borrowers. These are the original variables I investigated, and whether they will be included in the model:
Loan origination date: No. Loan origination year was correlated with delinquency rates, but it has no predictive power. Including it would better classify loans in the data set, but it can’t be used to make predictions. Economic data would be more interesting to include.
Loan original amount: Yes. It was surprising that loans for lower amounts corresponded to higher delinquency rates, but it does seem to have predictive power for the prosper loans data set.
Term: No. Only 3 year loans were offered before 2010, so it would skew the result.
Loan status: This is the dependent variable.
Borrower monthly income: Yes. Both the bivariate and multivariate plots showed that yearly income was correlated with delinquency rates.
Credit score: No. Credit scores are correlated with delinquency rates, but they are also highly correlated with the other variables.
Debt to income ratio: Yes. The multivariate plots confirmed that debt to income ratio correlated with delinquency rates.
Is borrower homeowner: Yes. Bivariate plots suggested that homeowners have slightly lower delinquency rates. Delinquencies in the last 7 years: Yes. Bivariate plots showed previous delinquencies were correlated with future delinquencies.
Occupation: No. Occupation is correlated with income, which has more predictive power. I also don’t want dozens of dummy variables in this particular model.
So I will include LoanOriginalAmount, YearlyIncome, DebtToIncomeRatio, IsBorrowerHomeowner, and DelinquenciesInLast7Years.
##
## Call:
## glm(formula = DelinquentBorrowerCat ~ ., family = binomial(link = "logit"),
## data = loan_data[c("LoanOriginalAmount", "YearlyIncome",
## "DebtToIncomeRatio", "IsBorrowerHomeowner", "DelinquenciesLast7Years",
## "DelinquentBorrowerCat")])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9291 -0.6480 -0.5599 -0.4182 5.8443
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.010e+00 2.007e-02 -50.342 <2e-16 ***
## LoanOriginalAmount -4.599e-05 1.761e-06 -26.109 <2e-16 ***
## YearlyIncome -6.199e-06 2.878e-07 -21.541 <2e-16 ***
## DebtToIncomeRatio 1.395e-01 1.187e-02 11.750 <2e-16 ***
## IsBorrowerHomeownerTRUE 4.171e-03 1.779e-02 0.234 0.815
## DelinquenciesLast7Years 1.419e-02 7.026e-04 20.201 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93104 on 104504 degrees of freedom
## Residual deviance: 90037 on 104499 degrees of freedom
## (9410 observations deleted due to missingness)
## AIC: 90049
##
## Number of Fisher Scoring iterations: 5
IsBorrowerHomeowner is not significant so I will remove it from the model.
##
## Call:
## glm(formula = loan_data$DelinquentBorrowerCat ~ ., family = binomial(link = "logit"),
## data = loan_data[c("LoanOriginalAmount", "YearlyIncome",
## "DebtToIncomeRatio", "DelinquenciesLast7Years")])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9297 -0.6481 -0.5598 -0.4183 5.8379
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.009e+00 1.974e-02 -51.15 <2e-16 ***
## LoanOriginalAmount -4.596e-05 1.756e-06 -26.17 <2e-16 ***
## YearlyIncome -6.183e-06 2.795e-07 -22.12 <2e-16 ***
## DebtToIncomeRatio 1.396e-01 1.186e-02 11.77 <2e-16 ***
## DelinquenciesLast7Years 1.419e-02 7.017e-04 20.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93104 on 104504 degrees of freedom
## Residual deviance: 90037 on 104500 degrees of freedom
## (9410 observations deleted due to missingness)
## AIC: 90047
##
## Number of Fisher Scoring iterations: 5
The predicted probability of delinquency is decreasing in LoanOriginalAmount and YearlyIncome, and increasing in DebtToIncomeRatio and DelinquenciesLast7Years.
The predicted probability of default for someone who takes out a $10000 loan, makes $50000 a year, has a debt to income ratio of .2, and no delinquencies in the last 7 years is:
new.df <- data.frame(LoanOriginalAmount = 10000,
YearlyIncome = 50000,
DebtToIncomeRatio = .2,
DelinquenciesLast7Years = 0)
predict(model, new.df, type = "response")
## 1
## 0.1480198
I did not find many interesting relationships using multivariate plots. However, multivariate plots confirmed that debt to income ratio and yearly income are correlated with delinquency status.
A logit model was able to capture relationships found during the analysis.
Yearly income, debt to income ratio, delinquencies last 7 years, and loan original amount explain some of the variation in delinquency status.
The data set contained 81 variables, which was overwhelming at first. It took me a while to settle on which ones I would investigate, and I chose the ones that might have the most predictive power for defaults. However, this led to confirming rather obvious relationships between variables during my analysis.
I learned a lot and got a lot of practice with R, and I am interested in its other functionalities. I initially wanted to do things like reverse engineer credit scores, and create a very rigorous statistical model, but these turned out to be too ambitious.
Instead of fitting a binomial model, I considered fitting an ordinal model with 3 levels: “good standing”, “past due” and “defaulted or charged off.” However, the categories are not well defined: some of the “completed” or “current” loans could have been past due, or even charged off in the past, so I stuck with a simpler model. I also did not know what to do with current loans. They are included with loans that are in good standing, but many of them are recent. I considered dropping the years 2013 and 2014 from the data set, but decided against it.
I would be interested to see how closely my model’s predictions correlate with credit scores, and I wonder what other variables or techniques are used for credit scores. In the future I would be interested in adding more variables to a model, and using more data.