Prosper Loans is a peer-to-peer crowdsourced loan company that matches borrowers with willing investors. The data set contains 113,937 observations of 81 variables, and 10 are selected for analysis. After exploratory data analysis, a regression is run to predict the probability of default for new borrowers.
Variables:
Loan origination date: The date the loan originated.
Loan original amount: The amount of the loan.
Term: The term of the loan expressed in months.
Loan status: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, or PastDue.
Borrower monthly income: The monthly income the borrower stated at the time the listing was created.
Credit score: The value representing the range of the borrower’s credit score as provided by a consumer credit rating agency. Only lower and upper limits provided.
Debt to income ratio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is capped at 10.01.
IsBorrowerHomeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
Delinquencies in the last 7 years: Number of delinquencies in the past 7 years at the time the credit profile was pulled.
## [1] "LoanOriginationDate" "LoanOriginalAmount"
## [3] "StatedMonthlyIncome" "CreditScoreRangeLower"
## [5] "CreditScoreRangeUpper" "Term"
## [7] "LoanStatus" "DebtToIncomeRatio"
## [9] "IsBorrowerHomeowner" "DelinquenciesLast7Years"
## LoanOriginationDate LoanOriginalAmount StatedMonthlyIncome
## 1/22/2014 0:00 : 491 Min. : 1000 Min. : 0
## 11/13/2013 0:00: 490 1st Qu.: 4000 1st Qu.: 3200
## 2/19/2014 0:00 : 439 Median : 6500 Median : 4667
## 10/16/2013 0:00: 434 Mean : 8337 Mean : 5608
## 1/28/2014 0:00 : 339 3rd Qu.:12000 3rd Qu.: 6825
## 9/24/2013 0:00 : 316 Max. :35000 Max. :1750003
## (Other) :111428
## CreditScoreRangeLower CreditScoreRangeUpper Term
## Min. : 0.0 Min. : 19.0 Min. :12.00
## 1st Qu.:660.0 1st Qu.:679.0 1st Qu.:36.00
## Median :680.0 Median :699.0 Median :36.00
## Mean :685.6 Mean :704.6 Mean :40.83
## 3rd Qu.:720.0 3rd Qu.:739.0 3rd Qu.:36.00
## Max. :880.0 Max. :899.0 Max. :60.00
## NA's :591 NA's :591
## LoanStatus DebtToIncomeRatio IsBorrowerHomeowner
## Current :56576 Min. : 0.000 Mode :logical
## Completed :38074 1st Qu.: 0.140 FALSE:56459
## Chargedoff :11992 Median : 0.220 TRUE :57478
## Defaulted : 5018 Mean : 0.276
## Past Due (1-15 days) : 806 3rd Qu.: 0.320
## Past Due (31-60 days): 363 Max. :10.010
## (Other) : 1108 NA's :8554
## DelinquenciesLast7Years
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 4.155
## 3rd Qu.: 3.000
## Max. :99.000
## NA's :990
# Extracting loan origination year from loan date
loan_data$LoanOriginationYear <-
as.numeric(format(as.Date(loan_data$LoanOriginationDate,'%m/%d/%Y'), '%Y'))
The data set covers the time period 2005 to 2014. Prosper was created in 2005, so the first 10 years of operation are covered.
The number of loans dropped off after the financial crisis, but then continued to grow. The data did not include the full year for 2014, so the histogram drops again in 2014. A Google search showed that the business continued to grow year over year.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Prosper makes mostly small loans: mean loan amount is $8338, and median is $6500. The minimum loan is set at $1000, and loans are capped at $35000, but even so the distribution is skewed to the right. There are spikes at multiples of 5000, which are common loan amounts.
# Creating a new variable YearlyIncome
loan_data$YearlyIncome <- loan_data$StatedMonthlyIncome * 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 38404 56000 67296 81900 21000035
The maximum reported yearly income is over 20 million, and there are several other outliers, but the majority of incomes are between 0 and 300000 so the graph is zoomed in on this part. Mean income is $67284 and the median is $38400. The graph is skewed to the right as is expected of income distributions. There are small spikes at multiples of 10000, which are commonly reported incomes, and some borrowers are unemployed, reporting yearly incomes of 0.
An alternative is to log transform YearlyIncome, but this doesn’t reveal much more:
Credit scores:
# CreditScore is the average of the lower and upper limits for credit score
loan_data$CreditScore <- (loan_data$CreditScoreRangeUpper + loan_data$CreditScoreRangeLower) / 2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 591
The majority of borrowers have credit scores around 700. Peer-to-peer lending sometimes attracts borrowers who are unable to qualify for traditional bank loans, so default rates tend to be higher, but Prosper Loans has introduced stricter credit requirements to counterract this. The minimum credit score was set at 600 in 2009, and raised to 640 in 2014. Credit scores below 600 and 640, must have been from before 2009 and 2014 respectively. Credit scores are not available for 569 borrowers, and 133 borrowers have credit scores in the 0 to 19 range, which is not reasonable. These are replaced with NA.
# 9.5 is the average of 0 and 19
loan_data$CreditScore[loan_data$CreditScore == 9.5] <- NA
Moving on to term:
Only 1 year, 3 year, and 5 year loan terms are available. The most popular type are 3 year loans, and 1 year loans are rare.
There are about as many current loans as there have been loans in the past, which is impressive and indicates that the business is growing. However, there have been problems with a significant portion of loans.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
Mean debt to income ratio is .276, and the median is .22. The domain is limited to (0, 1). 800 people have a debt to income ratio higher than 1, mostly due to low incomes as opposed to high debt. The distribution is skewed to the right.
About half of borrowers are homeowners.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
Most people have no delinquencies, but there are a few outliers. It’s surprising that anyone has more than 10 delinquencies.
During univariate analysis, 3 new variables were created: LoanOriginationYear, YearlyIncome, and CreditScore. Prosper has been growing since its inception, with loans dipping only after the financial crisis. The median income of borrowers is $56000, and median loan amount is $6500, mostly for 3 year loans. Median credit scores were 689.5, however minimum credit scores of 600 and 640 were introduced in 2009 and 2014 respectively, so moving forward credit scores might be higher.
First, a scatterplot matrix is created to identify pairs of variables that might be interesting to investigate more closely.
Pairs of variables with higher correlations are:
Term and loan origination amount
Term and loan origination year
Loan origination amount and loan origination year
Credit score and loan origination year
Credit score and loan original amount
Credit score and delinquencies in the last 7 years
A closer look at these pairs of variables follows.
Term and loan amount are positively correlated. The 75th percentile of loan amount for 1 year loans is well below the 25th percentile for 5 year loans. This could indicate that 1 year loans on Prosper are used for smaller, more impulsive purchases.
The positive correlation coefficient of .34 between term and loan origination year is unlikely to indicate a meaningful relationship between the two variables. However, no loans of term 1 year and 5 years originated before 2010, so these options might have been introduced in 2010. In fact, the data set contains almost all 3 year loans in 2010, and only 3 year loans prior to 2010:
Loan year and loan original amount:
Median loan amounts have been increasing over the years. A new minimum loan amount of $2000 was introduced in 2011, and the maximum loan amount was raised from $25,000 to $35,000 in 2013. Loan amounts have a local maximum around the time of the financial crisis. The multivariate section will show that loans written in 2007 and 2008 defaulted at higher rates than in other years.
There is a positive correlation between credit scores and loan amount. It’s unclear whether people with higher credit scores tend to borrow more, or people who want to borrow more need higher credit scores to attract investors.
With the conditional means plotted, it’s clear that there’s a negative correlation. The correlation coefficient is -.263. It’s surprising that the average credit score for someone with 100 delinquencies is over 600.
Next, I looked at how the creditworthiness of borrowers has evolved over time:
The data from 2005 is strange. No data points from 2005 have any credit score entries.
As with incomes, it looks like credit scores have increased over time, although they have leveled off and even decreased since 2009. The minimum credit score was raised to 600 in 2009, and raised again to 640 in 2014. It’s interesting that despite the stricter credit requirement in 2014, median credit scores dropped. Nonetheless, total loans made have been growing despite stricter requirements, so Prosper has been able to attract higher quality borrowers.
The data from 2005 looks inaccurate, with borrowers having a median income of 100,000. Prosper loans was founded in 2005 and there are only 22 data points from that year, so they might be test transactions and are dropped from the data set.
# Dropping 22 data points from 2005
loan_data = loan_data[!loan_data$LoanOriginationYear == 2005,]
Besides 2005, it looks like the median yearly income of borrowers has been increasing over time, from about 45,000 per year in 2006 to 70,000 per year in 2014. Some of this can be accounted for by inflation, but using a CPI calculator, $45000 in 2006 would be about equal in purchasing power to $53000 in 2014. Therefore, inflation adjusted incomes of borrowers have been increasing.
Next, a new Boolean variable called DelinquentBorrower is created, which should be true if there has been a problem with repayment of the loan.
There are some problems with defining delinquency in this data set, as the categories are not well defined. Only the current status of the loan is given: some of the “completed” or “current” loans could have been past due or even charged off. Additional data on loan history would make the Boolean variable more accurate, but current status will have to do. Furthermore, current loans are included with loans that are in good standing even though they may become delinquent before being repaid. I decided to go with a simple definition of delinquency:
loan_data$DelinquentBorrower <- ifelse(loan_data$LoanStatus == "Defaulted" |
loan_data$LoanStatus == "Chargedoff" |
loan_data$LoanStatus == "Past Due (1-15 days)" |
loan_data$LoanStatus == "Past Due (16-30 days)"|
loan_data$LoanStatus == "Past Due (31-60 days)"|
loan_data$LoanStatus == "Past Due (61-90 days)" |
loan_data$LoanStatus == "Past Due (91-120 days)"|
loan_data$LoanStatus == "Past Due (>120 days)",
1, 0)
Delinquency rates have decreased significantly over time. This is due to higher minimum credit score requirements, and because recent loans are more likely to remain in good standing.
## loan_data$DelinquentBorrowerCat: Good Standing
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7000 8683 12500 35000
## --------------------------------------------------------
## loan_data$DelinquentBorrowerCat: Delinquent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4500 6624 8800 35000
The graphs look similar, with delinquent borrowers having a slightly higher density for small loans less than $5000. In fact the conditional medians are $7000 for borrowers in good standing, and only $4500 for delinquent borrowers. This is surprising, as the expected relationship would be that higher loans are more difficult to pay back. Perhaps this would be true if there were no lending limit, but with a $35000 limit, borrowers looking for a small, quick loan might be more likely to run into trouble paying it back.
## loan_data$DelinquentBorrowerCat: Good Standing
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 40000 59040 69835 85000 21000035
## --------------------------------------------------------
## loan_data$DelinquentBorrowerCat: Delinquent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 30000 45000 54601 68000 2500000
The graphs look similar, but the median yearly income for a borrower in good standing is $59040, while for a delinquent borrower it is $45000.
Loans originating in 2007 and 2008 had significantly higher rates of delinquency than in other years. Most people who borrowed in 2013 and 2014 are still in good standing, but they have had less time to fall behind on their payments, so the relationship in later years is unclear.
Debt to income ratio is capped at 10.01, but this graph only shows ratios from 0 to 1. The graphs look similar, although debt to income ratios for delinquent borrowers skew slightly more to the right.
This graph shows proportion, rather than counts. 3 year loans have the highest delinquency rates, however, previous graphs show that defaults peaked in 2007 and 2008, and 1 year and 5 year loans were not introduced until 2010. Many of the delinquent 3 year loans are from 2007 and 2008, which would explain this result.
Homeowners have slightly lower delinquency rates.
The 22 data points from 2005 were dropped, and a new Boulean variable called DelinquentBorrower was created which was defined to be true if the current status of the loan is delinquent.
Credit scores and yearly incomes have increased over time because minimum credit scores have been increasing. Delinquency rates are correlated with credit scores, loan amounts, yearly incomes, term of the loan, home ownership, and loan origination year. The relationship delinquency rates have with debt to income ratios is unclear from the bivariate plots. The higher rates of delinquency in 2007 and 2008, when only 3 year loans were available, calls into question the relationship with loan term.
Debt to income ratios of delinquent borrowers have been significantly higher than their counterparts in good standing in every year. Although the bivariate section did not show a clear relationship between debt to income ratios and loan standing, this graph reveals that debt to income ratio is significant.
Similarly, yearly income has been higher for borrowers in good standing in every year.
Next, a logit regression is used to predict delinquent borrowers. The reasoning for including each variable in the model is explained here:
Loan origination date: No. Loan origination year was correlated with delinquency rates, but has no predictive power.
Loan original amount: Yes. Surprisnigly, smaller loans were more likely to default
Term: No. Only 3 year loans were offered before 2010, so it would skew the result.
Borrower monthly income: Yes. Both the bivariate and multivariate plots showed that income was correlated with delinquency rates.
Credit score: No. Credit scores are correlated with delinquency rates, but they are also highly correlated with the other variables.
Debt to income ratio: Yes. The multivariate plots confirmed that debt to income ratio is correlated with delinquency.
Is borrower homeowner: Yes. Bivariate plots suggested that homeowners have slightly lower delinquency rates.
Delinquencies in the last 7 years: Yes. Bivariate plots showed previous delinquencies are correlated with future delinquencies.
##
## Call:
## glm(formula = DelinquentBorrowerCat ~ ., family = binomial(link = "logit"),
## data = loan_data[c("LoanOriginalAmount", "YearlyIncome",
## "DebtToIncomeRatio", "IsBorrowerHomeowner", "DelinquenciesLast7Years",
## "DelinquentBorrowerCat")])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9291 -0.6480 -0.5599 -0.4182 5.8443
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.010e+00 2.007e-02 -50.342 <2e-16 ***
## LoanOriginalAmount -4.599e-05 1.761e-06 -26.109 <2e-16 ***
## YearlyIncome -6.199e-06 2.878e-07 -21.541 <2e-16 ***
## DebtToIncomeRatio 1.395e-01 1.187e-02 11.750 <2e-16 ***
## IsBorrowerHomeownerTRUE 4.171e-03 1.779e-02 0.234 0.815
## DelinquenciesLast7Years 1.419e-02 7.026e-04 20.201 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93104 on 104504 degrees of freedom
## Residual deviance: 90037 on 104499 degrees of freedom
## (9410 observations deleted due to missingness)
## AIC: 90049
##
## Number of Fisher Scoring iterations: 5
IsBorrowerHomeowner is not significant so it is removed from the model.
##
## Call:
## glm(formula = loan_data$DelinquentBorrowerCat ~ ., family = binomial(link = "logit"),
## data = loan_data[c("LoanOriginalAmount", "YearlyIncome",
## "DebtToIncomeRatio", "DelinquenciesLast7Years")])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9297 -0.6481 -0.5598 -0.4183 5.8379
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.009e+00 1.974e-02 -51.15 <2e-16 ***
## LoanOriginalAmount -4.596e-05 1.756e-06 -26.17 <2e-16 ***
## YearlyIncome -6.183e-06 2.795e-07 -22.12 <2e-16 ***
## DebtToIncomeRatio 1.396e-01 1.186e-02 11.77 <2e-16 ***
## DelinquenciesLast7Years 1.419e-02 7.017e-04 20.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93104 on 104504 degrees of freedom
## Residual deviance: 90037 on 104500 degrees of freedom
## (9410 observations deleted due to missingness)
## AIC: 90047
##
## Number of Fisher Scoring iterations: 5
The predicted probability of delinquency is decreasing in LoanOriginalAmount and YearlyIncome, and increasing in DebtToIncomeRatio and DelinquenciesLast7Years.
Prosper Loans is a peer-to-peer lending company offering 1-, 3-, and 5-year loans of between $2000 and $35000. The data set covers the time period 2005 to 2014, the first 10 years that Prosper Loans was in business. The number of loans originating in each year has been growing steadily during that time, dropping off only after the financial crisis.
Some changes were made during this time period. 1- and 5-year loans were only introduced in 2010, the minimum loan amount increased from $1000 to $2000 in 2011, and the maximum loan amount increased from $25000 to $35000 in 2013. Most importantly, a minimum credit score requirement of 600 was set in 2009, and increased to 640 in 2014. This led to an increase in borrower quality. Median credit scores increased from 600 in 2006 to 700 by 2014, borrower incomes increased, and rates of delinquency were cut in half.
Exploratory data analysis showed that some of the variables were correlated with delinquency status. The average debt to income ratio of delinquent borrowers was higher than borrowers in good standing in every year. Furthermore, smaller, shorter-term loans were delinquent at higher rates, which is counterintuitive.
A logit model was used to predict probabilities of delinquency for new borrowers.