I chose the Prosper loan data for project 3. I wanted to explore the P2P lender’s data to find insights related to the borrowers and the loans they broker. At quick glance, there are over 100k rows of loans and 81 variables. I initally choose way too many variables (30+) and through exploratory analysis, whittled it down to 20 variables and even created a few variables. Let’s begin.
## ListingCategory..numeric. EmploymentStatusDuration CreditScoreRangeLower
## 1 :58308 Min. : 0.00 Min. : 0.0
## 0 :16965 1st Qu.: 26.00 1st Qu.:660.0
## 7 :10494 Median : 67.00 Median :680.0
## 2 : 7433 Mean : 96.07 Mean :685.6
## 3 : 7189 3rd Qu.:137.00 3rd Qu.:720.0
## 6 : 2572 Max. :755.00 Max. :880.0
## (Other):10976 NA's :7625 NA's :591
## CreditScoreRangeUpper CurrentCreditLines DebtToIncomeRatio
## Min. : 19.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:679.0 1st Qu.: 7.00 1st Qu.: 0.140
## Median :699.0 Median :10.00 Median : 0.220
## Mean :704.6 Mean :10.32 Mean : 0.276
## 3rd Qu.:739.0 3rd Qu.:13.00 3rd Qu.: 0.320
## Max. :899.0 Max. :59.00 Max. :10.010
## NA's :591 NA's :7604 NA's :8554
## StatedMonthlyIncome BorrowerState Occupation
## Min. : 0 CA :14717 Other :28617
## 1st Qu.: 3200 TX : 6842 Professional :13628
## Median : 4667 NY : 6729 Computer Programmer : 4478
## Mean : 5608 FL : 6720 Executive : 4311
## 3rd Qu.: 6825 IL : 5921 Teacher : 3759
## Max. :1750003 : 5515 Administrative Assistant: 3688
## (Other):67493 (Other) :55456
## EmploymentStatus IsBorrowerHomeowner IncomeRange
## Employed :67322 False:56459 $25,000-49,999:32192
## Full-time :26355 True :57478 $50,000-74,999:31050
## Self-employed: 6134 $100,000+ :17337
## Not available: 5347 $75,000-99,999:16916
## Other : 3806 Not displayed : 7741
## : 2255 $1-24,999 : 7274
## (Other) : 2718 (Other) : 1427
## IncomeVerifiable BorrowerAPR BorrowerRate Term
## False: 8669 Min. :0.00653 Min. :0.0000 12: 1614
## True :105268 1st Qu.:0.15629 1st Qu.:0.1340 36:87778
## Median :0.20976 Median :0.1840 60:24545
## Mean :0.21883 Mean :0.1928
## 3rd Qu.:0.28381 3rd Qu.:0.2500
## Max. :0.51229 Max. :0.4975
## NA's :25
## LoanStatus LoanOriginalAmount
## Current :56576 Min. : 1000
## Completed :38074 1st Qu.: 4000
## Chargedoff :11992 Median : 6500
## Defaulted : 5018 Mean : 8337
## Past Due (1-15 days) : 806 3rd Qu.:12000
## Past Due (31-60 days): 363 Max. :35000
## (Other) : 1108
## LoanOriginationDate MonthlyLoanPayment
## 2014-01-22 00:00:00: 491 Min. : 0.0
## 2013-11-13 00:00:00: 490 1st Qu.: 131.6
## 2014-02-19 00:00:00: 439 Median : 217.7
## 2013-10-16 00:00:00: 434 Mean : 272.5
## 2014-01-28 00:00:00: 339 3rd Qu.: 371.6
## 2013-09-24 00:00:00: 316 Max. :2251.5
## (Other) :111428
## 'data.frame': 113937 obs. of 20 variables:
## $ ListingCategory..numeric.: Factor w/ 21 levels "0","1","2","3",..: 1 3 1 17 3 2 2 3 8 8 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ Term : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
Loan amount exhibits positive skew, mean 8300 > median 6500.
Monthly loan payments exhibits positive skew, mean 272 > median 217.
Credit ranges are normally distributed with mean/med = 685/680. It seem they have common intervals and are bunched as such. Note: I only use CreditScoreRangeLower throughout this analysis. CreditScoreRangeHigher has very similar distribution but 20 points higher.
Interest rate are normally distributed mean/med = .19 However, there is a spike around .31. I’ll take a closer look in bivarate section.
DI ratio exhibits slight positive skew mean .27 > median .22. There are more people with DI < 0.5.
Monthly income exhibits positive skew mean 5600 > median 4666.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 637.5 1062.0 1225.0 1622.0 171000.0 8554
I created the variable TotalMonthlyDebt to estimate the monthly debt at inception of loan by multiplying the DI ratio by the stated monthly income. The resulting distribution is positively skewed.
Employment duration exhibits positive skew mean 96 > median 67. It seems that more people that are employed shorter durations.
Current credit lines has a slight positive skew and the mean/med = 10.
I decomposed loan origination date using lubridate and plotted by month and year. I observed seasonality in the loans by month by during the holiday season. It is also interesting in the loans by year plot that the loans dropped off in 2009. I would guess that is a supply side shortage due to the Financial Crisis. What do you think? (Hint: I’m wrong.)
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Most loans in the whole data set are completed and current. Later I will break it down by year.
## [1] "# of Loans by Income Range"
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
## [1] "# of Loans by Term (months)"
## 12 36 60
## 1614 87778 24545
Income range is mostly in the middle from $25k-75k with mostly 3 year loans.
Graph of loan term by year. Dominated by 3 year loans.
## Source: local data frame [21 x 3]
##
## ListingCategory..numeric. DollarTotal CountTotal
## 1 1 577736197 58308
## 2 0 106096621 16965
## 3 3 64175191 7189
## 4 7 62035775 10494
## 5 2 60148466 7433
## 6 6 12861665 2572
## 7 4 10913226 2395
## 8 13 10549832 1996
## 9 15 9929936 1522
## 10 14 7684187 876
## .. ... ... ...
Debt consolidation by far is the leading loan category in number of loans and $ amount. But it seems there are a lot of loans not classified. More on this in the multivarate section.
Loan status broken down by year you can see more texture: most defaults occurred in 2006-2008.
I created the variable DefaultedPercent (# of defaulted loans / # total loans) to show default rates over each year. Prosper default rates seems to have decreased dramatically over the years.
I selected the top 5 states. California dominates this category.
Top 5 Occupations but data seems incomplete.
## [1] "Is the borrower a homeowner? "
## False True
## 56459 57478
Borrowers are mostly employed and about half are homeowners.
There are 113,937 loans in the dataset. I choose with 20 features out of 81, listed below. 10 being continuous variables and 10 being discrete variables or factors.
Continuous Variables:
EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper CurrentCreditLines DebtToIncomeRatio StatedMonthlyIncome BorrowerAPR BorrowerRate LoanOriginalAmount MonthlyLoanPayment
Discrete Variables:
BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner IncomeRange IncomeVerifiable Term ListingCategory LoanStatus LoanOriginationDate
Most loans have terms of 3 years and are under $10k (with the median being 6500). Most have monthly payments of usually < $500 (median of $217) and paying the median interest rate of 18%. Out of 17 categories of loans, the majority of the loans are classified under debt consolidation (51%) Borrowers tend to have credit scores in the IQR of 660-720 (using creditscorerangelower). The borrower median monthly income is 4667, 28% earns $25-50k and 27% earns $50-75K. 88% are employed in some way (employed, full-time or self-employed).
There are a number of features that will give color to the profile of the loan and borrower. I think the main ones for borrowers are credit score, debt to income ratio, income and the ones for the loans are loan amount, interest rate, and term.
Loan category and status can help determine how the loans are being used and what loans are current, defaulted, delinquent etc.
I created TotalMonthlyDebt to estimate the total monthly debt at inception of loan by multiplying the DI ratio by the stated monthly income. The distribution is positively skewed. I broke out the loan origination date in LoanYear, LoanMonth using lubridate. This really helped with yearly analysis and seasonality. I also created DefaultedPercent to show the default rates over each year.
I manually set two numeric categorical variables as a factor so when I plotted them it would look discrete: Term and ListingCategory..numeric.
Correlation matrix of all variables discrete and continuous. Next few scatterplots will explore the stronger relationships in greater depth.
Positive correlation R^2 of .47. You have more debt as you increase the # of credit lines or vice-versa.
Negative correlation R^2 of -.46. Makes sense the interest rate is lower for more creditworthy borrowers.
Monthly income and total monthly debt has a positive relationship of R^2 of .36.
Slight negative correlation R^2 of -.33. Seems counterintuitive, lower interest rate on larger loans.
Slight positive correlation R^2 of -.33. The larger the loan, the higher the credit score.
I plotted total monthly debt vs monthly income facet wrapped by IsBorrowerHomeowner. The homeowners have a higher dispersion whereas the non-homeowners are more concentrated under $5000 income.
Taking a closer look at spike in the histogram of BorrowerRate, these high interest loans were made mainly for debt consolidation in 2011-2012.
And the credit score ranges are shown, the higher the borrower rate (cost) the worse the credit rating.
Mean loans amounts dips in 2009 and increased in 2013.
For comparison, aggregate loans amounts dips in 2009 and increases sharply in 2013, from $150M to $300M.
Investigating further, I graphed every loan by date, it shows a gap in service between November 2008 and July 2009. After doing some research on Wikipedia, there was a cease and desist from the SEC regarding the P2P lending model. The data also only goes until March 2014.
Shows the seasonality of increased average loan amounts and variance during the holiday seasons and a pull back during the spring and summer months.
Employed and fulltime on average take out larger loans.
Larger incomes take out larger loans on average.
As the term is longer, the loan amounts are larger.
But not necessarily more expensive. The mean interest rate moves up slightly at the 5 year term and the variance is smaller.
5 year borrowers seem to be more credit-worthy on average and also the variance is smaller.
There’s a strong positive relationship between open credit lines and total monthly debt.
There’s a negative relationship between interest rates and loan amount, the larger the loan, the lower the rate on average. That was mostly due to them having higher credit scores.
Monthly income and monthly debt have a positive relationship and levels off after a certain point.
Average loan amounts and variance are seasonal (higher during holidays)
People who are employed and fulltime on average take out larger loans than other groups.
There is a strong positive relationship between term and loan amount (longer the term, the larger the loan)
I observed that in 2008-2009 there was a large dip in loan origination. After further investigation, it was related to a cease and desist order by the SEC regarding the nature of the P2P lending model. But after they opened again, there has been impressive growth topping out at $300 million in loans in 2013 and close to $150 million by March of 2014 (partial year)
The strongest relationship I found was between credit score and interest rate, with R^2 .46. This makes sense since credit score is a rating of the credit-worthiness of the borrower and that should be directly related to the cost of borrowing (interest rate) Also, between open credit lines and total monthly deb with R^2 of .47.
The borrowers with high credit scores are on the bottom right. They generally have lower interest rates and larger loan amounts. I subsetted only credit scores from 660 (1st quartile) and above for better visual presentation.
I normalized the loan amounts by multiplying by their credit score and dividing by 850 (the best achievable score). The results do not differ to much from just using the loan amounts.
In this plot, I divided MonthlyLoanPayment by StatedMonthlyIncome see what the debt coverage ratio would be by year and faceted by IncomeRange. It seems most of the borrowers with lower income ranges take on larger loans and all income ranges have an uptick in the most recent years.
Comparing DI ratio, most of the borrowers seem to have DI ratio close to 20-30% with a uptick in the most recent years. The $100k + income range have noticeably lower DI ratio at around 15-20%. The variance is reduced as incomes increase.
Looking at Income Ranges it looks like the uptick in loans in 2012-2013 grew primarily in $50-75k, $75-100k and $100k+.
I bucketed categories by year and you can clearly see that it looks like pre-2008 they did not capture the loan category and debt consolidation dominated the cateogory every year subsequently.
When looking at the loan amounts vs cost (interest rate), the credit scores demarcated borrowers by credit worthiness. I looked at loan amounts (normalized by credit score) vs income ranges and saw that the higher the income, the larger the loans amounts on average.
Investigating further, I looked at DI ratio and debt coverage ratio and observed the higher the income, the lower the percentage of debt. Additionally, it appears the platform after a few years of maturity is beginning to lend to slight more indebted borrowers in the income ranges of $50-75k, $75-100k and $100k+. Looking at the debt categories, it seemed like they didn’t collect this information before 2007. But the recent uptick of lending is mostly in debt consolidation.
The borrowers with high credit scores are in red on the left. They generally have lower interest rates and larger loan amounts. I subsetted only credit scores from 660 (1st quartile) and above for better visual presentation. And this shows that as the lending platform matured, the overal risk exposure increased. In 2014, much more blue (credit score ~700) borrowers.
Looking at DI (debt to income) ratio, it illustrates a picture of increased loan sizing, especially at lower borrowing rates (2013-2014) as the platform matured and risk managment improves, the default rates were decreasing (graphed previously).
I subsetted only deliquent loans (total count: 2067) to look at their at exposure to more risky (delinquent) loans. It appears the majority of their delinquencies are clustered in 2012-2013, 1-15days past due and the borrowers have income ranges $25k-50k and $50-75k.
The data set had nearly 114,000 loans from Nov 2005 - March 2014. Over the course of those years, Prosper has made almost $1 trillion dollars in loans ($949,894,347 to be exact). The difficulties I had with the data mainly stemmed from understanding the variables and then selecting the appropriate ones to analyze. I had to do an initial exploration of the data and even then I probably selected a bit too many. I initally wanted to create a model to see if I could model loans likely to be delinquent or default but I realize solving classification isn’t until the machine learning course. I would love to see if a model could be learned and created. From the data I selected, I saw them struggle initially with their first few years of loans. However, their business grew very quickly after the cease and desist by the SEC in 2009. Their main loan exposures are in debt consolidation and I’m sure the low interest rate environment has helped the supply of lenders looking for yield. This was a great learning experience about the P2P lending model and I’m glad Udacity gave me the opportunity access to this data.