By Toufik Kannab
Prosper is a peer-to-peer lending platform, with over $10 billion in funded loans, where individuals can either invest in personal loans or request to borrow money. On Prosper, borrowers list loan requests between $2,000 and $35,000 and individual investors invest as little as $25 in each loan listing they select. Prosper handles the servicing of the loan on behalf of the matched borrowers and investors. Prosper Marketplace is backed by leading investors including Sequoia Capital, Francisco Partners, Institutional Venture Partners, and Credit Suisse NEXT Fund.
This report will explore a loan dataset containing 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, the latest payment information and many others.
For the purpose of this project and after reviewing the 81 variables, the analysis will be limited to the following variables:
## [1] "ListingKey" "ListingCreationDate"
## [3] "Term" "LoanStatus"
## [5] "BorrowerRate" "ProsperRating..numeric."
## [7] "ProsperRating..Alpha." "ProsperScore"
## [9] "ListingCategory..numeric." "BorrowerState"
## [11] "Occupation" "EmploymentStatus"
## [13] "IsBorrowerHomeowner" "CreditScoreRangeLower"
## [15] "CreditScoreRangeUpper" "IncomeVerifiable"
## [17] "AvailableBankcardCredit" "DebtToIncomeRatio"
## [19] "LoanKey" "MemberKey"
## [21] "IncomeRange" "LoanOriginalAmount"
## [23] "Investors"
As we start our exploration, we will check for missing and duplicated data as applicable.
I reviewed the dataset structure and the variable definitions and found 4 key-related variables: ListingKey, GroupKey, LoanKey, and MemberKey; while GroupKey and MemberKey seem to have a distinct purpose and definition, ListingKey and LoanKey have a similar definition, so let’s check if there are any listings where ListingKey and LoanKey match:
## [1] ListingKey ListingCreationDate
## [3] Term LoanStatus
## [5] BorrowerRate ProsperRating..numeric.
## [7] ProsperRating..Alpha. ProsperScore
## [9] ListingCategory..numeric. BorrowerState
## [11] Occupation EmploymentStatus
## [13] IsBorrowerHomeowner CreditScoreRangeLower
## [15] CreditScoreRangeUpper IncomeVerifiable
## [17] AvailableBankcardCredit DebtToIncomeRatio
## [19] LoanKey MemberKey
## [21] IncomeRange LoanOriginalAmount
## [23] Investors
## <0 rows> (or 0-length row.names)
Based on the output above, ListingKey and LoanKey have different values, but before determining which one to use, let’s check for NA empty values and duplication.
## No. of NA or empty values found in the variable (ListingKey): 0
## No. of NA or empty values found in the variable (LoanKey): 0
## No. of duplicate values found in the variable (ListingKey): 827
## No. of duplicate values found in the variable (LoanKey): 827
According to the definitions provided with the dataset, both ListingKey and LoanKey should be unique, however, as per the output above, I found a total of 827 duplicates; after isolating and reviewing duplicated rows, I found that for each duplicated key, all the other variables have duplicated values (including LoanKey) except for ProsperScore. For the purpose of this project, I will exclude them from the dataset.
Looking at the variable definition, Member keys are not necessarily unique, but let’s check if there are any duplicate borrowers (borrowers who have more than one loan in the database):
## No. of duplicate values found in the variable (MemberKey): 0
As per the output above, it looks like MemberKey doesn’t have duplicate values.
For the purpose of this project, I will create a new variable called LoanCreationYear based on ListingCreationDate. Let’s plot the listings per the new variable to see the listings distribution by year.
Looking at the plot above, we notice that the loan dataset spans between 2005 and 2014. Moreover, we also notice that the number of loans prior to 2010 is significantly less than 2010 - 2014. Also, from the dataset documentation, we notice there are a number of variabes ONLY applicable to loans created after July 13, 2009, including: ProsperRating (numeric), ProsperRating (Alpha), ProsperScore, and since I’m planning to use these variables later in my analysis, I will exclude the loans created prior to July 13, 2009 to avoid NA or missing values. Let’s re-plot the chart above after excluding the listings (loans) created prior to July 13, 2009:
## [1] 83155
After excluding the loans, we are left with 83,155 loans which is a reasonable number to perform the analysis.
## # A tibble: 3 x 3
## Term n freq
## <int> <int> <dbl>
## 1 36 57611 0.693
## 2 60 23931 0.288
## 3 12 1613 0.0194
Looking at the output abvove, we notice the majority of the loans (69.3%) have a 36-month term duration, where (29%) have a 60-month duration, and only about (2%) have a 12-month duration.
## # A tibble: 11 x 3
## LoanStatus n freq
## <fct> <int> <dbl>
## 1 Current 54928 0.661
## 2 Completed 19638 0.236
## 3 Chargedoff 5336 0.0642
## 4 Defaulted 1005 0.0121
## 5 Past Due (1-15 days) 794 0.00955
## 6 Past Due (31-60 days) 359 0.00432
## 7 Past Due (61-90 days) 309 0.00372
## 8 Past Due (91-120 days) 304 0.00366
## 9 Past Due (16-30 days) 265 0.00319
## 10 FinalPaymentInProgress 201 0.00242
## 11 Past Due (>120 days) 16 0.000192
Looking at the output above, we notice around (66%) of the loans have an active status or Current, (23.6%) are Completed, (6.4%) are Chargedoff and only (1.2%) are Defaulted. Since Loan Delinquency (LoanStatus) will play a major role in this analysis, I thought it might be useful to spend some time understanding the different categories within this variable, more specifically: Defaulted, ChargedOff and PastDue; below is a summary of the information I’ve gathered:
Loan Delinquency: it is a situation in which a borrower misses its due date for a single scheduled payment for a form of financing, like student loans, mortgages, credit card balances or automobile loans. Delinquency occurs as soon as a borrower misses a payment. Investopedia
Default Loan: the failure to pay interest or principal on a loan or security when due. Default occurs when a debtor is unable to meet the legal obligation of debt repayment, and it also refers to cases in which one party fails to perform on a futures contract as required by an exchange. Investopedia
Charge-off: A charge-off is a debt, (e.g. a credit card), that is deemed unlikely to be collected by the creditor because the borrower has become substantially delinquent after a period of time. However, a charge-off does not mean a write-off of the debt entirely. A charge-off usually occurs when the creditor has deemed an outstanding debt is uncollectible; this typically follows 180 days or six months of non-payment. In addition, debt payments that fall below the required minimum payment for the period will also be charged off if the debtor does not make up for the shortfall. The creditor crosses off the consumer’s debt as uncollectible and marks it on the consumer’s credit report as a charge-off. Investopedia.
According to federal regulations, installment loans (like student loans) must be charged-off after being delinquent for 120, while revolving credit accounts (i.e, credit cards) must be charged-off after they’ve been delinquent for 180 days. BadCredit.org
Delinquency vs. Default: Delinquency and default are both loan terms representing different degrees of the same problem: missing payments. A loan becomes delinquent when you make payments late (even by one day) or miss a regular installment payment or payments. On the other hand, Loan default is much more serious, changing the nature of your borrowing relationship with the lender, and with other potential lenders as well. Investopedia
Looking at the variable LoanStatus, although there is no category called Delinquent, we can identify which loands are delinquent based on the assumption that ANY delay will be considered a Delinquency, by definition. To achieve that, I created a new variable called DelinquentList which takes one of the following values:
## # A tibble: 2 x 3
## DelinquentList n freq
## <chr> <int> <dbl>
## 1 Good Standing 75032 0.902
## 2 Delinquent 8123 0.0977
Looking at the output above, we notice the greatest majority of loans (90.23%) are in Good Standing status where (9.77%) are Delinquent. Later on, we will use the new variable DelinquentList to break down other variables and identify any possible correlations between those variables and loan delinquency.
## Min. 1st Qu. Median Mean 3rd Qu. Max. Std. Dev
## 0.0400000 0.1359000 0.1897000 0.1965692 0.2579000 0.3600000 0.0700000
## # A tibble: 10 x 3
## BorrowerRate n freq
## <dbl> <int> <dbl>
## 1 0.318 3672 0.0442
## 2 0.320 1645 0.0198
## 3 0.270 1304 0.0157
## 4 0.110 890 0.0107
## 5 0.350 802 0.00964
## 6 0.210 760 0.00914
## 7 0.260 750 0.00902
## 8 0.158 742 0.00892
## 9 0.130 730 0.00878
## 10 0.220 729 0.00877
Looking on the output above, we notice that the median and mean are close (18.97% to 19.66%); also, there are 3,672 borrowers (or 3.2 %) that have received an interest rate of (31.77 %) which is way far from the mean and median, this peak looks interesting and may deserve a deeper investigation.
## # A tibble: 7 x 3
## ProsperRating..numeric. n freq
## <int> <int> <dbl>
## 1 1 6901 0.0830
## 2 2 9639 0.116
## 3 3 14070 0.169
## 4 4 17862 0.215
## 5 5 15166 0.182
## 6 6 14237 0.171
## 7 7 5280 0.0635
According to the definition, ProsperRating (numeric) is a rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. However, the definition doesn’t provide sufficient information as to what this rating actually represents or how it is calcuated/determined. As we progress, we will try to identify any possible correlations between this rating and the other variables.
## # A tibble: 11 x 3
## ProsperScore n freq
## <dbl> <int> <dbl>
## 1 1. 917 0.0110
## 2 2. 5632 0.0677
## 3 3. 7456 0.0897
## 4 4. 12300 0.148
## 5 5. 9661 0.116
## 6 6. 12064 0.145
## 7 7. 10392 0.125
## 8 8. 11875 0.143
## 9 9. 6813 0.0819
## 10 10. 4637 0.0558
## 11 11. 1408 0.0169
Looking at the output above, the distribution looks relatively normal; the top 3 proportions of loans have received a score of 4, 6 and 8 respectively.
According to the definition, ProsperScore is a custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. However, the output above shows (1,408) loans with a score of 11. For the purpose of this project and after reviewing the documentation, I will replace the 11’s with the mean value of ProsperScore which is 6 (the mean is 5.9, I will round it up to 6). After replacing the errored values, let’s check the values again:
## # A tibble: 10 x 3
## ProsperScore n freq
## <dbl> <int> <dbl>
## 1 1. 917 0.0110
## 2 2. 5632 0.0677
## 3 3. 7456 0.0897
## 4 4. 12300 0.148
## 5 5. 9661 0.116
## 6 6. 13472 0.162
## 7 7. 10392 0.125
## 8 8. 11875 0.143
## 9 9. 6813 0.0819
## 10 10. 4637 0.0558
Per review of the definition and the values asssigned to variable ListingCategory..numeric., I will create a new variable called NewListingCategory to include the actual categories.
## # A tibble: 20 x 3
## NewListingCategory n freq
## <chr> <int> <dbl>
## 1 Debt Consolidation 51849 0.624
## 2 Other 9127 0.110
## 3 Home Improvement 6713 0.0807
## 4 Business 5236 0.0630
## 5 Auto 2229 0.0268
## 6 Household Expenses 1978 0.0238
## 7 Medical/Dental 1493 0.0180
## 8 Taxes 879 0.0106
## 9 Large Purchases 851 0.0102
## 10 Vacation 760 0.00914
## 11 Wedding Loans 753 0.00906
## 12 Motorcycle 304 0.00366
## 13 Student Use 274 0.00330
## 14 Engagement Ring 211 0.00254
## 15 Baby&Adoption 193 0.00232
## 16 Cosmetic Procedure 91 0.00109
## 17 Boat 85 0.00102
## 18 Green Loans 57 0.000685
## 19 RV 52 0.000625
## 20 Not Available 20 0.000241
Looking at the output above, we notice the largest category (62.24%) is Debit Consolidation, where the smallest category is Not Available with only 20 loans.
## # A tibble: 10 x 3
## BorrowerState n freq
## <fct> <int> <dbl>
## 1 CA 10524 0.127
## 2 NY 5709 0.0687
## 3 TX 5523 0.0664
## 4 FL 5302 0.0638
## 5 IL 4168 0.0501
## 6 OH 3308 0.0398
## 7 GA 3284 0.0395
## 8 VA 2725 0.0328
## 9 NJ 2657 0.0320
## 10 PA 2634 0.0317
Looking at the output above, we notice there are loans from the 48 states, as expected, the state with the largest proportion (12.66%) is California, where the state with the smallest proportion is Wyoming with 123 loans.
## # A tibble: 10 x 3
## Occupation n freq
## <fct> <int> <dbl>
## 1 Other 20942 0.252
## 2 Professional 10354 0.125
## 3 Executive 3407 0.0410
## 4 Computer Programmer 3165 0.0381
## 5 Teacher 2828 0.0340
## 6 Analyst 2687 0.0323
## 7 Administrative Assistant 2686 0.0323
## 8 Accountant/CPA 2526 0.0304
## 9 Sales - Commission 2288 0.0275
## 10 Skilled Labor 2134 0.0257
## No. of NA or empty values in the variable (Occupation): 1218
Looking at the results above, the largest proportion of borrowers (25.11%) reported Other as occupation, next comes Professional with (11.96%). These results lead us to think of a number of scenarios, one scenario is that the majority of borrowers don’t probably have one specific job or role; or they are not willing to report their actual job, or perhaps they have an occupation not on list, which is kind of unlikely to be the case as the list quite comprehensive. Also, I also found 1,218 loans that don’t have the borrower’s occupation included.
## # A tibble: 7 x 3
## EmploymentStatus n freq
## <fct> <int> <dbl>
## 1 Employed 65900 0.792
## 2 Full-time 7925 0.0953
## 3 Self-employed 4377 0.0526
## 4 Other 3681 0.0443
## 5 Not employed 649 0.00780
## 6 Retired 367 0.00441
## 7 Part-time 256 0.00308
Looking at the output above, the largest proportion of borrowers (79.25%) are Employed where only (0.3%) have a Part-time job.
## # A tibble: 2 x 3
## IsBorrowerHomeowner n freq
## <fct> <int> <dbl>
## 1 True 44014 0.529
## 2 False 39141 0.471
Looking at the output above, we notice that about 53% of the borrowers are home owners.
To analyze the credit score, I will create a new variable called AvgCreditScore which is the average value of two variables: CreditScoreRangeUpper and CreditScoreRangeLower.
## Min. 1st Qu. Median Mean 3rd Qu. Max. Std. Dev
## 609.5000 669.5000 709.5000 709.0601 729.5000 889.5000 47.2700
## # A tibble: 10 x 3
## AvgCreditScore n freq
## <dbl> <int> <dbl>
## 1 610. 1038 0.0125
## 2 630. 1653 0.0199
## 3 650. 8656 0.104
## 4 670. 13762 0.165
## 5 690. 13672 0.164
## 6 710. 13307 0.160
## 7 730. 10791 0.130
## 8 750. 7760 0.0933
## 9 770. 5176 0.0622
## 10 790. 3672 0.0442
Looking at the results above, The average value of the AvgCreditScore is (709.1) with a minimum of (609.5) and a maximum of (889.5). According to a (fico.com) blog post, the percentage of consumers scoring in the range (800 and up) has steadily increased since 2010 after the bottoming out of the economy between 2009 - 2010; the data presented in the post was based on a snapshot of millions of US consumers’ credit data that ranges between 2005 - 2017, the post shows how the average credit score has increased from (686) in October 2009 to (696) in October 2015, and later hit the (700) threshold by April 2017. By comparing the figures included in this post with our dataset, I can reasonably assume that our dataset is consistent with what was found in the post; I’ve also noticed, based on the histogram above, that the distribution of the average credit score is relatively normal, which is expected, with a longer right tail. Another interesting observation I’ve noticed is that there are spaces of the same size (20 points) between the bins, I’m not quite sure why this is the case, it could be related to the way credit scores are fed to the loan system, but this is something to be considered when conducting a more in-depth analysis of the credit score variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max. Std. Dev
## 0.00 1138.00 4565.00 11399.61 13900.00 498374.00 18660.96
## # A tibble: 10 x 3
## AvailableBankcardCredit n freq
## <dbl> <int> <dbl>
## 1 0. 2407 0.0289
## 2 500. 221 0.00266
## 3 300. 101 0.00121
## 4 1000. 98 0.00118
## 5 2000. 89 0.00107
## 6 3000. 75 0.000902
## 7 1500. 72 0.000866
## 8 2500. 63 0.000758
## 9 5000. 55 0.000661
## 10 76. 50 0.000601
Looking at the ouput above, we notice the distribution of the variable AvailableBankcardCredit is significantly right-skewed;
## Min. 1st Qu. Median Mean 3rd Qu.
## 0.0000000 0.1500000 0.2200000 0.2586333 0.3200000
## Max. NA's Std. Dev
## 10.0100000 7135.0000000 NA
## # A tibble: 10 x 3
## DebtToIncomeRatio n freq
## <dbl> <int> <dbl>
## 1 NA 7135 0.0858
## 2 0.180 3156 0.0380
## 3 0.220 2950 0.0355
## 4 0.170 2707 0.0326
## 5 0.140 2662 0.0320
## 6 0.210 2512 0.0302
## 7 0.160 2502 0.0301
## 8 0.200 2477 0.0298
## 9 0.190 2405 0.0289
## 10 0.150 2400 0.0289
## No. of NA or empty values in the variable (DebtToIncomeRatio): 7135
Looking at the output above, the distribution of the variable DebtToIncomeRatio is right skewed with a mean of (25.9%). We also notice that the largest proportions (8.5%) or (7,135) loans don’t have a value for DebtToIncomeRatio, where (3.8%) of the loans have a DebtToIncomeRatio of (18%).
## # A tibble: 7 x 3
## IncomeRange n freq
## <fct> <int> <dbl>
## 1 $50,000-74,999 25040 0.301
## 2 $25,000-49,999 23686 0.285
## 3 $100,000+ 14914 0.179
## 4 $75,000-99,999 14232 0.171
## 5 $1-24,999 4589 0.0552
## 6 Not employed 649 0.00780
## 7 $0 45 0.000541
Looking at the output above, we notice that the largest propotion of borrowers (30%) have an income range of (50K-75K), where the smallest group (0.5%) have reported zero dollar income.
## # A tibble: 2 x 3
## IncomeVerifiable n freq
## <fct> <int> <dbl>
## 1 True 75983 0.914
## 2 False 7172 0.0862
Looking at the output above, we notice the majority of borrowers (91.38%) have provided income verification documents.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 9039 13500 35000
## # A tibble: 10 x 3
## LoanOriginalAmount n freq
## <int> <int> <dbl>
## 1 4000 12985 0.156
## 2 15000 11118 0.134
## 3 10000 9526 0.115
## 4 2000 4544 0.0546
## 5 5000 4152 0.0499
## 6 3000 3411 0.0410
## 7 20000 2815 0.0339
## 8 25000 2706 0.0325
## 9 7000 2254 0.0271
## 10 7500 2162 0.0260
Looking at the output above, the average original loan amount is ($ 9,039). To better understnad the distribution of loan amounts, I adjusted the histogram binwidth to (100). As a result, we notice there are specific loan amounts that are more common than others, for example ($ 4,000) has the highest proportion (15.6%), ($ 15,000) comes next with (13.3%); We can also see an outlier at the value $35,000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 33.00 69.07 98.00 1189.00
## # A tibble: 10 x 3
## Investors n freq
## <int> <int> <dbl>
## 1 1 26206 0.315
## 2 2 1178 0.0142
## 3 3 848 0.0102
## 4 4 686 0.00825
## 5 5 609 0.00732
## 6 8 606 0.00729
## 7 6 565 0.00679
## 8 7 542 0.00652
## 9 9 542 0.00652
## 10 10 519 0.00624
Looking at the output above, the average number of investors per loan is (69) investors; the largest proportion of loans (%31.5) have one investor. We also notice that the maximum number of investors for one loan is (1,189) , which is an interesting number, so I pulled up the loans with a number of investors greater than 1000 and found (2) loans with a loan original amount of ($ 25,000) and (2) loans with a loan original amount of ($ 20,000)
## Investors LoanOriginalAmount
## 1 1024 25000
## 2 1035 25000
## 3 1011 20000
## 4 1189 20000
The dataset includes 113,937 observations and 81 variables. After excluding loans with duplicate keys and loans created pior to July 13, 2009, the new number of loans is 83,155.
There are two main features to be investigated:
DelinquentList based on the variable LoanStatus and the information collected from the external references included above.To perform my analysis, I’ve limited the dataset to the following variables:
“ListingKey”, “ListingCreationDate”, “Term”, “LoanStatus”, “BorrowerRate”, “ProsperRating..numeric.”, “ProsperRating..Alpha.”, “ProsperScore”,“ListingCategory..numeric.”,“BorrowerState”, “Occupation”, “EmploymentStatus”, “IsBorrowerHomeowner”, “CreditScoreRangeLower”, “CreditScoreRangeUpper”, “IncomeVerifiable”,“AvailableBankcardCredit”, “DebtToIncomeRatio”, “LoanKey”, “MemberKey”,“IncomeRange”, “LoanOriginalAmount”,“Investors”
Yes, as I mentioned above, I created the following variables: LoanCreationYear, DelinquentList, AvgCreditScore, NewListingCategory
I also created a new function mySummary to add standard deviation to the output of the built-in summary function.
The dataset is well-structured, therefore, I haven’t performed any major operations to adjust the structure, however, I’ve made a number of minor formatting changes. For example, I changed the format of ListingCreationDate from character to DateTime.
Also, I updated the ProserScore column to replace 11’s with 6’s (the mean score) as the score takes values between 1 to 10, per the definition.
As far as unusual distributions, I found the following: - AvailableBankcardCredit is significantly right-skewed. - DebtToIncomeRatio is right skewed. - LoanOriginalAmount is relatively right-skewed with a good proportion of outliers. - Investors is significantly right-skewed.
I also found some interesting Data Peaks & valleys. For example:
Term: (70%) of the loans have a 36 month term duration.DelinquentList: (90%) of the loans have a Good Standing status, which indicates a healthy economy.BorrowerRate:: (4.4 %) of the loans have an interest rate of (31.8 %) which can be considered high when compared to the mean (19.6%).ListingCategory: (62.4%) of the loans reported Debit Consolidation.BorrowerState: (12.7%) of the loans belong to borrowers from California.Occupation: (25%) reported Other as an Occupation.As far as missing values, I identified the following:
DebtToIncomeRatio: I’ve found 7,135 missing value which seems unusual. Unfortunately, there is no further explanation provided in the dataset documentation as to why these values are missing. While it might be on option to populate the missing rows with a calcuated DebtToIncomeRatio using the available debt and income information in the dataset, it wouldn’t be easy to determine the total amount of debt since borrowers might have other loans with different financial institutions not included in the dataset, so it won’t be an efficient choice, and therefore I excluded those loans from the dataset.Occupation: we found 1,218 missing values.In this section, I will mainly focus on two main features: Loan Delinquency (DelinquentList) and Interest Rate (BorrowerRate), however, I may extend my analysis to include other features, as applicable,
## var2 Delinquent Good Standing
## var1
## 2009 15.29 84.71
## 2010 16.89 83.11
## 2011 21.39 78.61
## 2012 17.18 82.82
## 2013 3.05 96.95
## 2014 0.07 99.93
Looking at the output above, we notice the percentage of Good Standing loans has significantly increased in 2014 (99.93%) as opposed to (78.61%) in 2011.
## var2 Delinquent Good Standing
## var1
## 12 5.52 94.48
## 36 10.71 89.29
## 60 7.79 92.21
Looking at the output above, it looks like loans with 12-month duration tend to perform slightly better than loans with 36- and 60-month duations;
Looking at the plot above, we notice the distribution of the
BorrowerRate for Good standing loans experiences more fluctuations than Delinquent loans; we also notice the count of both Good standing and Delinquent loans experiances a significant peak at BorrowRate = 0.31799. This is an interesting observation and can be worth further analysis.
## var2 Delinquent Good Standing
## var1
## 1 24.21 75.79
## 2 17.09 82.91
## 3 14.78 85.22
## 4 7.03 92.97
## 5 5.41 94.59
## 6 3.83 96.17
## 7 1.99 98.01
Looking at the output above, and as expected, the proportion of Good Standing loans increases as the ProsperRating increases (positive correlation).
## var2 Delinquent Good Standing
## var1
## 1 33.04 66.96
## 2 11.75 88.25
## 3 9.56 90.44
## 4 9.77 90.23
## 5 13.68 86.32
## 6 10.30 89.70
## 7 9.45 90.55
## 8 8.46 91.54
## 9 6.47 93.53
## 10 2.29 97.71
Looking at the output above; similar to what we found with ProsperRating..numeric., the proportion of Good Standing loans increases as the ProsperScore increases. However, for loans with scores of 4 and 5, As the score increases, the proportion of Good Standing loans decreases, which leads us to think there might be another feature that could have a stronger impact on loan Delinquency than impact ProsperScore has on loan Delinquency. This is an interesting observation and can be worth further analysis.
## var2 Delinquent Good Standing
## var1
## Auto 13.01 86.99
## Baby&Adoption 8.29 91.71
## Boat 5.88 94.12
## Business 16.39 83.61
## Cosmetic Procedure 14.29 85.71
## Debt Consolidation 7.65 92.35
## Engagement Ring 5.21 94.79
## Green Loans 19.30 80.70
## Home Improvement 11.80 88.20
## Household Expenses 14.66 85.34
## Large Purchases 8.93 91.07
## Medical/Dental 12.93 87.07
## Motorcycle 5.26 94.74
## Not Available 5.00 95.00
## Other 14.56 85.44
## RV 3.85 96.15
## Student Use 15.69 84.31
## Taxes 8.76 91.24
## Vacation 9.34 90.66
## Wedding Loans 8.37 91.63
Looking at the output above, the category RV has the highest proportion (96.15%) of Good Standing Loans, where in Debt Consoliation, the percentage of good standing loans is (92.35%)
From the plot above, we notice the distributions of the DebtToIncomeRatio are right-skewed for both Good Standing and Delinquent loans in the range between 0 and 0.8, however, things become less clear above 0.8. To get a better understanding of the distributions above 0.8, I will transform the data using log10() function, as follows:
After transforming the data, we notice DebtToIncomeRatio has a similar behaviour for Delinquent and Good standing loans with minor fluctuations.
## # A tibble: 10 x 6
## BorrowerState n mean median `25%` `75%`
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AL 1165 0.209 0.203 0.149 0.270
## 2 MS 667 0.208 0.204 0.147 0.264
## 3 SD 182 0.208 0.198 0.153 0.268
## 4 AR 761 0.207 0.205 0.152 0.260
## 5 ID 385 0.205 0.198 0.144 0.268
## 6 LA 817 0.205 0.194 0.146 0.268
## 7 KY 870 0.204 0.198 0.147 0.261
## 8 TN 1518 0.204 0.198 0.145 0.264
## 9 MO 1759 0.204 0.198 0.142 0.270
## 10 MI 2572 0.204 0.198 0.145 0.262
Looking at the output above, we notice that Alabama (AL) has the highest average interest rate of (20.9%) where District of Columbia (DC) has the lowest average interest rate of (17.7%); I also found that California (CA), which has the highest loan proportion, has an average interest rate of (19%) which is quite close to the overall average interest rate;
## # A tibble: 20 x 6
## NewListingCategory n mean median `25%` `75%`
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Not Available 20 0.239 0.243 0.213 0.296
## 2 Cosmetic Procedure 91 0.226 0.245 0.166 0.290
## 3 Household Expenses 1978 0.224 0.229 0.168 0.292
## 4 Other 9127 0.216 0.221 0.148 0.287
## 5 Auto 2229 0.214 0.218 0.145 0.285
## 6 Student Use 274 0.212 0.201 0.110 0.289
## 7 Business 5236 0.210 0.210 0.141 0.279
## 8 Medical/Dental 1493 0.210 0.210 0.150 0.270
## 9 Vacation 760 0.208 0.208 0.146 0.279
## 10 RV 52 0.207 0.214 0.146 0.272
## 11 Taxes 879 0.207 0.208 0.151 0.257
## 12 Wedding Loans 753 0.206 0.208 0.151 0.258
## 13 Green Loans 57 0.204 0.197 0.133 0.300
## 14 Home Improvement 6713 0.201 0.198 0.136 0.264
## 15 Motorcycle 304 0.198 0.191 0.141 0.256
## 16 Engagement Ring 211 0.196 0.198 0.144 0.246
## 17 Large Purchases 851 0.192 0.184 0.130 0.249
## 18 Debt Consolidation 51849 0.188 0.180 0.133 0.243
## 19 Baby&Adoption 193 0.187 0.177 0.122 0.247
## 20 Boat 85 0.174 0.164 0.125 0.226
Looking at the output above, we notice that the category Not Available has the highest average interest rate of (23.9%), unfortunately, that doesn’t tell us much information; but the interesting thing is the category Cosmetic Procedure which has a mean of (22.6%); this gives an idea of how these procedures are increasinlgy becoming a priority in peoples’ lives over other necessities, such as education, health, or business projects.
## # A tibble: 7 x 6
## EmploymentStatus n mean median `25%` `75%`
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Not employed 649 0.261 0.286 0.215 0.318
## 2 Retired 367 0.217 0.220 0.140 0.293
## 3 Other 3681 0.215 0.210 0.158 0.277
## 4 Self-employed 4377 0.213 0.208 0.149 0.285
## 5 Part-time 256 0.213 0.226 0.120 0.290
## 6 Full-time 7925 0.199 0.199 0.107 0.272
## 7 Employed 65900 0.193 0.184 0.136 0.250
Looking at the output above, we notice the category Not employed has the highest average interest rate of 26%, where the category Employed has the lowest average interest rate of 19.3%, which is expected.
##
## Pearson's product-moment correlation
##
## data: loan_updated$LoanOriginalAmount and loan_updated$BorrowerRate
## t = -131.03, df = 83153, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4193118 -0.4080445
## sample estimates:
## cor
## -0.413694
Looking at the plot above, we see a negative correlation between the original loan amount and the interest rate which makes sense as larger loan amounts are more likely to receive lower interest rates than smaller loan amounts; to confirm our observation, I performed a correlation test and got (-0.33) which indicates a weak negative correlation between the two variables.
To build on what we found in the correlation test above and provide some intuition on which variables have a strong positive or negative correlation (i.e. above 0.5, below -0.5), let’s plot the correlation matrix using Pearson correlation test
## Warning in ggcorr(data = loan_updated, method = c("pairwise", "pearson"), :
## data in column(s) 'ListingKey', 'ListingCreationDate', 'LoanStatus',
## 'ProsperRating..Alpha.', 'BorrowerState', 'Occupation', 'EmploymentStatus',
## 'IsBorrowerHomeowner', 'IncomeVerifiable', 'LoanKey', 'MemberKey',
## 'IncomeRange', 'LoanCreationYear', 'DelinquentList', 'NewListingCategory'
## are not numeric and were ignored
Looking at the matrix above, we notice the following:
Credit Score and Available Bank Card Credit: we notice a fairly strong postive correlation (coeff: 0.5) between credit score and available bank card credit.
Prosper Score & Prosper Rating: we notice a strong positive correlation (coeff: 0.7) between ProsperScore and ProsperRating which is expected.
Credit Score & Prosper Rating: we notice a strong positive correlation (coeff: 0.5) between credit score and Prosper Rating which is also expected.
Prosper Score & Borrower Rate: we notice a strong negative correlation (coeff: -0.6) between Prosper score and Borrower Rate which is expected.
Credit Score & Borrower Rate: we notice a farily strong negative correlation (coeff: -0.5) between Credit score and Borrower Rate which is expected.
Prosper Rating & Borrower Rate: we notice a strong negative correlation (coeff: -1) between Prosper Rating and Borrower Rate which is expected.
In the bivariate analysis, we analyzed two types of relationships between variables:
Categorical / Categorical: in this section, I analyzed the relationship between Loan Delinquency and a number of variables and found the following:
Loan Delinquency (by ProsperScore): the proportion of Good Standing loans increases as the ProsperRating increases, however, for loans with scores of 4 and 5, the proportion of Delinquent loans goes up, which leads us to think there might be another feature that could have a stronger correlation with loan Delinquency than the correlation between ProsperScore and loan Delinuency.
Loan Delinquency (by Listing Category): the category RV has the highest proportion of Good Standing Loans of (96.15%), where in Debt Consoliation, the percentage of good standing loans is (92.35%).
Categorical / Numerical: in this section, we analyzed how the values of numerical variables (mainly: Interest Rate and Original Loan Amount) are distributed when broken down by specific categorical variables, as follows:
Numerical / Numerical:
Interest Rate vs. Debt-To-Income Ratio: Unexpectedly, the data showed a very weak or no corrolation between the two variables as someone would expect borrowers with higher DebtToIncomeRatio to be charged higher interest rates, however, after applying Pearson’s corrolation test, the returned corrolation co-efficient was (0.06).
Interest Rate Vs. LoanOriginalAmount: here we found a relatively negative correlation between the original loan amount and the interest rate with a corrolation co-efficient of (-0.33).
In this section, I will continue to focus on interest rate, loan delinquency but I will add additional variables, such as prosperScore, loan original amount and debtToIncomeRatio and build on the analysis performed in the first two sections.
In this section, I will analyze the relationship between the average interest rate and the average credit score (the custom variable we created) and show how this relationship may vary depending on loan status (Good Standing vs. Delinquent).
Looking at the output above, we notice that for both Good Standing and Delinquent loans, there is a negative correlation between the mean interest rate and the borrower’s average credit score, however, Good Standing loans show a steady decrease in the mean interest rate as the average credit score goes up, where in Delinquent loans, we see more fluctuations especially for borrowers with high credit scores, as we can see, as the average credit score goes above 850, the mean interest rate for Delinquent loans jumps up which may seem unusual. This interesting observation may be an area for further analysis.
As we have seen in the Bivariate Analysis (plot 5: Loan Delinquency vs. ProsperScore), there is a negative correlation between ProsperRating and loan Delinquency, however, this correlation unexpectedly changed to positive for loans with Prosper socres of 3, 4 and 5. In this section, I will analyze the relationship between Loan Delinquency, ProposerScore and a number of variables, in an attempt to find an explanation for this observation.
Looking at the output above, for loans with the scores 3, 4, and 5, the count of both good standing and delinquent loans experiences a significant spike at 0.3177 and 0.3199; we saw this observation before in the bivariate section; to take things further, let’s isolate the loans with a ProsperScore of 3, 4, or 5, and a BorrowerRate of 0.3177 and 0.3199
## [1] 4448
Looking at the output above, we notice there are 4,448 loans that fit our criteria. Now, let’s plot the correlation matrix for these loans:
## Warning in ggcorr(data = score_BR_df, method = c("pairwise", "pearson"), :
## data in column(s) 'ListingKey', 'ListingCreationDate', 'LoanStatus',
## 'ProsperRating..Alpha.', 'BorrowerState', 'Occupation', 'EmploymentStatus',
## 'IsBorrowerHomeowner', 'IncomeVerifiable', 'LoanKey', 'MemberKey',
## 'IncomeRange', 'LoanCreationYear', 'DelinquentList', 'NewListingCategory'
## are not numeric and were ignored
## Warning in cor(data, use = method[1], method = method[2]): the standard
## deviation is zero
Looking at the matrix above, we don’t see any difference from the original correlation matrix. Therefore, more in-depth analysis is needed to find an explanation of what caused this spike in the borrowerRate variable. However, for the purpose of this project, we will limit our analysis to this extent.
In this section, I will analyze the relationship between average DebtToIncome Ratio and Credit Score based on loan status
Looking at the plot above, we notice the type relationship between Average DebtToIncomeRation and average Credit score varies based on the credit score range. For borrowers with credit scores between 600 - 750, we see a fairly positive correlation between Debt2Income and Credit score, however, as the credit score goes above 750 we notice a negative correlation for the most part. We also notice that the mean DebtToIncome Ratio spikes above 0.45 for the credit scores between 825 and 835 and then drops signficiantly below 0.15
In this section, I will analyze the relationship between average Loan Original Amount and Credit Score and how this relationship may vary depending on the income range and loan status.
Looking at the plot grid above, while different income levels show differet behaviour, we may fairly say that for the most part, there may be a weak positive corroaltion between the average loan amount and the average credit score for both good standing and delinquent loans with few deviations, for example: in the income range 50K-75K, we noticed the average loan amount dropped significantly for borrowers with a credit score of 830 and above. On the other hand, for borrowers with an Income Range 100K+, the average loan amount spikes up especially for those with Delinquent loans.
In this section, I will zoom in on two main variables: loan delinquency (loan status) and interest rate by exploring the following relationships:
Here I summarized the data to find out how the average interest rate varies depending on the average credit score by loan status. This is a common aspect to look at when applying for loans or credit cards and the goal here is to see if Delinquent loans have a different behaviour from good standing loans, and as expected, there is negative correlation between credit score and interest rate. In other words, higher credit score means lower interest rate. However, for delinquent loans, I observed some fluctuations and unusual behaviour where the interest rate increased as the credit score went above 850 which. This is area to be considered for further investigation.
In this section, I tried to find an explanation of the observation I found in the Bivariate Analysis section where the proportion delinquent loans increased as the PropserScore increased from 3 to 5. By isolating the loans with these scores 3,4,5 and ploting the correlation matrix, I wasn’t able to identify any clear relationship between the ProsperScore and the variables, however, I found that for loans with the scores 3, 4, and 5, the count of both good standing and delinquent loans experiences a significant spike at interest rates 0.3177 and 0.3199.
Here I found that for borrowers with credit scores between 600 - 750, we see a fairly positive correlation between Debt2Income and Credit score, however, as the credit score goes above 750 we notice a negative correlation for the most part. I also found that the mean DebtToIncome Ratio spikes above 0.45 for the credit scores between 825 and 835 and then drops signficiantly below 0.15.
In this analysis, we found a weak positive corroaltion between the average loan amount and the average credit score for both good standing and delinquent loans with few deviations, for example: in the income range 50K-75K, we noticed the average loan amount dropped significantly for borrowers with a credit score of 830 and above. On the other hand, for borrowers with an Income Range 100K+, the average loan amount spikes up especially for those with Delinquent loans.
For the most part, all the observations found are consistent with our general understanding of the loan process. However, as I mentioned above, there is some unusual behaviour for the delinquent loans that may be interesting for further analysis. Please refer to the answer of the question above for more information.
In the plot above, we found that the distribution of the BorrowerRate for Good standing loans experiences more fluctuations than Delinquent loans; we also notice the count of both Good standing and Delinquent loans experiances a significant peak at BorrowRate = 0.31799.
In this plot, we observed a negative correlation between the original loan amount and the interest rate which makes sense as larger loan amounts are more likely to receive lower interest rates than smaller loan amounts; to confirm our observation, I performed a correlation test and got (-0.33) which indicates a weak negative correlation between the two variables.
Looking at the output above, we notice that for both Good Standing and Delinquent loans, there is a negative correlation between the mean interest rate and the borrower’s average credit score, however, Good Standing loans show a steady decrease in the mean interest rate as the average credit score goes up, where in Delinquent loans, we see more fluctuations especially for borrowers with high credit scores, as we can see, as the average credit score goes above 850, the mean interest rate for Delinquent loans jumps up which may seem unusual.
The analysis above aimed at exploring the Prosper Loan dataset. I focused 2 key variables: Interest Rate and Loan Delinquency. The purpose of the analysis was to provide a preliminary guidance to help borrowers and investors make informative decisions. The majority of the observations is consistent with the general understanding of the loan process. There was a number of unusual observations that were pointed out in the analysis that could be investigated further. There was no major chanllenges in performing the analysis. Some of the minor challenges faced: - missing values - duplicated values - erroneous values - Lack of detailed definitions for a number of variables.
Throughtout the analysis, I created a number of variables and functions to facilitate the analysis.
Beside the comments above, some of the areas for future improvement may include:
ListingCategory, Occupation, EmploymentStatushttps://en.wikipedia.org/wiki/Prosper_Marketplace https://www.prosper.com/ https://www.investopedia.com/ask/answers/09/difference-between-yields-and-interest-rate.asp https://www.investopedia.com/ask/answers/062315/what-are-differences-between-delinquency-and-default.asp https://www.investopedia.com/terms/r/revolving-account.asp https://www.investopedia.com/terms/t/trade-line.asp https://en.wikipedia.org/wiki/Pearson_correlation_coefficient