Loan Data from Prosper (Exploratory Data Analysis)

By Toufik Kannab


Introduction:

Prosper is a peer-to-peer lending platform, with over $10 billion in funded loans, where individuals can either invest in personal loans or request to borrow money. On Prosper, borrowers list loan requests between $2,000 and $35,000 and individual investors invest as little as $25 in each loan listing they select. Prosper handles the servicing of the loan on behalf of the matched borrowers and investors. Prosper Marketplace is backed by leading investors including Sequoia Capital, Francisco Partners, Institutional Venture Partners, and Credit Suisse NEXT Fund.

This report will explore a loan dataset containing 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, the latest payment information and many others.

  • To download the dataset documentation, please click here

  • To download the dataset, please click here.

  • To view the code, please click here

Univariate Plots Section

For the purpose of this project and after reviewing the 81 variables, the analysis will be limited to the following variables:

##  [1] "ListingKey"                "ListingCreationDate"      
##  [3] "Term"                      "LoanStatus"               
##  [5] "BorrowerRate"              "ProsperRating..numeric."  
##  [7] "ProsperRating..Alpha."     "ProsperScore"             
##  [9] "ListingCategory..numeric." "BorrowerState"            
## [11] "Occupation"                "EmploymentStatus"         
## [13] "IsBorrowerHomeowner"       "CreditScoreRangeLower"    
## [15] "CreditScoreRangeUpper"     "IncomeVerifiable"         
## [17] "AvailableBankcardCredit"   "DebtToIncomeRatio"        
## [19] "LoanKey"                   "MemberKey"                
## [21] "IncomeRange"               "LoanOriginalAmount"       
## [23] "Investors"

Preliminary Exploration:

As we start our exploration, we will check for missing and duplicated data as applicable.

1. Dataset Keys

I reviewed the dataset structure and the variable definitions and found 4 key-related variables: ListingKey, GroupKey, LoanKey, and MemberKey; while GroupKey and MemberKey seem to have a distinct purpose and definition, ListingKey and LoanKey have a similar definition, so let’s check if there are any listings where ListingKey and LoanKey match:

##  [1] ListingKey                ListingCreationDate      
##  [3] Term                      LoanStatus               
##  [5] BorrowerRate              ProsperRating..numeric.  
##  [7] ProsperRating..Alpha.     ProsperScore             
##  [9] ListingCategory..numeric. BorrowerState            
## [11] Occupation                EmploymentStatus         
## [13] IsBorrowerHomeowner       CreditScoreRangeLower    
## [15] CreditScoreRangeUpper     IncomeVerifiable         
## [17] AvailableBankcardCredit   DebtToIncomeRatio        
## [19] LoanKey                   MemberKey                
## [21] IncomeRange               LoanOriginalAmount       
## [23] Investors                
## <0 rows> (or 0-length row.names)

Based on the output above, ListingKey and LoanKey have different values, but before determining which one to use, let’s check for NA empty values and duplication.

## No. of NA or empty values found in the variable (ListingKey): 0
## No. of NA or empty values found in the variable (LoanKey): 0
## No. of duplicate values found in the variable (ListingKey): 827
## No. of duplicate values found in the variable (LoanKey): 827

According to the definitions provided with the dataset, both ListingKey and LoanKey should be unique, however, as per the output above, I found a total of 827 duplicates; after isolating and reviewing duplicated rows, I found that for each duplicated key, all the other variables have duplicated values (including LoanKey) except for ProsperScore. For the purpose of this project, I will exclude them from the dataset.

2. Member Keys

Looking at the variable definition, Member keys are not necessarily unique, but let’s check if there are any duplicate borrowers (borrowers who have more than one loan in the database):

## No. of duplicate values found in the variable (MemberKey): 0

As per the output above, it looks like MemberKey doesn’t have duplicate values.

3. Loans Creation Date

For the purpose of this project, I will create a new variable called LoanCreationYear based on ListingCreationDate. Let’s plot the listings per the new variable to see the listings distribution by year.

Looking at the plot above, we notice that the loan dataset spans between 2005 and 2014. Moreover, we also notice that the number of loans prior to 2010 is significantly less than 2010 - 2014. Also, from the dataset documentation, we notice there are a number of variabes ONLY applicable to loans created after July 13, 2009, including: ProsperRating (numeric), ProsperRating (Alpha), ProsperScore, and since I’m planning to use these variables later in my analysis, I will exclude the loans created prior to July 13, 2009 to avoid NA or missing values. Let’s re-plot the chart above after excluding the listings (loans) created prior to July 13, 2009:

## [1] 83155

After excluding the loans, we are left with 83,155 loans which is a reasonable number to perform the analysis.

4. Loan Term:

## # A tibble: 3 x 3
##    Term     n   freq
##   <int> <int>  <dbl>
## 1    36 57611 0.693 
## 2    60 23931 0.288 
## 3    12  1613 0.0194

Looking at the output abvove, we notice the majority of the loans (69.3%) have a 36-month term duration, where (29%) have a 60-month duration, and only about (2%) have a 12-month duration.

5. LoanStatus:

## # A tibble: 11 x 3
##    LoanStatus                 n     freq
##    <fct>                  <int>    <dbl>
##  1 Current                54928 0.661   
##  2 Completed              19638 0.236   
##  3 Chargedoff              5336 0.0642  
##  4 Defaulted               1005 0.0121  
##  5 Past Due (1-15 days)     794 0.00955 
##  6 Past Due (31-60 days)    359 0.00432 
##  7 Past Due (61-90 days)    309 0.00372 
##  8 Past Due (91-120 days)   304 0.00366 
##  9 Past Due (16-30 days)    265 0.00319 
## 10 FinalPaymentInProgress   201 0.00242 
## 11 Past Due (>120 days)      16 0.000192

Looking at the output above, we notice around (66%) of the loans have an active status or Current, (23.6%) are Completed, (6.4%) are Chargedoff and only (1.2%) are Defaulted. Since Loan Delinquency (LoanStatus) will play a major role in this analysis, I thought it might be useful to spend some time understanding the different categories within this variable, more specifically: Defaulted, ChargedOff and PastDue; below is a summary of the information I’ve gathered:

  • Loan Delinquency: it is a situation in which a borrower misses its due date for a single scheduled payment for a form of financing, like student loans, mortgages, credit card balances or automobile loans. Delinquency occurs as soon as a borrower misses a payment. Investopedia

  • Default Loan: the failure to pay interest or principal on a loan or security when due. Default occurs when a debtor is unable to meet the legal obligation of debt repayment, and it also refers to cases in which one party fails to perform on a futures contract as required by an exchange. Investopedia

  • Charge-off: A charge-off is a debt, (e.g. a credit card), that is deemed unlikely to be collected by the creditor because the borrower has become substantially delinquent after a period of time. However, a charge-off does not mean a write-off of the debt entirely. A charge-off usually occurs when the creditor has deemed an outstanding debt is uncollectible; this typically follows 180 days or six months of non-payment. In addition, debt payments that fall below the required minimum payment for the period will also be charged off if the debtor does not make up for the shortfall. The creditor crosses off the consumer’s debt as uncollectible and marks it on the consumer’s credit report as a charge-off. Investopedia.

According to federal regulations, installment loans (like student loans) must be charged-off after being delinquent for 120, while revolving credit accounts (i.e, credit cards) must be charged-off after they’ve been delinquent for 180 days. BadCredit.org

Delinquency vs. Default: Delinquency and default are both loan terms representing different degrees of the same problem: missing payments. A loan becomes delinquent when you make payments late (even by one day) or miss a regular installment payment or payments. On the other hand, Loan default is much more serious, changing the nature of your borrowing relationship with the lender, and with other potential lenders as well. Investopedia

Looking at the variable LoanStatus, although there is no category called Delinquent, we can identify which loands are delinquent based on the assumption that ANY delay will be considered a Delinquency, by definition. To achieve that, I created a new variable called DelinquentList which takes one of the following values:

  • Delinquent: when a listing (loan) status is one of the following: Chargedoff, Defaulted, Past Due (1-15 days), Past Due (31-60 days), Past Due (61-90 days), Past Due (91-120 days), or Past Due (16-30 days).
  • Good Standing: when a listing status is one of the following: Completed, Current, FinalPaymentInProgress, and Canceled (after excluding the loans created prior to July 13, 2009, I didn’t find any loans that have the status Canceled).

## # A tibble: 2 x 3
##   DelinquentList     n   freq
##   <chr>          <int>  <dbl>
## 1 Good Standing  75032 0.902 
## 2 Delinquent      8123 0.0977

Looking at the output above, we notice the greatest majority of loans (90.23%) are in Good Standing status where (9.77%) are Delinquent. Later on, we will use the new variable DelinquentList to break down other variables and identify any possible correlations between those variables and loan delinquency.

6. Borrower Rate

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.  Std. Dev 
## 0.0400000 0.1359000 0.1897000 0.1965692 0.2579000 0.3600000 0.0700000
## # A tibble: 10 x 3
##    BorrowerRate     n    freq
##           <dbl> <int>   <dbl>
##  1        0.318  3672 0.0442 
##  2        0.320  1645 0.0198 
##  3        0.270  1304 0.0157 
##  4        0.110   890 0.0107 
##  5        0.350   802 0.00964
##  6        0.210   760 0.00914
##  7        0.260   750 0.00902
##  8        0.158   742 0.00892
##  9        0.130   730 0.00878
## 10        0.220   729 0.00877

Looking on the output above, we notice that the median and mean are close (18.97% to 19.66%); also, there are 3,672 borrowers (or 3.2 %) that have received an interest rate of (31.77 %) which is way far from the mean and median, this peak looks interesting and may deserve a deeper investigation.

7. Prosper Rating

## # A tibble: 7 x 3
##   ProsperRating..numeric.     n   freq
##                     <int> <int>  <dbl>
## 1                       1  6901 0.0830
## 2                       2  9639 0.116 
## 3                       3 14070 0.169 
## 4                       4 17862 0.215 
## 5                       5 15166 0.182 
## 6                       6 14237 0.171 
## 7                       7  5280 0.0635

According to the definition, ProsperRating (numeric) is a rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. However, the definition doesn’t provide sufficient information as to what this rating actually represents or how it is calcuated/determined. As we progress, we will try to identify any possible correlations between this rating and the other variables.

8. Prosper Score

## # A tibble: 11 x 3
##    ProsperScore     n   freq
##           <dbl> <int>  <dbl>
##  1           1.   917 0.0110
##  2           2.  5632 0.0677
##  3           3.  7456 0.0897
##  4           4. 12300 0.148 
##  5           5.  9661 0.116 
##  6           6. 12064 0.145 
##  7           7. 10392 0.125 
##  8           8. 11875 0.143 
##  9           9.  6813 0.0819
## 10          10.  4637 0.0558
## 11          11.  1408 0.0169

Looking at the output above, the distribution looks relatively normal; the top 3 proportions of loans have received a score of 4, 6 and 8 respectively.

According to the definition, ProsperScore is a custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. However, the output above shows (1,408) loans with a score of 11. For the purpose of this project and after reviewing the documentation, I will replace the 11’s with the mean value of ProsperScore which is 6 (the mean is 5.9, I will round it up to 6). After replacing the errored values, let’s check the values again:

## # A tibble: 10 x 3
##    ProsperScore     n   freq
##           <dbl> <int>  <dbl>
##  1           1.   917 0.0110
##  2           2.  5632 0.0677
##  3           3.  7456 0.0897
##  4           4. 12300 0.148 
##  5           5.  9661 0.116 
##  6           6. 13472 0.162 
##  7           7. 10392 0.125 
##  8           8. 11875 0.143 
##  9           9.  6813 0.0819
## 10          10.  4637 0.0558

8. Listing Category:

Per review of the definition and the values asssigned to variable ListingCategory..numeric., I will create a new variable called NewListingCategory to include the actual categories.

## # A tibble: 20 x 3
##    NewListingCategory     n     freq
##    <chr>              <int>    <dbl>
##  1 Debt Consolidation 51849 0.624   
##  2 Other               9127 0.110   
##  3 Home Improvement    6713 0.0807  
##  4 Business            5236 0.0630  
##  5 Auto                2229 0.0268  
##  6 Household Expenses  1978 0.0238  
##  7 Medical/Dental      1493 0.0180  
##  8 Taxes                879 0.0106  
##  9 Large Purchases      851 0.0102  
## 10 Vacation             760 0.00914 
## 11 Wedding Loans        753 0.00906 
## 12 Motorcycle           304 0.00366 
## 13 Student Use          274 0.00330 
## 14 Engagement Ring      211 0.00254 
## 15 Baby&Adoption        193 0.00232 
## 16 Cosmetic Procedure    91 0.00109 
## 17 Boat                  85 0.00102 
## 18 Green Loans           57 0.000685
## 19 RV                    52 0.000625
## 20 Not Available         20 0.000241

Looking at the output above, we notice the largest category (62.24%) is Debit Consolidation, where the smallest category is Not Available with only 20 loans.

9. BorrowerState:

## # A tibble: 10 x 3
##    BorrowerState     n   freq
##    <fct>         <int>  <dbl>
##  1 CA            10524 0.127 
##  2 NY             5709 0.0687
##  3 TX             5523 0.0664
##  4 FL             5302 0.0638
##  5 IL             4168 0.0501
##  6 OH             3308 0.0398
##  7 GA             3284 0.0395
##  8 VA             2725 0.0328
##  9 NJ             2657 0.0320
## 10 PA             2634 0.0317

Looking at the output above, we notice there are loans from the 48 states, as expected, the state with the largest proportion (12.66%) is California, where the state with the smallest proportion is Wyoming with 123 loans.

10. Occupation:

## # A tibble: 10 x 3
##    Occupation                   n   freq
##    <fct>                    <int>  <dbl>
##  1 Other                    20942 0.252 
##  2 Professional             10354 0.125 
##  3 Executive                 3407 0.0410
##  4 Computer Programmer       3165 0.0381
##  5 Teacher                   2828 0.0340
##  6 Analyst                   2687 0.0323
##  7 Administrative Assistant  2686 0.0323
##  8 Accountant/CPA            2526 0.0304
##  9 Sales - Commission        2288 0.0275
## 10 Skilled Labor             2134 0.0257
## No. of NA or empty values in the variable (Occupation): 1218

Looking at the results above, the largest proportion of borrowers (25.11%) reported Other as occupation, next comes Professional with (11.96%). These results lead us to think of a number of scenarios, one scenario is that the majority of borrowers don’t probably have one specific job or role; or they are not willing to report their actual job, or perhaps they have an occupation not on list, which is kind of unlikely to be the case as the list quite comprehensive. Also, I also found 1,218 loans that don’t have the borrower’s occupation included.

11. Employement Status:

## # A tibble: 7 x 3
##   EmploymentStatus     n    freq
##   <fct>            <int>   <dbl>
## 1 Employed         65900 0.792  
## 2 Full-time         7925 0.0953 
## 3 Self-employed     4377 0.0526 
## 4 Other             3681 0.0443 
## 5 Not employed       649 0.00780
## 6 Retired            367 0.00441
## 7 Part-time          256 0.00308

Looking at the output above, the largest proportion of borrowers (79.25%) are Employed where only (0.3%) have a Part-time job.

12. Home Owners:

## # A tibble: 2 x 3
##   IsBorrowerHomeowner     n  freq
##   <fct>               <int> <dbl>
## 1 True                44014 0.529
## 2 False               39141 0.471

Looking at the output above, we notice that about 53% of the borrowers are home owners.

13. Average Credit Score:

To analyze the credit score, I will create a new variable called AvgCreditScore which is the average value of two variables: CreditScoreRangeUpper and CreditScoreRangeLower.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. Std. Dev 
## 609.5000 669.5000 709.5000 709.0601 729.5000 889.5000  47.2700
## # A tibble: 10 x 3
##    AvgCreditScore     n   freq
##             <dbl> <int>  <dbl>
##  1           610.  1038 0.0125
##  2           630.  1653 0.0199
##  3           650.  8656 0.104 
##  4           670. 13762 0.165 
##  5           690. 13672 0.164 
##  6           710. 13307 0.160 
##  7           730. 10791 0.130 
##  8           750.  7760 0.0933
##  9           770.  5176 0.0622
## 10           790.  3672 0.0442

Looking at the results above, The average value of the AvgCreditScore is (709.1) with a minimum of (609.5) and a maximum of (889.5). According to a (fico.com) blog post, the percentage of consumers scoring in the range (800 and up) has steadily increased since 2010 after the bottoming out of the economy between 2009 - 2010; the data presented in the post was based on a snapshot of millions of US consumers’ credit data that ranges between 2005 - 2017, the post shows how the average credit score has increased from (686) in October 2009 to (696) in October 2015, and later hit the (700) threshold by April 2017. By comparing the figures included in this post with our dataset, I can reasonably assume that our dataset is consistent with what was found in the post; I’ve also noticed, based on the histogram above, that the distribution of the average credit score is relatively normal, which is expected, with a longer right tail. Another interesting observation I’ve noticed is that there are spaces of the same size (20 points) between the bins, I’m not quite sure why this is the case, it could be related to the way credit scores are fed to the loan system, but this is something to be considered when conducting a more in-depth analysis of the credit score variables.

14. Available Bank card Credit

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.  Std. Dev 
##      0.00   1138.00   4565.00  11399.61  13900.00 498374.00  18660.96
## # A tibble: 10 x 3
##    AvailableBankcardCredit     n     freq
##                      <dbl> <int>    <dbl>
##  1                      0.  2407 0.0289  
##  2                    500.   221 0.00266 
##  3                    300.   101 0.00121 
##  4                   1000.    98 0.00118 
##  5                   2000.    89 0.00107 
##  6                   3000.    75 0.000902
##  7                   1500.    72 0.000866
##  8                   2500.    63 0.000758
##  9                   5000.    55 0.000661
## 10                     76.    50 0.000601

Looking at the ouput above, we notice the distribution of the variable AvailableBankcardCredit is significantly right-skewed;

15. DebtToIncomeRatio:

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
##    0.0000000    0.1500000    0.2200000    0.2586333    0.3200000 
##         Max.         NA's     Std. Dev 
##   10.0100000 7135.0000000           NA
## # A tibble: 10 x 3
##    DebtToIncomeRatio     n   freq
##                <dbl> <int>  <dbl>
##  1            NA      7135 0.0858
##  2             0.180  3156 0.0380
##  3             0.220  2950 0.0355
##  4             0.170  2707 0.0326
##  5             0.140  2662 0.0320
##  6             0.210  2512 0.0302
##  7             0.160  2502 0.0301
##  8             0.200  2477 0.0298
##  9             0.190  2405 0.0289
## 10             0.150  2400 0.0289
## No. of NA or empty values in the variable (DebtToIncomeRatio): 7135

Looking at the output above, the distribution of the variable DebtToIncomeRatio is right skewed with a mean of (25.9%). We also notice that the largest proportions (8.5%) or (7,135) loans don’t have a value for DebtToIncomeRatio, where (3.8%) of the loans have a DebtToIncomeRatio of (18%).

16. Income Range:

## # A tibble: 7 x 3
##   IncomeRange        n     freq
##   <fct>          <int>    <dbl>
## 1 $50,000-74,999 25040 0.301   
## 2 $25,000-49,999 23686 0.285   
## 3 $100,000+      14914 0.179   
## 4 $75,000-99,999 14232 0.171   
## 5 $1-24,999       4589 0.0552  
## 6 Not employed     649 0.00780 
## 7 $0                45 0.000541

Looking at the output above, we notice that the largest propotion of borrowers (30%) have an income range of (50K-75K), where the smallest group (0.5%) have reported zero dollar income.

17. Income Verifiable:

## # A tibble: 2 x 3
##   IncomeVerifiable     n   freq
##   <fct>            <int>  <dbl>
## 1 True             75983 0.914 
## 2 False             7172 0.0862

Looking at the output above, we notice the majority of borrowers (91.38%) have provided income verification documents.

18. Loan Original Amount:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    9039   13500   35000
## # A tibble: 10 x 3
##    LoanOriginalAmount     n   freq
##                 <int> <int>  <dbl>
##  1               4000 12985 0.156 
##  2              15000 11118 0.134 
##  3              10000  9526 0.115 
##  4               2000  4544 0.0546
##  5               5000  4152 0.0499
##  6               3000  3411 0.0410
##  7              20000  2815 0.0339
##  8              25000  2706 0.0325
##  9               7000  2254 0.0271
## 10               7500  2162 0.0260

Looking at the output above, the average original loan amount is ($ 9,039). To better understnad the distribution of loan amounts, I adjusted the histogram binwidth to (100). As a result, we notice there are specific loan amounts that are more common than others, for example ($ 4,000) has the highest proportion (15.6%), ($ 15,000) comes next with (13.3%); We can also see an outlier at the value $35,000.

19. Investors

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00   33.00   69.07   98.00 1189.00
## # A tibble: 10 x 3
##    Investors     n    freq
##        <int> <int>   <dbl>
##  1         1 26206 0.315  
##  2         2  1178 0.0142 
##  3         3   848 0.0102 
##  4         4   686 0.00825
##  5         5   609 0.00732
##  6         8   606 0.00729
##  7         6   565 0.00679
##  8         7   542 0.00652
##  9         9   542 0.00652
## 10        10   519 0.00624

Looking at the output above, the average number of investors per loan is (69) investors; the largest proportion of loans (%31.5) have one investor. We also notice that the maximum number of investors for one loan is (1,189) , which is an interesting number, so I pulled up the loans with a number of investors greater than 1000 and found (2) loans with a loan original amount of ($ 25,000) and (2) loans with a loan original amount of ($ 20,000)

##   Investors LoanOriginalAmount
## 1      1024              25000
## 2      1035              25000
## 3      1011              20000
## 4      1189              20000

Univariate Analysis

What is the structure of your dataset?

The dataset includes 113,937 observations and 81 variables. After excluding loans with duplicate keys and loans created pior to July 13, 2009, the new number of loans is 83,155.

What is/are the main feature(s) of interest in your dataset?

There are two main features to be investigated:

  1. Loan Delinquency: I’m interested in analyzing how other variables contribute to the loan delinquency rate. This analysis will provide some guidance on the contributing factors to loan delinquency and ultimately help investors select investments with a reasonable risk/return ratio. In order to do that, I created a new variable DelinquentList based on the variable LoanStatus and the information collected from the external references included above.
  2. Interest Rate: Here, I’m interested in analyzing the contributing factors to the loan interest rate and ultimately provide some guidance to help borrowers make more informative decisions when applying for loans.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

To perform my analysis, I’ve limited the dataset to the following variables:

“ListingKey”, “ListingCreationDate”, “Term”, “LoanStatus”, “BorrowerRate”, “ProsperRating..numeric.”, “ProsperRating..Alpha.”, “ProsperScore”,“ListingCategory..numeric.”,“BorrowerState”, “Occupation”, “EmploymentStatus”, “IsBorrowerHomeowner”, “CreditScoreRangeLower”, “CreditScoreRangeUpper”, “IncomeVerifiable”,“AvailableBankcardCredit”, “DebtToIncomeRatio”, “LoanKey”, “MemberKey”,“IncomeRange”, “LoanOriginalAmount”,“Investors”

Did you create any new variables from existing variables in the dataset?

Yes, as I mentioned above, I created the following variables: LoanCreationYear, DelinquentList, AvgCreditScore, NewListingCategory

I also created a new function mySummary to add standard deviation to the output of the built-in summary function.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The dataset is well-structured, therefore, I haven’t performed any major operations to adjust the structure, however, I’ve made a number of minor formatting changes. For example, I changed the format of ListingCreationDate from character to DateTime.

Also, I updated the ProserScore column to replace 11’s with 6’s (the mean score) as the score takes values between 1 to 10, per the definition.

As far as unusual distributions, I found the following: - AvailableBankcardCredit is significantly right-skewed. - DebtToIncomeRatio is right skewed. - LoanOriginalAmount is relatively right-skewed with a good proportion of outliers. - Investors is significantly right-skewed.

I also found some interesting Data Peaks & valleys. For example:

  • Term: (70%) of the loans have a 36 month term duration.
  • DelinquentList: (90%) of the loans have a Good Standing status, which indicates a healthy economy.
  • BorrowerRate:: (4.4 %) of the loans have an interest rate of (31.8 %) which can be considered high when compared to the mean (19.6%).
  • ListingCategory: (62.4%) of the loans reported Debit Consolidation.
  • BorrowerState: (12.7%) of the loans belong to borrowers from California.
  • Occupation: (25%) reported Other as an Occupation.

As far as missing values, I identified the following:

  • DebtToIncomeRatio: I’ve found 7,135 missing value which seems unusual. Unfortunately, there is no further explanation provided in the dataset documentation as to why these values are missing. While it might be on option to populate the missing rows with a calcuated DebtToIncomeRatio using the available debt and income information in the dataset, it wouldn’t be easy to determine the total amount of debt since borrowers might have other loans with different financial institutions not included in the dataset, so it won’t be an efficient choice, and therefore I excluded those loans from the dataset.
  • Occupation: we found 1,218 missing values.

Bivariate Plots Section

In this section, I will mainly focus on two main features: Loan Delinquency (DelinquentList) and Interest Rate (BorrowerRate), however, I may extend my analysis to include other features, as applicable,

I. Loan Delinquency

1. Loan Delinquency vs. Loan Creation Year

##      var2 Delinquent Good Standing
## var1                              
## 2009           15.29         84.71
## 2010           16.89         83.11
## 2011           21.39         78.61
## 2012           17.18         82.82
## 2013            3.05         96.95
## 2014            0.07         99.93

Looking at the output above, we notice the percentage of Good Standing loans has significantly increased in 2014 (99.93%) as opposed to (78.61%) in 2011.

2. Loan Delinquency vs. Term

##      var2 Delinquent Good Standing
## var1                              
## 12              5.52         94.48
## 36             10.71         89.29
## 60              7.79         92.21

Looking at the output above, it looks like loans with 12-month duration tend to perform slightly better than loans with 36- and 60-month duations;

3. Loan Delinquency vs. Borrower Rate

Looking at the plot above, we notice the distribution of the BorrowerRate for Good standing loans experiences more fluctuations than Delinquent loans; we also notice the count of both Good standing and Delinquent loans experiances a significant peak at BorrowRate = 0.31799. This is an interesting observation and can be worth further analysis.

4. Loan Delinquency vs. ProsperRating

##      var2 Delinquent Good Standing
## var1                              
## 1              24.21         75.79
## 2              17.09         82.91
## 3              14.78         85.22
## 4               7.03         92.97
## 5               5.41         94.59
## 6               3.83         96.17
## 7               1.99         98.01

Looking at the output above, and as expected, the proportion of Good Standing loans increases as the ProsperRating increases (positive correlation).

5. Loan Delinquency vs. Prospoer Score

##      var2 Delinquent Good Standing
## var1                              
## 1              33.04         66.96
## 2              11.75         88.25
## 3               9.56         90.44
## 4               9.77         90.23
## 5              13.68         86.32
## 6              10.30         89.70
## 7               9.45         90.55
## 8               8.46         91.54
## 9               6.47         93.53
## 10              2.29         97.71

Looking at the output above; similar to what we found with ProsperRating..numeric., the proportion of Good Standing loans increases as the ProsperScore increases. However, for loans with scores of 4 and 5, As the score increases, the proportion of Good Standing loans decreases, which leads us to think there might be another feature that could have a stronger impact on loan Delinquency than impact ProsperScore has on loan Delinquency. This is an interesting observation and can be worth further analysis.

6. Loan Delinquency vs. Listing Category

##                    var2 Delinquent Good Standing
## var1                                            
## Auto                         13.01         86.99
## Baby&Adoption                 8.29         91.71
## Boat                          5.88         94.12
## Business                     16.39         83.61
## Cosmetic Procedure           14.29         85.71
## Debt Consolidation            7.65         92.35
## Engagement Ring               5.21         94.79
## Green Loans                  19.30         80.70
## Home Improvement             11.80         88.20
## Household Expenses           14.66         85.34
## Large Purchases               8.93         91.07
## Medical/Dental               12.93         87.07
## Motorcycle                    5.26         94.74
## Not Available                 5.00         95.00
## Other                        14.56         85.44
## RV                            3.85         96.15
## Student Use                  15.69         84.31
## Taxes                         8.76         91.24
## Vacation                      9.34         90.66
## Wedding Loans                 8.37         91.63

Looking at the output above, the category RV has the highest proportion (96.15%) of Good Standing Loans, where in Debt Consoliation, the percentage of good standing loans is (92.35%)

7. Loan Delinquency vs. Debt-To-InCome Ratio

From the plot above, we notice the distributions of the DebtToIncomeRatio are right-skewed for both Good Standing and Delinquent loans in the range between 0 and 0.8, however, things become less clear above 0.8. To get a better understanding of the distributions above 0.8, I will transform the data using log10() function, as follows:

After transforming the data, we notice DebtToIncomeRatio has a similar behaviour for Delinquent and Good standing loans with minor fluctuations.

II. Interest Rate (BorrowerRate)

1. Interest Rate vs State

## # A tibble: 10 x 6
##    BorrowerState     n  mean median `25%` `75%`
##    <fct>         <int> <dbl>  <dbl> <dbl> <dbl>
##  1 AL             1165 0.209  0.203 0.149 0.270
##  2 MS              667 0.208  0.204 0.147 0.264
##  3 SD              182 0.208  0.198 0.153 0.268
##  4 AR              761 0.207  0.205 0.152 0.260
##  5 ID              385 0.205  0.198 0.144 0.268
##  6 LA              817 0.205  0.194 0.146 0.268
##  7 KY              870 0.204  0.198 0.147 0.261
##  8 TN             1518 0.204  0.198 0.145 0.264
##  9 MO             1759 0.204  0.198 0.142 0.270
## 10 MI             2572 0.204  0.198 0.145 0.262

Looking at the output above, we notice that Alabama (AL) has the highest average interest rate of (20.9%) where District of Columbia (DC) has the lowest average interest rate of (17.7%); I also found that California (CA), which has the highest loan proportion, has an average interest rate of (19%) which is quite close to the overall average interest rate;

2. Interest Rate vs Listing Category

## # A tibble: 20 x 6
##    NewListingCategory     n  mean median `25%` `75%`
##    <chr>              <int> <dbl>  <dbl> <dbl> <dbl>
##  1 Not Available         20 0.239  0.243 0.213 0.296
##  2 Cosmetic Procedure    91 0.226  0.245 0.166 0.290
##  3 Household Expenses  1978 0.224  0.229 0.168 0.292
##  4 Other               9127 0.216  0.221 0.148 0.287
##  5 Auto                2229 0.214  0.218 0.145 0.285
##  6 Student Use          274 0.212  0.201 0.110 0.289
##  7 Business            5236 0.210  0.210 0.141 0.279
##  8 Medical/Dental      1493 0.210  0.210 0.150 0.270
##  9 Vacation             760 0.208  0.208 0.146 0.279
## 10 RV                    52 0.207  0.214 0.146 0.272
## 11 Taxes                879 0.207  0.208 0.151 0.257
## 12 Wedding Loans        753 0.206  0.208 0.151 0.258
## 13 Green Loans           57 0.204  0.197 0.133 0.300
## 14 Home Improvement    6713 0.201  0.198 0.136 0.264
## 15 Motorcycle           304 0.198  0.191 0.141 0.256
## 16 Engagement Ring      211 0.196  0.198 0.144 0.246
## 17 Large Purchases      851 0.192  0.184 0.130 0.249
## 18 Debt Consolidation 51849 0.188  0.180 0.133 0.243
## 19 Baby&Adoption        193 0.187  0.177 0.122 0.247
## 20 Boat                  85 0.174  0.164 0.125 0.226

Looking at the output above, we notice that the category Not Available has the highest average interest rate of (23.9%), unfortunately, that doesn’t tell us much information; but the interesting thing is the category Cosmetic Procedure which has a mean of (22.6%); this gives an idea of how these procedures are increasinlgy becoming a priority in peoples’ lives over other necessities, such as education, health, or business projects.

3. Interest Rate vs Employement Status

## # A tibble: 7 x 6
##   EmploymentStatus     n  mean median `25%` `75%`
##   <fct>            <int> <dbl>  <dbl> <dbl> <dbl>
## 1 Not employed       649 0.261  0.286 0.215 0.318
## 2 Retired            367 0.217  0.220 0.140 0.293
## 3 Other             3681 0.215  0.210 0.158 0.277
## 4 Self-employed     4377 0.213  0.208 0.149 0.285
## 5 Part-time          256 0.213  0.226 0.120 0.290
## 6 Full-time         7925 0.199  0.199 0.107 0.272
## 7 Employed         65900 0.193  0.184 0.136 0.250

Looking at the output above, we notice the category Not employed has the highest average interest rate of 26%, where the category Employed has the lowest average interest rate of 19.3%, which is expected.

4. Interest Rate vs LoanOriginalAmount

## 
##  Pearson's product-moment correlation
## 
## data:  loan_updated$LoanOriginalAmount and loan_updated$BorrowerRate
## t = -131.03, df = 83153, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4193118 -0.4080445
## sample estimates:
##       cor 
## -0.413694

Looking at the plot above, we see a negative correlation between the original loan amount and the interest rate which makes sense as larger loan amounts are more likely to receive lower interest rates than smaller loan amounts; to confirm our observation, I performed a correlation test and got (-0.33) which indicates a weak negative correlation between the two variables.

5. Corrolation Coefficient Matrix

To build on what we found in the correlation test above and provide some intuition on which variables have a strong positive or negative correlation (i.e. above 0.5, below -0.5), let’s plot the correlation matrix using Pearson correlation test

## Warning in ggcorr(data = loan_updated, method = c("pairwise", "pearson"), :
## data in column(s) 'ListingKey', 'ListingCreationDate', 'LoanStatus',
## 'ProsperRating..Alpha.', 'BorrowerState', 'Occupation', 'EmploymentStatus',
## 'IsBorrowerHomeowner', 'IncomeVerifiable', 'LoanKey', 'MemberKey',
## 'IncomeRange', 'LoanCreationYear', 'DelinquentList', 'NewListingCategory'
## are not numeric and were ignored

Looking at the matrix above, we notice the following:

  • Credit Score and Available Bank Card Credit: we notice a fairly strong postive correlation (coeff: 0.5) between credit score and available bank card credit.

  • Prosper Score & Prosper Rating: we notice a strong positive correlation (coeff: 0.7) between ProsperScore and ProsperRating which is expected.

    • Credit Score & Prosper Rating: we notice a strong positive correlation (coeff: 0.5) between credit score and Prosper Rating which is also expected.

    • Prosper Score & Borrower Rate: we notice a strong negative correlation (coeff: -0.6) between Prosper score and Borrower Rate which is expected.

    • Credit Score & Borrower Rate: we notice a farily strong negative correlation (coeff: -0.5) between Credit score and Borrower Rate which is expected.

    • Prosper Rating & Borrower Rate: we notice a strong negative correlation (coeff: -1) between Prosper Rating and Borrower Rate which is expected.

Bivariate Analysis

Summary:

In the bivariate analysis, we analyzed two types of relationships between variables:

  1. Categorical / Categorical: in this section, I analyzed the relationship between Loan Delinquency and a number of variables and found the following:

    • Loan Delinquency (by Loan Creation Year): the percentage of Good Standing loans has significantly increased in 2014 (99.93%) after reaching its lowest value (78.61%) in 2011.
    • Loan Delinquency (by Term): the loans with 12 month term duation tend to perform better than loans with 36- and 60- month term duation;
    • Loan Delinquency (by ProsperRating): the proportion of Good Standing loans increases as the ProsperRating increases.
    • Loan Delinquency (by ProsperScore): the proportion of Good Standing loans increases as the ProsperRating increases, however, for loans with scores of 4 and 5, the proportion of Delinquent loans goes up, which leads us to think there might be another feature that could have a stronger correlation with loan Delinquency than the correlation between ProsperScore and loan Delinuency.

    • Loan Delinquency (by Listing Category): the category RV has the highest proportion of Good Standing Loans of (96.15%), where in Debt Consoliation, the percentage of good standing loans is (92.35%).

  2. Categorical / Numerical: in this section, we analyzed how the values of numerical variables (mainly: Interest Rate and Original Loan Amount) are distributed when broken down by specific categorical variables, as follows:

    • Borrower Rate (by Loan Delinquency): we noticed the distribution of the BorrowerRate for Good standing loans experiances more fluctuations than for Delinquent loans; we also found that the count of both Good standing and Delinquent loans peaks at BorrowRate = 0.32 or (32%).
    • Debt-To-InCome Ratio (by Loan Delinquency): DebtToIncomeRatio has a right-skewed distribution across both Delinquent and Good standing loans.
    • Interest Rate (by State): Alabama (AL) has the highest average interest rate of (20.9%) where District of Columbia (DC) has the lowest average interest rate of around (17.7%); We also found that California (CA), which has the highest loan proportion has an average interest rate of (19%).
    • Interest Rate (by Listing Category): here we noticed that Not Available has the highest average interest rate of 23.9%, where Boat has the lowest average interest rate of 17.4%.
    • Interest Rate (by Employment Status): the category Not employed has the highest average interest rate of 26%, where the category Employed has the lowest average interest rate of 19.3%.
  3. Numerical / Numerical:

    • Interest Rate vs. Debt-To-Income Ratio: Unexpectedly, the data showed a very weak or no corrolation between the two variables as someone would expect borrowers with higher DebtToIncomeRatio to be charged higher interest rates, however, after applying Pearson’s corrolation test, the returned corrolation co-efficient was (0.06).

    • Interest Rate Vs. LoanOriginalAmount: here we found a relatively negative correlation between the original loan amount and the interest rate with a corrolation co-efficient of (-0.33).

Multivariate Plots Section

In this section, I will continue to focus on interest rate, loan delinquency but I will add additional variables, such as prosperScore, loan original amount and debtToIncomeRatio and build on the analysis performed in the first two sections.

1. Average Interest Rate by Average Credit Score & Delinquency

In this section, I will analyze the relationship between the average interest rate and the average credit score (the custom variable we created) and show how this relationship may vary depending on loan status (Good Standing vs. Delinquent).

Looking at the output above, we notice that for both Good Standing and Delinquent loans, there is a negative correlation between the mean interest rate and the borrower’s average credit score, however, Good Standing loans show a steady decrease in the mean interest rate as the average credit score goes up, where in Delinquent loans, we see more fluctuations especially for borrowers with high credit scores, as we can see, as the average credit score goes above 850, the mean interest rate for Delinquent loans jumps up which may seem unusual. This interesting observation may be an area for further analysis.

2. Interest rate by Prosper Score & Loan Delinquency

As we have seen in the Bivariate Analysis (plot 5: Loan Delinquency vs. ProsperScore), there is a negative correlation between ProsperRating and loan Delinquency, however, this correlation unexpectedly changed to positive for loans with Prosper socres of 3, 4 and 5. In this section, I will analyze the relationship between Loan Delinquency, ProposerScore and a number of variables, in an attempt to find an explanation for this observation.

Looking at the output above, for loans with the scores 3, 4, and 5, the count of both good standing and delinquent loans experiences a significant spike at 0.3177 and 0.3199; we saw this observation before in the bivariate section; to take things further, let’s isolate the loans with a ProsperScore of 3, 4, or 5, and a BorrowerRate of 0.3177 and 0.3199

## [1] 4448

Looking at the output above, we notice there are 4,448 loans that fit our criteria. Now, let’s plot the correlation matrix for these loans:

## Warning in ggcorr(data = score_BR_df, method = c("pairwise", "pearson"), :
## data in column(s) 'ListingKey', 'ListingCreationDate', 'LoanStatus',
## 'ProsperRating..Alpha.', 'BorrowerState', 'Occupation', 'EmploymentStatus',
## 'IsBorrowerHomeowner', 'IncomeVerifiable', 'LoanKey', 'MemberKey',
## 'IncomeRange', 'LoanCreationYear', 'DelinquentList', 'NewListingCategory'
## are not numeric and were ignored
## Warning in cor(data, use = method[1], method = method[2]): the standard
## deviation is zero

Looking at the matrix above, we don’t see any difference from the original correlation matrix. Therefore, more in-depth analysis is needed to find an explanation of what caused this spike in the borrowerRate variable. However, for the purpose of this project, we will limit our analysis to this extent.

3. DebtToIncomeRatio by Average Credit Score and Loan Status

In this section, I will analyze the relationship between average DebtToIncome Ratio and Credit Score based on loan status

Looking at the plot above, we notice the type relationship between Average DebtToIncomeRation and average Credit score varies based on the credit score range. For borrowers with credit scores between 600 - 750, we see a fairly positive correlation between Debt2Income and Credit score, however, as the credit score goes above 750 we notice a negative correlation for the most part. We also notice that the mean DebtToIncome Ratio spikes above 0.45 for the credit scores between 825 and 835 and then drops signficiantly below 0.15

4. Average Loan Original Amount by Average Credit Score, Delinquency and Income Range

In this section, I will analyze the relationship between average Loan Original Amount and Credit Score and how this relationship may vary depending on the income range and loan status.

Looking at the plot grid above, while different income levels show differet behaviour, we may fairly say that for the most part, there may be a weak positive corroaltion between the average loan amount and the average credit score for both good standing and delinquent loans with few deviations, for example: in the income range 50K-75K, we noticed the average loan amount dropped significantly for borrowers with a credit score of 830 and above. On the other hand, for borrowers with an Income Range 100K+, the average loan amount spikes up especially for those with Delinquent loans.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In this section, I will zoom in on two main variables: loan delinquency (loan status) and interest rate by exploring the following relationships:

  1. Average Interest Rate by Average Credit Score & Delinquency:

Here I summarized the data to find out how the average interest rate varies depending on the average credit score by loan status. This is a common aspect to look at when applying for loans or credit cards and the goal here is to see if Delinquent loans have a different behaviour from good standing loans, and as expected, there is negative correlation between credit score and interest rate. In other words, higher credit score means lower interest rate. However, for delinquent loans, I observed some fluctuations and unusual behaviour where the interest rate increased as the credit score went above 850 which. This is area to be considered for further investigation.

  1. Interest rate by Prosper Score & Loan Delinquency:

In this section, I tried to find an explanation of the observation I found in the Bivariate Analysis section where the proportion delinquent loans increased as the PropserScore increased from 3 to 5. By isolating the loans with these scores 3,4,5 and ploting the correlation matrix, I wasn’t able to identify any clear relationship between the ProsperScore and the variables, however, I found that for loans with the scores 3, 4, and 5, the count of both good standing and delinquent loans experiences a significant spike at interest rates 0.3177 and 0.3199.

  1. DebtToIncomeRatio by Average Credit Score and Loan Status

Here I found that for borrowers with credit scores between 600 - 750, we see a fairly positive correlation between Debt2Income and Credit score, however, as the credit score goes above 750 we notice a negative correlation for the most part. I also found that the mean DebtToIncome Ratio spikes above 0.45 for the credit scores between 825 and 835 and then drops signficiantly below 0.15.

  1. Average Loan Original Amount by Average Credit Score, Delinquency and Income Range

In this analysis, we found a weak positive corroaltion between the average loan amount and the average credit score for both good standing and delinquent loans with few deviations, for example: in the income range 50K-75K, we noticed the average loan amount dropped significantly for borrowers with a credit score of 830 and above. On the other hand, for borrowers with an Income Range 100K+, the average loan amount spikes up especially for those with Delinquent loans.

Were there any interesting or surprising interactions between features?

For the most part, all the observations found are consistent with our general understanding of the loan process. However, as I mentioned above, there is some unusual behaviour for the delinquent loans that may be interesting for further analysis. Please refer to the answer of the question above for more information.

Final Plots and Summary

Plot One: Borrower Rate Distribution (by Loan Status)

Description One

In the plot above, we found that the distribution of the BorrowerRate for Good standing loans experiences more fluctuations than Delinquent loans; we also notice the count of both Good standing and Delinquent loans experiances a significant peak at BorrowRate = 0.31799.

Plot Two: Interest Rate vs LoanOriginalAmount

Description Two

In this plot, we observed a negative correlation between the original loan amount and the interest rate which makes sense as larger loan amounts are more likely to receive lower interest rates than smaller loan amounts; to confirm our observation, I performed a correlation test and got (-0.33) which indicates a weak negative correlation between the two variables.

Plot Three: Average Interest Rate Vs Average Credit Score (by Loan Status)

Description Three

Looking at the output above, we notice that for both Good Standing and Delinquent loans, there is a negative correlation between the mean interest rate and the borrower’s average credit score, however, Good Standing loans show a steady decrease in the mean interest rate as the average credit score goes up, where in Delinquent loans, we see more fluctuations especially for borrowers with high credit scores, as we can see, as the average credit score goes above 850, the mean interest rate for Delinquent loans jumps up which may seem unusual.

Reflection

The analysis above aimed at exploring the Prosper Loan dataset. I focused 2 key variables: Interest Rate and Loan Delinquency. The purpose of the analysis was to provide a preliminary guidance to help borrowers and investors make informative decisions. The majority of the observations is consistent with the general understanding of the loan process. There was a number of unusual observations that were pointed out in the analysis that could be investigated further. There was no major chanllenges in performing the analysis. Some of the minor challenges faced: - missing values - duplicated values - erroneous values - Lack of detailed definitions for a number of variables.

Throughtout the analysis, I created a number of variables and functions to facilitate the analysis.

Beside the comments above, some of the areas for future improvement may include: