Abstract

Prosper is America’s first peer-to-peer lending marketplace, with more than 2 million members and over $2,000,000,000 in funded loans. Here we use the data available to the public (last updated on March 11th, 2014) from Prosper, which contains all the listings and loans ever created with 81 variables on each loan/listing, to do some data analysis. As a potential amateur investor, I will explore the borrower market (including the demographic segmentation and beyond) and try to let data tell some ‘behind scene’ stories about the borrowers, also the performance of Prosper in terms of the volume of listings by year and by area.

Analysis

Overview of Dataset

Browse Data

This data set contains 113,937 loans with 81 variables on each loan.

## [1] 113937     81

Here are the names of all the 81 variables:

##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

Let’s take a glimpse of all the variables. Since I set stringsAsFactor=FALSE when load the data. All the varibles are either numeric or string for now. But to make more sense of the analysis, I may change the class of some variables later.

## Classes 'tbl_df', 'tbl' and 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : chr  "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : chr  "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
##  $ CreditGrade                        : chr  "C" "" "HR" "" ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : chr  "Completed" "Current" "Completed" "Current" ...
##  $ ClosedDate                         : chr  "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : chr  "" "A" "" "A" ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : chr  "CO" "CO" "GA" "GA" ...
##  $ Occupation                         : chr  "Other" "Professional" "Other" "Skilled Labor" ...
##  $ EmploymentStatus                   : chr  "Self-employed" "Employed" "Not available" "Employed" ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : chr  "True" "False" "False" "True" ...
##  $ CurrentlyInGroup                   : chr  "True" "False" "True" "False" ...
##  $ GroupKey                           : chr  "" "" "783C3371218786870A73D20" "" ...
##  $ DateCreditPulled                   : chr  "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : chr  "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : chr  "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
##  $ IncomeVerifiable                   : chr  "True" "True" "True" "True" ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : chr  "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : chr  "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
##  $ LoanOriginationQuarter             : chr  "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
##  $ MemberKey                          : chr  "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Exploratory Univariate Analysis

There’s a borrower behind every loan, I’ll start with exploring some variables of the characteristics of the borrowers.

First, let’s find out how much borrowers earn annually.

The income of most borrowers ranges from $25,000 - $74,999. To my surprise, there are 621 borrowers with zero income successfully got the loan. Although it’s not clearly stated in the dataset, My guess is here ‘the income’ means the income from employment rather than from all sources. Maybe exploring more information about those borrowers would make it clearer that why they can get the loan.


When browse the website of ‘Prosper’ as a prospective lender, one feature - high ‘Prosper Rating’ (usually ‘AA’ or ‘A’) - of all listings in ‘Featured Loan Listings’ caught my eyes, which makes me curious about what’s the ‘Prosper Rating’ of borrowers in this dataset.

## ymax not defined: adjusting position using y instead

Except for NAs, the distribution of the ordinal variable has a bell-like shape. ‘C’ is the most frequent rating in our data and the highest (AA) and the lowest (HR) rating are less common comparing with other ratings in between. I was surprised to see so many missing values. But it seems that Prosper Rating only applicable for loans originated after July 2009. So taking a glimpse of when were the listings created maybe a reasonable thing to do next.

table(loan$ListingCreationDate < ymd("2009-07-01"),
      dnn = "Number of Listings Created Prior to 07/01/09")
## Number of Listings Created Prior to 07/01/09
## FALSE  TRUE 
## 84853 29084
table(loan$ListingCreationDate < ymd("2009-07-01"),
      dnn = "Percentage of Listings Created Prior to 07/01/09") /
    length(loan$ListingCreationDate)
## Percentage of Listings Created Prior to 07/01/09
##     FALSE      TRUE 
## 0.7447361 0.2552639
range(loan$ListingCreationDate)
## [1] "2005-11-09 20:44:28 UTC" "2014-03-10 12:20:53 UTC"

The number of listings prior to 2009-07-01 is 29084, which is exactly the same with the number of missing values in Prosper Rating variable. And the earliest listing was created at 2005-11-09 20:44:28 and the latest one was at 2014-03-10 12:20:53.


Another thing I found in ‘Featured Loan Listings’ on the website was all listings there were of 3-year long. Let’s see if 3-year is really the most popular choice in terms of the length of the loans.

## Frequecy Table for Term of the loans
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113937 
## 
##  
##           |        12 |        36 |        60 | 
##           |-----------|-----------|-----------|
##           |      1614 |     87778 |     24545 | 
##           |     0.014 |     0.770 |     0.215 | 
##           |-----------|-----------|-----------|
## 
## 
## 
## 

There’re only three options for the length of the loan - 1 year, 3 years and 5 years. Only about 1.4% of the borrowers chose 1 year and 21.5% chose 5 years. With 77% of people chose 3 years, it surely is the most popular choice of all.


I started my first job about half year ago and the idea of borrowing a loan has never come to me so far. So I’m curious about when does a person get a loan in terms of the length of their working status.

I subset the borrowers with employment status listed as “Employed”, “Full-time”, “Part-time” and “Self-employed” and get the histogram of the length of employment for them. Using year as the unit of time, there’s a very clear descending trend of the number of people who borrow loans with the increase of the length of employment. My guess is the longer one works, the more one earns and therefore more savings. With more saving in the bank, people are less likely to borrow loans.


Next, let’s take a look at the distribution of the percentage of a consumer’s monthly gross income that goes toward paying debts.

summary(loan$DebtToIncomeRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

I see a huge gap between the 3rd quartile and the maximum and a simple calculation tells me they are 31 times away! Let’s draw the histogram to get a better understanding of DIT(Debt to Income) Ratio.

ggplot(aes(x=DebtToIncomeRatio), data=loan) + 
    geom_histogram(fill='#05DBF2', color='black') +
    labs(title=expression(paste("Debt to Income Ratio")), 
         x='Debt / Income')

quantile(loan$DebtToIncomeRatio, 0.99, na.rm=T)
##  99% 
## 0.86
ggplot(aes(x=DebtToIncomeRatio), 
       data=subset(loan,DebtToIncomeRatio<=1)) + 
    geom_histogram(binwidth=0.01, 
                   fill='#05DBF2', color='black') +
    labs(title=expression(
        paste("Debt to Income Ratio (for ",
              DTI <= 1,")")), 
         x='Debt / Income')

After limiting the ratio to less or equal to 1, we include more than \(99\%\) of the data but get a much finer graphic. The distribution has a bell shape and is right-skewed. And if we call the function quantile(loan$DebtToIncomeRatio, 0.95, na.rm=T), it tells us \(95\%\) of the borrowers keep their DTI \(\leq 1\). Of course, besides paying the debt people get life to live and it needs money.


Let’s see what people borrow money for.

## ymax not defined: adjusting position using y instead

Majority of people get the personal loans for ‘debt consolidation’. ‘Home improvement’ and ‘Business’ are also important purpose for the loans. Another thing I found very interesting is that about the same amount of people got loans for ‘student use’, ‘Vacation’ and ‘Wedding’. It seems like that education, preparation for getting married and getaway from home are all equally worth investing.


I want to know if one needs recommendations to successfully receive the loan.

## If the borrower has one or 
##       more recommendations
##  FALSE   TRUE 
## 109678   4259

The majority of borrowers didn’t have a recommendation at all.


Prosper is a peer-to-peer lending business. After learning a few things about the borrowers, I decide to explore some lenders related variables to see how lenders benefit from investing in loans.

summary(loan$LenderYield)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925
ggplot(aes(x=LenderYield), data=loan) + 
    geom_histogram(binwidth=0.005, 
                   fill='#05DBF2', color='black') +
    scale_x_continuous(breaks=seq(0, 0.5, 0.05)) + 
    labs(title="The Lender Yield on the Loan")

There are 22 cases that the lenders didn’t get any profit and even lost some of their own money. But 22 out of 113937, the odd is really low. Most yields range from 0.05 to 0.35. Surprisingly, the highest peak in the graphic is around 0.31, which shows a really rosy picture if one considers about doing the similar investment.


How many lenders usually ‘share’ the risk the loan together?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00
## Only one investor
## FALSE  TRUE 
## 86123 27814

About 24.4% of loans only has one investor. For the rest of listings, there are multiple investors.


Exploratory Bivariate Analysis

With 24.4% of loans only has one investor, I want to know if the only lender is usually a friend of the borrower?

## if the single investor is 
##            a friend of the borrower
##     0     1 
## 27732    82

No, there’re not many friend-to-friend loan going on through Prosper.


Let’s take a look at which part of the country the borrowers live.

State Number of borrowers
CA 14717
TX 6842
NY 6729
FL 6720
IL 5921
5515
GA 5008
OH 4197
MI 3593
VA 3278
NJ 3097
NC 3084
WA 3048
PA 2972
MD 2821
MO 2615
MN 2318
MA 2242
CO 2210
IN 2078
AZ 1901
WI 1842
OR 1817
TN 1737
AL 1679
CT 1627
SC 1122
NV 1090
KS 1062
KY 983
OK 971
LA 954
UT 877
AR 855
MS 787
NE 674
ID 599
NH 551
NM 472
RI 435
HI 409
WV 391
DC 382
MT 330
DE 300
VT 207
AK 200
SD 189
IA 186
WY 150
ME 101
ND 52

The number of borrowers of different states varies a lot and California, Texas, New York and Florida are the top 4 on the list of state with big borrower market. While it’s hard to take all the states into consideration and especially to visualize the result all together, I will choose CA, TX and NY, the three states with largest amount of borrowers, as representatives for the later geographical analysis.


In the previous analysis, I find most borrowers keep their debt to income ratio (DTI) at a reasonable low level (50% of borrowers keep their DTI lower than 0.22 and 95% have their DTI 0.5 or less), however, there are some ‘dangerous players’ out there decided to use nearly all of their monthly (or even more) income to pay the loan. I want to explore some features of the borrower together with DTI to see if there are some common features within the ‘ordinary players’ and the ‘dangerous players’.
Also, Prosper Rate of the borrower is a unique and really important feature that surely of interest of both borrowers and investors. I’d like to see what’s the relationship between Prosper Rate and some other features of the listings.
To get a overview quickly, I will start with ggpairs and dig in deeper for any findings of my interest later on.


Who gives every buck they earn (or even more) to pay back the loan?

ggplot(aes(x=DebtToIncomeRatio), data=loan) +
    geom_density(aes(color = IncomeRange, 
                     fill  = IncomeRange), 
                 alpha = 0.5) +
    facet_wrap(~IncomeRange, ncol=2, 
               scales = "free")

with(loan, by(DebtToIncomeRatio, IncomeRange, 
              summary))
## IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.010   0.160   0.295   3.328  10.010  10.010     728 
## -------------------------------------------------------- 
## IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     621 
## -------------------------------------------------------- 
## IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.020   0.190   0.320   0.737   0.500  10.010     913 
## -------------------------------------------------------- 
## IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1700  0.2600  0.2789  0.3600  7.9000    2311 
## -------------------------------------------------------- 
## IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1600  0.2300  0.2457  0.3200 10.0100    1690 
## -------------------------------------------------------- 
## IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1400  0.2000  0.2137  0.2800  2.5500     901 
## -------------------------------------------------------- 
## IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1200  0.1700  0.1806  0.2300 10.0100    1266 
## -------------------------------------------------------- 
## IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.090   0.160   0.297   0.260  10.010     124

According to the information online, the borrower can choose their DIT anywhere between 0 and 10.01. But putting a large portion of (even an amount exceeds) one’s income every month to pay back the loan for 1, 3 or even 5 years continuously is extremely hard, if not a mission imporssible. But for borrowers who were not employed when they created the listings, more than a quarter of those people decided to pay the amount of money that is ten times of their monthly income for the loan (even though many of them keep the DTI less than 1 just as those from the other groups)! 90% of people with $1-24,999 annual income keep their DIT lower than 0.9, 90% from $25,000-49,999 group have DTI of 0.47 or lower, $75,000-99,999 with 0.35 or lower and $100,000+ of 0.3 and 0.4 for those income range are not displayed. When comparing other quartiles and mean of DTI throughout all the income groups with numeric label (from $1-24,999 to $100,000+), we see the similar descending trend.
Interestingly, 2.55 is the limit of DTI for the borrowers with income of $75,000-99,999, which is so much lower than that of any other group. Maybe we can conclude that borrowers with this range of income are more stable and less likely to risk, at least financially. Besides $75,000-99,999 group, $25,000-49,999 group is the only other group didn’t push their DTI limit to 10.01, but still as high as 7.9. For all the other groups, at least some people within each group decided to borrow the amount of money that requires them to give back about their 10-month income for monthly payment! It’s surprising to see there is no data for the zero income group at first, but after a second thought I understand that technically everyone in this group have DTI as infinite! So Prosper put all their DTI as null.

Does Higher Risk Really Mean Higher Return?

ggplot(aes(x=ProsperRating..Alpha., y=LenderYield), 
       data=loan) +
    geom_boxplot(aes(fill = ProsperRating..Alpha.)) +
    guides(fill = guide_legend(title = "ProsperRating"))

with(loan, by(LenderYield, 
              ProsperRating..Alpha., 
              summary))
## ProsperRating..Alpha.: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03000 0.05990 0.06790 0.06911 0.07450 0.20000 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0398  0.0890  0.1019  0.1029  0.1139  0.2050 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0593  0.1314  0.1409  0.1444  0.1539  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0795  0.1665  0.1814  0.1844  0.1999  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1057  0.2187  0.2392  0.2364  0.2525  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1379  0.2612  0.2825  0.2833  0.3049  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1679  0.3034  0.3077  0.3073  0.3077  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: NA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1170  0.1600  0.1730  0.2224  0.4925

I create a new variable here called YearsSinceFirstCreditLine, which show the duration from the time of FirstRecordedCreditLine to the date the dataset last updated (03/01/2014) for each borrower. I want to know if person with longer history of credit record has higher ProsperRating.

For Credit Score, average age of open credit lines is a key credit-influencing factors. So I was expecting to see the similar effects of FirstRecordedCreditLine(The date the first credit line was opened) on ProsperRating. To my surprise, the length of time since the borrower opened the first credit line makes no difference on the what ProsperRating he/she can get. The conditional distributions of FirstRecordedCreditLine are almost identical one another.


If FirstRecordedCreditLine is not a key factor of ProsperRating, let’s take California, Texas and New York as examples to see if there is a geographic difference of ProsperRating among the three states.

## [1] "The Percentage of Prosper Rates in California"
## 
##          A         AA          B          C          D          E 
## 0.18371899 0.07963944 0.18539169 0.21038937 0.15045070 0.11411579 
##         HR 
## 0.07629402
## [1] "The Percentage of Prosper Rates in Texas"
## 
##          A         AA          B          C          D          E 
## 0.18583600 0.06549521 0.18743344 0.22080227 0.15317714 0.10614129 
##         HR 
## 0.08111466
## [1] "The Percentage of Prosper Rates in New York"
## 
##          A         AA          B          C          D          E 
## 0.17556468 0.07135524 0.18069815 0.21081451 0.16940452 0.11447639 
##         HR 
## 0.07768652

From the stacked bar chart and the following percentage information in detail, there is no significant difference of ProsperRating of borrowers live in CA, TX and NY. But a statistical test would help us understand to difference (if there is any) better.

After changing the order of “Not employed” and “$0”, a very interesting step-wise graphic shows up. We can claim that Borrowers of higher income range tend to have better Prosper Rate, however, borrowers who were not employed when they send out the loan request have higher Prosper Rate than those who had zero income.

Exploratory Multivariate Analysis

When we look at the proper rate vs income range separately for three states: CA, TX and NY, the only obvious difference among the three states and with the nationwise data lies in the no income ($0) group. Half of the borrowers in CA without income were of the rate HR and half D, while all of the no income borrowers in NY of rate E (the average rate of CA from the same group) and TX’s no income borrowers held a much higher rate of B. The pie charts tell us that California has more high-end borrowers (from $100,000+ group) comparing with the other two. While the state of New York has higher portion of borrowers of $1-24,999 and $25,000-49,999 income group.


Whether the business of Prosper has been growing year by year?

## ymax not defined: adjusting position using y instead

From 2006 to 2007, the number of listings created almost doubled. But the number dropped a little in the year of 2008 and decreased dramatically in the following year - 2009. My guess is that the global financial crisis was the principal culprit. After 2009, the business started to recover steadily and by the year of 2011 the business got back to the size before the global financial crisis in terms of the number of listings created per year. And the following 2012 and 2013, the business bloomed really well.

Seeing from the small multiples, the trend of the listings created each year of most states are similar to the overall trend as I explained in the last paragraph. Interestingly, it seems that the borrower markets were shut down completely after 2008 in Iowa, Maine and North Dakota in despite of the growing trend of listings created from 2006 to 2008. Also, the borrower market didn’t start until 2007 in District of Columbia, 2008 in Nevada and Rhode Island and 2009 for South Dakota.


State Average Monthly Income of borrowers
CT 7394.697
DC 6822.252
NJ 6807.020
MD 6516.032
VA 6508.574
NY 6225.881
HI 6221.880
MA 6190.190
CA 6184.829
TX 6134.473
AK 5880.618
NH 5855.356
DE 5754.263
OR 5662.285
CO 5647.536
IL 5590.820
NV 5589.204
NM 5545.596
RI 5533.777
GA 5460.784
NC 5425.283
PA 5415.842
FL 5399.216
WA 5379.550
AZ 5347.032
WY 5300.478
LA 5274.877
SC 5230.936
KS 5148.678
IN 5094.876
TN 5051.212
WV 4971.739
OH 4964.927
MI 4932.191
OK 4931.019
KY 4929.388
MS 4923.627
WI 4920.912
MN 4912.896
AR 4851.891
VT 4817.739
NE 4796.570
4766.317
AL 4759.906
UT 4747.760
MO 4710.877
SD 4323.247
MT 4235.047
ID 4182.105
IA 3714.397
ME 3635.320
ND 3280.018

On the list of average monthly income by state, the top 5 are Connecticut of $7394.697/month, Watshington D.C of $6822.252/month, New Jersey $6807.020/month, Maryland $6516.032/month and Virginia $6508.574/month. Interestingly, all of them are on the East Coast. But keep in mind, those are average income of the borrowers of Prosper not a census result.


Final Plots and Summary

Plot One

Description One

Just as we have seen in the previous section, higher risk is usually associated with higher return. Also, although most borrowers keep their debt to income ratio low, but those with worse Prosper Rating are more likely to put higher debt to income ratio listing up. That’s why we see an upper triangular shape with the cluster of most points.


Plot Two

The Listing to Population Ratio (per 10,000 capita) for each State

States are Open to Investors on Prosper: (from Prosper Website)

Invester_Map

Description Two

In the bivariate analysis section, I found that California, Texas, New York and Florida are the top 4 on the list of state with big borrower market. Meanwhile, I found a map(shown above) shows the states are open to the investors on the Prosper website. I was wondering if whether a state opens to the investors or not relates to the size of borrower market, so I put the online map into the analysis for comparison purpose. However, I didn’t see a clear relation between those two maps. The fact that California, Texas, New York and Florida – the four states with biggest borrower market – are also the top 4 states with largest population makes me thinking if the listing to population ratio tells a different story from the absolute number of the listings. So I pulled the data of estimated population of each state for July 2014 from Wikipedia and then got a new map. Comparing the map of listing to population ratio to the one shows whether a state opens to the investors, it seems that those states are open to investors usually have higher listing to population ratio than those that don’t.


Plot Three

Description Three

In the univariate analysis part, we took a look at the length of employment for those who had a job. To further the analysis, I decided to explore why they borrow the loan for different types of employees, namely Full-time, Part-time, Self-employed and Employed, seperately.

Although there are so many differences for the four bar plots above, the first thing drew my attention was the huge difference of the data that not available. In the part-time group, if we take NA as an independent category, it is the highest of all. And NA is also the second largest in the full-time group. While in the self-employed group, NA is still the third largest category but relatively small comparing with the top two - Debt Consolidation and Business and for the employed group there are almost no NA at all! We can almost surely that the missing value (or NA values) among the four employment group are not random but informative missing! It would be an interesting research topic to explore of social science and/or psychological science. Secondly, we see Debt Consolidation are the dominant reasons of all four group regardless of those NAs. Next, business is the second biggest reason for the self-employed borrowers - borrowing money to help growing one’s own business sounds important and reasonable, and student use for the part-time workers. I remember when I was in graduate school, the income from my part-time job (TA) mostly went to school or study use. Another interesting finding is comparing to the other three groups those part-time worker & borrowers borrow money for less kinds of intentional usages – such as loan for babys or boats are totally out of their consideration.


Reflection

Using ggplot, rMaps and some other packages makes R such a powerful visualization tool. While it’s easy to get a graphic of any kind, it’s not easy to get a good and informative graphic with reasonable type, scale, legend etc.

Since I have never been a borrower or a investor of a loan, it took me a lot of time just to understand the meaning of all the variables in the list of 81. But the good thing about the Prosper dataset is I was able to find detailed information for both the borrower side and the lender side from their website. With better understanding of how Prosper works, I even started to think about being an investor in the future when I have extra money to invest. I guess that is also the biggest takeaway for me for doing the project, I feel like that I’m now armed with the knowledge and confidence to explore data, even for those in some fields that I don’t quite familiar with.

To achieve the perfect looking I was after, I changed the ichoropleth function in rMaps package a little bit and defined a customized version ichoropleth.o. I was really excited that I understood the drill of the function and was able to customize it to meet my own expectation.

For further enriching the analysis, I would like a build a regression model using the available variables to predict the Prosper Rating for the future borrowers. As the models of credit score are the core of the Credit Score Companies and how big the business is for those companies, I want to explore myself using the great tool I learned in this course.

One last thing I want to mention but doesn’t really show up in this analysis part is that when I pulled the population by state data from Wikipedia, google spreadsheet helped a lot! IMPORTHTML gives people an elegant and easy way to export tables and/or list from the webpage.

Reproducibility

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin14.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] reshape_0.8.5      GGally_0.5.0       RColorBrewer_1.1-2
## [4] lubridate_1.3.3    knitr_1.9          gmodels_2.15.4.1  
## [7] dplyr_0.4.1        rMaps_0.1.1        ggplot2_1.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1    codetools_0.2-10  colorspace_1.2-4 
##  [4] DBI_0.3.1         digest_0.6.8      evaluate_0.5.5   
##  [7] formatR_1.0       gdata_2.13.3      gtable_0.1.2     
## [10] gtools_3.4.1      htmltools_0.2.6   labeling_0.3     
## [13] lattice_0.20-29   lazyeval_0.1.10   magrittr_1.5     
## [16] MASS_7.3-37       memoise_0.2.1     munsell_0.4.2    
## [19] parallel_3.1.2    plyr_1.8.1        proto_0.3-10     
## [22] rCharts_0.4.5     Rcpp_0.11.3       reshape2_1.4.1   
## [25] RJSONIO_1.3-0     rmarkdown_0.5.0.1 scales_0.2.4     
## [28] stringr_0.6.2     tools_3.1.2       whisker_0.3-2    
## [31] yaml_2.1.13