Prosper is America’s first peer-to-peer lending marketplace, with more than 2 million members and over $2,000,000,000 in funded loans. Here we use the data available to the public (last updated on March 11th, 2014) from Prosper, which contains all the listings and loans ever created with 81 variables on each loan/listing, to do some data analysis. As a potential amateur investor, I will explore the borrower market (including the demographic segmentation and beyond) and try to let data tell some ‘behind scene’ stories about the borrowers, also the performance of Prosper in terms of the volume of listings by year and by area.
This data set contains 113,937 loans with 81 variables on each loan.
## [1] 113937 81
Here are the names of all the 81 variables:
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
Let’s take a glimpse of all the variables. Since I set stringsAsFactor=FALSE when load the data. All the varibles are either numeric or string for now. But to make more sense of the analysis, I may change the class of some variables later.
## Classes 'tbl_df', 'tbl' and 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : chr "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : chr "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
## $ CreditGrade : chr "C" "" "HR" "" ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : chr "Completed" "Current" "Completed" "Current" ...
## $ ClosedDate : chr "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : chr "" "A" "" "A" ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : chr "CO" "CO" "GA" "GA" ...
## $ Occupation : chr "Other" "Professional" "Other" "Skilled Labor" ...
## $ EmploymentStatus : chr "Self-employed" "Employed" "Not available" "Employed" ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : chr "True" "False" "False" "True" ...
## $ CurrentlyInGroup : chr "True" "False" "True" "False" ...
## $ GroupKey : chr "" "" "783C3371218786870A73D20" "" ...
## $ DateCreditPulled : chr "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : chr "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : chr "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
## $ IncomeVerifiable : chr "True" "True" "True" "True" ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : chr "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : chr "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
## $ LoanOriginationQuarter : chr "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
## $ MemberKey : chr "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
There’s a borrower behind every loan, I’ll start with exploring some variables of the characteristics of the borrowers.
First, let’s find out how much borrowers earn annually.
The income of most borrowers ranges from $25,000 - $74,999. To my surprise, there are 621 borrowers with zero income successfully got the loan. Although it’s not clearly stated in the dataset, My guess is here ‘the income’ means the income from employment rather than from all sources. Maybe exploring more information about those borrowers would make it clearer that why they can get the loan.
When browse the website of ‘Prosper’ as a prospective lender, one feature - high ‘Prosper Rating’ (usually ‘AA’ or ‘A’) - of all listings in ‘Featured Loan Listings’ caught my eyes, which makes me curious about what’s the ‘Prosper Rating’ of borrowers in this dataset.
## ymax not defined: adjusting position using y instead
Except for NAs, the distribution of the ordinal variable has a bell-like shape. ‘C’ is the most frequent rating in our data and the highest (AA) and the lowest (HR) rating are less common comparing with other ratings in between. I was surprised to see so many missing values. But it seems that Prosper Rating only applicable for loans originated after July 2009. So taking a glimpse of when were the listings created maybe a reasonable thing to do next.
table(loan$ListingCreationDate < ymd("2009-07-01"),
dnn = "Number of Listings Created Prior to 07/01/09")
## Number of Listings Created Prior to 07/01/09
## FALSE TRUE
## 84853 29084
table(loan$ListingCreationDate < ymd("2009-07-01"),
dnn = "Percentage of Listings Created Prior to 07/01/09") /
length(loan$ListingCreationDate)
## Percentage of Listings Created Prior to 07/01/09
## FALSE TRUE
## 0.7447361 0.2552639
range(loan$ListingCreationDate)
## [1] "2005-11-09 20:44:28 UTC" "2014-03-10 12:20:53 UTC"
The number of listings prior to 2009-07-01 is 29084, which is exactly the same with the number of missing values in Prosper Rating variable. And the earliest listing was created at 2005-11-09 20:44:28 and the latest one was at 2014-03-10 12:20:53.
Another thing I found in ‘Featured Loan Listings’ on the website was all listings there were of 3-year long. Let’s see if 3-year is really the most popular choice in terms of the length of the loans.
## Frequecy Table for Term of the loans
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 113937
##
##
## | 12 | 36 | 60 |
## |-----------|-----------|-----------|
## | 1614 | 87778 | 24545 |
## | 0.014 | 0.770 | 0.215 |
## |-----------|-----------|-----------|
##
##
##
##
There’re only three options for the length of the loan - 1 year, 3 years and 5 years. Only about 1.4% of the borrowers chose 1 year and 21.5% chose 5 years. With 77% of people chose 3 years, it surely is the most popular choice of all.
I started my first job about half year ago and the idea of borrowing a loan has never come to me so far. So I’m curious about when does a person get a loan in terms of the length of their working status.
I subset the borrowers with employment status listed as “Employed”, “Full-time”, “Part-time” and “Self-employed” and get the histogram of the length of employment for them. Using year as the unit of time, there’s a very clear descending trend of the number of people who borrow loans with the increase of the length of employment. My guess is the longer one works, the more one earns and therefore more savings. With more saving in the bank, people are less likely to borrow loans.
Next, let’s take a look at the distribution of the percentage of a consumer’s monthly gross income that goes toward paying debts.
summary(loan$DebtToIncomeRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
I see a huge gap between the 3rd quartile and the maximum and a simple calculation tells me they are 31 times away! Let’s draw the histogram to get a better understanding of DIT(Debt to Income) Ratio.
ggplot(aes(x=DebtToIncomeRatio), data=loan) +
geom_histogram(fill='#05DBF2', color='black') +
labs(title=expression(paste("Debt to Income Ratio")),
x='Debt / Income')
quantile(loan$DebtToIncomeRatio, 0.99, na.rm=T)
## 99%
## 0.86
ggplot(aes(x=DebtToIncomeRatio),
data=subset(loan,DebtToIncomeRatio<=1)) +
geom_histogram(binwidth=0.01,
fill='#05DBF2', color='black') +
labs(title=expression(
paste("Debt to Income Ratio (for ",
DTI <= 1,")")),
x='Debt / Income')
After limiting the ratio to less or equal to 1, we include more than \(99\%\) of the data but get a much finer graphic. The distribution has a bell shape and is right-skewed. And if we call the function quantile(loan$DebtToIncomeRatio, 0.95, na.rm=T), it tells us \(95\%\) of the borrowers keep their DTI \(\leq 1\). Of course, besides paying the debt people get life to live and it needs money.
Let’s see what people borrow money for.
## ymax not defined: adjusting position using y instead
Majority of people get the personal loans for ‘debt consolidation’. ‘Home improvement’ and ‘Business’ are also important purpose for the loans. Another thing I found very interesting is that about the same amount of people got loans for ‘student use’, ‘Vacation’ and ‘Wedding’. It seems like that education, preparation for getting married and getaway from home are all equally worth investing.
I want to know if one needs recommendations to successfully receive the loan.
## If the borrower has one or
## more recommendations
## FALSE TRUE
## 109678 4259
The majority of borrowers didn’t have a recommendation at all.
Prosper is a peer-to-peer lending business. After learning a few things about the borrowers, I decide to explore some lenders related variables to see how lenders benefit from investing in loans.
summary(loan$LenderYield)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1242 0.1730 0.1827 0.2400 0.4925
ggplot(aes(x=LenderYield), data=loan) +
geom_histogram(binwidth=0.005,
fill='#05DBF2', color='black') +
scale_x_continuous(breaks=seq(0, 0.5, 0.05)) +
labs(title="The Lender Yield on the Loan")
There are 22 cases that the lenders didn’t get any profit and even lost some of their own money. But 22 out of 113937, the odd is really low. Most yields range from 0.05 to 0.35. Surprisingly, the highest peak in the graphic is around 0.31, which shows a really rosy picture if one considers about doing the similar investment.
How many lenders usually ‘share’ the risk the loan together?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
## Only one investor
## FALSE TRUE
## 86123 27814
About 24.4% of loans only has one investor. For the rest of listings, there are multiple investors.
With 24.4% of loans only has one investor, I want to know if the only lender is usually a friend of the borrower?
## if the single investor is
## a friend of the borrower
## 0 1
## 27732 82
No, there’re not many friend-to-friend loan going on through Prosper.
Let’s take a look at which part of the country the borrowers live.
| State | Number of borrowers |
|---|---|
| CA | 14717 |
| TX | 6842 |
| NY | 6729 |
| FL | 6720 |
| IL | 5921 |
| 5515 | |
| GA | 5008 |
| OH | 4197 |
| MI | 3593 |
| VA | 3278 |
| NJ | 3097 |
| NC | 3084 |
| WA | 3048 |
| PA | 2972 |
| MD | 2821 |
| MO | 2615 |
| MN | 2318 |
| MA | 2242 |
| CO | 2210 |
| IN | 2078 |
| AZ | 1901 |
| WI | 1842 |
| OR | 1817 |
| TN | 1737 |
| AL | 1679 |
| CT | 1627 |
| SC | 1122 |
| NV | 1090 |
| KS | 1062 |
| KY | 983 |
| OK | 971 |
| LA | 954 |
| UT | 877 |
| AR | 855 |
| MS | 787 |
| NE | 674 |
| ID | 599 |
| NH | 551 |
| NM | 472 |
| RI | 435 |
| HI | 409 |
| WV | 391 |
| DC | 382 |
| MT | 330 |
| DE | 300 |
| VT | 207 |
| AK | 200 |
| SD | 189 |
| IA | 186 |
| WY | 150 |
| ME | 101 |
| ND | 52 |
The number of borrowers of different states varies a lot and California, Texas, New York and Florida are the top 4 on the list of state with big borrower market. While it’s hard to take all the states into consideration and especially to visualize the result all together, I will choose CA, TX and NY, the three states with largest amount of borrowers, as representatives for the later geographical analysis.
In the previous analysis, I find most borrowers keep their debt to income ratio (DTI) at a reasonable low level (50% of borrowers keep their DTI lower than 0.22 and 95% have their DTI 0.5 or less), however, there are some ‘dangerous players’ out there decided to use nearly all of their monthly (or even more) income to pay the loan. I want to explore some features of the borrower together with DTI to see if there are some common features within the ‘ordinary players’ and the ‘dangerous players’.
Also, Prosper Rate of the borrower is a unique and really important feature that surely of interest of both borrowers and investors. I’d like to see what’s the relationship between Prosper Rate and some other features of the listings.
To get a overview quickly, I will start with ggpairs and dig in deeper for any findings of my interest later on.
Who gives every buck they earn (or even more) to pay back the loan?
ggplot(aes(x=DebtToIncomeRatio), data=loan) +
geom_density(aes(color = IncomeRange,
fill = IncomeRange),
alpha = 0.5) +
facet_wrap(~IncomeRange, ncol=2,
scales = "free")
with(loan, by(DebtToIncomeRatio, IncomeRange,
summary))
## IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.010 0.160 0.295 3.328 10.010 10.010 728
## --------------------------------------------------------
## IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 621
## --------------------------------------------------------
## IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.020 0.190 0.320 0.737 0.500 10.010 913
## --------------------------------------------------------
## IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1700 0.2600 0.2789 0.3600 7.9000 2311
## --------------------------------------------------------
## IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1600 0.2300 0.2457 0.3200 10.0100 1690
## --------------------------------------------------------
## IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1400 0.2000 0.2137 0.2800 2.5500 901
## --------------------------------------------------------
## IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1200 0.1700 0.1806 0.2300 10.0100 1266
## --------------------------------------------------------
## IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.090 0.160 0.297 0.260 10.010 124
According to the information online, the borrower can choose their DIT anywhere between 0 and 10.01. But putting a large portion of (even an amount exceeds) one’s income every month to pay back the loan for 1, 3 or even 5 years continuously is extremely hard, if not a mission imporssible. But for borrowers who were not employed when they created the listings, more than a quarter of those people decided to pay the amount of money that is ten times of their monthly income for the loan (even though many of them keep the DTI less than 1 just as those from the other groups)! 90% of people with $1-24,999 annual income keep their DIT lower than 0.9, 90% from $25,000-49,999 group have DTI of 0.47 or lower, $75,000-99,999 with 0.35 or lower and $100,000+ of 0.3 and 0.4 for those income range are not displayed. When comparing other quartiles and mean of DTI throughout all the income groups with numeric label (from $1-24,999 to $100,000+), we see the similar descending trend.
Interestingly, 2.55 is the limit of DTI for the borrowers with income of $75,000-99,999, which is so much lower than that of any other group. Maybe we can conclude that borrowers with this range of income are more stable and less likely to risk, at least financially. Besides $75,000-99,999 group, $25,000-49,999 group is the only other group didn’t push their DTI limit to 10.01, but still as high as 7.9. For all the other groups, at least some people within each group decided to borrow the amount of money that requires them to give back about their 10-month income for monthly payment! It’s surprising to see there is no data for the zero income group at first, but after a second thought I understand that technically everyone in this group have DTI as infinite! So Prosper put all their DTI as null.
Does Higher Risk Really Mean Higher Return?
ggplot(aes(x=ProsperRating..Alpha., y=LenderYield),
data=loan) +
geom_boxplot(aes(fill = ProsperRating..Alpha.)) +
guides(fill = guide_legend(title = "ProsperRating"))
with(loan, by(LenderYield,
ProsperRating..Alpha.,
summary))
## ProsperRating..Alpha.: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03000 0.05990 0.06790 0.06911 0.07450 0.20000
## --------------------------------------------------------
## ProsperRating..Alpha.: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0398 0.0890 0.1019 0.1029 0.1139 0.2050
## --------------------------------------------------------
## ProsperRating..Alpha.: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0593 0.1314 0.1409 0.1444 0.1539 0.3400
## --------------------------------------------------------
## ProsperRating..Alpha.: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0795 0.1665 0.1814 0.1844 0.1999 0.3400
## --------------------------------------------------------
## ProsperRating..Alpha.: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1057 0.2187 0.2392 0.2364 0.2525 0.3400
## --------------------------------------------------------
## ProsperRating..Alpha.: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1379 0.2612 0.2825 0.2833 0.3049 0.3400
## --------------------------------------------------------
## ProsperRating..Alpha.: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1679 0.3034 0.3077 0.3073 0.3077 0.3400
## --------------------------------------------------------
## ProsperRating..Alpha.: NA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1170 0.1600 0.1730 0.2224 0.4925
I create a new variable here called YearsSinceFirstCreditLine, which show the duration from the time of FirstRecordedCreditLine to the date the dataset last updated (03/01/2014) for each borrower. I want to know if person with longer history of credit record has higher ProsperRating.
For Credit Score, average age of open credit lines is a key credit-influencing factors. So I was expecting to see the similar effects of FirstRecordedCreditLine(The date the first credit line was opened) on ProsperRating. To my surprise, the length of time since the borrower opened the first credit line makes no difference on the what ProsperRating he/she can get. The conditional distributions of FirstRecordedCreditLine are almost identical one another.
If FirstRecordedCreditLine is not a key factor of ProsperRating, let’s take California, Texas and New York as examples to see if there is a geographic difference of ProsperRating among the three states.
## [1] "The Percentage of Prosper Rates in California"
##
## A AA B C D E
## 0.18371899 0.07963944 0.18539169 0.21038937 0.15045070 0.11411579
## HR
## 0.07629402
## [1] "The Percentage of Prosper Rates in Texas"
##
## A AA B C D E
## 0.18583600 0.06549521 0.18743344 0.22080227 0.15317714 0.10614129
## HR
## 0.08111466
## [1] "The Percentage of Prosper Rates in New York"
##
## A AA B C D E
## 0.17556468 0.07135524 0.18069815 0.21081451 0.16940452 0.11447639
## HR
## 0.07768652
From the stacked bar chart and the following percentage information in detail, there is no significant difference of ProsperRating of borrowers live in CA, TX and NY. But a statistical test would help us understand to difference (if there is any) better.
After changing the order of “Not employed” and “$0”, a very interesting step-wise graphic shows up. We can claim that Borrowers of higher income range tend to have better Prosper Rate, however, borrowers who were not employed when they send out the loan request have higher Prosper Rate than those who had zero income.
When we look at the proper rate vs income range separately for three states: CA, TX and NY, the only obvious difference among the three states and with the nationwise data lies in the no income ($0) group. Half of the borrowers in CA without income were of the rate HR and half D, while all of the no income borrowers in NY of rate E (the average rate of CA from the same group) and TX’s no income borrowers held a much higher rate of B. The pie charts tell us that California has more high-end borrowers (from $100,000+ group) comparing with the other two. While the state of New York has higher portion of borrowers of $1-24,999 and $25,000-49,999 income group.
Whether the business of Prosper has been growing year by year?
## ymax not defined: adjusting position using y instead
From 2006 to 2007, the number of listings created almost doubled. But the number dropped a little in the year of 2008 and decreased dramatically in the following year - 2009. My guess is that the global financial crisis was the principal culprit. After 2009, the business started to recover steadily and by the year of 2011 the business got back to the size before the global financial crisis in terms of the number of listings created per year. And the following 2012 and 2013, the business bloomed really well.
Seeing from the small multiples, the trend of the listings created each year of most states are similar to the overall trend as I explained in the last paragraph. Interestingly, it seems that the borrower markets were shut down completely after 2008 in Iowa, Maine and North Dakota in despite of the growing trend of listings created from 2006 to 2008. Also, the borrower market didn’t start until 2007 in District of Columbia, 2008 in Nevada and Rhode Island and 2009 for South Dakota.
| State | Average Monthly Income of borrowers |
|---|---|
| CT | 7394.697 |
| DC | 6822.252 |
| NJ | 6807.020 |
| MD | 6516.032 |
| VA | 6508.574 |
| NY | 6225.881 |
| HI | 6221.880 |
| MA | 6190.190 |
| CA | 6184.829 |
| TX | 6134.473 |
| AK | 5880.618 |
| NH | 5855.356 |
| DE | 5754.263 |
| OR | 5662.285 |
| CO | 5647.536 |
| IL | 5590.820 |
| NV | 5589.204 |
| NM | 5545.596 |
| RI | 5533.777 |
| GA | 5460.784 |
| NC | 5425.283 |
| PA | 5415.842 |
| FL | 5399.216 |
| WA | 5379.550 |
| AZ | 5347.032 |
| WY | 5300.478 |
| LA | 5274.877 |
| SC | 5230.936 |
| KS | 5148.678 |
| IN | 5094.876 |
| TN | 5051.212 |
| WV | 4971.739 |
| OH | 4964.927 |
| MI | 4932.191 |
| OK | 4931.019 |
| KY | 4929.388 |
| MS | 4923.627 |
| WI | 4920.912 |
| MN | 4912.896 |
| AR | 4851.891 |
| VT | 4817.739 |
| NE | 4796.570 |
| 4766.317 | |
| AL | 4759.906 |
| UT | 4747.760 |
| MO | 4710.877 |
| SD | 4323.247 |
| MT | 4235.047 |
| ID | 4182.105 |
| IA | 3714.397 |
| ME | 3635.320 |
| ND | 3280.018 |
On the list of average monthly income by state, the top 5 are Connecticut of $7394.697/month, Watshington D.C of $6822.252/month, New Jersey $6807.020/month, Maryland $6516.032/month and Virginia $6508.574/month. Interestingly, all of them are on the East Coast. But keep in mind, those are average income of the borrowers of Prosper not a census result.
Just as we have seen in the previous section, higher risk is usually associated with higher return. Also, although most borrowers keep their debt to income ratio low, but those with worse Prosper Rating are more likely to put higher debt to income ratio listing up. That’s why we see an upper triangular shape with the cluster of most points.
In the bivariate analysis section, I found that California, Texas, New York and Florida are the top 4 on the list of state with big borrower market. Meanwhile, I found a map(shown above) shows the states are open to the investors on the Prosper website. I was wondering if whether a state opens to the investors or not relates to the size of borrower market, so I put the online map into the analysis for comparison purpose. However, I didn’t see a clear relation between those two maps. The fact that California, Texas, New York and Florida – the four states with biggest borrower market – are also the top 4 states with largest population makes me thinking if the listing to population ratio tells a different story from the absolute number of the listings. So I pulled the data of estimated population of each state for July 2014 from Wikipedia and then got a new map. Comparing the map of listing to population ratio to the one shows whether a state opens to the investors, it seems that those states are open to investors usually have higher listing to population ratio than those that don’t.
In the univariate analysis part, we took a look at the length of employment for those who had a job. To further the analysis, I decided to explore why they borrow the loan for different types of employees, namely Full-time, Part-time, Self-employed and Employed, seperately.
Although there are so many differences for the four bar plots above, the first thing drew my attention was the huge difference of the data that not available. In the part-time group, if we take NA as an independent category, it is the highest of all. And NA is also the second largest in the full-time group. While in the self-employed group, NA is still the third largest category but relatively small comparing with the top two - Debt Consolidation and Business and for the employed group there are almost no NA at all! We can almost surely that the missing value (or NA values) among the four employment group are not random but informative missing! It would be an interesting research topic to explore of social science and/or psychological science. Secondly, we see Debt Consolidation are the dominant reasons of all four group regardless of those NAs. Next, business is the second biggest reason for the self-employed borrowers - borrowing money to help growing one’s own business sounds important and reasonable, and student use for the part-time workers. I remember when I was in graduate school, the income from my part-time job (TA) mostly went to school or study use. Another interesting finding is comparing to the other three groups those part-time worker & borrowers borrow money for less kinds of intentional usages – such as loan for babys or boats are totally out of their consideration.
Using ggplot, rMaps and some other packages makes R such a powerful visualization tool. While it’s easy to get a graphic of any kind, it’s not easy to get a good and informative graphic with reasonable type, scale, legend etc.
Since I have never been a borrower or a investor of a loan, it took me a lot of time just to understand the meaning of all the variables in the list of 81. But the good thing about the Prosper dataset is I was able to find detailed information for both the borrower side and the lender side from their website. With better understanding of how Prosper works, I even started to think about being an investor in the future when I have extra money to invest. I guess that is also the biggest takeaway for me for doing the project, I feel like that I’m now armed with the knowledge and confidence to explore data, even for those in some fields that I don’t quite familiar with.
To achieve the perfect looking I was after, I changed the ichoropleth function in rMaps package a little bit and defined a customized version ichoropleth.o. I was really excited that I understood the drill of the function and was able to customize it to meet my own expectation.
For further enriching the analysis, I would like a build a regression model using the available variables to predict the Prosper Rating for the future borrowers. As the models of credit score are the core of the Credit Score Companies and how big the business is for those companies, I want to explore myself using the great tool I learned in this course.
One last thing I want to mention but doesn’t really show up in this analysis part is that when I pulled the population by state data from Wikipedia, google spreadsheet helped a lot! IMPORTHTML gives people an elegant and easy way to export tables and/or list from the webpage.
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin14.1.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] reshape_0.8.5 GGally_0.5.0 RColorBrewer_1.1-2
## [4] lubridate_1.3.3 knitr_1.9 gmodels_2.15.4.1
## [7] dplyr_0.4.1 rMaps_0.1.1 ggplot2_1.0.0
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 codetools_0.2-10 colorspace_1.2-4
## [4] DBI_0.3.1 digest_0.6.8 evaluate_0.5.5
## [7] formatR_1.0 gdata_2.13.3 gtable_0.1.2
## [10] gtools_3.4.1 htmltools_0.2.6 labeling_0.3
## [13] lattice_0.20-29 lazyeval_0.1.10 magrittr_1.5
## [16] MASS_7.3-37 memoise_0.2.1 munsell_0.4.2
## [19] parallel_3.1.2 plyr_1.8.1 proto_0.3-10
## [22] rCharts_0.4.5 Rcpp_0.11.3 reshape2_1.4.1
## [25] RJSONIO_1.3-0 rmarkdown_0.5.0.1 scales_0.2.4
## [28] stringr_0.6.2 tools_3.1.2 whisker_0.3-2
## [31] yaml_2.1.13