Prosper Data Exploration by Anand Sharma

For Prosper data analysis I used only 13 variables for exploration out of 81 available following are details on them:

Univariate Plots Section

## [1] 90831    81

## 
##  factor integer numeric 
##      20      30      31

## [1] 90831    12

## 'data.frame':    90831 obs. of  12 variables:
##  $ CreditGrade          : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                 : int  36 36 36 36 36 60 36 36 36 60 ...
##  $ BorrowerRate         : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ Rating               : int  NA 6 NA 6 3 5 2 4 7 4 ...
##  $ Category             : Factor w/ 21 levels "Not Available",..: 1 3 1 17 3 2 2 3 8 2 ...
##  $ Occupation           : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 22 ...
##  $ IsBorrowerHomeowner  : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 1 ...
##  $ OpenRevolvingAccounts: int  1 13 0 7 6 13 6 5 12 4 ...
##  $ StatedMonthlyIncome  : num  3083 6125 2083 2875 9583 ...
##  $ Investors            : int  258 1 41 158 20 1 1 1 1 19 ...
##  $ LatePayment          : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ CreditScore          : num  650 690 490 810 690 ...

##   CreditGrade Term BorrowerRate Rating           Category    Occupation
## 1           C   36       0.1580     NA      Not Available         Other
## 2               36       0.0920      6   Home Improvement  Professional
## 3          HR   36       0.2750     NA      Not Available         Other
## 4               36       0.0974      6         Motorcycle Skilled Labor
## 5               36       0.2085      3   Home Improvement     Executive
## 6               60       0.1314      5 Debt Consolidation  Professional
##   IsBorrowerHomeowner OpenRevolvingAccounts StatedMonthlyIncome Investors
## 1                True                     1            3083.333       258
## 2               False                    13            6125.000         1
## 3               False                     0            2083.333        41
## 4                True                     7            2875.000       158
## 5                True                     6            9583.333        20
## 6                True                    13            8333.333         1
##   LatePayment CreditScore
## 1          NA       649.5
## 2          NA       689.5
## 3          NA       489.5
## 4          NA       809.5
## 5           0       689.5
## 6          NA       749.5

##   CreditGrade         Term        BorrowerRate        Rating     
##         :67003   Min.   :12.00   Min.   :0.0000   Min.   :1.000  
##  C      : 4474   1st Qu.:36.00   1st Qu.:0.1350   1st Qu.:3.000  
##  D      : 4188   Median :36.00   Median :0.1830   Median :4.000  
##  B      : 3500   Mean   :41.07   Mean   :0.1926   Mean   :4.073  
##  HR     : 3148   3rd Qu.:36.00   3rd Qu.:0.2499   3rd Qu.:5.000  
##  AA     : 2893   Max.   :60.00   Max.   :0.4975   Max.   :7.000  
##  (Other): 5625                                    NA's   :23895  
##                Category                   Occupation   
##  Debt Consolidation:47562   Other              :22931  
##  Not Available     :14398   Professional       :10368  
##  Other             : 7903   Executive          : 3400  
##  Business          : 5406   Computer Programmer: 3301  
##  Home Improvement  : 5242                      : 3190  
##  Personal Loan     : 1851   Teacher            : 3002  
##  (Other)           : 8469   (Other)            :44639  
##  IsBorrowerHomeowner OpenRevolvingAccounts StatedMonthlyIncome
##  False:45292         Min.   : 0.000        Min.   :      0    
##  True :45539         1st Qu.: 4.000        1st Qu.:   3167    
##                      Median : 6.000        Median :   4600    
##                      Mean   : 7.002        Mean   :   5587    
##                      3rd Qu.: 9.000        3rd Qu.:   6750    
##                      Max.   :51.000        Max.   :1750003    
##                                                               
##    Investors        LatePayment     CreditScore   
##  Min.   :   1.00   Min.   : 0.00   Min.   :  9.5  
##  1st Qu.:   1.00   1st Qu.: 0.00   1st Qu.:669.5  
##  Median :  39.00   Median : 0.00   Median :689.5  
##  Mean   :  78.35   Mean   : 0.04   Mean   :695.2  
##  3rd Qu.: 112.00   3rd Qu.: 0.00   3rd Qu.:729.5  
##  Max.   :1189.00   Max.   :21.00   Max.   :889.5  
##                    NA's   :82251   NA's   :510

Data dimensionality and structure of the data is revealed by above summaries.

CreditGrade: The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period. There are 84984 pre-2009 records which doesn’t have any grade. C and D graded loans are largest in numbers. There are also 141 records with no credit score.

Term: The length of the loan expressed in months. Interestingly there are no records with 24 and 48 months tenure. Majority of people’s preferred 36 months terms and 12 month term was least preferred option

BorrowerRate: The Borrower’s interest rate for this loan. Most people borrow between 10%-20% while mean being at 19.28%

table((df$Rating))

Rating: The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009. Ita a pretty normal distribution, peak being at 4 with 15224 loans.

ListingCategory: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans. This one is really interesting, The biggest reason people take loans is to repay it. Second and Third largest category are not even categories in themselves i.e. Not Available and Other.

Occupation: The Occupation selected by the Borrower at the time they created the listing. I have selected bar chart here because it shows distribution among occupation quite clearly. People who apply most for loans are Computer Programmer and not so surprisingly Judges are the ones who doesn’t need money at all. Their applications are even 50% of student which is second least interested occupation in taking loan

## False  True 
## 45292 45539

IsBorrowerHomeOwner: A Borrower will be classified as a homeowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner. There is even distribution between those who own a house and those who doesn’t. It will be interesting to observe relationship of this variables with income and rating.

Open Revolving Accounts: Number of open revolving accounts at the time the credit profile was pulled. On an average people have 6 open revolving account from where they can take loan. This is very important attribute that can affect people’s loan paying ability.

StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created. Expected pattern, Removed top and bottom 5% records as not to skew distribution. There were 0 income for some of the records which also got removed. Average income is 5000 USD

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00   39.00   78.35  112.00 1189.00

Investors: The number of investors that funded the loan. Average number of investors that funded the loan is 80, it is skewed.

Univariate Analysis

What is the structure of your dataset?

Loan data has 81 variables and 113937 observations with 20 factor, 30 integer and 31 numeric values.

What is/are the main feature(s) of interest in your dataset?

ProsperRating is main feature of the dataset.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

CreditScore, ListingCategory, OpenRevolvingAccounts, StatedMonthlyIncome are some of important features that can provide real insight.

Did you create any new variables from existing variables in the dataset?

Yes, I took avergae of CreditScoreRangeLower and CreditScoreRangeUpper to create CreditScore variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, There were couple of them, For example most of loans term was 36 months and there was not even single loan in 24 and 48 months tenure. Second one is reason why people take loan which can be deduced from ListingCategory, The biggest reason people take loan is to pay them back. Second and Third largest category are not even categories in themselves i.e. Not Available and Other. Then in Occupation, Computer Programmers are the one which applies most for the loan whereas Judges don’t need any money at all. Judges are even 50% of Student which is second least interested occupation in taking loan.

I formatted two variable (i.e. ProsperRating and ListingCategory) names to be more concise and meaningful. Created factor levels for ListingCategory variables for better understanding and data manipulation.

Bivariate Plots Section

For Category vs Rating distribution I chose Boxplot as it will reveal distribution among categories and within category also. Boat has highest rating as it is a premium segment. People don’t usually falter or delinquent as it is usually opted by very rich. On contrary Auto on an average gets lowest rating.

with(df, cor.test(Rating, CreditScore))

There is a positive correlation of 53.22 % but it’s not strong correlation between Rating and CreditSCore. It surprised me because I was expecting a strong positive relation between both of them.

This distribution clearly shows that homeowner gets better rating than who doesn’t own house. Interestingly mean rating for both categories remains at 4.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and Rating
## t = -830.36, df = 66934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9553976 -0.9540569
## sample estimates:
##        cor 
## -0.9547321

with(df, cor.test(BorrowerRate, Rating))

Rating shows negative correlationship with date. As Rating increases Rate goes down consistently. Rate explains 95.47% covariance in rating.

Of Course people with more disposable income has higher rating. An interesting observation here is the spread of Rating against income levels. For lowest rating (i.e. 1) income is most widespread wheres for highest rating (i.e. 7) its concentrated in distribution.

We have removed Other and Professional from Occupation as too many people opted for that category and entire distribution was skewed because of that. It was not giving us any insight into specifically which occupations people are more likely to make late payment and how late. Interestingly Analyst (36 months) are the ones which fails for most duration in paying their EMI followed by Teacher (30 months).

## df$IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2611    3859    4525    5658  394400 
## -------------------------------------------------------- 
## df$IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4000    5583    6643    7917 1750000

Clearly people who are home owner earns more and that’s why they have better Prosper Rating than those who doesn’t own a home. There is difference of $2118 in their mean income.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Almost all of them has no strong relationship with the prosper rating. Interest rate however shows strong relationship to prosper rating. Based on data, borrower’s’ interest rate explains 95.47% of the variance in prosper rating.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, In Occupation and LatePayment relationship I found out that Analyst are most likely to miss an EMI. Then people who has home have higher income than people who don’t. income gap between their mean income is $2118.

What was the strongest relationship you found?

The strongest relationship shown by interest rate and rating, with negative strong linear relationship.

Multivariate Plots Section

We want to observe density of income vs rating distribution. As income data in highly skewed, I have removed top 5% and bottom 5% to get a real sense of existing distribution. As expected rating 1 (lowest) has lowest peak at 1.8 whereas rating 7 (highest) has peak at 2.3

As is the case people who have better rating even if they make late payment usually they clear it after first month which is verified by high peak at 1 month for 7 rated loans and immediate fall thereafter not in case of lower rated loans which has much more spread distribution.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Monthly income and number of late payments both affected the Rating. Monthly income had positive and late payments has negative correlation with Rating.

Were there any interesting or surprising interactions between features?

Yes, As was expected the peak for most number of Investor was for highest rating i.e. 7 but interestingly second largest peak was for worst rating i.e. 1.

Final Plots and Summary

Plot One

Description One

I choose histogram since income is numerical variable. Histogram is used when we want to see the distribution.Salary Income is expected to be highly right skewed, since fewer and fewer people have higher salary. After exclude the outliers and log10 scale the income, we have tidier normal distribution.

Since this is right skewed, I use quantiles to describe it more statistically. We see that minimum monthly salary is 0, and this is not to be expected. Income is stated manually by the borrower, so there’s some users that prefer not to fill it in, and it’s default to zero. The other thing is where most of the users in the interquartile range is in thousands, there are income that flies as high as 1.75 million but makes loan of 4000 dollars. Clearly person that have this income shouldn’t make such a low loan. And since this is human input, there’s high chance this person is not being honest. Because of this distribution and outliers stated, I choose to log10 scale to make it normal distribution.

Plot Two

Description Two

Looking at the statistics, Prosper Rating is strongly correlated with borrower’s interest rate, with negative linear relationship. This is the highest correlation among the all other features. This suggest that as interest rate decrease, the Prosper Rating receive higher grade, and with R^2 of 0.95, means that 95% of Prosper rating variance can be explain by borrower’s interest rate in the data.

I use box plot and differentiate the charts by Prosper Rating. The reason to do this is I want to see skewed distribution of borrowers’ interest rate across Prosper Rating. Looking at the chart, the lowest Prosper rating have highest median of interest rate. It has no outliers in above Q3. Meanwhile the highest rating has the smallest IQR compared to the rest of the rating. This suggest that the interest rate in smallest rating only in small range. If we observe summary statistics of interest rate at highest rating,

Plot Three

Description Three

I choose density plot to see where distribution is centered as Prosper Rating increase. While it’s not a very distinct trend, we see that the center of the distribution, is shifted towards the right when the rating goes higher. We can’t see the normal distribution like this unless we exclude 5% and 95% quantiles, and also log10 scale the

## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome and Rating
## t = 22.654, df = 66934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07970669 0.09474270
## sample estimates:
##        cor 
## 0.08722966

Reflection

The dataset is from the Prosper loans, where people could loans money by listing it in the website, specifying all of the requirement field. The columns specified by the borrowers, among the 13 features I’m exploring are homeowner, income, and occupation. Income is expected to be right skewed as well, so I scale log10 the income and also exclude the outlier. The features of interest is Prosper Rating, that is what features that contributed to the Prosper rating system. The rating is given immediately after the loans is listed. So among all of the features, I select borrower’s interest rate, homeowner, listing category, borrower’s income, number of late payments, number of recommendation, loans’ term, and the occupation of the borrower. Among all of this features, the interest rate is strongly correlated, with lower interest rate contribute to higher prosper rating.

Occupation and Listing Category should be the features for predicting Prosper rating. But this is hard since both have many categorical variable. People with more interesting job could have higher rating. But too many occupation have Other and Professional which is hard to defined. While listing category could also plays important role, this is also has too many categorical level.