Univariate Analysis on Prosper Loan Data by Lujing Chen

Data set introduction

The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.

## [1] 113937     81

Since the dataset has 81 variables, I decided to choose ten important one for detail analysis.I narrowed down the predictive variables by excluding, all-unique, all-same, and more than 50% missing value, these three kinds.

First, I checked the variables has a unique value for each loan. Four variables were deleted, 76 variables left.Second, I removed those variables have more than 50% missing value. Then there are 67 variables left. Third, I checked the variables has an identical value for each loan. There are no such variables. Fourth, I deleted those variables that are unavailable when a new loaner comes, except the “loan status” and its alternative “loan current day delinquent.” In the end, 38 variables left.

Since Credit Score Range upper and lower is the max and min value provided by a consumer credit rating agencies, average of this two varaibles will present a better picture of the credit score. Therefore, I created a new varaible “CreditScoreRangeAvg”.

Here are the 37 left varaibles

##  [1] "CreditGrade"                       
##  [2] "Term"                              
##  [3] "LoanStatus"                        
##  [4] "ClosedDate"                        
##  [5] "ListingCategory..numeric."         
##  [6] "BorrowerState"                     
##  [7] "Occupation"                        
##  [8] "EmploymentStatus"                  
##  [9] "EmploymentStatusDuration"          
## [10] "IsBorrowerHomeowner"               
## [11] "CurrentlyInGroup"                  
## [12] "DateCreditPulled"                  
## [13] "FirstRecordedCreditLine"           
## [14] "CurrentCreditLines"                
## [15] "OpenCreditLines"                   
## [16] "TotalCreditLinespast7years"        
## [17] "OpenRevolvingAccounts"             
## [18] "OpenRevolvingMonthlyPayment"       
## [19] "InquiriesLast6Months"              
## [20] "TotalInquiries"                    
## [21] "CurrentDelinquencies"              
## [22] "AmountDelinquent"                  
## [23] "DelinquenciesLast7Years"           
## [24] "PublicRecordsLast10Years"          
## [25] "PublicRecordsLast12Months"         
## [26] "RevolvingCreditBalance"            
## [27] "BankcardUtilization"               
## [28] "AvailableBankcardCredit"           
## [29] "TotalTrades"                       
## [30] "TradesNeverDelinquent..percentage."
## [31] "TradesOpenedLast6Months"           
## [32] "DebtToIncomeRatio"                 
## [33] "IncomeRange"                       
## [34] "IncomeVerifiable"                  
## [35] "StatedMonthlyIncome"               
## [36] "LoanCurrentDaysDelinquent"         
## [37] "CreditScoreRangeAvg"

Then, I am going to explore 10 independent variables and two dependent varaibles in the univaraite plot section.

Those ten varaibles are

Income Characteristic Variables: StatedMonthlyIncome, EmploymentStatusDuration, DebtToIncomeRatio, ListingCategory, IncomeVerifiable

Credit Record Characteristic Variables: CreditScoreRangeAvg, InquiriesLast6Months, PublicRecordsLast10Years, CurrentDelinquencies, AvailableBankcardCredit

Univariate Plots Section

Dependent Vraibles: Loan Status and Loan Current Days Delinquent

The first challenges here is if I want to investigae the deliquencies or default , I have to define it. Because whether past due and charge off really matters.

“In the United States, Federal regulations require creditors to charge-off installment loans after 120 days of delinquency, while revolving credit accounts must be charged-off after 180 days.”–Wikipedia

Therefore, I grouped Current/FinalPaymentInProgress/Completed into ‘Not Defaulted’; grouped all the past due within 120 days into “Past Due”; grouped Past Due (>120 days)/Chargedoff/Defaulted into “Defaulted.” I named this new variable as LoanStatus_New.

Furthermore, I created another new variable “LoanStatus_B,” which has a value of “1” if the correspondent loan status is not paid on time and a value of “0.” vice versa.

Loan Status (Original)

There are 11 levels of Loan Status. The Current and Completed account for the 83% of the population. However, the bars like 10.5% charged off and 4.4% defaulted are not a healthy signal for the whole portfolio.

LoanStatus (New) and (Binary)

From the chart above, we can tell the default rate for the whole Prosper portfolio is 16.7%, which is three time of the Delinquency Rates on Loans and Leases at Commercial Banks at 2014 (4.3%-5.3%) https://www.federalreserve.gov/releases/chargeoff/delallsa.htm

Loan Current Days Delinquent

The first histogram shows an extreme positively skewed, meaning most of the loans are healthy with a long tail of bad loans. The second histogram shows all the delinquent loan’s past due day distribution. There are two peaks. One is near the 120 days; the other is near 2000 days. Since the loan will be hard to collect once the delinquent days is longer than 120 days and Prosper treat loans after four missed payments (i.e. 120 days) as “Charge Off.” The first peak is the reason an observable “charged off” bar in the Loan Status bar plot.

Stated Monthly Income

The distribution of the whole population is extreme positively skewed, due to several big outliers. Since the monthly income variable’s third quartile is $6,828 and the IQR is 3,628, any data point above $10456 is an outlier. After excluding the outliers, the distribution is normally distributed.

## ------------------------------------------------------------------------- 
## Stated Monthly Income distribution
## 
##          length          n        NAs     unique         0s       mean
##         113'932    113'932          0     13'501      1'393  5'608.157
##                     100.0%       0.0%                  1.2%           
##                                                                       
##             .05        .10        .25     median        .75        .90
##       1'541.667  2'108.375  3'200.708  4'666.667  6'828.417  9'750.000
##                                                                       
##           range         sd      vcoef        mad        IQR       skew
##   1'750'002.917  7'478.629      1.334  2'594.550  3'627.708    125.005
##                                                                       
##       meanCI
##    5'564.731
##    5'651.583
##             
##          .95
##   12'250.000
##             
##         kurt
##   26'820.952
##             
## lowest : 0.0 (1'393), 0.083 (251), 0.25, 0.833, 1.417
## highest: 416'666.667, 466'666.667, 483'333.333, 618'547.833, 1'750'002.917

## Warning: Removed 8682 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

Employment Status Duration

The work duration is positively skewed too. The median working years is six years, and the mean is eight years. But some people have 63-year work experience in the population. Most of the people have less than ten-year work duration. As the box plot shown, there are several outliers with more than 20 years duration.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.167   5.583   8.006  11.420  62.920    7621

## Warning: Removed 7621 rows containing non-finite values (stat_bin).

## Warning: Removed 7621 rows containing non-finite values (stat_boxplot).

Debt To Income Ratio

This variable is capped at 10.01. That’s the reason it has a small peak at the point 10. 272 loaners have an extremely high debt level, higher than 10. The whole population has median of 22% (debt) and a mean of 28% (debt). And most of people has a debt to income ratio less than 50%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

## Warning: Removed 8554 rows containing non-finite values (stat_bin).

Listing Catergory (Reason for the loan)

Since there are 21 kinds of reasons in the raw dataset, I decided to use the word cloud to visualize the data.

## Warning in wordcloud(loan_descriptions.corpus, max.words = 100,
## random.order = FALSE, : consolidation could not be fit on page. It will not
## be plotted.

From the word cloud visualizaiton, most common reason of the loan is to repay debt. The bar chart below presents the numeric percentage of each categories. THe top three reason is Debt Consolidation, Not Available and Others.

The bar plot shows the exact percentage of each category. THe top five reasons are Debt Consolidation, Not Available, Other, Home Improvement and Business.

IncomeVerifiable

There are only two values, verified - True and not verified - False for this Variable. And it is obvious that most of loaner’s income are verifiable.

CreditScoreRangeAvg

I used the average of CreditScoreRangeUpper and CreditScoreRangeLower to present a balanced picture of the individual credit score. Ths histogram is normally distributed with a mean score of 695. the box plot shows there are a few low score outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     590

## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Warning: Removed 590 rows containing non-finite values (stat_count).

## Warning: Removed 590 rows containing non-finite values (stat_boxplot).

Inquiries in the Last 6 Months

The Inquiries Last 6 Months is highly skewed, with 50% “0” value, and 25% “1” value. The median is 1 and mean is 1.4. Most of the people’s credit file is pulled less than two times in last six months. However there 372 loans have more than ten times credit pull in six months. Such high frequency can be viewed as a sign of potential “credit seeking behavior.” There are 30 loaners have more than 20 times credit pulled, which might indicate a cash urgency and difficulty of getting the loan for these loaners.

## ------------------------------------------------------------------------- 
## loan_data$InquiriesLast6Months (integer)
## 
##    length        n    NAs  unique      0s  mean  meanCI
##   113'932  113'236    696      50  50'004  1.44    1.42
##              99.4%   0.6%           43.9%          1.45
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    1.00    2.00  4.00    5.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##    105.00     2.44   1.70    1.48    2.00  5.66   88.93
##                                                        
## lowest : 0 (50'004), 1 (28'619), 2 (14'432), 3 (7'697), 4 (4'297)
## highest: 52, 53, 63, 97, 105

Public Records Last 10 Years

Based on the definition from Prosper, the number of public records is on the borrower’s credit report. Public records include bankruptcies, liens, and judgments. 75% the loaners have no public record in last ten years. But there are almost 700 loaners with more than three public record in the population.

## ------------------------------------------------------------------------- 
## loan_data$PublicRecordsLast10Years (integer)
## 
##    length        n    NAs  unique      0s  mean  meanCI
##   113'932  113'236    696      25  85'801  0.31    0.31
##              99.4%   0.6%           75.3%          0.32
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    0.00  1.00    1.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##     38.00     0.73   2.33    0.00    0.00  7.96  201.65
##                                                        
## lowest : 0 (85'801), 1 (22'833), 2 (3'011), 3 (893), 4 (345)
## highest: 22, 25, 30, 34, 38

Current Delinquencies

Current Delinquencies is the number of accounts of the borrower that are currently late. This includes accounts with charged-off balances. The CurrentDelinquencies is highly skewed too, with 20% zero value, a median of zero and mean of 0.9. There are more than 1000 loaners are late for payment more than ten times.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.592   0.000  83.000     696

## [1] 17247

## Warning: Removed 696 rows containing non-finite values (stat_bin).

Available Bankcard Credit

The distribution is postitive skewed, with a median of 4100

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     880    4100   11210   13180  646300    7540

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Univariate Analysis

What is the structure of your dataset?

There are 113,937 loans in the data set with 81 features. I picked up ten key varibles run the univariate analysis.

Income Characteristic Variables: StatedMonthlyIncome, EmploymentStatusDuration, DebtToIncomeRatio, Occupation, IncomeVerifiable

Credit Record Characteristic Variables: CreditScoreRangeAvg, InquiriesLast6Months, PublicRecordsLast10Years, CurrentDelinquencies, AvailableBankcardCredit

Categorical variables like IncomeVerifiable is ordered factor variables with the following levels. (worst) ——> (best) IncomeVerifiable:“False”, “True”

What is/are the main feature(s) of interest in your dataset?

Most numeric variables are positively skewed.

What other features in the dataset do you think will help support your

analysis? There are a lot of zero value in some of the variables. Maybe regroup them as delinquent happened and not happened as categorical variables may bring some insight in the further investigation

Further statistics testing is needed to have a more detailed picture. ### Did you create any new variables from existing variables in the dataset? I created the average of CreditScoreRangeUpper and CreditScoreRangeLower to present a balanced picture of the individual credit score.

Of the features you investigated, were there any unusual distributions?

Most of the distributions are positively skewed and have extreme outliners. I did not use the log() scale since I am going to use logistic regression model instead of the linear model for prediction.