The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
Since the dataset has 81 variables, I decided to choose ten important one for detail analysis.I narrowed down the predictive variables by excluding, all-unique, all-same, and more than 50% missing value, these three kinds.
First, I checked the variables has a unique value for each loan. Four variables were deleted, 76 variables left.Second, I removed those variables have more than 50% missing value. Then there are 67 variables left. Third, I checked the variables has an identical value for each loan. There are no such variables. Fourth, I deleted those variables that are unavailable when a new loaner comes, except the “loan status” and its alternative “loan current day delinquent.” In the end, 38 variables left.
Since Credit Score Range upper and lower is the max and min value provided by a consumer credit rating agencies, average of this two varaibles will present a better picture of the credit score. Therefore, I created a new varaible “CreditScoreRangeAvg”.
Here are the 37 left varaibles
## [1] "CreditGrade"
## [2] "Term"
## [3] "LoanStatus"
## [4] "ClosedDate"
## [5] "ListingCategory..numeric."
## [6] "BorrowerState"
## [7] "Occupation"
## [8] "EmploymentStatus"
## [9] "EmploymentStatusDuration"
## [10] "IsBorrowerHomeowner"
## [11] "CurrentlyInGroup"
## [12] "DateCreditPulled"
## [13] "FirstRecordedCreditLine"
## [14] "CurrentCreditLines"
## [15] "OpenCreditLines"
## [16] "TotalCreditLinespast7years"
## [17] "OpenRevolvingAccounts"
## [18] "OpenRevolvingMonthlyPayment"
## [19] "InquiriesLast6Months"
## [20] "TotalInquiries"
## [21] "CurrentDelinquencies"
## [22] "AmountDelinquent"
## [23] "DelinquenciesLast7Years"
## [24] "PublicRecordsLast10Years"
## [25] "PublicRecordsLast12Months"
## [26] "RevolvingCreditBalance"
## [27] "BankcardUtilization"
## [28] "AvailableBankcardCredit"
## [29] "TotalTrades"
## [30] "TradesNeverDelinquent..percentage."
## [31] "TradesOpenedLast6Months"
## [32] "DebtToIncomeRatio"
## [33] "IncomeRange"
## [34] "IncomeVerifiable"
## [35] "StatedMonthlyIncome"
## [36] "LoanCurrentDaysDelinquent"
## [37] "CreditScoreRangeAvg"
Then, I am going to explore 10 independent variables and two dependent varaibles in the univaraite plot section.
Those ten varaibles are
Income Characteristic Variables: StatedMonthlyIncome, EmploymentStatusDuration, DebtToIncomeRatio, ListingCategory, IncomeVerifiable
Credit Record Characteristic Variables: CreditScoreRangeAvg, InquiriesLast6Months, PublicRecordsLast10Years, CurrentDelinquencies, AvailableBankcardCredit
Dependent Vraibles: Loan Status and Loan Current Days Delinquent
The first challenges here is if I want to investigae the deliquencies or default , I have to define it. Because whether past due and charge off really matters.
“In the United States, Federal regulations require creditors to charge-off installment loans after 120 days of delinquency, while revolving credit accounts must be charged-off after 180 days.”–Wikipedia
Therefore, I grouped Current/FinalPaymentInProgress/Completed into ‘Not Defaulted’; grouped all the past due within 120 days into “Past Due”; grouped Past Due (>120 days)/Chargedoff/Defaulted into “Defaulted.” I named this new variable as LoanStatus_New.
Furthermore, I created another new variable “LoanStatus_B,” which has a value of “1” if the correspondent loan status is not paid on time and a value of “0.” vice versa.
There are 11 levels of Loan Status. The Current and Completed account for the 83% of the population. However, the bars like 10.5% charged off and 4.4% defaulted are not a healthy signal for the whole portfolio.
From the chart above, we can tell the default rate for the whole Prosper portfolio is 16.7%, which is three time of the Delinquency Rates on Loans and Leases at Commercial Banks at 2014 (4.3%-5.3%) https://www.federalreserve.gov/releases/chargeoff/delallsa.htm
The first histogram shows an extreme positively skewed, meaning most of the loans are healthy with a long tail of bad loans. The second histogram shows all the delinquent loan’s past due day distribution. There are two peaks. One is near the 120 days; the other is near 2000 days. Since the loan will be hard to collect once the delinquent days is longer than 120 days and Prosper treat loans after four missed payments (i.e. 120 days) as “Charge Off.” The first peak is the reason an observable “charged off” bar in the Loan Status bar plot.
The distribution of the whole population is extreme positively skewed, due to several big outliers. Since the monthly income variable’s third quartile is $6,828 and the IQR is 3,628, any data point above $10456 is an outlier. After excluding the outliers, the distribution is normally distributed.
## -------------------------------------------------------------------------
## Stated Monthly Income distribution
##
## length n NAs unique 0s mean
## 113'932 113'932 0 13'501 1'393 5'608.157
## 100.0% 0.0% 1.2%
##
## .05 .10 .25 median .75 .90
## 1'541.667 2'108.375 3'200.708 4'666.667 6'828.417 9'750.000
##
## range sd vcoef mad IQR skew
## 1'750'002.917 7'478.629 1.334 2'594.550 3'627.708 125.005
##
## meanCI
## 5'564.731
## 5'651.583
##
## .95
## 12'250.000
##
## kurt
## 26'820.952
##
## lowest : 0.0 (1'393), 0.083 (251), 0.25, 0.833, 1.417
## highest: 416'666.667, 466'666.667, 483'333.333, 618'547.833, 1'750'002.917
## Warning: Removed 8682 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
The work duration is positively skewed too. The median working years is six years, and the mean is eight years. But some people have 63-year work experience in the population. Most of the people have less than ten-year work duration. As the box plot shown, there are several outliers with more than 20 years duration.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.167 5.583 8.006 11.420 62.920 7621
## Warning: Removed 7621 rows containing non-finite values (stat_bin).
## Warning: Removed 7621 rows containing non-finite values (stat_boxplot).
This variable is capped at 10.01. That’s the reason it has a small peak at the point 10. 272 loaners have an extremely high debt level, higher than 10. The whole population has median of 22% (debt) and a mean of 28% (debt). And most of people has a debt to income ratio less than 50%.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
## Warning: Removed 8554 rows containing non-finite values (stat_bin).
Since there are 21 kinds of reasons in the raw dataset, I decided to use the word cloud to visualize the data.
## Warning in wordcloud(loan_descriptions.corpus, max.words = 100,
## random.order = FALSE, : consolidation could not be fit on page. It will not
## be plotted.
From the word cloud visualizaiton, most common reason of the loan is to repay debt. The bar chart below presents the numeric percentage of each categories. THe top three reason is Debt Consolidation, Not Available and Others.
The bar plot shows the exact percentage of each category. THe top five reasons are Debt Consolidation, Not Available, Other, Home Improvement and Business.
There are only two values, verified - True and not verified - False for this Variable. And it is obvious that most of loaner’s income are verifiable.
I used the average of CreditScoreRangeUpper and CreditScoreRangeLower to present a balanced picture of the individual credit score. Ths histogram is normally distributed with a mean score of 695. the box plot shows there are a few low score outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 590
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 590 rows containing non-finite values (stat_count).
## Warning: Removed 590 rows containing non-finite values (stat_boxplot).
The Inquiries Last 6 Months is highly skewed, with 50% “0” value, and 25% “1” value. The median is 1 and mean is 1.4. Most of the people’s credit file is pulled less than two times in last six months. However there 372 loans have more than ten times credit pull in six months. Such high frequency can be viewed as a sign of potential “credit seeking behavior.” There are 30 loaners have more than 20 times credit pulled, which might indicate a cash urgency and difficulty of getting the loan for these loaners.
## -------------------------------------------------------------------------
## loan_data$InquiriesLast6Months (integer)
##
## length n NAs unique 0s mean meanCI
## 113'932 113'236 696 50 50'004 1.44 1.42
## 99.4% 0.6% 43.9% 1.45
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 1.00 2.00 4.00 5.00
##
## range sd vcoef mad IQR skew kurt
## 105.00 2.44 1.70 1.48 2.00 5.66 88.93
##
## lowest : 0 (50'004), 1 (28'619), 2 (14'432), 3 (7'697), 4 (4'297)
## highest: 52, 53, 63, 97, 105
Based on the definition from Prosper, the number of public records is on the borrower’s credit report. Public records include bankruptcies, liens, and judgments. 75% the loaners have no public record in last ten years. But there are almost 700 loaners with more than three public record in the population.
## -------------------------------------------------------------------------
## loan_data$PublicRecordsLast10Years (integer)
##
## length n NAs unique 0s mean meanCI
## 113'932 113'236 696 25 85'801 0.31 0.31
## 99.4% 0.6% 75.3% 0.32
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 0.00 1.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 38.00 0.73 2.33 0.00 0.00 7.96 201.65
##
## lowest : 0 (85'801), 1 (22'833), 2 (3'011), 3 (893), 4 (345)
## highest: 22, 25, 30, 34, 38
Current Delinquencies is the number of accounts of the borrower that are currently late. This includes accounts with charged-off balances. The CurrentDelinquencies is highly skewed too, with 20% zero value, a median of zero and mean of 0.9. There are more than 1000 loaners are late for payment more than ten times.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.592 0.000 83.000 696
## [1] 17247
## Warning: Removed 696 rows containing non-finite values (stat_bin).
The distribution is postitive skewed, with a median of 4100
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 880 4100 11210 13180 646300 7540
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are 113,937 loans in the data set with 81 features. I picked up ten key varibles run the univariate analysis.
Income Characteristic Variables: StatedMonthlyIncome, EmploymentStatusDuration, DebtToIncomeRatio, Occupation, IncomeVerifiable
Credit Record Characteristic Variables: CreditScoreRangeAvg, InquiriesLast6Months, PublicRecordsLast10Years, CurrentDelinquencies, AvailableBankcardCredit
Categorical variables like IncomeVerifiable is ordered factor variables with the following levels. (worst) ——> (best) IncomeVerifiable:“False”, “True”
Most numeric variables are positively skewed.
analysis? There are a lot of zero value in some of the variables. Maybe regroup them as delinquent happened and not happened as categorical variables may bring some insight in the further investigation
Further statistics testing is needed to have a more detailed picture. ### Did you create any new variables from existing variables in the dataset? I created the average of CreditScoreRangeUpper and CreditScoreRangeLower to present a balanced picture of the individual credit score.
Most of the distributions are positively skewed and have extreme outliners. I did not use the log() scale since I am going to use logistic regression model instead of the linear model for prediction.