The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
take a look on the Loan Status count on default and non default group
## Defaulted Not Default
## 17026 96906
I am going to explore 10 independent variables and dependent varaibles (Loan Status) in the bivaraite plot section.
Those ten varaibles are
Credit Record Characteristic Variables: InquiriesLast6Months, CreditScoreRangeAvg, PublicRecordsLast10Years, CurrentDelinquencies, AvailableBankcardCredit
Income Characteristic Variables: StatedMonthlyIncome, EmploymentStatusDuration, DebtToIncomeRatio, ListingCategory, IncomeVerifiable
I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation
I started with variable InquiriesLast6Months. I created a two way frequency table to have a look on the distribution of InquiriesLast6Months in both default and not default group.
## [1] "descriptive statistics for Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 2.000 2.989 4.000 105.000 233
## [1] "descriptive statistics for Not Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.000 1.164 2.000 63.000 463
As shown in the table below, the column proportion of InquiriesLast6Months in default group is always higher than the not default group except 0 and 1 Inquiries in the Last 6 Months.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## Defaulted 0.25 0.22 0.15 0.10 0.07 0.05 0.04 0.03 0.02 0.02 0.01 0.01
## Not Default 0.47 0.26 0.12 0.06 0.03 0.02 0.01 0.01 0.00 0.00 0.00 0.00
##
## 12 13 14 15 16 17 18 19 20 21 22 23
## Defaulted 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Not Default 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## 24 25 26 27 28 29 30 31 32 33 34 35
## Defaulted 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Not Default 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## 36 37 38 40 41 42 44 46 50 52 53 63
## Defaulted 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Not Default 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## 97 105 Sum
## Defaulted 0.00 0.00 0.99
## Not Default 0.00 0.00 0.98
As shown in the table below, the row proportion of InquiriesLast6Months in not default group dominate from Inquiries range between 0 to 8; the default group dominate from 9 to 27. Since the total count of InquiriesLast6Months over 27 for both two group is less than 10, which is very small, the row proportion has less meaning.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## Defaulted 0.09 0.13 0.17 0.22 0.27 0.32 0.38 0.45 0.48 0.53 0.53 0.57
## Not Default 0.91 0.87 0.83 0.78 0.73 0.68 0.62 0.55 0.52 0.47 0.47 0.43
## Sum 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## 12 13 14 15 16 17 18 19 20 21 22 23
## Defaulted 0.61 0.62 0.62 0.62 0.66 0.67 0.62 0.67 0.83 0.80 0.73 0.67
## Not Default 0.39 0.38 0.38 0.38 0.34 0.33 0.38 0.33 0.17 0.20 0.27 0.33
## Sum 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## 24 25 26 27 28 29 30 31 32 33 34 35
## Defaulted 0.81 0.79 0.71 0.62 0.88 1.00 0.75 0.50 1.00 1.00 1.00 0.75
## Not Default 0.19 0.21 0.29 0.38 0.12 0.00 0.25 0.50 0.00 0.00 0.00 0.25
## Sum 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## 36 37 38 40 41 42 44 46 50 52 53 63
## Defaulted 0.00 1.00 0.50 1.00 0.00 1.00 1.00 1.00 1.00 0.00 1.00 0.00
## Not Default 1.00 0.00 0.50 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00
## Sum 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## 97 105
## Defaulted 1.00 1.00
## Not Default 0.00 0.00
## Sum 1.00 1.00
the testing result shows the InquiriesLast6Months and Loan Status has statistically significent realtionship
##
## Pearson's Chi-squared test
##
## data: InquiriesLast6Months
## X-squared = 8795.3, df = 49, p-value < 2.2e-16
Since the distribution of InquiriesLast6Months is highly positively skewed, I exclude the outliers, which is outside interquartile range for further visualization. As it shown in the Interleaved histograms and Density plots, the no default group has a positively skewed distribution and a much lower mean. In the density plot, the both groups have peaked at value 0. However, the known default group apparently experience a sharp drop, instead the default group is a much smoothing probability density.Therefore, it is more likely for people in the non default group to have a 0 or 1 or 2 Inquiries in Last 6 Months
The boxplot below shows the same result as the above histogram and density plot. The default group has a more spread out distribution and larger mean.
I am going to desgin a hypothesis testing on Inquiries in last 6 months cross two groups in the next analysis
on InquiriesLast6Months variable? I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”
##
## F test to compare two variances
##
## data: newV_d and newV_n
## F = 5.0667, num df = 16792, denom df = 96442, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 4.951077 5.185993
## sample estimates:
## ratio of variances
## 5.066689
##
## Wilcoxon rank sum test with continuity correction
##
## data: newV_d and newV_n
## W = 1086200000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
At .05 significance level, I conclude that the loan data of default and nondefault group are nonidentical populations.