The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation
I started with variable DebtToIncomeRatio. The definition is below. DTI = total monthly debt payments/gross monthly income I created a two way frequency table to have a look on the distribution of DebtToIncomeRatio in both default and not default group.
The max of “DebtToIncomeRatio” is both 10.01 for both group, is because the prosper assigned a ‘10.01’ value on all loaners witha a DebtToIncomeRatio greater than 10. the nondefault group’s mean of DebtToIncomeRatio is small than the default group.
## [1] "descriptive statistics for Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1400 0.2200 0.3484 0.3300 10.0100 1498
## [1] "descriptive statistics for Not Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.150 0.220 0.263 0.310 10.010 7056
I grouped the DebtToIncomeRatio into three group based on the online definition of good, caution and danger. Good: <15% Caution: 15%-20% Danger: >20% https://www.clearpoint.org/blog/what-is-a-good-debt-to-income-ratio-anyway/
## x freq
## 1 Good 57092
## 2 Caution 18237
## 3 Danger 30049
## 4 Unknown 8554
As shown in the table below, the column proportion of DebtToIncomeRatio in default group and nondefault group is similar cross group
##
## Good Caution Danger Unknown Sum
## Defaulted 0.48 0.15 0.28 0.09 1.00
## Not Default 0.50 0.16 0.26 0.07 0.99
As shown in the table below, the row proportion of DebtToIncomeRatio in not default group dominates all three group, and stands for the same proportion cross groups
##
## Good Caution Danger Unknown
## Defaulted 0.14 0.14 0.16 0.18
## Not Default 0.86 0.86 0.84 0.82
## Sum 1.00 1.00 1.00 1.00
the testing result shows the DebtToIncomeRatio and Loan Status has statistically significent realtionship
##
## Pearson's Chi-squared test
##
## data: DebtToIncomeRatio
## X-squared = 111.38, df = 3, p-value < 2.2e-16
Since the distribution of DebtToIncomeRatio is highly positively skewed, I exclude the outliers, which is outside interquartile range for further visualization. As it shown in the Interleaved histograms and Density plots, the no default group has a positively skewed distribution and a much lower mean. In the density plot, the both groups have peaked at value 17%.
The boxplot below shows the same result as the above histogram and density plot. The default group has a larger mean.
I am going to desgin a hypothesis testing on Inquiries in last 6 months cross two groups in the next analysis
I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”
##
## F test to compare two variances
##
## data: newV_d and newV_n
## F = 4.2522, num df = 15527, denom df = 89849, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 4.151400 4.356316
## sample estimates:
## ratio of variances
## 4.252183
##
## Wilcoxon rank sum test with continuity correction
##
## data: newV_d and newV_n
## W = 699730000, p-value = 0.5411
## alternative hypothesis: true location shift is not equal to 0
At .05 significance level, I conclude that the loan data of default and nondefault group are identical populations, since the p-value is0.5411.