The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation
This time the target variable PublicRecordsLast10Years. I created a two way frequency table to have a look on the distribution of PublicRecordsLast10Years in both default and not default group.
## [1] "descriptive statistics for Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.445 1.000 30.000 233
## [1] "descriptive statistics for Not Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.2896 0.0000 38.0000 463
As shown in the table below, the column proportion of PublicRecordsLast10Years in both group are concentrated in small value, more than 90% value are 0 or 1.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## Defaulted 0.70 0.22 0.05 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Not Default 0.77 0.20 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## 12 13 14 15 16 17 20 21 22 25 30 34
## Defaulted 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Not Default 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## 38 Sum
## Defaulted 0.00 1.00
## Not Default 0.00 1.00
As shown in the table below, the row proportion of PublicRecordsLast10Years in not default group dominate from Inquiries range between 0 to 8. Since the total count of PublicRecordsLast10Years over 9 for both two group is less than 10, which is very small, the row proportion has less meaning.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## Defaulted 0.14 0.16 0.25 0.33 0.33 0.34 0.43 0.33 0.29 0.73 0.50 0.43
## Not Default 0.86 0.84 0.75 0.67 0.67 0.66 0.57 0.67 0.71 0.27 0.50 0.57
## Sum 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## 12 13 14 15 16 17 20 21 22 25 30 34
## Defaulted 0.25 1.00 0.50 0.67 0.20 1.00 0.00 0.00 1.00 0.00 1.00 0.00
## Not Default 0.75 0.00 0.50 0.33 0.80 0.00 1.00 1.00 0.00 1.00 0.00 1.00
## Sum 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##
## 38
## Defaulted 0.00
## Not Default 1.00
## Sum 1.00
the testing result shows the PublicRecordsLast10Years and Loan Status has statistically significent realtionship
##
## Pearson's Chi-squared test
##
## data: PublicRecordsLast10Years
## X-squared = 874.32, df = 24, p-value < 2.2e-16
Since the distribution of PublicRecordsLast10Years is highly positively skewed, I exclude the outliers, which is outside interquartile range for further visualization. As it shown in the Interleaved histograms and Density plots, the no default group has a positively skewed distribution and a higer mean. In the density plot, the both groups have peaked at value 0. However, the
nondefault group apparently experience a sharp drop, instead the default group is a much smoothing probability density. One thing about this PublicRecordsLast10Years variables is it has a majority of zero value, which make the distribution less meaningful.
The boxplot below shows the same result as the above histogram and density plot. The default group has a more spread out distribution and larger mean.
I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”
##
## F test to compare two variances
##
## data: newV_d and newV_n
## F = 2.1567, num df = 16792, denom df = 96442, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 2.107471 2.207466
## sample estimates:
## ratio of variances
## 2.156682
##
## Wilcoxon rank sum test with continuity correction
##
## data: newV_d and newV_n
## W = 866810000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
At .05 significance level, I conclude that the PublicRecordsLast10Years of default and nondefault group are nonidentical populations.