The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation
I explored variable AvailableBankcardCredit this time. I created a two way frequency table to have a look on the distribution of AvailableBankcardCredit in both default and not default group.
## [1] "descriptive statistics for Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 203 1786 7128 7651 364300 3066
## [1] "descriptive statistics for Not Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1067 4540 11830 14140 646300 4474
As shown in the table below, the column proportion of AvailableBankcardCredit in the nondefault group shares the similar percentage cross four groups; the default group’s percentage increase as the AvailableBankcardCredit decrease, which may infer the cash urgency from the loaner.
##
## High Upper Middle Lower Middle Low Sum
## Defaulted 0.14 0.19 0.25 0.42 1.00
## Not Default 0.24 0.24 0.28 0.24 1.00
As shown in the table below, the row proportion of AvailableBankcardCredit in not default group dominate all class, but the proportion is stable and expereince a drop in the low group, while the proportion of default group is going up with a hike in the low group.
##
## High Upper Middle Lower Middle Low
## Defaulted 0.08 0.11 0.12 0.21
## Not Default 0.92 0.89 0.88 0.79
## Sum 1.00 1.00 1.00 1.00
the testing result shows the AvailableBankcardCredit and Loan Status has statistically significent realtionship
##
## Pearson's Chi-squared test
##
## data: AvailableBankcardCreditgrop
## X-squared = 2143.5, df = 3, p-value < 2.2e-16
Since the distribution of AvailableBankcardCredit is positively skewed, I exclude the outliers, which is outside interquartile range for further visualization. As it shown in the Interleaved histograms and Density plots, the no default group has a positively skewed distribution and a higher mean.
The boxplot below shows the same result as the above histogram and density plot. The non default group has a more spread out distribution and larger mean.
I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”
##
## F test to compare two variances
##
## data: newV_d and newV_n
## F = 0.55389, num df = 13959, denom df = 92431, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5401903 0.5680801
## sample estimates:
## ratio of variances
## 0.5538941
##
## Wilcoxon rank sum test with continuity correction
##
## data: newV_d and newV_n
## W = 482630000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
At .05 significance level, I conclude that the AvailableBankcardCredit of default and nondefault group are nonidentical populations.