The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
I am going to explore 10 independent variables and dependent varaibles (Loan Status) in the bivaraite plot section.
I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation
I am going to explore the variable CreditScoreRangeAvg this time. I created a two way frequency table to have a look on the distribution of CreditScoreRangeAvg in both default and not default group.
THe quantile, median and mean of non default group are both higher than the default group
## [1] "descriptive statistics for Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 609.5 649.5 650.3 709.5 869.5 174
## [1] "descriptive statistics for Not Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 709.5 702.9 729.5 889.5 416
I grouped the credit score into four groups as followed to see whether this is any interesting observation.
300-629: Bad credit 630-689: Fair credit, also called “average credit” 690-719: Good credit 720 and up: Excellent credit Source: https://www.nerdwallet.com/blog/finance/credit-score-ranges-and-how-to-improve/
As shown in the table below, the column proportion of CreditScoreRangeAvg in non default group is always higher than the default group except the Bad credit group.
##
## Excellent credit Good credit average credit Bad credit Sum
## Defaulted 0.19 0.09 0.35 0.37 1.00
## Not Default 0.36 0.14 0.41 0.09 1.00
As shown in the table below, the row proportion of credit group in Nondefault group is slowly decreasing, while the default group hikes from 13% in average credit group to 43% in bad credit group.
##
## Excellent credit Good credit average credit Bad credit
## Defaulted 0.08 0.10 0.13 0.43
## Not Default 0.92 0.90 0.87 0.57
## Sum 1.00 1.00 1.00 1.00
the testing result shows the CreditScoreRangeAvg and Loan Status has statistically significent realtionship
##
## Pearson's Chi-squared test
##
## data: creditgrop
## X-squared = 11019, df = 3, p-value < 2.2e-16
The distribution of CreditScoreRangeAvg is normally distributed,
As it shown in the Interleaved histograms and Density plots, the no default group has a normal distribution and a much lower mean and lower peak value in the density plot.
The stacked density plot presents an evenly split at around 550. With the nondefault dominate the part below 550, and default group dominate the upper part.
The boxplot below shows the same result as the above histogram and density plot. The default group has a more spread out distribution and lower mean credit score.
I am going to desgin a hypothesis testing on Inquiries in last 6 months cross two groups in the next analysis
I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”
##
## F test to compare two variances
##
## data: newV_d and newV_n
## F = 2.3665, num df = 16851, denom df = 96489, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 2.312568 2.422122
## sample estimates:
## ratio of variances
## 2.366485
##
## Wilcoxon rank sum test with continuity correction
##
## data: newV_d and newV_n
## W = 500010000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
At .05 significance level, I conclude that the loan data of default and nondefault group are nonidentical populations.