Variable Analysis “DebtToIncomeRatio”

Data set introduction

The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.

## [1] 113937     81

I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation

I started with variable DebtToIncomeRatio. The definition is below. DTI = total monthly debt payments/gross monthly income I created a two way frequency table to have a look on the distribution of DebtToIncomeRatio in both default and not default group.

Descriptive statistics on “DebtToIncomeRatio” by group

The max of “DebtToIncomeRatio” is both 10.01 for both group, is because the prosper assigned a ‘10.01’ value on all loaners witha a DebtToIncomeRatio greater than 10. the nondefault group’s mean of DebtToIncomeRatio is small than the default group.

## [1] "descriptive statistics for Defaulted Group"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1400  0.2200  0.3484  0.3300 10.0100    1498
## [1] "descriptive statistics for Not Defaulted Group"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.150   0.220   0.263   0.310  10.010    7056

I grouped the DebtToIncomeRatio into three group based on the online definition of good, caution and danger. Good: <15% Caution: 15%-20% Danger: >20% https://www.clearpoint.org/blog/what-is-a-good-debt-to-income-ratio-anyway/

##         x  freq
## 1    Good 57092
## 2 Caution 18237
## 3  Danger 30049
## 4 Unknown  8554

Based on col proportion

As shown in the table below, the column proportion of DebtToIncomeRatio in default group and nondefault group is similar cross group

##              
##               Good Caution Danger Unknown  Sum
##   Defaulted   0.48    0.15   0.28    0.09 1.00
##   Not Default 0.50    0.16   0.26    0.07 0.99

Based on row proportion

As shown in the table below, the row proportion of DebtToIncomeRatio in not default group dominates all three group, and stands for the same proportion cross groups

##              
##               Good Caution Danger Unknown
##   Defaulted   0.14    0.14   0.16    0.18
##   Not Default 0.86    0.86   0.84    0.82
##   Sum         1.00    1.00   1.00    1.00

Let’s do a simple testing on DebtToIncomeRatio and Loan Status

the testing result shows the DebtToIncomeRatio and Loan Status has statistically significent realtionship

## 
##  Pearson's Chi-squared test
## 
## data:  DebtToIncomeRatio
## X-squared = 111.38, df = 3, p-value < 2.2e-16

Let’s do some visualizaiton on DebtToIncomeRatio, and compare two groups.

Since the distribution of DebtToIncomeRatio is highly positively skewed, I exclude the outliers, which is outside interquartile range for further visualization. As it shown in the Interleaved histograms and Density plots, the no default group has a positively skewed distribution and a much lower mean. In the density plot, the both groups have peaked at value 17%.

The boxplot below shows the same result as the above histogram and density plot. The default group has a larger mean.

I am going to desgin a hypothesis testing on Inquiries in last 6 months cross two groups in the next analysis

What’s the result of Two independent samples t-test on Debt To

Income Ratio variable?

I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”

## 
##  F test to compare two variances
## 
## data:  newV_d and newV_n
## F = 4.2522, num df = 15527, denom df = 89849, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  4.151400 4.356316
## sample estimates:
## ratio of variances 
##           4.252183
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  newV_d and newV_n
## W = 699730000, p-value = 0.5411
## alternative hypothesis: true location shift is not equal to 0

At .05 significance level, I conclude that the loan data of default and nondefault group are identical populations, since the p-value is0.5411.