Variable Analysis “PublicRecordsLast10Years”

Data set introduction

The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.

## [1] 113937     81

I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation

This time the target variable PublicRecordsLast10Years. I created a two way frequency table to have a look on the distribution of PublicRecordsLast10Years in both default and not default group.

Descriptive statistics on “PublicRecordsLast10Years” by group

## [1] "descriptive statistics for Defaulted Group"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.445   1.000  30.000     233
## [1] "descriptive statistics for Not Defaulted Group"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.2896  0.0000 38.0000     463

Based on col proportion

As shown in the table below, the column proportion of PublicRecordsLast10Years in both group are concentrated in small value, more than 90% value are 0 or 1.

##              
##                  0    1    2    3    4    5    6    7    8    9   10   11
##   Defaulted   0.70 0.22 0.05 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##   Not Default 0.77 0.20 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##              
##                 12   13   14   15   16   17   20   21   22   25   30   34
##   Defaulted   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##   Not Default 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##              
##                 38  Sum
##   Defaulted   0.00 1.00
##   Not Default 0.00 1.00

Based on row proportion

As shown in the table below, the row proportion of PublicRecordsLast10Years in not default group dominate from Inquiries range between 0 to 8. Since the total count of PublicRecordsLast10Years over 9 for both two group is less than 10, which is very small, the row proportion has less meaning.

##              
##                  0    1    2    3    4    5    6    7    8    9   10   11
##   Defaulted   0.14 0.16 0.25 0.33 0.33 0.34 0.43 0.33 0.29 0.73 0.50 0.43
##   Not Default 0.86 0.84 0.75 0.67 0.67 0.66 0.57 0.67 0.71 0.27 0.50 0.57
##   Sum         1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##              
##                 12   13   14   15   16   17   20   21   22   25   30   34
##   Defaulted   0.25 1.00 0.50 0.67 0.20 1.00 0.00 0.00 1.00 0.00 1.00 0.00
##   Not Default 0.75 0.00 0.50 0.33 0.80 0.00 1.00 1.00 0.00 1.00 0.00 1.00
##   Sum         1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##              
##                 38
##   Defaulted   0.00
##   Not Default 1.00
##   Sum         1.00

Let’s do a simple testing on PublicRecordsLast10Years and Loan Status

the testing result shows the PublicRecordsLast10Years and Loan Status has statistically significent realtionship

## 
##  Pearson's Chi-squared test
## 
## data:  PublicRecordsLast10Years
## X-squared = 874.32, df = 24, p-value < 2.2e-16

Let’s do some visualizaiton on PublicRecordsLast10Years,

and compare two groups.

Since the distribution of PublicRecordsLast10Years is highly positively skewed, I exclude the outliers, which is outside interquartile range for further visualization. As it shown in the Interleaved histograms and Density plots, the no default group has a positively skewed distribution and a higer mean. In the density plot, the both groups have peaked at value 0. However, the
nondefault group apparently experience a sharp drop, instead the default group is a much smoothing probability density. One thing about this PublicRecordsLast10Years variables is it has a majority of zero value, which make the distribution less meaningful.

The boxplot below shows the same result as the above histogram and density plot. The default group has a more spread out distribution and larger mean.

What’s the result of Two independent samples t-test on

PublicRecordsLast10Years variable?

I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”

## 
##  F test to compare two variances
## 
## data:  newV_d and newV_n
## F = 2.1567, num df = 16792, denom df = 96442, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  2.107471 2.207466
## sample estimates:
## ratio of variances 
##           2.156682
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  newV_d and newV_n
## W = 866810000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

At .05 significance level, I conclude that the PublicRecordsLast10Years of default and nondefault group are nonidentical populations.