Variable Analysis “CurrentDelinquencies”

Data set introduction

The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.

## [1] 113937     81

I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation

I started with variable CurrentDelinquencies. I created a two way frequency table to have a look on the distribution of CurrentDelinquencies in both default and not default group.

Descriptive statistics on “CurrentDelinquencies” by group

## [1] "descriptive statistics for Defaulted Group"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   1.601   1.000  83.000     233
## [1] "descriptive statistics for Not Defaulted Group"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.4163  0.0000 51.0000     463

Based on col proportion

As shown in the table below, the column proportion of CurrentDelinquencies in default group is always higher than the not default group except 0 Current Delinquencies.

##              
##                  0    1    2    3    4    5    6    7    8    9   10   11
##   Defaulted   0.62 0.13 0.06 0.04 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.01
##   Not Default 0.82 0.10 0.03 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00
##              
##                 12   13   14   15   16   17   18   19   20   21   22   23
##   Defaulted   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##   Not Default 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##              
##                 24   25   26   27   28   30   31   32   33   35   36   37
##   Defaulted   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##   Not Default 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##              
##                 39   40   41   45   50   51   57   59   64   82   83  Sum
##   Defaulted   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97
##   Not Default 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98

Based on row proportion

As shown in the table below, the row proportion of CurrentDelinquencies in not default group dominate from Inquiries range between 0 to 7; the default group dominate from 8 to 23. Since the total count of CurrentDelinquencies over 23 for both two group is less than 15, which is very small, the row proportion has less meaning.

##              
##                  0    1    2    3    4    5    6    7    8    9   10   11
##   Defaulted   0.12 0.19 0.24 0.32 0.35 0.42 0.43 0.43 0.51 0.49 0.50 0.59
##   Not Default 0.88 0.81 0.76 0.68 0.65 0.58 0.57 0.57 0.49 0.51 0.50 0.41
##   Sum         1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##              
##                 12   13   14   15   16   17   18   19   20   21   22   23
##   Defaulted   0.55 0.57 0.63 0.66 0.66 0.45 0.76 0.57 0.78 0.68 0.62 0.89
##   Not Default 0.45 0.43 0.37 0.34 0.34 0.55 0.24 0.43 0.22 0.32 0.38 0.11
##   Sum         1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##              
##                 24   25   26   27   28   30   31   32   33   35   36   37
##   Defaulted   0.67 1.00 0.75 0.50 0.80 0.50 0.83 0.80 1.00 1.00 0.50 1.00
##   Not Default 0.33 0.00 0.25 0.50 0.20 0.50 0.17 0.20 0.00 0.00 0.50 0.00
##   Sum         1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
##              
##                 39   40   41   45   50   51   57   59   64   82   83
##   Defaulted   1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00
##   Not Default 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00
##   Sum         1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Let’s do a simple testing on CurrentDelinquencies and Loan Status

the testing result shows the CurrentDelinquencies and Loan Status has statistically significent realtionship

## 
##  Pearson's Chi-squared test
## 
## data:  CurrentDelinquencies
## X-squared = 6066.1, df = 46, p-value < 2.2e-16

Let’s do some visualizaiton on CurrentDelinquencies, and compare two groups.

The varaible CurrentDelinquencies’s visualizaiton result is similar to PublicRecordsLast10Years, except the stacked density plot.

The distribution of CurrentDelinquencies is highly positively skewed with a majority of zero value.

As it shown in the Interleaved histograms and Density plots, the default group has a less positively skewed distribution and a much higher mean. In the density plot, the both groups have peaked at value 0. However, the
nondefault group apparently experience a sharp drop, instead the default group is a much smoothing probability density.Therefore, it is more likely for people in the non default group to have a 0 or 1 or 2 Inquiries in Last 6 Months

According to the stacked density plot, the default group’s proportion is steadly increasing.

The boxplot below shows the same result as the above histogram and density plot. The default group has a more spread out distribution and larger mean. The mean is outside the box, which means the dataset is so skewed and with so many zero value, that means is dragged to right due to large positive tail value.

What’s the result of Two independent samples t-test on

Current Delinquencies variable?

I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”

## 
##  F test to compare two variances
## 
## data:  newV_d and newV_n
## F = 6.655, num df = 16792, denom df = 96442, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  6.503126 6.811684
## sample estimates:
## ratio of variances 
##            6.65498
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  newV_d and newV_n
## W = 989320000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

At .05 significance level, I conclude that the loan data of default and nondefault group are nonidentical populations.