Variable Analysis “AvailableBankcardCredit”

Data set introduction

The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.

## [1] 113937     81

I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation

I explored variable AvailableBankcardCredit this time. I created a two way frequency table to have a look on the distribution of AvailableBankcardCredit in both default and not default group.

Descriptive statistics on “AvailableBankcardCredit” by group

## [1] "descriptive statistics for Defaulted Group"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     203    1786    7128    7651  364300    3066

## [1] "descriptive statistics for Not Defaulted Group"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    1067    4540   11830   14140  646300    4474

Based on col proportion

As shown in the table below, the column proportion of AvailableBankcardCredit in the nondefault group shares the similar percentage cross four groups; the default group’s percentage increase as the AvailableBankcardCredit decrease, which may infer the cash urgency from the loaner.

##              
##               High Upper Middle Lower Middle  Low  Sum
##   Defaulted   0.14         0.19         0.25 0.42 1.00
##   Not Default 0.24         0.24         0.28 0.24 1.00

Based on row proportion

As shown in the table below, the row proportion of AvailableBankcardCredit in not default group dominate all class, but the proportion is stable and expereince a drop in the low group, while the proportion of default group is going up with a hike in the low group.

##              
##               High Upper Middle Lower Middle  Low
##   Defaulted   0.08         0.11         0.12 0.21
##   Not Default 0.92         0.89         0.88 0.79
##   Sum         1.00         1.00         1.00 1.00

Let’s do a simple testing on AvailableBankcardCredit and Loan Status

the testing result shows the AvailableBankcardCredit and Loan Status has statistically significent realtionship

## 
##  Pearson's Chi-squared test
## 
## data:  AvailableBankcardCreditgrop
## X-squared = 2143.5, df = 3, p-value < 2.2e-16

Let’s do some visualizaiton on AvailableBankcardCredit, and compare two groups.

Since the distribution of AvailableBankcardCredit is positively skewed, I exclude the outliers, which is outside interquartile range for further visualization. As it shown in the Interleaved histograms and Density plots, the no default group has a positively skewed distribution and a higher mean.

The boxplot below shows the same result as the above histogram and density plot. The non default group has a more spread out distribution and larger mean.

What’s the result of Two independent samples t-test

on AvailableBankcardCredit variable?

I obtained p-value smaller than 0.05 from the Variance Homogeneous Test, then I can assume that the two variances are not homogeneous. Therefore, the assumption of t-test is violated. I chose the alternative “Mann-Whitney-Wilcoxon Test”

## 
##  F test to compare two variances
## 
## data:  newV_d and newV_n
## F = 0.55389, num df = 13959, denom df = 92431, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5401903 0.5680801
## sample estimates:
## ratio of variances 
##          0.5538941

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  newV_d and newV_n
## W = 482630000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

At .05 significance level, I conclude that the AvailableBankcardCredit of default and nondefault group are nonidentical populations.