Variable Analysis “ListingCategory”

Data set introduction

The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.

## [1] 113937     81

I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation

I started with variable ListingCategory. I created a two way frequency table to have a look on the distribution of ListingCategory in both default and not default group.

Descriptive statistics on “ListingCategory” by group

## [1] "descriptive statistics for Defaulted Group"
##      Baby&Adoption         Motorcycle                 RV 
##                 13                  9                  1 
##              Taxes           Vacation      Wedding Loans 
##                 54                 53                 46 
##               Auto               Boat           Business 
##                329                  3               1410 
## Cosmetic Procedure Debt Consolidation    Engagement Ring 
##                 10               4707                  6 
##        Green Loans   Home Improvement Household Expenses 
##                  9                812                228 
##   Large Purchases      Medical/Dental      Not Available 
##                 46                131               6695 
##              Other      Personal Loan        Student Use 
##               1494                775                195
## [1] "descriptive statistics for Not Defaulted Group"
##      Baby&Adoption         Motorcycle                 RV 
##                186                295                 51 
##              Taxes           Vacation      Wedding Loans 
##                831                715                725 
##               Auto               Boat           Business 
##               2243                 82               5779 
## Cosmetic Procedure Debt Consolidation    Engagement Ring 
##                 81              53600                211 
##        Green Loans   Home Improvement Household Expenses 
##                 50               6621               1768 
##   Large Purchases      Medical/Dental      Not Available 
##                830               1391              10266 
##              Other      Personal Loan        Student Use 
##               9000               1620                561

Based on col proportion

As shown in the table below, the column proportion of listing category in the default group is four times the percentage of nondefault group in the “Not Available” listing category and the nondefault group has half of those people are using the loan to pay debt consolidation, while the default group only has 28%.

##              
##                Baby&Adoption  Motorcycle   RV  Taxes  Vacation
##   Defaulted             0.00        0.00 0.00   0.00      0.00
##   Not Default           0.00        0.00 0.00   0.01      0.01
##              
##                Wedding Loans Auto Boat Business Cosmetic Procedure
##   Defaulted             0.00 0.02 0.00     0.08               0.00
##   Not Default           0.01 0.02 0.00     0.06               0.00
##              
##               Debt Consolidation Engagement Ring Green Loans
##   Defaulted                 0.28            0.00        0.00
##   Not Default               0.55            0.00        0.00
##              
##               Home Improvement Household Expenses Large Purchases 
##   Defaulted               0.05               0.01             0.00
##   Not Default             0.07               0.02             0.01
##              
##               Medical/Dental Not Available Other Personal Loan Student Use
##   Defaulted             0.01          0.39  0.09          0.05        0.01
##   Not Default           0.01          0.11  0.09          0.02        0.01
##              
##                Sum
##   Defaulted   0.99
##   Not Default 1.00

Based on row proportion

As shown in the table below, the row proportion of ListingCategory in not default group dominate all the categories. the portion of default group in the folloing listing are higher than other categories, including Not Available ,Personal Loan and Student Use

##              
##                Baby&Adoption  Motorcycle   RV  Taxes  Vacation
##   Defaulted             0.07        0.03 0.02   0.06      0.07
##   Not Default           0.93        0.97 0.98   0.94      0.93
##   Sum                   1.00        1.00 1.00   1.00      1.00
##              
##                Wedding Loans Auto Boat Business Cosmetic Procedure
##   Defaulted             0.06 0.13 0.04     0.20               0.11
##   Not Default           0.94 0.87 0.96     0.80               0.89
##   Sum                   1.00 1.00 1.00     1.00               1.00
##              
##               Debt Consolidation Engagement Ring Green Loans
##   Defaulted                 0.08            0.03        0.15
##   Not Default               0.92            0.97        0.85
##   Sum                       1.00            1.00        1.00
##              
##               Home Improvement Household Expenses Large Purchases 
##   Defaulted               0.11               0.11             0.05
##   Not Default             0.89               0.89             0.95
##   Sum                     1.00               1.00             1.00
##              
##               Medical/Dental Not Available Other Personal Loan Student Use
##   Defaulted             0.09          0.39  0.14          0.32        0.26
##   Not Default           0.91          0.61  0.86          0.68        0.74
##   Sum                   1.00          1.00  1.00          1.00        1.00

Let’s do some visualizaiton on ListingCategory, and compare two groups.

Since listing category and default status are both categorical variables, i chose to use the Heatmap to visualize it. The visualization is aligned with the frequncy table above

Let’s do a simple testing on ListingCategory and Loan Status

the testing result shows the ListingCategory and Loan Status has statistically significent realtionship

## 
##  Pearson's Chi-squared test
## 
## data:  ListingCategory
## X-squared = 11429, df = 20, p-value < 2.2e-16

what’s the Weights Of Evidence (WOE) for variable “ListingCategory”?

The definition and calcualtion of WOE is followed: “The Weight of Evidence or WoE value is a widely used measure of the”strength" of a grouping for separating good and bad risk (default). It is computed from the basic odds ratio: (Distribution of not default Outcomes) / (Distribution of default Outcomes)"

http://support.sas.com/documentation/cdl/en/prochp/66704/HTML/default/viewer.htm#prochp_hpbin_details02.htm

##                   CAT GOODS  BADS TOTAL    PCT_G   PCT_B    WOE         IV
## 18      Not Available  6695 10266 16961 0.393222 0.10594  1.312 0.37678023
## 11 Debt Consolidation  4707 53600 58307 0.276460 0.55311 -0.693 0.19185898
## 20      Personal Loan   775  1620  2395 0.045519 0.01672  1.002 0.02884982
## 9            Business  1410  5779  7189 0.082815 0.05964  0.328 0.00761117
## 14   Home Improvement   812  6621  7433 0.047692 0.06832 -0.360 0.00741730
## 16   Large Purchases     46   830   876 0.002702 0.00857 -1.154 0.00676493
## 4               Taxes    54   831   885 0.003172 0.00858 -0.995 0.00537477
## 6       Wedding Loans    46   725   771 0.002702 0.00748 -1.019 0.00486830
## 2          Motorcycle     9   295   304 0.000529 0.00304 -1.751 0.00440416
## 17     Medical/Dental   131  1391  1522 0.007694 0.01435 -0.624 0.00415305
## 21        Student Use   195   561   756 0.011453 0.00579  0.682 0.00386440
## 5            Vacation    53   715   768 0.003113 0.00738 -0.863 0.00368100
## 12    Engagement Ring     6   211   217 0.000352 0.00218 -1.821 0.00332344
## 15 Household Expenses   228  1768  1996 0.013391 0.01824 -0.309 0.00150089
## 1       Baby&Adoption    13   186   199 0.000764 0.00192 -0.922 0.00106546
## 8                Boat     3    82    85 0.000176 0.00085 -1.569 0.00105127
## 3                  RV     1    51    52 0.000059 0.00053 -2.193 0.00102525
## 7                Auto   329  2243  2572 0.019323 0.02315 -0.181 0.00069005
## 19              Other  1494  9000 10494 0.087748 0.09287 -0.057 0.00029095
## 10 Cosmetic Procedure    10    81    91 0.000587 0.00084 -0.353 0.00008770
## 13        Green Loans     9    50    59 0.000529 0.00052  0.024 0.00000031

IV (Information value) of categorical variable Listing Category

“IV=(perc.Good - perc.Bad)*WOE The IV of the categorical variables is the sum of information value of its individual categories." http://r-statistics.co/Information-Value-With-R.html

## [1] 0.6547
## attr(,"howgood")
## [1] "Highly Predictive"