The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation
I started with variable ListingCategory. I created a two way frequency table to have a look on the distribution of ListingCategory in both default and not default group.
## [1] "descriptive statistics for Defaulted Group"
## Baby&Adoption Motorcycle RV
## 13 9 1
## Taxes Vacation Wedding Loans
## 54 53 46
## Auto Boat Business
## 329 3 1410
## Cosmetic Procedure Debt Consolidation Engagement Ring
## 10 4707 6
## Green Loans Home Improvement Household Expenses
## 9 812 228
## Large Purchases Medical/Dental Not Available
## 46 131 6695
## Other Personal Loan Student Use
## 1494 775 195
## [1] "descriptive statistics for Not Defaulted Group"
## Baby&Adoption Motorcycle RV
## 186 295 51
## Taxes Vacation Wedding Loans
## 831 715 725
## Auto Boat Business
## 2243 82 5779
## Cosmetic Procedure Debt Consolidation Engagement Ring
## 81 53600 211
## Green Loans Home Improvement Household Expenses
## 50 6621 1768
## Large Purchases Medical/Dental Not Available
## 830 1391 10266
## Other Personal Loan Student Use
## 9000 1620 561
As shown in the table below, the column proportion of listing category in the default group is four times the percentage of nondefault group in the “Not Available” listing category and the nondefault group has half of those people are using the loan to pay debt consolidation, while the default group only has 28%.
##
## Baby&Adoption Motorcycle RV Taxes Vacation
## Defaulted 0.00 0.00 0.00 0.00 0.00
## Not Default 0.00 0.00 0.00 0.01 0.01
##
## Wedding Loans Auto Boat Business Cosmetic Procedure
## Defaulted 0.00 0.02 0.00 0.08 0.00
## Not Default 0.01 0.02 0.00 0.06 0.00
##
## Debt Consolidation Engagement Ring Green Loans
## Defaulted 0.28 0.00 0.00
## Not Default 0.55 0.00 0.00
##
## Home Improvement Household Expenses Large Purchases
## Defaulted 0.05 0.01 0.00
## Not Default 0.07 0.02 0.01
##
## Medical/Dental Not Available Other Personal Loan Student Use
## Defaulted 0.01 0.39 0.09 0.05 0.01
## Not Default 0.01 0.11 0.09 0.02 0.01
##
## Sum
## Defaulted 0.99
## Not Default 1.00
As shown in the table below, the row proportion of ListingCategory in not default group dominate all the categories. the portion of default group in the folloing listing are higher than other categories, including Not Available ,Personal Loan and Student Use
##
## Baby&Adoption Motorcycle RV Taxes Vacation
## Defaulted 0.07 0.03 0.02 0.06 0.07
## Not Default 0.93 0.97 0.98 0.94 0.93
## Sum 1.00 1.00 1.00 1.00 1.00
##
## Wedding Loans Auto Boat Business Cosmetic Procedure
## Defaulted 0.06 0.13 0.04 0.20 0.11
## Not Default 0.94 0.87 0.96 0.80 0.89
## Sum 1.00 1.00 1.00 1.00 1.00
##
## Debt Consolidation Engagement Ring Green Loans
## Defaulted 0.08 0.03 0.15
## Not Default 0.92 0.97 0.85
## Sum 1.00 1.00 1.00
##
## Home Improvement Household Expenses Large Purchases
## Defaulted 0.11 0.11 0.05
## Not Default 0.89 0.89 0.95
## Sum 1.00 1.00 1.00
##
## Medical/Dental Not Available Other Personal Loan Student Use
## Defaulted 0.09 0.39 0.14 0.32 0.26
## Not Default 0.91 0.61 0.86 0.68 0.74
## Sum 1.00 1.00 1.00 1.00 1.00
Since listing category and default status are both categorical variables, i chose to use the Heatmap to visualize it. The visualization is aligned with the frequncy table above
the testing result shows the ListingCategory and Loan Status has statistically significent realtionship
##
## Pearson's Chi-squared test
##
## data: ListingCategory
## X-squared = 11429, df = 20, p-value < 2.2e-16
The definition and calcualtion of WOE is followed: “The Weight of Evidence or WoE value is a widely used measure of the”strength" of a grouping for separating good and bad risk (default). It is computed from the basic odds ratio: (Distribution of not default Outcomes) / (Distribution of default Outcomes)"
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 18 Not Available 6695 10266 16961 0.393222 0.10594 1.312 0.37678023
## 11 Debt Consolidation 4707 53600 58307 0.276460 0.55311 -0.693 0.19185898
## 20 Personal Loan 775 1620 2395 0.045519 0.01672 1.002 0.02884982
## 9 Business 1410 5779 7189 0.082815 0.05964 0.328 0.00761117
## 14 Home Improvement 812 6621 7433 0.047692 0.06832 -0.360 0.00741730
## 16 Large Purchases 46 830 876 0.002702 0.00857 -1.154 0.00676493
## 4 Taxes 54 831 885 0.003172 0.00858 -0.995 0.00537477
## 6 Wedding Loans 46 725 771 0.002702 0.00748 -1.019 0.00486830
## 2 Motorcycle 9 295 304 0.000529 0.00304 -1.751 0.00440416
## 17 Medical/Dental 131 1391 1522 0.007694 0.01435 -0.624 0.00415305
## 21 Student Use 195 561 756 0.011453 0.00579 0.682 0.00386440
## 5 Vacation 53 715 768 0.003113 0.00738 -0.863 0.00368100
## 12 Engagement Ring 6 211 217 0.000352 0.00218 -1.821 0.00332344
## 15 Household Expenses 228 1768 1996 0.013391 0.01824 -0.309 0.00150089
## 1 Baby&Adoption 13 186 199 0.000764 0.00192 -0.922 0.00106546
## 8 Boat 3 82 85 0.000176 0.00085 -1.569 0.00105127
## 3 RV 1 51 52 0.000059 0.00053 -2.193 0.00102525
## 7 Auto 329 2243 2572 0.019323 0.02315 -0.181 0.00069005
## 19 Other 1494 9000 10494 0.087748 0.09287 -0.057 0.00029095
## 10 Cosmetic Procedure 10 81 91 0.000587 0.00084 -0.353 0.00008770
## 13 Green Loans 9 50 59 0.000529 0.00052 0.024 0.00000031
“IV=(perc.Good - perc.Bad)*WOE The IV of the categorical variables is the sum of information value of its individual categories." http://r-statistics.co/Information-Value-With-R.html
## [1] 0.6547
## attr(,"howgood")
## [1] "Highly Predictive"