========================================================
The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
I manually picked ten variables, based on my judgement.
The ideal proportion of event happened or not should be half and half. However, the default data or any fraud data set is naturally imbalance, since there will be only a small amount of people default or conduct fraudulent. Obviously, the class bias exists in this dataset. One solution will be making sure the sample of building the model are in the equal proportions.
##
## 0 1
## 96906 17026
I create my traning and testing sample by setting equal proportion in my sampling. In other word, the numbers of “0” and “1” in my training data set are the same.
## [1] "StatedMonthlyIncome" "EmploymentStatusDuration"
## [3] "DebtToIncomeRatio" "ListingCategory"
## [5] "IncomeVerifiable" "CreditScoreRangeAvg"
## [7] "InquiriesLast6Months" "PublicRecordsLast10Years"
## [9] "CurrentDelinquencies" "AvailableBankcardCredit"
## [11] "LoanStatus_B"
Although some model has the default setting to impute the missing value with mean, median or mode, it will increase the time of buidling the model. Therefore, to avoid waiting time, I impute the missing values in the numeric variable using median (since most variables is highly skewed, average will be affected by outliers and fat tail) and factor variables using mode (most frequent value)
##
## FALSE TRUE
## 254045 8151
##
## FALSE
## 262196
## VARS IV
## 6 CreditScoreRangeAvg 0.6594
## 7 InquiriesLast6Months 0.4521
## 10 AvailableBankcardCredit 0.4338
## 9 CurrentDelinquencies 0.3070
## 4 EmploymentStatusDuration 0.2712
## 3 StatedMonthlyIncome 0.2064
## 5 DebtToIncomeRatio 0.0573
## 8 PublicRecordsLast10Years 0.0520
## 2 IncomeVerifiable 0.0054
## 1 ListingCategory 0.0000
##
## Call:
## glm(formula = LoanStatus_B ~ CreditScoreRangeAvg + InquiriesLast6Months +
## CurrentDelinquencies + EmploymentStatusDuration + StatedMonthlyIncome,
## family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6405 -0.9885 0.0000 1.0683 5.8277
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 4.583753711 0.178807930 25.635
## CreditScoreRangeAvg -0.006670617 0.000258096 -25.845
## InquiriesLast6Months 0.231902681 0.007444451 31.151
## CurrentDelinquencies 0.119722417 0.009373462 12.772
## EmploymentStatusDuration -0.001600556 0.000169243 -9.457
## StatedMonthlyIncome -0.000080159 0.000004513 -17.760
## Pr(>|z|)
## (Intercept) <0.0000000000000002 ***
## CreditScoreRangeAvg <0.0000000000000002 ***
## InquiriesLast6Months <0.0000000000000002 ***
## CurrentDelinquencies <0.0000000000000002 ***
## EmploymentStatusDuration <0.0000000000000002 ***
## StatedMonthlyIncome <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33044 on 23835 degrees of freedom
## Residual deviance: 28086 on 23830 degrees of freedom
## AIC: 28098
##
## Number of Fisher Scoring iterations: 5
Sensitivity, also considered as the ‘True Positive Rate’ or ‘recall’ is the proportion of ‘Events’ (or ‘Ones’) correctly predicted by the model, for a given prediction probability cutoff score.
## One_minus_specificity sensitivity Threshold
## 1 0.000000000 0.00000000 1.00
## 2 0.002129712 0.03797964 0.98
## 3 0.004118228 0.06049334 0.96
## 4 0.006106744 0.08183242 0.94
## 5 0.008318821 0.10630384 0.92
## 6 0.011048619 0.12881754 0.90
## 7 0.013884313 0.14996085 0.88
## 8 0.016943569 0.16836335 0.86
## 9 0.020344049 0.18989820 0.84
## 10 0.024097520 0.21104150 0.82
## 11 0.028451075 0.23198904 0.80
## 12 0.032769332 0.25215348 0.78
## 13 0.037487645 0.26977291 0.76
## 14 0.042782510 0.29072044 0.74
## 15 0.048724526 0.31166797 0.72
## 16 0.055996141 0.33281128 0.70
## 17 0.064044336 0.35297572 0.68
## 18 0.072445522 0.37255286 0.66
## 19 0.082788158 0.39506656 0.64
## 20 0.095213442 0.41699295 0.62
## 21 0.109768438 0.44303054 0.60
## 22 0.126488445 0.47376664 0.58
## 23 0.146644232 0.50939702 0.56
## 24 0.170094602 0.54209084 0.54
## 25 0.198286817 0.58183242 0.52
## 26 0.230973785 0.61922475 0.50
## 27 0.269838095 0.65935787 0.48
## 28 0.314632654 0.69812060 0.46
## 29 0.363310114 0.73610023 0.44
## 30 0.413987857 0.77368833 0.42
## 31 0.467713089 0.80579483 0.40
## 32 0.524615240 0.84103367 0.38
## 33 0.580387819 0.86589663 0.36
## 34 0.636442792 0.89291308 0.34
## 35 0.692038876 0.91503524 0.32
## 36 0.742281263 0.93089272 0.30
## 37 0.789758554 0.94753328 0.28
## 38 0.832741093 0.95947533 0.26
## 39 0.869169765 0.97004699 0.24
## 40 0.901315480 0.97866092 0.22
## 41 0.928213395 0.98492561 0.20
## 42 0.949875277 0.99079875 0.18
## 43 0.965995199 0.99393109 0.16
## 44 0.977502706 0.99706343 0.14
## 45 0.986021556 0.99804229 0.12
## 46 0.991575281 0.99882537 0.10
## 47 0.994846331 0.99902114 0.08
## 48 0.996834847 0.99960846 0.06
## 49 0.998093849 0.99980423 0.04
## 50 0.998964560 0.99980423 0.02
## 51 1.000000000 1.00000000 0.00
## 52 1.000000000 1.00000000 -0.02
Resources: http://r-statistics.co/Information-Value-With-R.html