Modeling on Prosper Loan data - Logistic Regression

================================================================================

Data set introduction

The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.

## [1] 113937     81

I manually picked ten variables, based on my judgement.

check possible correlated variables

The correlation between the Credit Score and Avaialble bank card credit is the strongest between other variables, which is 0.45, a moderate correlation. Therefore, I kept all the independent variables.

## [1] "the correlation between CreditScoreRange and AvailableBankcardCredit"
## [1] 0.4532574

Check Class bias

The ideal proportion of event happened or not should be half and half. However, the default data or any fraud data set is naturally imbalance, since there will be only a small amount of people default or conduct fraudulent. Obviously, the class bias exists in this dataset. One solution will be making sure the sample of building the model are in the equal proportions.

## 
##     0     1 
## 96906 17026

Create Training and Test Samples

I create my traning and testing sample by setting equal proportion in my sampling. In other word, the numbers of “0” and “1” in my training data set are the same.

Checking the missing value

Although some model has the default setting to impute the missing value with mean, median or mode, it will increase the time of buidling the model. Therefore, to avoid waiting time, I impute the missing values in the numeric variable using median (since most variables is highly skewed, average will be affected by outliers and fat tail) and factor variables using mode (most frequent value)

## 
##  FALSE   TRUE 
## 254045   8151

impute missing value

Build Logit Models and Predict

## 
## Call:
## glm(formula = LoanStatus_B ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4501  -0.9048   0.0000   0.9717   4.8943  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                        3.710e+00  4.237e-01   8.755  < 2e-16
## StatedMonthlyIncome               -5.938e-05  4.570e-06 -12.992  < 2e-16
## EmploymentStatusDuration          -9.242e-04  1.741e-04  -5.308 1.11e-07
## DebtToIncomeRatio                  1.487e-01  3.637e-02   4.087 4.36e-05
## ListingCategory Motorcycle        -4.734e-01  5.765e-01  -0.821 0.411536
## ListingCategory RV                -1.061e+01  7.942e+01  -0.134 0.893734
## ListingCategory Taxes              4.546e-02  4.148e-01   0.110 0.912724
## ListingCategory Vacation          -6.821e-02  4.122e-01  -0.165 0.868561
## ListingCategory Wedding Loans     -3.231e-02  4.147e-01  -0.078 0.937898
## ListingCategoryAuto                4.956e-01  3.756e-01   1.319 0.187085
## ListingCategoryBoat                2.160e-01  7.995e-01   0.270 0.787007
## ListingCategoryBusiness            1.034e+00  3.678e-01   2.810 0.004948
## ListingCategoryCosmetic Procedure  9.525e-01  7.492e-01   1.271 0.203564
## ListingCategoryDebt Consolidation  8.276e-02  3.647e-01   0.227 0.820498
## ListingCategoryEngagement Ring    -1.153e+00  7.273e-01  -1.586 0.112799
## ListingCategoryGreen Loans         3.388e-01  6.853e-01   0.494 0.621032
## ListingCategoryHome Improvement    5.061e-01  3.685e-01   1.374 0.169591
## ListingCategoryHousehold Expenses  5.147e-01  3.801e-01   1.354 0.175683
## ListingCategoryLarge Purchases    -3.189e-01  4.254e-01  -0.750 0.453496
## ListingCategoryMedical/Dental      1.484e-01  3.873e-01   0.383 0.701623
## ListingCategoryNot Available       1.292e+00  3.661e-01   3.531 0.000414
## ListingCategoryOther               7.005e-01  3.669e-01   1.910 0.056190
## ListingCategoryPersonal Loan       1.334e+00  3.742e-01   3.565 0.000364
## ListingCategoryStudent Use         1.015e+00  3.940e-01   2.575 0.010029
## IncomeVerifiableTrue              -5.456e-01  5.427e-02 -10.053  < 2e-16
## CreditScoreRangeAvg               -5.561e-03  2.968e-04 -18.736  < 2e-16
## InquiriesLast6Months               1.843e-01  7.635e-03  24.135  < 2e-16
## PublicRecordsLast10Years          -9.861e-03  1.951e-02  -0.506 0.613165
## CurrentDelinquencies               8.891e-02  9.435e-03   9.424  < 2e-16
## AvailableBankcardCredit           -4.739e-06  1.054e-06  -4.496 6.92e-06
##                                      
## (Intercept)                       ***
## StatedMonthlyIncome               ***
## EmploymentStatusDuration          ***
## DebtToIncomeRatio                 ***
## ListingCategory Motorcycle           
## ListingCategory RV                   
## ListingCategory Taxes                
## ListingCategory Vacation             
## ListingCategory Wedding Loans        
## ListingCategoryAuto                  
## ListingCategoryBoat                  
## ListingCategoryBusiness           ** 
## ListingCategoryCosmetic Procedure    
## ListingCategoryDebt Consolidation    
## ListingCategoryEngagement Ring       
## ListingCategoryGreen Loans           
## ListingCategoryHome Improvement      
## ListingCategoryHousehold Expenses    
## ListingCategoryLarge Purchases       
## ListingCategoryMedical/Dental        
## ListingCategoryNot Available      ***
## ListingCategoryOther              .  
## ListingCategoryPersonal Loan      ***
## ListingCategoryStudent Use        *  
## IncomeVerifiableTrue              ***
## CreditScoreRangeAvg               ***
## InquiriesLast6Months              ***
## PublicRecordsLast10Years             
## CurrentDelinquencies              ***
## AvailableBankcardCredit           ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33044  on 23835  degrees of freedom
## Residual deviance: 26816  on 23806  degrees of freedom
## AIC: 26876
## 
## Number of Fisher Scoring iterations: 10

In sample Validation using the confusion matrix

I calculated the predicted probability for training set, and converted the probability to 0/1 using 0.5 as cutoff.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10481  5949
##          1  1437  5969
##                                          
##                Accuracy : 0.6901         
##                  95% CI : (0.6842, 0.696)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3803         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.8794         
##             Specificity : 0.5008         
##          Pos Pred Value : 0.6379         
##          Neg Pred Value : 0.8060         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4397         
##    Detection Prevalence : 0.6893         
##       Balanced Accuracy : 0.6901         
##                                          
##        'Positive' Class : 0              
## 

Out sample Validation using the confusion matrix

I calculated the predicted probability for testing set, and converted the probability to 0/1 using 0.5 as cutoff.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 74460  2557
##          1 10528  2551
##                                           
##                Accuracy : 0.8548          
##                  95% CI : (0.8524, 0.8571)
##     No Information Rate : 0.9433          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2167          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8761          
##             Specificity : 0.4994          
##          Pos Pred Value : 0.9668          
##          Neg Pred Value : 0.1950          
##              Prevalence : 0.9433          
##          Detection Rate : 0.8265          
##    Detection Prevalence : 0.8548          
##       Balanced Accuracy : 0.6878          
##                                           
##        'Positive' Class : 0               
## 

ROC Curve and AUC of training set

ROC Curve and AUC of testing set

R^2

##            llh        llhNull             G2       McFadden           r2ML 
## -13407.8250158 -16521.8561958   6228.0623601      0.1884795      0.2299409 
##           r2CU 
##      0.3065879

Resources: http://r-statistics.co/Information-Value-With-R.html https://rpubs.com/jpmurillo/153750