Introduction

A regression model with output between 0 and 1.

probability(default=1/x1,x2,…xm) = 1/(1+ e^ -(b0 + b1x1 + …+ bmxm))

                               = e ^(b0 + b1*x1 + ...+ bm*xm)
                                 ------------------------------
                                 1+ e^ (b0 + b1*x1 + ...+ bm*xm)

here

 x1,x2,x3...xm are independant variables

 b0,b1,b2...bm are parameters to be estimated

 b0 + b1*x1 + ...+ bm*xm is linear predicator

For non-default: probability(non-default=0/x1,x2,…xm)

                                          e ^(b0 + b1*x1 + ...+ bm*xm)
                              =  1 -    ------------------------------------
                                        1+ e^ (b0 + b1*x1 + ...+ bm*xm) bm*xm))

Odd ratio

           probability(default=1/x1,x2,...xm)
        =  -------------------------------------  =  e ^ (b0 + b1*x1 + ...+ bm*xm)
          probability(non-default=0/x1,x2,...xm)

Interpertation :

             if variable x1 goes up by 1 --> The odds are multiplied by e^(b1)
             
             if b1 < 0 --> e^(b1) < 1    --> The odds decrease as x1 increases
             
             if b1 > 0 --> e^(b1) > 1    --> The odds increase as x1 increases

We have different link functions : logit,Probit and cloglog functions.

Loan data

head(loan_data,5)

##   loan_status loan_amnt grade home_ownership annual_inc age emp_cat
## 1           0      5000     B           RENT      24000  33    0-15
## 2           0      2400     C           RENT      12252  31   15-30
## 3           0     10000     C           RENT      49200  24    0-15
## 4           0      5000     A           RENT      36000  39    0-15
## 5           0      3000     E           RENT      48000  24    0-15
##    ir_cat
## 1    8-11
## 2 Missing
## 3 11-13.5
## 4 Missing
## 5 Missing

str(loan_data)

## 'data.frame':    29091 obs. of  8 variables:
##  $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ annual_inc    : num  24000 12252 49200 36000 48000 ...
##  $ age           : int  33 31 24 39 24 28 22 22 28 22 ...
##  $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...

Divide the data into train and test

Dimensions of training_set and test_set

dim(training_set)

## [1] 21818     8

dim(test_set)

## [1] 7273    8

table(loan_data$ir_cat)

## 
##     0-8 11-13.5   13.5+    8-11 Missing 
##    7130    6953    6002    6230    2776

Buliding Logistic model with one variable

log_model_cat <- glm(loan_status ~ ir_cat, family="binomial",data=training_set)
summary(log_model_cat)

## 
## Call:
## glm(formula = loan_status ~ ir_cat, family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6290  -0.5408  -0.4427  -0.3287   2.4271  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.89136    0.06118 -47.262  < 2e-16 ***
## ir_cat11-13.5  1.04292    0.07336  14.216  < 2e-16 ***
## ir_cat13.5+    1.37168    0.07240  18.946  < 2e-16 ***
## ir_cat8-11     0.61769    0.07923   7.796 6.40e-15 ***
## ir_catMissing  0.75880    0.09395   8.077 6.66e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15353  on 21817  degrees of freedom
## Residual deviance: 14896  on 21813  degrees of freedom
## AIC: 14906
## 
## Number of Fisher Scoring iterations: 5

Buliding Logistic model with mutiple variable

log_model_multi <- glm(loan_status ~ age + ir_cat + grade + loan_amnt +
                      annual_inc , family = "binomial", data = training_set)
summary(log_model_multi)

## 
## Call:
## glm(formula = loan_status ~ age + ir_cat + grade + loan_amnt + 
##     annual_inc, family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1080  -0.5458  -0.4402  -0.3324   3.5461  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.268e+00  1.209e-01 -18.757  < 2e-16 ***
## age           -7.900e-03  3.659e-03  -2.159 0.030832 *  
## ir_cat11-13.5  6.071e-01  1.244e-01   4.880 1.06e-06 ***
## ir_cat13.5+    5.807e-01  1.382e-01   4.202 2.65e-05 ***
## ir_cat8-11     3.849e-01  1.109e-01   3.470 0.000521 ***
## ir_catMissing  3.851e-01  1.217e-01   3.163 0.001560 ** 
## gradeB         3.013e-01  9.971e-02   3.022 0.002509 ** 
## gradeC         5.739e-01  1.139e-01   5.037 4.74e-07 ***
## gradeD         8.481e-01  1.294e-01   6.555 5.58e-11 ***
## gradeE         1.010e+00  1.564e-01   6.457 1.07e-10 ***
## gradeF         1.444e+00  2.211e-01   6.530 6.59e-11 ***
## gradeG         1.910e+00  3.751e-01   5.091 3.55e-07 ***
## loan_amnt     -9.934e-08  3.940e-06  -0.025 0.979884    
## annual_inc    -6.610e-06  7.290e-07  -9.067  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15353  on 21817  degrees of freedom
## Residual deviance: 14690  on 21804  degrees of freedom
## AIC: 14718
## 
## Number of Fisher Scoring iterations: 5

Buliding another model with age and ir_cat

log_model_small <- glm(formula = loan_status ~ age + ir_cat, family = "binomial", 
    data = training_set) 
summary(log_model_small)

## 
## Call:
## glm(formula = loan_status ~ age + ir_cat, family = "binomial", 
##     data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6559  -0.5487  -0.4545  -0.3334   2.5423  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.55681    0.11739 -21.781  < 2e-16 ***
## age           -0.01220    0.00368  -3.316 0.000913 ***
## ir_cat11-13.5  1.04458    0.07338  14.236  < 2e-16 ***
## ir_cat13.5+    1.37384    0.07242  18.971  < 2e-16 ***
## ir_cat8-11     0.61909    0.07925   7.812 5.63e-15 ***
## ir_catMissing  0.76189    0.09398   8.107 5.17e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15353  on 21817  degrees of freedom
## Residual deviance: 14884  on 21812  degrees of freedom
## AIC: 14896
## 
## Number of Fisher Scoring iterations: 5

Predicting the probability of default and look at range of the objects “predictions_all_small”

predictions_all_small <- predict(log_model_small, newdata = test_set, type = "response")
range(predictions_all_small)

## [1] 0.0342983 0.1916655

Change the code below to construct a logistic regression model using all available predictors in the data set and predict it and look at range

log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
predictions_all_full <- predict(log_model_full,test_set,type="response")
range(predictions_all_full)

## [1] 1.497067e-06 5.239165e-01

Evaluting the model

The probabilities lies in between zero and one.So need to assign a zero or one based on cut-off value

Classification accuracy=(TP+TN)/(TP+FP+TN+FN)

Sensitivity=TP(TP+FN)

Specificity=TN/(TN+FP)

ROC curve : x = 1 -specificity and y= sensitive is used to compare different models

AUC : used to tell how much area under curve will comparing different models.

Make a binary predictions-vector using a cut-off of 15% and confusion matrix

table(test_set$loan_status,pred_cutoff_15)

##    pred_cutoff_15
##        0    1
##   0 5058 1444
##   1  477  294

Need to find the optimal cut-off value

Let us suppose that acceptance rate is 80%

cutoff_logit <- quantile(predictions_all_full,0.8)
bin_pred_logit <- ifelse(predictions_all_full > cutoff_logit,1,0)

#Obtain the actual default status for the accepted loans
accepted_status_logit <- test_set$loan_status[bin_pred_logit == 0]

#Bad rate
sum(accepted_status_logit)/length(accepted_status_logit)

## [1] 0.08817463

function of different acceptance rate which is multiples of 5%

Apply the strategy function to predictions_all_full

strategy_logit <- strategy_bank(predictions_all_full)
print(strategy_logit$table)

##       accept_rate cutoff bad_rate
##  [1,]        1.00 0.5239   0.1060
##  [2,]        0.95 0.2105   0.0989
##  [3,]        0.90 0.1852   0.0943
##  [4,]        0.85 0.1710   0.0911
##  [5,]        0.80 0.1586   0.0882
##  [6,]        0.75 0.1475   0.0856
##  [7,]        0.70 0.1371   0.0807
##  [8,]        0.65 0.1282   0.0772
##  [9,]        0.60 0.1198   0.0729
## [10,]        0.55 0.1131   0.0703
## [11,]        0.50 0.1063   0.0646
## [12,]        0.45 0.0996   0.0617
## [13,]        0.40 0.0918   0.0602
## [14,]        0.35 0.0834   0.0554
## [15,]        0.30 0.0722   0.0564
## [16,]        0.25 0.0621   0.0522
## [17,]        0.20 0.0566   0.0474
## [18,]        0.15 0.0523   0.0412
## [19,]        0.10 0.0476   0.0385
## [20,]        0.05 0.0407   0.0385
## [21,]        0.00 0.0000   0.0000

Logistic Regression

Mahesh

May 11, 2017