A regression model with output between 0 and 1.
probability(default=1/x1,x2,…xm) = 1/(1+ e^ -(b0 + b1x1 + …+ bmxm))
= e ^(b0 + b1*x1 + ...+ bm*xm)
------------------------------
1+ e^ (b0 + b1*x1 + ...+ bm*xm)
here
x1,x2,x3...xm are independant variables
b0,b1,b2...bm are parameters to be estimated
b0 + b1*x1 + ...+ bm*xm is linear predicator
For non-default: probability(non-default=0/x1,x2,…xm)
e ^(b0 + b1*x1 + ...+ bm*xm)
= 1 - ------------------------------------
1+ e^ (b0 + b1*x1 + ...+ bm*xm) bm*xm))
Odd ratio
probability(default=1/x1,x2,...xm)
= ------------------------------------- = e ^ (b0 + b1*x1 + ...+ bm*xm)
probability(non-default=0/x1,x2,...xm)
Interpertation :
if variable x1 goes up by 1 --> The odds are multiplied by e^(b1)
if b1 < 0 --> e^(b1) < 1 --> The odds decrease as x1 increases
if b1 > 0 --> e^(b1) > 1 --> The odds increase as x1 increases
We have different link functions : logit,Probit and cloglog functions.
head(loan_data,5)
## loan_status loan_amnt grade home_ownership annual_inc age emp_cat
## 1 0 5000 B RENT 24000 33 0-15
## 2 0 2400 C RENT 12252 31 15-30
## 3 0 10000 C RENT 49200 24 0-15
## 4 0 5000 A RENT 36000 39 0-15
## 5 0 3000 E RENT 48000 24 0-15
## ir_cat
## 1 8-11
## 2 Missing
## 3 11-13.5
## 4 Missing
## 5 Missing
str(loan_data)
## 'data.frame': 29091 obs. of 8 variables:
## $ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
## $ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
## $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
## $ annual_inc : num 24000 12252 49200 36000 48000 ...
## $ age : int 33 31 24 39 24 28 22 22 28 22 ...
## $ emp_cat : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...
## $ ir_cat : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
dim(training_set)
## [1] 21818 8
dim(test_set)
## [1] 7273 8
table(loan_data$ir_cat)
##
## 0-8 11-13.5 13.5+ 8-11 Missing
## 7130 6953 6002 6230 2776
log_model_cat <- glm(loan_status ~ ir_cat, family="binomial",data=training_set)
summary(log_model_cat)
##
## Call:
## glm(formula = loan_status ~ ir_cat, family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6290 -0.5408 -0.4427 -0.3287 2.4271
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.89136 0.06118 -47.262 < 2e-16 ***
## ir_cat11-13.5 1.04292 0.07336 14.216 < 2e-16 ***
## ir_cat13.5+ 1.37168 0.07240 18.946 < 2e-16 ***
## ir_cat8-11 0.61769 0.07923 7.796 6.40e-15 ***
## ir_catMissing 0.75880 0.09395 8.077 6.66e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15353 on 21817 degrees of freedom
## Residual deviance: 14896 on 21813 degrees of freedom
## AIC: 14906
##
## Number of Fisher Scoring iterations: 5
log_model_multi <- glm(loan_status ~ age + ir_cat + grade + loan_amnt +
annual_inc , family = "binomial", data = training_set)
summary(log_model_multi)
##
## Call:
## glm(formula = loan_status ~ age + ir_cat + grade + loan_amnt +
## annual_inc, family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1080 -0.5458 -0.4402 -0.3324 3.5461
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.268e+00 1.209e-01 -18.757 < 2e-16 ***
## age -7.900e-03 3.659e-03 -2.159 0.030832 *
## ir_cat11-13.5 6.071e-01 1.244e-01 4.880 1.06e-06 ***
## ir_cat13.5+ 5.807e-01 1.382e-01 4.202 2.65e-05 ***
## ir_cat8-11 3.849e-01 1.109e-01 3.470 0.000521 ***
## ir_catMissing 3.851e-01 1.217e-01 3.163 0.001560 **
## gradeB 3.013e-01 9.971e-02 3.022 0.002509 **
## gradeC 5.739e-01 1.139e-01 5.037 4.74e-07 ***
## gradeD 8.481e-01 1.294e-01 6.555 5.58e-11 ***
## gradeE 1.010e+00 1.564e-01 6.457 1.07e-10 ***
## gradeF 1.444e+00 2.211e-01 6.530 6.59e-11 ***
## gradeG 1.910e+00 3.751e-01 5.091 3.55e-07 ***
## loan_amnt -9.934e-08 3.940e-06 -0.025 0.979884
## annual_inc -6.610e-06 7.290e-07 -9.067 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15353 on 21817 degrees of freedom
## Residual deviance: 14690 on 21804 degrees of freedom
## AIC: 14718
##
## Number of Fisher Scoring iterations: 5
log_model_small <- glm(formula = loan_status ~ age + ir_cat, family = "binomial",
data = training_set)
summary(log_model_small)
##
## Call:
## glm(formula = loan_status ~ age + ir_cat, family = "binomial",
## data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6559 -0.5487 -0.4545 -0.3334 2.5423
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.55681 0.11739 -21.781 < 2e-16 ***
## age -0.01220 0.00368 -3.316 0.000913 ***
## ir_cat11-13.5 1.04458 0.07338 14.236 < 2e-16 ***
## ir_cat13.5+ 1.37384 0.07242 18.971 < 2e-16 ***
## ir_cat8-11 0.61909 0.07925 7.812 5.63e-15 ***
## ir_catMissing 0.76189 0.09398 8.107 5.17e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15353 on 21817 degrees of freedom
## Residual deviance: 14884 on 21812 degrees of freedom
## AIC: 14896
##
## Number of Fisher Scoring iterations: 5
predictions_all_small <- predict(log_model_small, newdata = test_set, type = "response")
range(predictions_all_small)
## [1] 0.0342983 0.1916655
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
predictions_all_full <- predict(log_model_full,test_set,type="response")
range(predictions_all_full)
## [1] 1.497067e-06 5.239165e-01
The probabilities lies in between zero and one.So need to assign a zero or one based on cut-off value
Classification accuracy=(TP+TN)/(TP+FP+TN+FN)
Sensitivity=TP(TP+FN)
Specificity=TN/(TN+FP)
ROC curve : x = 1 -specificity and y= sensitive is used to compare different models
AUC : used to tell how much area under curve will comparing different models.
table(test_set$loan_status,pred_cutoff_15)
## pred_cutoff_15
## 0 1
## 0 5058 1444
## 1 477 294
Let us suppose that acceptance rate is 80%
cutoff_logit <- quantile(predictions_all_full,0.8)
bin_pred_logit <- ifelse(predictions_all_full > cutoff_logit,1,0)
#Obtain the actual default status for the accepted loans
accepted_status_logit <- test_set$loan_status[bin_pred_logit == 0]
#Bad rate
sum(accepted_status_logit)/length(accepted_status_logit)
## [1] 0.08817463
strategy_logit <- strategy_bank(predictions_all_full)
print(strategy_logit$table)
## accept_rate cutoff bad_rate
## [1,] 1.00 0.5239 0.1060
## [2,] 0.95 0.2105 0.0989
## [3,] 0.90 0.1852 0.0943
## [4,] 0.85 0.1710 0.0911
## [5,] 0.80 0.1586 0.0882
## [6,] 0.75 0.1475 0.0856
## [7,] 0.70 0.1371 0.0807
## [8,] 0.65 0.1282 0.0772
## [9,] 0.60 0.1198 0.0729
## [10,] 0.55 0.1131 0.0703
## [11,] 0.50 0.1063 0.0646
## [12,] 0.45 0.0996 0.0617
## [13,] 0.40 0.0918 0.0602
## [14,] 0.35 0.0834 0.0554
## [15,] 0.30 0.0722 0.0564
## [16,] 0.25 0.0621 0.0522
## [17,] 0.20 0.0566 0.0474
## [18,] 0.15 0.0523 0.0412
## [19,] 0.10 0.0476 0.0385
## [20,] 0.05 0.0407 0.0385
## [21,] 0.00 0.0000 0.0000