Predict Bankruptcy using Logistic regression and variable selection methods

0 Loading libraries

library(dplyr)
library(ggplot2)
library(fields)
library(verification)

1. Introduction

The bankruptcy data we are dealing with here has 5436 observations and 13 variables. The variable of interest is DLRSN which is coded as 0/1 where 1 means bankruptcy and 0 non-bankruptcy. There are 10 variables which will help us predict if there is a chance of bankruptcy or not. In this project we are using logistic regression along with different variable selection methods and cross validation to have a good out of sample accuracy.

2. Data Dictionary

Table 1: Data Dictionary

Variable	Description	Type
FYEAR	Fiscal Year	Numerical
R1	Working Capital/Total Asset	Numerical
R2	Retained Earnings/Total Asset	Numerical
R3	Earnings Before Interest & Tax/Total Asset	Numerical
R4	Market Capital / Total Liability	Numerical
R5	SALE/Total Asset	Numerical
R6	Total Liability/Total Asset	Numerical
R7	Current Asset/Current Liability	Numerical
R8	Net Income/Total Asset	Numerical
R9	Log(SALE)	Numerical
R10	Log(Market Cap)	Numerical
DLRSN	Coded as 1 = Bankruptcy, 0=Non-Bankruptcy	Binary

3. Summary of the data

setwd('C:/Users/anany/Desktop/BANA/STUDY/Semester 2/Data Mining 1/Hw/3')

bankruptcy_data <- read.csv('bankruptcy.csv')
glimpse(bankruptcy_data)

## Observations: 5,436
## Variables: 13
## $ FYEAR <int> 1999, 1999, 1999, 1994, 1999, 1999, 1999, 1999, 1987, 19...
## $ DLRSN <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CUSIP <fctr> 00036020, 00036110, 00037520, 00078110, 00079X10, 00086...
## $ R1    <dbl> 0.3071395, 0.7607365, -0.5135961, -0.4661293, 2.0234223,...
## $ R2    <dbl> 0.8870057, 0.5924934, 0.3376148, 0.3707467, 0.2148764, 0...
## $ R3    <dbl> 1.64768076, 0.45300275, 0.29901455, 0.49606665, 0.182594...
## $ R4    <dbl> -0.19915760, -0.36989014, -0.02907996, -0.37342897, 6.69...
## $ R5    <dbl> 1.092963959, 0.186153945, -0.432604675, -0.267423988, -1...
## $ R6    <dbl> -0.31328867, 0.03961907, 0.82999324, 0.97779888, -1.5058...
## $ R7    <dbl> -0.196793318, 0.327497319, -0.707786127, -0.610975074, 2...
## $ R8    <dbl> 1.20676281, 0.42841806, 0.47615328, 0.45680985, 0.287374...
## $ R9    <dbl> 0.28247091, 1.10696521, 2.17917545, 0.15195105, -0.98644...
## $ R10   <dbl> 0.158896251, 0.793442755, 2.484584500, 0.047788514, 0.79...

bankruptcy_data$DLRSN <- as.factor(bankruptcy_data$DLRSN)
summary(bankruptcy_data)

##      FYEAR      DLRSN         CUSIP            R1         
##  Min.   :1980   0:4660   00036020:   1   Min.   :-4.3828  
##  1st Qu.:1999   1: 776   00036110:   1   1st Qu.:-0.7501  
##  Median :1999            00037520:   1   Median :-0.2220  
##  Mean   :1998            00078110:   1   Mean   :-0.2352  
##  3rd Qu.:1999            00079X10:   1   3rd Qu.: 0.4821  
##  Max.   :1999            00086T10:   1   Max.   : 2.0234  
##                          (Other) :5430                    
##        R2                R3                 R4                 R5         
##  Min.   :-2.2418   Min.   :-2.06423   Min.   :-0.42712   Min.   :-1.3639  
##  1st Qu.:-1.0805   1st Qu.:-1.07887   1st Qu.:-0.38752   1st Qu.:-0.8748  
##  Median : 0.1337   Median : 0.06858   Median :-0.30754   Median :-0.3508  
##  Mean   :-0.2915   Mean   :-0.24411   Mean   : 0.23903   Mean   :-0.1338  
##  3rd Qu.: 0.5103   3rd Qu.: 0.50070   3rd Qu.: 0.02746   3rd Qu.: 0.2878  
##  Max.   : 1.4854   Max.   : 2.14246   Max.   : 6.69536   Max.   : 4.0362  
##                                                                           
##        R6                  R7                 R8         
##  Min.   :-1.505889   Min.   :-1.23340   Min.   :-2.2082  
##  1st Qu.:-0.656827   1st Qu.:-0.77996   1st Qu.:-1.0054  
##  Median : 0.003493   Median :-0.43205   Median : 0.2053  
##  Mean   : 0.195707   Mean   :-0.09612   Mean   :-0.2272  
##  3rd Qu.: 0.608376   3rd Qu.: 0.17578   3rd Qu.: 0.5289  
##  Max.   : 5.110424   Max.   : 2.87648   Max.   : 2.0006  
##                                                          
##        R9                R10         
##  Min.   :-2.76356   Min.   :-2.2140  
##  1st Qu.:-0.66606   1st Qu.:-0.6413  
##  Median : 0.06219   Median : 0.1230  
##  Mean   : 0.02538   Mean   : 0.1806  
##  3rd Qu.: 0.81215   3rd Qu.: 0.9878  
##  Max.   : 2.17918   Max.   : 2.4846  
##

dim(bankruptcy_data)

## [1] 5436   13

attach(bankruptcy_data)

barplot(table(DLRSN), xlab = "DLRSN", ylab = "# of observations")

Our variable of interest DLRSN has an imbalanced proportion of 0s and 1s with 0(14.27%) and 1(85.73%)

4. Plots of DLRSN vs the dependent variables

plot1 <- function(A){
  ggplot(bankruptcy_data,aes(x=as.name(A),fill=DLRSN))+geom_histogram(position="fill",binwidth=1) +xlab(A)
}
lapply(names(bankruptcy_data[,-c(1,2,3)]),plot1)

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

From the plots above we noticed the following:

% of DLRSN = 1 decreases as R1 increase thus, R1 looks moderately correlated with DLRSN
Same is observed for R2, R3, R4 and R7 3.??? % of DLRSN = 1 increases as R5 and R6 increases 4.??? For R8 the % of DLRSN = 1 is the maximum at the ends where R8 is on the lower side or the higher side

5. Splitting the data into test and train data

We first split the data into test and train data with 20% of the data and 80% of the data randomly. Thus, the train data that we would be using to make the model has 4348 number of observations. On checking the distribution of 1’s and 0’s in the DLRSN in the train data we see that it has 85% 1’s and rest 0’s which is like the whole data (unbiased sample).

set.seed(12384128)
subset <- sample(nrow(bankruptcy_data), nrow(bankruptcy_data) * 0.8)
bankruptcy.train = bankruptcy_data[subset,-c(1,3) ]
bankruptcy.test = bankruptcy_data[-subset,-c(1,3)]
table(bankruptcy.train$DLRSN)

## 
##    0    1 
## 3725  623

nrow(bankruptcy.train)

## [1] 4348

6. Model Building

1. Full Model Then we created the full model with all the dependent variables - R1, R2, R3, R4, R5, R6, R7, R8, R9 and R10. Let’s name it Model0. Summary of the model is as below:

bankruptcy.glm0 <- glm(DLRSN ~ ., family = binomial, bankruptcy.train)
summary(bankruptcy.glm0)

## 
## Call:
## glm(formula = DLRSN ~ ., family = binomial, data = bankruptcy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3291  -0.4595  -0.2408  -0.1072   3.3704  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.56141    0.07912 -32.375  < 2e-16 ***
## R1           0.16055    0.07439   2.158 0.030914 *  
## R2           0.57219    0.07932   7.214 5.44e-13 ***
## R3          -0.49094    0.10063  -4.879 1.07e-06 ***
## R4          -0.25690    0.13417  -1.915 0.055527 .  
## R5           0.03667    0.05492   0.668 0.504363    
## R6           0.23935    0.05590   4.282 1.86e-05 ***
## R7          -0.43555    0.10535  -4.134 3.56e-05 ***
## R8          -0.29604    0.08414  -3.519 0.000434 ***
## R9           0.26435    0.10642   2.484 0.012991 *  
## R10         -1.43413    0.09826 -14.596  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3573.0  on 4347  degrees of freedom
## Residual deviance: 2423.5  on 4337  degrees of freedom
## AIC: 2445.5
## 
## Number of Fisher Scoring iterations: 7

AIC(bankruptcy.glm0) #2445.495

## [1] 2445.495

BIC(bankruptcy.glm0) #2515.647

## [1] 2515.647

#Misclassification and confusion matrix
cutoff=1/16

prob.glm0.insample <- predict(bankruptcy.glm0, type = "response")
predicted.glm0.insample <- prob.glm0.insample > cutoff
predicted.glm0.insample <- as.numeric(predicted.glm0.insample)

#confusion matrix
table(bankruptcy.train$DLRSN, predicted.glm0.insample, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth    0    1
##     0 2262 1463
##     1   56  567

From the P value all the variable looks significant except R5. Also, AIC of the model is 2445.5 and BIC of the full model is 2515.65. Taking cutoff = 1/16 based on the cost ratio of 15:1 we found the misclassification rate and the confusion matrix. Misclassification Rate = 0.3493

2. Stepwise AIC Next, we performed a stepwise regression using AIC as the criteria. The final model with the lowest AIC was obtained by removing the variable R5

Using both backward and forward stepwise regression we got the model without R5 which resulted in the smallest AIC. The AIC and BIC of the model is 2443.9 and 2507.7 respectively Taking cutoff = 1/16 based on the cost ratio of 15:1 we found the misclassification rate and the confusion matrix. Misclassification Rate = 0.3475

null.model <- glm(DLRSN ~ 1, family = binomial, bankruptcy.train)
summary(null.model)

## 
## Call:
## glm(formula = DLRSN ~ 1, family = binomial, data = bankruptcy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5562  -0.5562  -0.5562  -0.5562   1.9713  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.78828    0.04329  -41.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3573  on 4347  degrees of freedom
## Residual deviance: 3573  on 4347  degrees of freedom
## AIC: 3575
## 
## Number of Fisher Scoring iterations: 4

#stepwise AIC

bankruptcy.glm.step.aic<-step(null.model, scope=list(lower=null.model, upper=bankruptcy.glm0), k = 2, direction="forward")

## Start:  AIC=3575.02
## DLRSN ~ 1
## 
##        Df Deviance    AIC
## + R10   1   2691.2 2695.2
## + R6    1   3216.2 3220.2
## + R8    1   3251.7 3255.7
## + R3    1   3259.2 3263.2
## + R1    1   3308.0 3312.0
## + R4    1   3350.5 3354.5
## + R7    1   3370.1 3374.1
## + R9    1   3382.9 3386.9
## + R2    1   3405.0 3409.0
## + R5    1   3464.9 3468.9
## <none>      3573.0 3575.0
## 
## Step:  AIC=2695.18
## DLRSN ~ R10
## 
##        Df Deviance    AIC
## + R7    1   2610.4 2616.4
## + R6    1   2616.8 2622.8
## + R4    1   2623.4 2629.4
## + R8    1   2637.3 2643.3
## + R9    1   2641.5 2647.5
## + R1    1   2644.6 2650.6
## + R3    1   2652.6 2658.6
## + R5    1   2663.5 2669.5
## <none>      2691.2 2695.2
## + R2    1   2690.2 2696.2
## 
## Step:  AIC=2616.43
## DLRSN ~ R10 + R7
## 
##        Df Deviance    AIC
## + R8    1   2575.3 2583.3
## + R9    1   2577.9 2585.9
## + R3    1   2581.4 2589.4
## + R4    1   2586.3 2594.3
## + R6    1   2588.2 2596.2
## + R5    1   2595.6 2603.6
## + R2    1   2600.9 2608.9
## <none>      2610.4 2616.4
## + R1    1   2610.2 2618.2
## 
## Step:  AIC=2583.26
## DLRSN ~ R10 + R7 + R8
## 
##        Df Deviance    AIC
## + R2    1   2503.5 2513.5
## + R9    1   2511.7 2521.7
## + R4    1   2528.0 2538.0
## + R5    1   2556.7 2566.7
## + R6    1   2565.0 2575.0
## <none>      2575.3 2583.3
## + R1    1   2573.8 2583.8
## + R3    1   2574.5 2584.5
## 
## Step:  AIC=2513.48
## DLRSN ~ R10 + R7 + R8 + R2
## 
##        Df Deviance    AIC
## + R6    1   2475.3 2487.3
## + R9    1   2477.2 2489.2
## + R4    1   2481.0 2493.0
## + R5    1   2490.3 2502.3
## + R3    1   2490.6 2502.6
## <none>      2503.5 2513.5
## + R1    1   2503.4 2515.4
## 
## Step:  AIC=2487.31
## DLRSN ~ R10 + R7 + R8 + R2 + R6
## 
##        Df Deviance    AIC
## + R9    1   2458.1 2472.1
## + R3    1   2461.0 2475.0
## + R4    1   2461.4 2475.4
## + R1    1   2463.0 2477.0
## + R5    1   2469.3 2483.3
## <none>      2475.3 2487.3
## 
## Step:  AIC=2472.13
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9
## 
##        Df Deviance    AIC
## + R3    1   2433.7 2449.7
## + R1    1   2452.5 2468.5
## + R4    1   2453.0 2469.0
## <none>      2458.1 2472.1
## + R5    1   2457.9 2473.9
## 
## Step:  AIC=2449.7
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3
## 
##        Df Deviance    AIC
## + R1    1   2428.2 2446.2
## + R4    1   2428.7 2446.7
## <none>      2433.7 2449.7
## + R5    1   2433.6 2451.6
## 
## Step:  AIC=2446.22
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3 + R1
## 
##        Df Deviance    AIC
## + R4    1   2423.9 2443.9
## <none>      2428.2 2446.2
## + R5    1   2428.2 2448.2
## 
## Step:  AIC=2443.94
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3 + R1 + R4
## 
##        Df Deviance    AIC
## <none>      2423.9 2443.9
## + R5    1   2423.5 2445.5

summary(bankruptcy.glm.step.aic) #-R5

## 
## Call:
## glm(formula = DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3 + R1 + 
##     R4, family = binomial, data = bankruptcy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3385  -0.4595  -0.2406  -0.1076   3.3482  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.56443    0.07893 -32.489  < 2e-16 ***
## R10         -1.46141    0.08943 -16.341  < 2e-16 ***
## R7          -0.43875    0.10512  -4.174 2.99e-05 ***
## R8          -0.29527    0.08411  -3.511 0.000447 ***
## R2           0.56995    0.07926   7.191 6.45e-13 ***
## R6           0.24550    0.05511   4.455 8.40e-06 ***
## R9           0.30008    0.09203   3.261 0.001111 ** 
## R3          -0.49316    0.10067  -4.899 9.63e-07 ***
## R1           0.16234    0.07432   2.184 0.028942 *  
## R4          -0.23959    0.12977  -1.846 0.064851 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3573.0  on 4347  degrees of freedom
## Residual deviance: 2423.9  on 4338  degrees of freedom
## AIC: 2443.9
## 
## Number of Fisher Scoring iterations: 7

bankruptcy.glm.step.aic$anova

##     Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1        NA         NA      4347   3573.020 3575.020
## 2  + R10 -1 881.842602      4346   2691.178 2695.178
## 3   + R7 -1  80.751812      4345   2610.426 2616.426
## 4   + R8 -1  35.162331      4344   2575.263 2583.263
## 5   + R2 -1  71.779729      4343   2503.484 2513.484
## 6   + R6 -1  28.169211      4342   2475.314 2487.314
## 7   + R9 -1  17.185392      4341   2458.129 2472.129
## 8   + R3 -1  24.428733      4340   2433.700 2449.700
## 9   + R1 -1   5.476481      4339   2428.224 2446.224
## 10  + R4 -1   4.284533      4338   2423.939 2443.939

AIC(bankruptcy.glm.step.aic)

## [1] 2443.939

BIC(bankruptcy.glm.step.aic)

## [1] 2507.714

#for stepwise model --AIC
prob.glm1.insample <- predict(bankruptcy.glm.step.aic, type = "response")
predicted.glm1.insample <- as.numeric(prob.glm1.insample > cutoff)


#confusion matrix
table(bankruptcy.train$DLRSN, predicted.glm1.insample, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth    0    1
##     0 2269 1456
##     1   55  568

#misclassification Rate
mean(bankruptcy.train$DLRSN != predicted.glm1.insample)

## [1] 0.3475161

3. Stepwise BIC regression Then we performed a stepwise regression using the BIC criteria.

Using stepwise BIC regression, we got a model without R1, R4 and R5. AIC and BIC of this model are 2449.7 and 2500 .7 respectively Taking cutoff = 1/16 based on the cost ratio of 15:1 we found the misclassification rate and the confusion matrix. Misclassification Rate = 0.3498

bankruptcy.glm.step.bic <- step(null.model, k = log(nrow(bankruptcy.train)),scope=list(lower=null.model, upper=bankruptcy.glm0), direction="forward")

## Start:  AIC=3581.4
## DLRSN ~ 1
## 
##        Df Deviance    AIC
## + R10   1   2691.2 2707.9
## + R6    1   3216.2 3233.0
## + R8    1   3251.7 3268.4
## + R3    1   3259.2 3276.0
## + R1    1   3308.0 3324.7
## + R4    1   3350.5 3367.2
## + R7    1   3370.1 3386.9
## + R9    1   3382.9 3399.7
## + R2    1   3405.0 3421.8
## + R5    1   3464.9 3481.7
## <none>      3573.0 3581.4
## 
## Step:  AIC=2707.93
## DLRSN ~ R10
## 
##        Df Deviance    AIC
## + R7    1   2610.4 2635.6
## + R6    1   2616.8 2641.9
## + R4    1   2623.4 2648.5
## + R8    1   2637.3 2662.4
## + R9    1   2641.5 2666.6
## + R1    1   2644.6 2669.7
## + R3    1   2652.6 2677.7
## + R5    1   2663.5 2688.6
## <none>      2691.2 2707.9
## + R2    1   2690.2 2715.4
## 
## Step:  AIC=2635.56
## DLRSN ~ R10 + R7
## 
##        Df Deviance    AIC
## + R8    1   2575.3 2608.8
## + R9    1   2577.9 2611.4
## + R3    1   2581.4 2614.9
## + R4    1   2586.3 2619.8
## + R6    1   2588.2 2621.8
## + R5    1   2595.6 2629.1
## + R2    1   2600.9 2634.5
## <none>      2610.4 2635.6
## + R1    1   2610.2 2643.7
## 
## Step:  AIC=2608.77
## DLRSN ~ R10 + R7 + R8
## 
##        Df Deviance    AIC
## + R2    1   2503.5 2545.4
## + R9    1   2511.7 2553.6
## + R4    1   2528.0 2569.9
## + R5    1   2556.7 2598.6
## + R6    1   2565.0 2606.9
## <none>      2575.3 2608.8
## + R1    1   2573.8 2615.7
## + R3    1   2574.5 2616.4
## 
## Step:  AIC=2545.37
## DLRSN ~ R10 + R7 + R8 + R2
## 
##        Df Deviance    AIC
## + R6    1   2475.3 2525.6
## + R9    1   2477.2 2527.5
## + R4    1   2481.0 2531.2
## + R5    1   2490.3 2540.5
## + R3    1   2490.6 2540.9
## <none>      2503.5 2545.4
## + R1    1   2503.4 2553.7
## 
## Step:  AIC=2525.58
## DLRSN ~ R10 + R7 + R8 + R2 + R6
## 
##        Df Deviance    AIC
## + R9    1   2458.1 2516.8
## + R3    1   2461.0 2519.7
## + R4    1   2461.4 2520.1
## + R1    1   2463.0 2521.7
## <none>      2475.3 2525.6
## + R5    1   2469.3 2528.0
## 
## Step:  AIC=2516.77
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9
## 
##        Df Deviance    AIC
## + R3    1   2433.7 2500.7
## <none>      2458.1 2516.8
## + R1    1   2452.5 2519.5
## + R4    1   2453.0 2520.0
## + R5    1   2457.9 2525.0
## 
## Step:  AIC=2500.72
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3
## 
##        Df Deviance    AIC
## <none>      2433.7 2500.7
## + R1    1   2428.2 2503.6
## + R4    1   2428.7 2504.1
## + R5    1   2433.6 2509.0

summary(bankruptcy.glm.step.bic) #-R1, -R4,-R5

## 
## Call:
## glm(formula = DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3, family = binomial, 
##     data = bankruptcy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2954  -0.4603  -0.2482  -0.1192   3.1673  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.53520    0.07603 -33.344  < 2e-16 ***
## R10         -1.55630    0.08144 -19.109  < 2e-16 ***
## R7          -0.34996    0.07617  -4.594 4.34e-06 ***
## R8          -0.27542    0.08432  -3.266  0.00109 ** 
## R2           0.57785    0.07837   7.373 1.67e-13 ***
## R6           0.18177    0.04235   4.292 1.77e-05 ***
## R9           0.42434    0.08289   5.119 3.06e-07 ***
## R3          -0.49841    0.10107  -4.932 8.16e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3573.0  on 4347  degrees of freedom
## Residual deviance: 2433.7  on 4340  degrees of freedom
## AIC: 2449.7
## 
## Number of Fisher Scoring iterations: 6

bankruptcy.glm.step.bic$anova

##    Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1       NA        NA      4347   3573.020 3581.398
## 2 + R10 -1 881.84260      4346   2691.178 2707.932
## 3  + R7 -1  80.75181      4345   2610.426 2635.558
## 4  + R8 -1  35.16233      4344   2575.263 2608.773
## 5  + R2 -1  71.77973      4343   2503.484 2545.371
## 6  + R6 -1  28.16921      4342   2475.314 2525.579
## 7  + R9 -1  17.18539      4341   2458.129 2516.771
## 8  + R3 -1  24.42873      4340   2433.700 2500.720

AIC(bankruptcy.glm.step.bic)

## [1] 2449.7

BIC(bankruptcy.glm.step.bic)

## [1] 2500.72

#for stepwise model - BIC
prob.glm2.insample <- predict(bankruptcy.glm.step.bic, type = "response")
predicted.glm2.insample <- as.numeric(prob.glm2.insample > cutoff)


#confusion matrix
table(bankruptcy.train$DLRSN, predicted.glm2.insample, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth    0    1
##     0 2259 1466
##     1   55  568

#misclassification Rate
mean(bankruptcy.train$DLRSN != predicted.glm2.insample)

## [1] 0.349816

bankruptcy.glm.step.bic

## 
## Call:  glm(formula = DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3, family = binomial, 
##     data = bankruptcy.train)
## 
## Coefficients:
## (Intercept)          R10           R7           R8           R2  
##     -2.5352      -1.5563      -0.3500      -0.2754       0.5779  
##          R6           R9           R3  
##      0.1818       0.4243      -0.4984  
## 
## Degrees of Freedom: 4347 Total (i.e. Null);  4340 Residual
## Null Deviance:       3573 
## Residual Deviance: 2434  AIC: 2450

7. Final Model selection

We plotted the ROC curves for all the 3 models and found out the AIC for the same. We see that there is not a major difference between the AUC values for the 3 models.

 #AUC
#roc_curve$roc.vol #0.8905805

#compare 2 models
roc.plot(x = bankruptcy.train$DLRSN == "1", pred = cbind(prob.glm0.insample, prob.glm1.insample,prob.glm2.insample), 
         legend = TRUE, leg.text = c("Full Model", "Model selected by AIC( -R1)", "Model selected by BIC( -R4,-R5,-R1)"))$roc.vol

##      Model      Area       p.value binorm.area
## 1 Model  1 0.8755280 1.090043e-198          NA
## 2 Model  2 0.8754276 1.388023e-198          NA
## 3 Model  3 0.8737647 7.526347e-197          NA

Table: Comparing the three models

Models| Variables| AIC| BIC| AUC| Misclassification Rate ————- | —————— | —————— ————- | —————— | —————— Full |Model| All |2445.49 |2515.6| 0.876| 0.3493 Model selected by AIC| -R1| 2443.94| 2507.7| 0.875| 0.3475 Model selected by BIC| -R1-R4-R5| 2449.7| 2500.7| 0.874| 0.3498

All the models have almost same AIC, AUC and misclassification rate. As the model with BIC has the least number of variables and comparable AIC, AUC and misclassification rate, I have decided to go ahead with the model without R1, R4 and R5. As these variables R4 and R5 do not add much information to the model I have decided to go ahead with the

8. Searching the optimal cut-off probability

Now, that we have the final model we want to find out the optimal cut-off probability. Based on our initial knowledge we defined out cutoff as 1/16. In this section we are trying to find out the optimal cutoff using an asymmetric cost function and then doing a grid search to find the best cutoff. The assumption behind the asymmetric cost function is that predicting a bankruptcy case as non-bankruptcy is 15 times more costlier than the vice-versa.

## Search for optimal cut-off probability
# define the searc grid from 0.01 to 0.99
searchgrid = seq(0.01, 0.99, 0.01)

# result is a 99x2 matrix, the 1st col stores the cut-off p, the 2nd column
# stores the cost
result = cbind(searchgrid, NA)

# in the cost function, both r and pi are vectors, r=truth, pi=predicted
# probability
cost2 <- function(r, pi) {
  weight1 = 15
  weight0 = 1
  c1 = (r == 1) & (pi < pcut)  #logical vector - true if actual 1 but predict 0
  c0 = (r == 0) & (pi > pcut)  #logical vecotr - true if actual 0 but predict 1
  return(mean(weight1 * c1 + weight0 * c0))
}

for (i in 1:length(searchgrid)) {
  pcut <- result[i, 1]
  # assign the cost to the 2nd col
  result[i, 2] <- cost2(bankruptcy.train$DLRSN, predict(bankruptcy.glm.step.bic, type = "response"))
}
plot(result, ylab = "Cost in Training Set")

#find the optimal cut-off prob.
searchgrid[which(result[,2]==min(result[,2]))]

## [1] 0.06

#0.06
#1.04 is the cost

The minimum cost was obtained at 0.06 which is approximately the cutoff we had applied = 1/16.

9. Out of sample performance

After predicting the DSLRN on the 20% testing data. We see the performance is as below:

###########################
#OUT OF SAMPLE DATA
###########################
cutoff=1/16
test = bankruptcy.test[-1]

prob.glm.outsample <- predict(bankruptcy.glm.step.bic,new =test, type = "response")
predicted.glm.outsample <- prob.glm.outsample > cutoff
predicted.glm.outsample <- as.numeric(predicted.glm.outsample)

#checking the auc
roc.plot(bankruptcy.test$DLRSN=="1", prob.glm.outsample)$roc.vol #0.8894691

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8894691 3.076907e-54          NA

mean(bankruptcy.test$DLRSN!=predicted.glm.outsample) #0.3465074

## [1] 0.3465074

pcut = 1/16
#asymetric classification rate
cost2(bankruptcy.test$DLRSN, predict(bankruptcy.glm.step.bic, new = bankruptcy.test[-1],type = "response")) #0.411

## [1] 0.4108456

cost2(bankruptcy.train$DLRSN, predict(bankruptcy.glm.step.bic,type = "response")) #0.527

## [1] 0.5269089

table(bankruptcy.test$DLRSN, predicted.glm.outsample, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth   0   1
##     0 563 372
##     1   5 148

Table: Out of Sample Performance Metric| Value ————- | —————— Out of Sample AUC| 0. 8895 Out of Sample AMR| 0.411

We see that the AUC is 0.8895 which is above the industry standard of 0.7. Thus, the model has a good prediction power The confusion matrix for the out of sample prediction is as below:

We can see from the confusion matrix above that only 5 bankruptcy cases where predicted as non-bankruptcy out of total number of cases 153 which is pretty good.

10. Out of sample performance as compared to in-sample performance

Out of sample Performance vs In sample Performance
AUC	Cost(AMR)
In-sample training data (80%)	0.874 (>0.7)	0.527
Out of sample testing data (20%)	0.8895 (>0.7)	0.411

Comparing the AUC with the threshold of 0.7 our final model is performed quite well both on the training data and the test data. We see that the model performance is better for test data as compared to training with a lower cost function as well as a lower AUC.