library(dplyr)
library(ggplot2)
library(fields)
library(verification)
The bankruptcy data we are dealing with here has 5436 observations and 13 variables. The variable of interest is DLRSN which is coded as 0/1 where 1 means bankruptcy and 0 non-bankruptcy. There are 10 variables which will help us predict if there is a chance of bankruptcy or not. In this project we are using logistic regression along with different variable selection methods and cross validation to have a good out of sample accuracy.
Table 1: Data Dictionary
| Variable | Description | Type |
|---|---|---|
| FYEAR | Fiscal Year | Numerical |
| R1 | Working Capital/Total Asset | Numerical |
| R2 | Retained Earnings/Total Asset | Numerical |
| R3 | Earnings Before Interest & Tax/Total Asset | Numerical |
| R4 | Market Capital / Total Liability | Numerical |
| R5 | SALE/Total Asset | Numerical |
| R6 | Total Liability/Total Asset | Numerical |
| R7 | Current Asset/Current Liability | Numerical |
| R8 | Net Income/Total Asset | Numerical |
| R9 | Log(SALE) | Numerical |
| R10 | Log(Market Cap) | Numerical |
| DLRSN | Coded as 1 = Bankruptcy, 0=Non-Bankruptcy | Binary |
setwd('C:/Users/anany/Desktop/BANA/STUDY/Semester 2/Data Mining 1/Hw/3')
bankruptcy_data <- read.csv('bankruptcy.csv')
glimpse(bankruptcy_data)
## Observations: 5,436
## Variables: 13
## $ FYEAR <int> 1999, 1999, 1999, 1994, 1999, 1999, 1999, 1999, 1987, 19...
## $ DLRSN <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CUSIP <fctr> 00036020, 00036110, 00037520, 00078110, 00079X10, 00086...
## $ R1 <dbl> 0.3071395, 0.7607365, -0.5135961, -0.4661293, 2.0234223,...
## $ R2 <dbl> 0.8870057, 0.5924934, 0.3376148, 0.3707467, 0.2148764, 0...
## $ R3 <dbl> 1.64768076, 0.45300275, 0.29901455, 0.49606665, 0.182594...
## $ R4 <dbl> -0.19915760, -0.36989014, -0.02907996, -0.37342897, 6.69...
## $ R5 <dbl> 1.092963959, 0.186153945, -0.432604675, -0.267423988, -1...
## $ R6 <dbl> -0.31328867, 0.03961907, 0.82999324, 0.97779888, -1.5058...
## $ R7 <dbl> -0.196793318, 0.327497319, -0.707786127, -0.610975074, 2...
## $ R8 <dbl> 1.20676281, 0.42841806, 0.47615328, 0.45680985, 0.287374...
## $ R9 <dbl> 0.28247091, 1.10696521, 2.17917545, 0.15195105, -0.98644...
## $ R10 <dbl> 0.158896251, 0.793442755, 2.484584500, 0.047788514, 0.79...
bankruptcy_data$DLRSN <- as.factor(bankruptcy_data$DLRSN)
summary(bankruptcy_data)
## FYEAR DLRSN CUSIP R1
## Min. :1980 0:4660 00036020: 1 Min. :-4.3828
## 1st Qu.:1999 1: 776 00036110: 1 1st Qu.:-0.7501
## Median :1999 00037520: 1 Median :-0.2220
## Mean :1998 00078110: 1 Mean :-0.2352
## 3rd Qu.:1999 00079X10: 1 3rd Qu.: 0.4821
## Max. :1999 00086T10: 1 Max. : 2.0234
## (Other) :5430
## R2 R3 R4 R5
## Min. :-2.2418 Min. :-2.06423 Min. :-0.42712 Min. :-1.3639
## 1st Qu.:-1.0805 1st Qu.:-1.07887 1st Qu.:-0.38752 1st Qu.:-0.8748
## Median : 0.1337 Median : 0.06858 Median :-0.30754 Median :-0.3508
## Mean :-0.2915 Mean :-0.24411 Mean : 0.23903 Mean :-0.1338
## 3rd Qu.: 0.5103 3rd Qu.: 0.50070 3rd Qu.: 0.02746 3rd Qu.: 0.2878
## Max. : 1.4854 Max. : 2.14246 Max. : 6.69536 Max. : 4.0362
##
## R6 R7 R8
## Min. :-1.505889 Min. :-1.23340 Min. :-2.2082
## 1st Qu.:-0.656827 1st Qu.:-0.77996 1st Qu.:-1.0054
## Median : 0.003493 Median :-0.43205 Median : 0.2053
## Mean : 0.195707 Mean :-0.09612 Mean :-0.2272
## 3rd Qu.: 0.608376 3rd Qu.: 0.17578 3rd Qu.: 0.5289
## Max. : 5.110424 Max. : 2.87648 Max. : 2.0006
##
## R9 R10
## Min. :-2.76356 Min. :-2.2140
## 1st Qu.:-0.66606 1st Qu.:-0.6413
## Median : 0.06219 Median : 0.1230
## Mean : 0.02538 Mean : 0.1806
## 3rd Qu.: 0.81215 3rd Qu.: 0.9878
## Max. : 2.17918 Max. : 2.4846
##
dim(bankruptcy_data)
## [1] 5436 13
attach(bankruptcy_data)
barplot(table(DLRSN), xlab = "DLRSN", ylab = "# of observations")
Our variable of interest DLRSN has an imbalanced proportion of 0s and 1s with 0(14.27%) and 1(85.73%)
plot1 <- function(A){
ggplot(bankruptcy_data,aes(x=as.name(A),fill=DLRSN))+geom_histogram(position="fill",binwidth=1) +xlab(A)
}
lapply(names(bankruptcy_data[,-c(1,2,3)]),plot1)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
From the plots above we noticed the following:
We first split the data into test and train data with 20% of the data and 80% of the data randomly. Thus, the train data that we would be using to make the model has 4348 number of observations. On checking the distribution of 1’s and 0’s in the DLRSN in the train data we see that it has 85% 1’s and rest 0’s which is like the whole data (unbiased sample).
set.seed(12384128)
subset <- sample(nrow(bankruptcy_data), nrow(bankruptcy_data) * 0.8)
bankruptcy.train = bankruptcy_data[subset,-c(1,3) ]
bankruptcy.test = bankruptcy_data[-subset,-c(1,3)]
table(bankruptcy.train$DLRSN)
##
## 0 1
## 3725 623
nrow(bankruptcy.train)
## [1] 4348
1. Full Model Then we created the full model with all the dependent variables - R1, R2, R3, R4, R5, R6, R7, R8, R9 and R10. Let’s name it Model0. Summary of the model is as below:
bankruptcy.glm0 <- glm(DLRSN ~ ., family = binomial, bankruptcy.train)
summary(bankruptcy.glm0)
##
## Call:
## glm(formula = DLRSN ~ ., family = binomial, data = bankruptcy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3291 -0.4595 -0.2408 -0.1072 3.3704
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.56141 0.07912 -32.375 < 2e-16 ***
## R1 0.16055 0.07439 2.158 0.030914 *
## R2 0.57219 0.07932 7.214 5.44e-13 ***
## R3 -0.49094 0.10063 -4.879 1.07e-06 ***
## R4 -0.25690 0.13417 -1.915 0.055527 .
## R5 0.03667 0.05492 0.668 0.504363
## R6 0.23935 0.05590 4.282 1.86e-05 ***
## R7 -0.43555 0.10535 -4.134 3.56e-05 ***
## R8 -0.29604 0.08414 -3.519 0.000434 ***
## R9 0.26435 0.10642 2.484 0.012991 *
## R10 -1.43413 0.09826 -14.596 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3573.0 on 4347 degrees of freedom
## Residual deviance: 2423.5 on 4337 degrees of freedom
## AIC: 2445.5
##
## Number of Fisher Scoring iterations: 7
AIC(bankruptcy.glm0) #2445.495
## [1] 2445.495
BIC(bankruptcy.glm0) #2515.647
## [1] 2515.647
#Misclassification and confusion matrix
cutoff=1/16
prob.glm0.insample <- predict(bankruptcy.glm0, type = "response")
predicted.glm0.insample <- prob.glm0.insample > cutoff
predicted.glm0.insample <- as.numeric(predicted.glm0.insample)
#confusion matrix
table(bankruptcy.train$DLRSN, predicted.glm0.insample, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 2262 1463
## 1 56 567
From the P value all the variable looks significant except R5. Also, AIC of the model is 2445.5 and BIC of the full model is 2515.65. Taking cutoff = 1/16 based on the cost ratio of 15:1 we found the misclassification rate and the confusion matrix. Misclassification Rate = 0.3493
2. Stepwise AIC Next, we performed a stepwise regression using AIC as the criteria. The final model with the lowest AIC was obtained by removing the variable R5
Using both backward and forward stepwise regression we got the model without R5 which resulted in the smallest AIC. The AIC and BIC of the model is 2443.9 and 2507.7 respectively Taking cutoff = 1/16 based on the cost ratio of 15:1 we found the misclassification rate and the confusion matrix. Misclassification Rate = 0.3475
null.model <- glm(DLRSN ~ 1, family = binomial, bankruptcy.train)
summary(null.model)
##
## Call:
## glm(formula = DLRSN ~ 1, family = binomial, data = bankruptcy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5562 -0.5562 -0.5562 -0.5562 1.9713
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.78828 0.04329 -41.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3573 on 4347 degrees of freedom
## Residual deviance: 3573 on 4347 degrees of freedom
## AIC: 3575
##
## Number of Fisher Scoring iterations: 4
#stepwise AIC
bankruptcy.glm.step.aic<-step(null.model, scope=list(lower=null.model, upper=bankruptcy.glm0), k = 2, direction="forward")
## Start: AIC=3575.02
## DLRSN ~ 1
##
## Df Deviance AIC
## + R10 1 2691.2 2695.2
## + R6 1 3216.2 3220.2
## + R8 1 3251.7 3255.7
## + R3 1 3259.2 3263.2
## + R1 1 3308.0 3312.0
## + R4 1 3350.5 3354.5
## + R7 1 3370.1 3374.1
## + R9 1 3382.9 3386.9
## + R2 1 3405.0 3409.0
## + R5 1 3464.9 3468.9
## <none> 3573.0 3575.0
##
## Step: AIC=2695.18
## DLRSN ~ R10
##
## Df Deviance AIC
## + R7 1 2610.4 2616.4
## + R6 1 2616.8 2622.8
## + R4 1 2623.4 2629.4
## + R8 1 2637.3 2643.3
## + R9 1 2641.5 2647.5
## + R1 1 2644.6 2650.6
## + R3 1 2652.6 2658.6
## + R5 1 2663.5 2669.5
## <none> 2691.2 2695.2
## + R2 1 2690.2 2696.2
##
## Step: AIC=2616.43
## DLRSN ~ R10 + R7
##
## Df Deviance AIC
## + R8 1 2575.3 2583.3
## + R9 1 2577.9 2585.9
## + R3 1 2581.4 2589.4
## + R4 1 2586.3 2594.3
## + R6 1 2588.2 2596.2
## + R5 1 2595.6 2603.6
## + R2 1 2600.9 2608.9
## <none> 2610.4 2616.4
## + R1 1 2610.2 2618.2
##
## Step: AIC=2583.26
## DLRSN ~ R10 + R7 + R8
##
## Df Deviance AIC
## + R2 1 2503.5 2513.5
## + R9 1 2511.7 2521.7
## + R4 1 2528.0 2538.0
## + R5 1 2556.7 2566.7
## + R6 1 2565.0 2575.0
## <none> 2575.3 2583.3
## + R1 1 2573.8 2583.8
## + R3 1 2574.5 2584.5
##
## Step: AIC=2513.48
## DLRSN ~ R10 + R7 + R8 + R2
##
## Df Deviance AIC
## + R6 1 2475.3 2487.3
## + R9 1 2477.2 2489.2
## + R4 1 2481.0 2493.0
## + R5 1 2490.3 2502.3
## + R3 1 2490.6 2502.6
## <none> 2503.5 2513.5
## + R1 1 2503.4 2515.4
##
## Step: AIC=2487.31
## DLRSN ~ R10 + R7 + R8 + R2 + R6
##
## Df Deviance AIC
## + R9 1 2458.1 2472.1
## + R3 1 2461.0 2475.0
## + R4 1 2461.4 2475.4
## + R1 1 2463.0 2477.0
## + R5 1 2469.3 2483.3
## <none> 2475.3 2487.3
##
## Step: AIC=2472.13
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9
##
## Df Deviance AIC
## + R3 1 2433.7 2449.7
## + R1 1 2452.5 2468.5
## + R4 1 2453.0 2469.0
## <none> 2458.1 2472.1
## + R5 1 2457.9 2473.9
##
## Step: AIC=2449.7
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3
##
## Df Deviance AIC
## + R1 1 2428.2 2446.2
## + R4 1 2428.7 2446.7
## <none> 2433.7 2449.7
## + R5 1 2433.6 2451.6
##
## Step: AIC=2446.22
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3 + R1
##
## Df Deviance AIC
## + R4 1 2423.9 2443.9
## <none> 2428.2 2446.2
## + R5 1 2428.2 2448.2
##
## Step: AIC=2443.94
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3 + R1 + R4
##
## Df Deviance AIC
## <none> 2423.9 2443.9
## + R5 1 2423.5 2445.5
summary(bankruptcy.glm.step.aic) #-R5
##
## Call:
## glm(formula = DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3 + R1 +
## R4, family = binomial, data = bankruptcy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3385 -0.4595 -0.2406 -0.1076 3.3482
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.56443 0.07893 -32.489 < 2e-16 ***
## R10 -1.46141 0.08943 -16.341 < 2e-16 ***
## R7 -0.43875 0.10512 -4.174 2.99e-05 ***
## R8 -0.29527 0.08411 -3.511 0.000447 ***
## R2 0.56995 0.07926 7.191 6.45e-13 ***
## R6 0.24550 0.05511 4.455 8.40e-06 ***
## R9 0.30008 0.09203 3.261 0.001111 **
## R3 -0.49316 0.10067 -4.899 9.63e-07 ***
## R1 0.16234 0.07432 2.184 0.028942 *
## R4 -0.23959 0.12977 -1.846 0.064851 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3573.0 on 4347 degrees of freedom
## Residual deviance: 2423.9 on 4338 degrees of freedom
## AIC: 2443.9
##
## Number of Fisher Scoring iterations: 7
bankruptcy.glm.step.aic$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 4347 3573.020 3575.020
## 2 + R10 -1 881.842602 4346 2691.178 2695.178
## 3 + R7 -1 80.751812 4345 2610.426 2616.426
## 4 + R8 -1 35.162331 4344 2575.263 2583.263
## 5 + R2 -1 71.779729 4343 2503.484 2513.484
## 6 + R6 -1 28.169211 4342 2475.314 2487.314
## 7 + R9 -1 17.185392 4341 2458.129 2472.129
## 8 + R3 -1 24.428733 4340 2433.700 2449.700
## 9 + R1 -1 5.476481 4339 2428.224 2446.224
## 10 + R4 -1 4.284533 4338 2423.939 2443.939
AIC(bankruptcy.glm.step.aic)
## [1] 2443.939
BIC(bankruptcy.glm.step.aic)
## [1] 2507.714
#for stepwise model --AIC
prob.glm1.insample <- predict(bankruptcy.glm.step.aic, type = "response")
predicted.glm1.insample <- as.numeric(prob.glm1.insample > cutoff)
#confusion matrix
table(bankruptcy.train$DLRSN, predicted.glm1.insample, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 2269 1456
## 1 55 568
#misclassification Rate
mean(bankruptcy.train$DLRSN != predicted.glm1.insample)
## [1] 0.3475161
3. Stepwise BIC regression Then we performed a stepwise regression using the BIC criteria.
Using stepwise BIC regression, we got a model without R1, R4 and R5. AIC and BIC of this model are 2449.7 and 2500 .7 respectively Taking cutoff = 1/16 based on the cost ratio of 15:1 we found the misclassification rate and the confusion matrix. Misclassification Rate = 0.3498
bankruptcy.glm.step.bic <- step(null.model, k = log(nrow(bankruptcy.train)),scope=list(lower=null.model, upper=bankruptcy.glm0), direction="forward")
## Start: AIC=3581.4
## DLRSN ~ 1
##
## Df Deviance AIC
## + R10 1 2691.2 2707.9
## + R6 1 3216.2 3233.0
## + R8 1 3251.7 3268.4
## + R3 1 3259.2 3276.0
## + R1 1 3308.0 3324.7
## + R4 1 3350.5 3367.2
## + R7 1 3370.1 3386.9
## + R9 1 3382.9 3399.7
## + R2 1 3405.0 3421.8
## + R5 1 3464.9 3481.7
## <none> 3573.0 3581.4
##
## Step: AIC=2707.93
## DLRSN ~ R10
##
## Df Deviance AIC
## + R7 1 2610.4 2635.6
## + R6 1 2616.8 2641.9
## + R4 1 2623.4 2648.5
## + R8 1 2637.3 2662.4
## + R9 1 2641.5 2666.6
## + R1 1 2644.6 2669.7
## + R3 1 2652.6 2677.7
## + R5 1 2663.5 2688.6
## <none> 2691.2 2707.9
## + R2 1 2690.2 2715.4
##
## Step: AIC=2635.56
## DLRSN ~ R10 + R7
##
## Df Deviance AIC
## + R8 1 2575.3 2608.8
## + R9 1 2577.9 2611.4
## + R3 1 2581.4 2614.9
## + R4 1 2586.3 2619.8
## + R6 1 2588.2 2621.8
## + R5 1 2595.6 2629.1
## + R2 1 2600.9 2634.5
## <none> 2610.4 2635.6
## + R1 1 2610.2 2643.7
##
## Step: AIC=2608.77
## DLRSN ~ R10 + R7 + R8
##
## Df Deviance AIC
## + R2 1 2503.5 2545.4
## + R9 1 2511.7 2553.6
## + R4 1 2528.0 2569.9
## + R5 1 2556.7 2598.6
## + R6 1 2565.0 2606.9
## <none> 2575.3 2608.8
## + R1 1 2573.8 2615.7
## + R3 1 2574.5 2616.4
##
## Step: AIC=2545.37
## DLRSN ~ R10 + R7 + R8 + R2
##
## Df Deviance AIC
## + R6 1 2475.3 2525.6
## + R9 1 2477.2 2527.5
## + R4 1 2481.0 2531.2
## + R5 1 2490.3 2540.5
## + R3 1 2490.6 2540.9
## <none> 2503.5 2545.4
## + R1 1 2503.4 2553.7
##
## Step: AIC=2525.58
## DLRSN ~ R10 + R7 + R8 + R2 + R6
##
## Df Deviance AIC
## + R9 1 2458.1 2516.8
## + R3 1 2461.0 2519.7
## + R4 1 2461.4 2520.1
## + R1 1 2463.0 2521.7
## <none> 2475.3 2525.6
## + R5 1 2469.3 2528.0
##
## Step: AIC=2516.77
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9
##
## Df Deviance AIC
## + R3 1 2433.7 2500.7
## <none> 2458.1 2516.8
## + R1 1 2452.5 2519.5
## + R4 1 2453.0 2520.0
## + R5 1 2457.9 2525.0
##
## Step: AIC=2500.72
## DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3
##
## Df Deviance AIC
## <none> 2433.7 2500.7
## + R1 1 2428.2 2503.6
## + R4 1 2428.7 2504.1
## + R5 1 2433.6 2509.0
summary(bankruptcy.glm.step.bic) #-R1, -R4,-R5
##
## Call:
## glm(formula = DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3, family = binomial,
## data = bankruptcy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2954 -0.4603 -0.2482 -0.1192 3.1673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.53520 0.07603 -33.344 < 2e-16 ***
## R10 -1.55630 0.08144 -19.109 < 2e-16 ***
## R7 -0.34996 0.07617 -4.594 4.34e-06 ***
## R8 -0.27542 0.08432 -3.266 0.00109 **
## R2 0.57785 0.07837 7.373 1.67e-13 ***
## R6 0.18177 0.04235 4.292 1.77e-05 ***
## R9 0.42434 0.08289 5.119 3.06e-07 ***
## R3 -0.49841 0.10107 -4.932 8.16e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3573.0 on 4347 degrees of freedom
## Residual deviance: 2433.7 on 4340 degrees of freedom
## AIC: 2449.7
##
## Number of Fisher Scoring iterations: 6
bankruptcy.glm.step.bic$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 4347 3573.020 3581.398
## 2 + R10 -1 881.84260 4346 2691.178 2707.932
## 3 + R7 -1 80.75181 4345 2610.426 2635.558
## 4 + R8 -1 35.16233 4344 2575.263 2608.773
## 5 + R2 -1 71.77973 4343 2503.484 2545.371
## 6 + R6 -1 28.16921 4342 2475.314 2525.579
## 7 + R9 -1 17.18539 4341 2458.129 2516.771
## 8 + R3 -1 24.42873 4340 2433.700 2500.720
AIC(bankruptcy.glm.step.bic)
## [1] 2449.7
BIC(bankruptcy.glm.step.bic)
## [1] 2500.72
#for stepwise model - BIC
prob.glm2.insample <- predict(bankruptcy.glm.step.bic, type = "response")
predicted.glm2.insample <- as.numeric(prob.glm2.insample > cutoff)
#confusion matrix
table(bankruptcy.train$DLRSN, predicted.glm2.insample, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 2259 1466
## 1 55 568
#misclassification Rate
mean(bankruptcy.train$DLRSN != predicted.glm2.insample)
## [1] 0.349816
bankruptcy.glm.step.bic
##
## Call: glm(formula = DLRSN ~ R10 + R7 + R8 + R2 + R6 + R9 + R3, family = binomial,
## data = bankruptcy.train)
##
## Coefficients:
## (Intercept) R10 R7 R8 R2
## -2.5352 -1.5563 -0.3500 -0.2754 0.5779
## R6 R9 R3
## 0.1818 0.4243 -0.4984
##
## Degrees of Freedom: 4347 Total (i.e. Null); 4340 Residual
## Null Deviance: 3573
## Residual Deviance: 2434 AIC: 2450
We plotted the ROC curves for all the 3 models and found out the AIC for the same. We see that there is not a major difference between the AUC values for the 3 models.
#AUC
#roc_curve$roc.vol #0.8905805
#compare 2 models
roc.plot(x = bankruptcy.train$DLRSN == "1", pred = cbind(prob.glm0.insample, prob.glm1.insample,prob.glm2.insample),
legend = TRUE, leg.text = c("Full Model", "Model selected by AIC( -R1)", "Model selected by BIC( -R4,-R5,-R1)"))$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8755280 1.090043e-198 NA
## 2 Model 2 0.8754276 1.388023e-198 NA
## 3 Model 3 0.8737647 7.526347e-197 NA
Table: Comparing the three models
Models| Variables| AIC| BIC| AUC| Misclassification Rate ————- | —————— | —————— ————- | —————— | —————— Full |Model| All |2445.49 |2515.6| 0.876| 0.3493 Model selected by AIC| -R1| 2443.94| 2507.7| 0.875| 0.3475 Model selected by BIC| -R1-R4-R5| 2449.7| 2500.7| 0.874| 0.3498
All the models have almost same AIC, AUC and misclassification rate. As the model with BIC has the least number of variables and comparable AIC, AUC and misclassification rate, I have decided to go ahead with the model without R1, R4 and R5. As these variables R4 and R5 do not add much information to the model I have decided to go ahead with the
Now, that we have the final model we want to find out the optimal cut-off probability. Based on our initial knowledge we defined out cutoff as 1/16. In this section we are trying to find out the optimal cutoff using an asymmetric cost function and then doing a grid search to find the best cutoff. The assumption behind the asymmetric cost function is that predicting a bankruptcy case as non-bankruptcy is 15 times more costlier than the vice-versa.
## Search for optimal cut-off probability
# define the searc grid from 0.01 to 0.99
searchgrid = seq(0.01, 0.99, 0.01)
# result is a 99x2 matrix, the 1st col stores the cut-off p, the 2nd column
# stores the cost
result = cbind(searchgrid, NA)
# in the cost function, both r and pi are vectors, r=truth, pi=predicted
# probability
cost2 <- function(r, pi) {
weight1 = 15
weight0 = 1
c1 = (r == 1) & (pi < pcut) #logical vector - true if actual 1 but predict 0
c0 = (r == 0) & (pi > pcut) #logical vecotr - true if actual 0 but predict 1
return(mean(weight1 * c1 + weight0 * c0))
}
for (i in 1:length(searchgrid)) {
pcut <- result[i, 1]
# assign the cost to the 2nd col
result[i, 2] <- cost2(bankruptcy.train$DLRSN, predict(bankruptcy.glm.step.bic, type = "response"))
}
plot(result, ylab = "Cost in Training Set")
#find the optimal cut-off prob.
searchgrid[which(result[,2]==min(result[,2]))]
## [1] 0.06
#0.06
#1.04 is the cost
The minimum cost was obtained at 0.06 which is approximately the cutoff we had applied = 1/16.
After predicting the DSLRN on the 20% testing data. We see the performance is as below:
###########################
#OUT OF SAMPLE DATA
###########################
cutoff=1/16
test = bankruptcy.test[-1]
prob.glm.outsample <- predict(bankruptcy.glm.step.bic,new =test, type = "response")
predicted.glm.outsample <- prob.glm.outsample > cutoff
predicted.glm.outsample <- as.numeric(predicted.glm.outsample)
#checking the auc
roc.plot(bankruptcy.test$DLRSN=="1", prob.glm.outsample)$roc.vol #0.8894691
## Model Area p.value binorm.area
## 1 Model 1 0.8894691 3.076907e-54 NA
mean(bankruptcy.test$DLRSN!=predicted.glm.outsample) #0.3465074
## [1] 0.3465074
pcut = 1/16
#asymetric classification rate
cost2(bankruptcy.test$DLRSN, predict(bankruptcy.glm.step.bic, new = bankruptcy.test[-1],type = "response")) #0.411
## [1] 0.4108456
cost2(bankruptcy.train$DLRSN, predict(bankruptcy.glm.step.bic,type = "response")) #0.527
## [1] 0.5269089
table(bankruptcy.test$DLRSN, predicted.glm.outsample, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 563 372
## 1 5 148
Table: Out of Sample Performance Metric| Value ————- | —————— Out of Sample AUC| 0. 8895 Out of Sample AMR| 0.411
We see that the AUC is 0.8895 which is above the industry standard of 0.7. Thus, the model has a good prediction power The confusion matrix for the out of sample prediction is as below:
We can see from the confusion matrix above that only 5 bankruptcy cases where predicted as non-bankruptcy out of total number of cases 153 which is pretty good.
| AUC | Cost(AMR) | |
|---|---|---|
| In-sample training data (80%) | 0.874 (>0.7) | 0.527 |
| Out of sample testing data (20%) | 0.8895 (>0.7) | 0.411 |
Comparing the AUC with the threshold of 0.7 our final model is performed quite well both on the training data and the test data. We see that the model performance is better for test data as compared to training with a lower cost function as well as a lower AUC.