Predicting Bankruptcy using Logistic Regression Model

Bankruptcy Data

Bankruptcy dataset is a dataset which contains the financial information and the bankruptcy status of the companies for specific years.This dataset is a subset of Man Xu’s project containing the financial information of the companies It contains 5436 observations with 13 variables.Variable CUSIP contains unique code to identify the company, variable FR determines the fiscal year, where approximately 85% of the records belong to year 1999. Variable DLRSN is the Bankruptcy/Non Bankruptcy flag, where 1 stands for Bankruptcy while 0 stands for Non Bankrupt companies. Variables R1 to R10 contain financial information which will be used while building a logistic regression model.

#Reading Dataset
bankruptcy.data <- read.csv("C:/Users/Devanshu/Downloads/bankruptcy.csv",header = TRUE, sep = "," )
head(bankruptcy.data)

##   FYEAR DLRSN    CUSIP         R1        R2        R3          R4
## 1  1999     0 00036020  0.3071395 0.8870057 1.6476808 -0.19915760
## 2  1999     0 00036110  0.7607365 0.5924934 0.4530028 -0.36989014
## 3  1999     0 00037520 -0.5135961 0.3376148 0.2990145 -0.02907996
## 4  1994     1 00078110 -0.4661293 0.3707467 0.4960667 -0.37342897
## 5  1999     0 00079X10  2.0234223 0.2148764 0.1825948  6.69536040
## 6  1999     0 00086T10  0.9074985 0.3868797 0.4778914 -0.34715872
##           R5          R6          R7        R8         R9         R10
## 1  1.0929640 -0.31328867 -0.19679332 1.2067628  0.2824709  0.15889625
## 2  0.1861539  0.03961907  0.32749732 0.4284181  1.1069652  0.79344276
## 3 -0.4326047  0.82999324 -0.70778613 0.4761533  2.1791755  2.48458450
## 4 -0.2674240  0.97779888 -0.61097507 0.4568098  0.1519511  0.04778851
## 5 -1.1483381 -1.50588927  2.87647687 0.2873749 -0.9864421  0.79107673
## 6  1.4079893 -0.48390230  0.07025944 0.5278110  0.5024659 -0.16464802

Exploratory Data Analysis

The first step before performing any modeling is to look at the high-level summary of a dataset, understand the variables and look for any abnormalities or hidden facts present in the data.

We look at summary statistics and distribution of the observations of each variable.

summary(bankruptcy.data)

##      FYEAR          DLRSN             CUSIP            R1         
##  Min.   :1980   Min.   :0.0000   00036020:   1   Min.   :-4.3828  
##  1st Qu.:1999   1st Qu.:0.0000   00036110:   1   1st Qu.:-0.7501  
##  Median :1999   Median :0.0000   00037520:   1   Median :-0.2220  
##  Mean   :1998   Mean   :0.1428   00078110:   1   Mean   :-0.2352  
##  3rd Qu.:1999   3rd Qu.:0.0000   00079X10:   1   3rd Qu.: 0.4821  
##  Max.   :1999   Max.   :1.0000   00086T10:   1   Max.   : 2.0234  
##                                  (Other) :5430                    
##        R2                R3                 R4                 R5         
##  Min.   :-2.2418   Min.   :-2.06423   Min.   :-0.42712   Min.   :-1.3639  
##  1st Qu.:-1.0805   1st Qu.:-1.07887   1st Qu.:-0.38752   1st Qu.:-0.8748  
##  Median : 0.1337   Median : 0.06858   Median :-0.30754   Median :-0.3508  
##  Mean   :-0.2915   Mean   :-0.24411   Mean   : 0.23903   Mean   :-0.1338  
##  3rd Qu.: 0.5103   3rd Qu.: 0.50070   3rd Qu.: 0.02746   3rd Qu.: 0.2878  
##  Max.   : 1.4854   Max.   : 2.14246   Max.   : 6.69536   Max.   : 4.0362  
##                                                                           
##        R6                  R7                 R8         
##  Min.   :-1.505889   Min.   :-1.23340   Min.   :-2.2082  
##  1st Qu.:-0.656827   1st Qu.:-0.77996   1st Qu.:-1.0054  
##  Median : 0.003493   Median :-0.43205   Median : 0.2053  
##  Mean   : 0.195707   Mean   :-0.09612   Mean   :-0.2272  
##  3rd Qu.: 0.608376   3rd Qu.: 0.17578   3rd Qu.: 0.5289  
##  Max.   : 5.110424   Max.   : 2.87648   Max.   : 2.0006  
##                                                          
##        R9                R10         
##  Min.   :-2.76356   Min.   :-2.2140  
##  1st Qu.:-0.66606   1st Qu.:-0.6413  
##  Median : 0.06219   Median : 0.1230  
##  Mean   : 0.02538   Mean   : 0.1806  
##  3rd Qu.: 0.81215   3rd Qu.: 0.9878  
##  Max.   : 2.17918   Max.   : 2.4846  
##

There are no missing values in the data set.

We now check the distributions of all independent variables.

par(mfrow = c(3,4))
i <- 4
for (i in 4:13) 
  {
hist((bankruptcy.data[,i]), main = paste("Distibution of ", colnames(bankruptcy.data[i])), xlab = colnames(bankruptcy.data[i]))
}

R9 and R10 seem to be close to a normal distribution and nothing concrete can be commented about the others. R2, R3 and R8 have heavy tails.

Let us see what is the distribution of the bankruptcy indicator in the dataset.

par(mfrow = c(1,1))
barplot(table(bankruptcy.data$DLRSN), main = "Frequency of Bankruptcy Flag")

Number of Non- Bankrupt Companies are way more than the Bankrupt ones in the data set.

Now let us check the correlation between independent variables and the bankruptcy flag.

library(GGally)
ggcorr(bankruptcy.data[,-3], label = TRUE)

We can see that R10 has the highest correlation with the bankruptcy flag with the correlation coefficient as -0.425 followed by R6 with a correlation coefficient of 0.331. We see a high correlation between R8 and R3 and moderate correlation among other variables.

Splitting dataset into Training and Test data

We first split our dataset into training and test data set. We randomly select 80% of the observations as a part of the training data set and 20% of the observations as the test data set. The model will be built on the training data set and will be tested on the test data set.

#Sampling for training and testing datasets
set.seed(12420053)
train.ind <- sample(seq_len(nrow(bankruptcy.data)), size = floor(0.80*nrow(bankruptcy.data)))
bankruptcy.train <- bankruptcy.data[train.ind,]
bankruptcy.test <- bankruptcy.data[-train.ind,]

Model creation using Logistic Regression

Since the response variable is qualitative i.e. we classify the company as bankrupt or not, we use the Logistic Regression technique for model creation.

Model selection using the BIC criterion

We build two models. First model is the null model containing only the bankruptcy indicator and no predictors. The second model is the full model containing all predictors.

Now we select the best model using the stepwise selection method - with BIC criterion.

library(MASS)
model.full <- glm(DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10, family = "binomial" ,data = bankruptcy.train)
model.null <- glm(DLRSN ~ 1, family = "binomial" ,data = bankruptcy.train)
step.BIC <- step(model.null, scope = list(lower = model.null, upper = model.full), direction = "both", k = log(nrow(bankruptcy.train)))

summary(step.BIC)

## 
## Call:
## glm(formula = DLRSN ~ R10 + R7 + R9 + R3 + R2 + R6 + R8, family = "binomial", 
##     data = bankruptcy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1913  -0.4505  -0.2360  -0.1086   3.1897  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.59128    0.07875 -32.903  < 2e-16 ***
## R10         -1.63097    0.08298 -19.655  < 2e-16 ***
## R7          -0.33386    0.07715  -4.327 1.51e-05 ***
## R9           0.47607    0.08271   5.756 8.60e-09 ***
## R3          -0.52465    0.09957  -5.269 1.37e-07 ***
## R2           0.56774    0.07837   7.244 4.35e-13 ***
## R6           0.17677    0.04328   4.084 4.42e-05 ***
## R8          -0.28079    0.08376  -3.352 0.000802 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3605.1  on 4347  degrees of freedom
## Residual deviance: 2398.3  on 4340  degrees of freedom
## AIC: 2414.3
## 
## Number of Fisher Scoring iterations: 6

The model with BIC criterion has 7 predictors namely R2, R3, R6, R7, R8, R9 and R10.

In-Sample performance of the model

Once we have the final logistic regression model, we will plot the ROC curve to get the area under the curve. The ROC curve is plot of Sensitivity (True Positives) vs Specificity (False Positives). Higher area under the curve, better is the model performance.

final.model <- glm(DLRSN ~ R2 + R3 + R6 + R7 + R8 + R9 + R10, family = "binomial" ,data = bankruptcy.train)
library(fields)
library(verification)
prob.insample <- predict(final.model, type = "response")
roc.plot(bankruptcy.train$DLRSN == "1", prob.insample)$roc.vol

##      Model     Area       p.value binorm.area
## 1 Model  1 0.881367 2.865588e-207          NA

The above figure shows the ROC Curve for the training Dataset and the Area under the Curve (AUC) obtained is 0.8813.

For obtaining the misclassification rate, we need to decide the threshold for categorizing the bankruptcy status. By using the cost function, we will plot a graph of Cost of Training data vs threshold value and choose the value with lowest cost as the threshold for classifying the bankruptcy status. The ratio of weightage for False Positive and False Negative is 1:15. Here we are trying to say that wrongly predicting the companies which will go bankrupt as Non-Bankrupt is more serious and risky than predicting the vis versa. Therefore, it is important to reduce the False Negatives.

#Optimal Cutoff
searchgrid = seq(0.01, 0.99, 0.01)
result = cbind(searchgrid, NA)

cost1 <- function(r, pi) {
  weight1 = 15
  weight0 = 1
  c1 = (r == 1) & (pi < cutoff)  #logical vector - true if actual 1 but predict 0
  c0 = (r == 0) & (pi > cutoff)  #logical vecotr - true if actual 0 but predict 1
  return(mean(weight1 * c1 + weight0 * c0))
}

for (i in 1:length(searchgrid)) {
  cutoff <- result[i, 1]
  # assign the cost to the 2nd col
  result[i, 2] <- cost1(bankruptcy.train$DLRSN, predict(final.model, type = "response"))
}
plot(result, ylab = "Cost in Training Set", main = "Asymmetric Cost Function")

opt.cutoff <- searchgrid[which(result[,2]==min(result[,2]))]
opt.cutoff

## [1] 0.06

The optimal cutoff probability comes out to be 0.06.

We now calculate the misclassification rate of the training dataset using this threshold value.

bankruptcy.train$pred <-  ifelse(prob.insample>opt.cutoff,1,0)
table(pred = bankruptcy.train$pred , true = bankruptcy.train$DLRSN)

##     true
## pred    0    1
##    0 2256   46
##    1 1460  586

mean(bankruptcy.train$pred != bankruptcy.train$DLRSN)

## [1] 0.3463661

The misclassification rate obtained for training dataset is 0.3463.This means our model is around 66% accurate in predicting the bankruptcy status.

Out-of-Sample Performance of the Model

Now, let us test the performance of the model on the test dataset. The 20% of the data sampled initially will be used as the test dataset. To gauge the model performance on the test data, we will predict the bankruptcy status for the companies present in the test dataset using the Final Logistic Regression Model built above. We will check for the asymmetric misclassification rate and the Area under the Curve for test dataset. The test dataset contains 1088 observations.

Let us plot the ROC Curve for test dataset.

prob.outsample <- predict(final.model, bankruptcy.test, type = "response")
roc.plot(bankruptcy.test$DLRSN == "1", prob.outsample)$roc.vol

## Warning in roc.plot.default(bankruptcy.test$DLRSN == "1", prob.outsample):
## Large amount of unique predictions used as thresholds. Consider specifying
## thresholds.

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8562927 1.475403e-43          NA

The Area Under the Curve (AUC) for the test dataset is 0.8562.

We now calculate the asymmetric classification rate for the test data set. It is to be noted that we will be using the value of cut-off probability as 0.06(the value we had derived using the cost function).

bankruptcy.test$pred <-  ifelse(prob.outsample>opt.cutoff,1,0)
table(pred = bankruptcy.test$pred , true = bankruptcy.test$DLRSN)

##     true
## pred   0   1
##    0 573  14
##    1 371 130

mean(bankruptcy.test$pred != bankruptcy.test$DLRSN)

## [1] 0.3538603

The misclassification rate obtained for the test dataset is 0.3538. This means our model shows nearly 65% accuracy in predicting the bankruptcy status for observations in the test dataset.

Conclusion

The misclassification rate and AUC (Area Under Curve) is close to each other for the training as well as the test dataset. The AUC for training dataset is slightly higher than the test data while the misclassification rate for the test data is slightly higher than the training dataset.

From the above process, we can say that the Logistic Regression model built has a decent performance for training as well as test dataset and is able to predict the bankruptcy status of the companies with an accuracy of around 65%.