Bankruptcy dataset is a dataset which contains the financial information and the bankruptcy status of the companies for specific years.This dataset is a subset of Man Xu’s project containing the financial information of the companies It contains 5436 observations with 13 variables.Variable CUSIP contains unique code to identify the company, variable FR determines the fiscal year, where approximately 85% of the records belong to year 1999. Variable DLRSN is the Bankruptcy/Non Bankruptcy flag, where 1 stands for Bankruptcy while 0 stands for Non Bankrupt companies. Variables R1 to R10 contain financial information which will be used while building a logistic regression model.
#Reading Dataset
bankruptcy.data <- read.csv("C:/Users/Devanshu/Downloads/bankruptcy.csv",header = TRUE, sep = "," )
head(bankruptcy.data)
## FYEAR DLRSN CUSIP R1 R2 R3 R4
## 1 1999 0 00036020 0.3071395 0.8870057 1.6476808 -0.19915760
## 2 1999 0 00036110 0.7607365 0.5924934 0.4530028 -0.36989014
## 3 1999 0 00037520 -0.5135961 0.3376148 0.2990145 -0.02907996
## 4 1994 1 00078110 -0.4661293 0.3707467 0.4960667 -0.37342897
## 5 1999 0 00079X10 2.0234223 0.2148764 0.1825948 6.69536040
## 6 1999 0 00086T10 0.9074985 0.3868797 0.4778914 -0.34715872
## R5 R6 R7 R8 R9 R10
## 1 1.0929640 -0.31328867 -0.19679332 1.2067628 0.2824709 0.15889625
## 2 0.1861539 0.03961907 0.32749732 0.4284181 1.1069652 0.79344276
## 3 -0.4326047 0.82999324 -0.70778613 0.4761533 2.1791755 2.48458450
## 4 -0.2674240 0.97779888 -0.61097507 0.4568098 0.1519511 0.04778851
## 5 -1.1483381 -1.50588927 2.87647687 0.2873749 -0.9864421 0.79107673
## 6 1.4079893 -0.48390230 0.07025944 0.5278110 0.5024659 -0.16464802
The first step before performing any modeling is to look at the high-level summary of a dataset, understand the variables and look for any abnormalities or hidden facts present in the data.
We look at summary statistics and distribution of the observations of each variable.
summary(bankruptcy.data)
## FYEAR DLRSN CUSIP R1
## Min. :1980 Min. :0.0000 00036020: 1 Min. :-4.3828
## 1st Qu.:1999 1st Qu.:0.0000 00036110: 1 1st Qu.:-0.7501
## Median :1999 Median :0.0000 00037520: 1 Median :-0.2220
## Mean :1998 Mean :0.1428 00078110: 1 Mean :-0.2352
## 3rd Qu.:1999 3rd Qu.:0.0000 00079X10: 1 3rd Qu.: 0.4821
## Max. :1999 Max. :1.0000 00086T10: 1 Max. : 2.0234
## (Other) :5430
## R2 R3 R4 R5
## Min. :-2.2418 Min. :-2.06423 Min. :-0.42712 Min. :-1.3639
## 1st Qu.:-1.0805 1st Qu.:-1.07887 1st Qu.:-0.38752 1st Qu.:-0.8748
## Median : 0.1337 Median : 0.06858 Median :-0.30754 Median :-0.3508
## Mean :-0.2915 Mean :-0.24411 Mean : 0.23903 Mean :-0.1338
## 3rd Qu.: 0.5103 3rd Qu.: 0.50070 3rd Qu.: 0.02746 3rd Qu.: 0.2878
## Max. : 1.4854 Max. : 2.14246 Max. : 6.69536 Max. : 4.0362
##
## R6 R7 R8
## Min. :-1.505889 Min. :-1.23340 Min. :-2.2082
## 1st Qu.:-0.656827 1st Qu.:-0.77996 1st Qu.:-1.0054
## Median : 0.003493 Median :-0.43205 Median : 0.2053
## Mean : 0.195707 Mean :-0.09612 Mean :-0.2272
## 3rd Qu.: 0.608376 3rd Qu.: 0.17578 3rd Qu.: 0.5289
## Max. : 5.110424 Max. : 2.87648 Max. : 2.0006
##
## R9 R10
## Min. :-2.76356 Min. :-2.2140
## 1st Qu.:-0.66606 1st Qu.:-0.6413
## Median : 0.06219 Median : 0.1230
## Mean : 0.02538 Mean : 0.1806
## 3rd Qu.: 0.81215 3rd Qu.: 0.9878
## Max. : 2.17918 Max. : 2.4846
##
There are no missing values in the data set.
We now check the distributions of all independent variables.
par(mfrow = c(3,4))
i <- 4
for (i in 4:13)
{
hist((bankruptcy.data[,i]), main = paste("Distibution of ", colnames(bankruptcy.data[i])), xlab = colnames(bankruptcy.data[i]))
}
R9 and R10 seem to be close to a normal distribution and nothing concrete can be commented about the others. R2, R3 and R8 have heavy tails.
Let us see what is the distribution of the bankruptcy indicator in the dataset.
par(mfrow = c(1,1))
barplot(table(bankruptcy.data$DLRSN), main = "Frequency of Bankruptcy Flag")
Number of Non- Bankrupt Companies are way more than the Bankrupt ones in the data set.
Now let us check the correlation between independent variables and the bankruptcy flag.
library(GGally)
ggcorr(bankruptcy.data[,-3], label = TRUE)
We can see that R10 has the highest correlation with the bankruptcy flag with the correlation coefficient as -0.425 followed by R6 with a correlation coefficient of 0.331. We see a high correlation between R8 and R3 and moderate correlation among other variables.
We first split our dataset into training and test data set. We randomly select 80% of the observations as a part of the training data set and 20% of the observations as the test data set. The model will be built on the training data set and will be tested on the test data set.
#Sampling for training and testing datasets
set.seed(12420053)
train.ind <- sample(seq_len(nrow(bankruptcy.data)), size = floor(0.80*nrow(bankruptcy.data)))
bankruptcy.train <- bankruptcy.data[train.ind,]
bankruptcy.test <- bankruptcy.data[-train.ind,]
Since the response variable is qualitative i.e. we classify the company as bankrupt or not, we use the Logistic Regression technique for model creation.
We build two models. First model is the null model containing only the bankruptcy indicator and no predictors. The second model is the full model containing all predictors.
Now we select the best model using the stepwise selection method - with BIC criterion.
library(MASS)
model.full <- glm(DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10, family = "binomial" ,data = bankruptcy.train)
model.null <- glm(DLRSN ~ 1, family = "binomial" ,data = bankruptcy.train)
step.BIC <- step(model.null, scope = list(lower = model.null, upper = model.full), direction = "both", k = log(nrow(bankruptcy.train)))
summary(step.BIC)
##
## Call:
## glm(formula = DLRSN ~ R10 + R7 + R9 + R3 + R2 + R6 + R8, family = "binomial",
## data = bankruptcy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1913 -0.4505 -0.2360 -0.1086 3.1897
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.59128 0.07875 -32.903 < 2e-16 ***
## R10 -1.63097 0.08298 -19.655 < 2e-16 ***
## R7 -0.33386 0.07715 -4.327 1.51e-05 ***
## R9 0.47607 0.08271 5.756 8.60e-09 ***
## R3 -0.52465 0.09957 -5.269 1.37e-07 ***
## R2 0.56774 0.07837 7.244 4.35e-13 ***
## R6 0.17677 0.04328 4.084 4.42e-05 ***
## R8 -0.28079 0.08376 -3.352 0.000802 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3605.1 on 4347 degrees of freedom
## Residual deviance: 2398.3 on 4340 degrees of freedom
## AIC: 2414.3
##
## Number of Fisher Scoring iterations: 6
The model with BIC criterion has 7 predictors namely R2, R3, R6, R7, R8, R9 and R10.
Once we have the final logistic regression model, we will plot the ROC curve to get the area under the curve. The ROC curve is plot of Sensitivity (True Positives) vs Specificity (False Positives). Higher area under the curve, better is the model performance.
final.model <- glm(DLRSN ~ R2 + R3 + R6 + R7 + R8 + R9 + R10, family = "binomial" ,data = bankruptcy.train)
library(fields)
library(verification)
prob.insample <- predict(final.model, type = "response")
roc.plot(bankruptcy.train$DLRSN == "1", prob.insample)$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.881367 2.865588e-207 NA
The above figure shows the ROC Curve for the training Dataset and the Area under the Curve (AUC) obtained is 0.8813.
For obtaining the misclassification rate, we need to decide the threshold for categorizing the bankruptcy status. By using the cost function, we will plot a graph of Cost of Training data vs threshold value and choose the value with lowest cost as the threshold for classifying the bankruptcy status. The ratio of weightage for False Positive and False Negative is 1:15. Here we are trying to say that wrongly predicting the companies which will go bankrupt as Non-Bankrupt is more serious and risky than predicting the vis versa. Therefore, it is important to reduce the False Negatives.
#Optimal Cutoff
searchgrid = seq(0.01, 0.99, 0.01)
result = cbind(searchgrid, NA)
cost1 <- function(r, pi) {
weight1 = 15
weight0 = 1
c1 = (r == 1) & (pi < cutoff) #logical vector - true if actual 1 but predict 0
c0 = (r == 0) & (pi > cutoff) #logical vecotr - true if actual 0 but predict 1
return(mean(weight1 * c1 + weight0 * c0))
}
for (i in 1:length(searchgrid)) {
cutoff <- result[i, 1]
# assign the cost to the 2nd col
result[i, 2] <- cost1(bankruptcy.train$DLRSN, predict(final.model, type = "response"))
}
plot(result, ylab = "Cost in Training Set", main = "Asymmetric Cost Function")
opt.cutoff <- searchgrid[which(result[,2]==min(result[,2]))]
opt.cutoff
## [1] 0.06
The optimal cutoff probability comes out to be 0.06.
We now calculate the misclassification rate of the training dataset using this threshold value.
bankruptcy.train$pred <- ifelse(prob.insample>opt.cutoff,1,0)
table(pred = bankruptcy.train$pred , true = bankruptcy.train$DLRSN)
## true
## pred 0 1
## 0 2256 46
## 1 1460 586
mean(bankruptcy.train$pred != bankruptcy.train$DLRSN)
## [1] 0.3463661
The misclassification rate obtained for training dataset is 0.3463.This means our model is around 66% accurate in predicting the bankruptcy status.
Now, let us test the performance of the model on the test dataset. The 20% of the data sampled initially will be used as the test dataset. To gauge the model performance on the test data, we will predict the bankruptcy status for the companies present in the test dataset using the Final Logistic Regression Model built above. We will check for the asymmetric misclassification rate and the Area under the Curve for test dataset. The test dataset contains 1088 observations.
Let us plot the ROC Curve for test dataset.
prob.outsample <- predict(final.model, bankruptcy.test, type = "response")
roc.plot(bankruptcy.test$DLRSN == "1", prob.outsample)$roc.vol
## Warning in roc.plot.default(bankruptcy.test$DLRSN == "1", prob.outsample):
## Large amount of unique predictions used as thresholds. Consider specifying
## thresholds.
## Model Area p.value binorm.area
## 1 Model 1 0.8562927 1.475403e-43 NA
The Area Under the Curve (AUC) for the test dataset is 0.8562.
We now calculate the asymmetric classification rate for the test data set. It is to be noted that we will be using the value of cut-off probability as 0.06(the value we had derived using the cost function).
bankruptcy.test$pred <- ifelse(prob.outsample>opt.cutoff,1,0)
table(pred = bankruptcy.test$pred , true = bankruptcy.test$DLRSN)
## true
## pred 0 1
## 0 573 14
## 1 371 130
mean(bankruptcy.test$pred != bankruptcy.test$DLRSN)
## [1] 0.3538603
The misclassification rate obtained for the test dataset is 0.3538. This means our model shows nearly 65% accuracy in predicting the bankruptcy status for observations in the test dataset.
The misclassification rate and AUC (Area Under Curve) is close to each other for the training as well as the test dataset. The AUC for training dataset is slightly higher than the test data while the misclassification rate for the test data is slightly higher than the training dataset.
From the above process, we can say that the Logistic Regression model built has a decent performance for training as well as test dataset and is able to predict the bankruptcy status of the companies with an accuracy of around 65%.