This report details a full analysis of firm financial data for the purpose of predicting bankruptcy 2 years in advance

The data consists of 66 failed firms who faced bankruptchy sometime between 1970 and 1982. For each of these failed firms a healthy firm of approximately the same size from the same industry was choses for comparison. For all 132 firms, 24 financial ratios are used in this dataset to measure firm healthiness.

The Dataset

Here is a list of the financial variables along with their abbreviations and definitions

NO: standard ID
D: Healthy or Failed Firm (0 = Failed, 1 = Healthy)
YR: Year of bankruptcy
R1: ASSETS; Total Assets (CASH / CURDEBT)
R2: CASH;   CASH / SALES
R3: CFFO;   Cash flow from operations       (CASH / ASSETS)
R4: COGS;   Cost of goods sold      (CASH / DEBTS)
R5: CURASS; Current assets      (CFFO / SALES)
R6: CURDEBT;    Current debt        (CFFO / ASSETS)
R7: DEBTS;  Total debt      (CFFO / DEBTS)
R8: INC;    Income      (COGS / INV)
R9: INCDEP; Income plus depreciation        (CURASS / CURDEBT)
R10: INV;   Inventory       (CURASS / SALES)
R11:  REC;  Receivables     (CURAS / ASSETS)
R12: SALES; Sales       (CURDEBT / DEBTS)
R13: WCFO;  Working capital from operations     (INC / SALES)
R14 (INC / ASSETS)
R15 (INC / DEBTS)
R16 (INCDEP / SALES)
R17 (INCDEP / ASSETS)
R18 (INCEP / DEBTS)
R19 (SALES / REC)
R20 (SALES / ASSETS)
R21 (ASSETS / DEBTS)
R22 (WCFO / ASSETS)
R23 (WCFO / ASSETS)
R24 (WCFO / DEBTS)

Preliminary Exploration

First, I’ll explore the data to gain a preliminary understanding of which variabls might be important. I’ll begin by taking a look at the corrplot to see which variables correlate most with our target variable ‘D’

setwd('C:/Users/danny/Downloads')

bank.df <- read.csv('Bankruptcy.csv', header = TRUE)
# begin with a corr plot
cols <- c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')

suppressWarnings(library(ggcorrplot))

ggcorrplot(cor(bank.df[,cols]), ggtheme = theme_classic, outline.color = 'white', 
           colors = c("#6D9EC1", "white", "#E46726"), lab = TRUE)

The variables with the highest correlations with D are:

R9; INCDEP = (CURASS / CURDEBT): with a coefficient of 0.47
R17; (INCDEP / ASSETS): with a coefficient of 0.47
R23; (WCFO / ASSETS): with a coefficient of 0.46

Since these 3 variables were most correlated with Bankruptcy in firms, I’ll focus specifically on these in 3 separate boxplots

library(ggplot2)
# Boxplot - R9 vs D
ggplot(aes(x= D, y= R9, fill = factor(D)), data = bank.df[,cols]) +
  geom_jitter(alpha = .25) +
  geom_boxplot(alpha = .5, aes(group = D)) +
  labs(x= 'Healthy or Failed Firm (0 = Failed, 1 = Healthy)',
       y= 'R9 (CURASS / CURDEBT)') +
  scale_fill_discrete(name="Healthy or Failed Firm?\n\n0 = Failed\n1 = Healthy") +
  ggtitle('R9 (INCDEP = CURASS / CURDEBT) grouped by D (Healthy or Failed Firm) Boxplot')

Bankruptcy seems to be associated with lower values of R9 (CURASS / CURDEBT)

# Boxplot - R17 vs D
ggplot(aes(x= D, y= R17, fill = factor(D)), data = bank.df[,cols]) +
  geom_jitter(alpha = .25) +
  geom_boxplot(alpha = .5, aes(group = D)) +
  labs(x= 'Healthy or Failed Firm (0 = Failed, 1 = Healthy)',
       y= 'R17 (INCDEP / ASSETS)') +
  scale_fill_discrete(name="Healthy or Failed Firm?\n\n0 = Failed\n1 = Healthy") +
  ggtitle('R17 (INCDEP / ASSETS) grouped by D (Healthy or Failed Firm) Boxplot')

Bankruptcy seems to be associated with lower values of R17 (INCDEP / ASSETS)

# Boxplot - R23 vs D
ggplot(aes(x= D, y= R23, fill = factor(D)), data = bank.df[,cols]) +
  geom_jitter(alpha = .25) +
  geom_boxplot(alpha = .5, aes(group = D)) +
  labs(x= 'Healthy or Failed Firm (0 = Failed, 1 = Healthy))',
       y= 'R23 (WCFO / ASSETS)') +
  scale_fill_discrete(name="Healthy or Failed Firm?\n\n0 = Failed\n1 = Healthy") +
  ggtitle('R23 (WCFO / ASSETS) grouped by D (Healthy or Failed Firm) Boxplot')

Bankruptcy seems to be associated with lower values of R23 (WCFO / ASSETS)

Principal Component Analysis

Next I’ll use PCA to assess whether there are groups of variables that convey the same information and how important that information is.

# Make cols vector of all predictor financial ratio variables for PCA
cols <- c('R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')

pcs <- prcomp(bank.df[,cols], scale. = T)
summary(pcs)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     2.9846 1.7708 1.6492 1.47268 1.30930 1.04699
## Proportion of Variance 0.3712 0.1307 0.1133 0.09037 0.07143 0.04567
## Cumulative Proportion  0.3712 0.5018 0.6151 0.70551 0.77694 0.82262
##                            PC7    PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.02900 0.9625 0.89910 0.69731 0.51962 0.48177
## Proportion of Variance 0.04412 0.0386 0.03368 0.02026 0.01125 0.00967
## Cumulative Proportion  0.86674 0.9053 0.93902 0.95928 0.97053 0.98020
##                           PC13    PC14   PC15    PC16    PC17    PC18
## Standard deviation     0.39314 0.33854 0.2498 0.21756 0.17159 0.15977
## Proportion of Variance 0.00644 0.00478 0.0026 0.00197 0.00123 0.00106
## Cumulative Proportion  0.98664 0.99142 0.9940 0.99599 0.99722 0.99828
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.12564 0.09732 0.08479 0.06974 0.04889 0.03954
## Proportion of Variance 0.00066 0.00039 0.00030 0.00020 0.00010 0.00007
## Cumulative Proportion  0.99894 0.99933 0.99963 0.99984 0.99993 1.00000

PC1 and PC2 account for 50.19% of the total variability

The first 5 principal components account for 77.7%

Stepwise Regression

Before I move on to the various classifiers I’ll use to try and predict bankruptcy, I’ll use stepwise regression to see if some predictors could be dropped

cols <- c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')

## partitioning into training (60%) and validation (40%)
set.seed(1)

train.rows <- sample(rownames(bank.df), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.rows,]

valid.rows <- setdiff(rownames(bank.df), train.rows)
valid.df <- bank.df[valid.rows,]

# run logistic regression
logit.reg <- glm(D ~ ., data = train.df[,cols], family = "binomial")
options(scipen=999)

# stepwise regression -- backward
bank.lm.step.back <- step(logit.reg, direction = 'backward')
summary(bank.lm.step.back) # check which variables it dropped
## 
## Call:
## glm(formula = D ~ R1 + R4 + R5 + R8 + R9 + R10 + R13, family = "binomial", 
##     data = train.df[, cols])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.87133  -0.40126   0.00423   0.39459   2.07313  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -7.66478    2.51602  -3.046  0.00232 **
## R1          -9.41290    3.97121  -2.370  0.01777 * 
## R4          11.30753    6.46450   1.749  0.08026 . 
## R5          13.08825    5.55260   2.357  0.01842 * 
## R8           0.06296    0.04125   1.526  0.12695   
## R9           5.24899    1.75314   2.994  0.00275 **
## R10         -5.28893    2.17079  -2.436  0.01483 * 
## R13         49.52674   16.80252   2.948  0.00320 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107.98  on 78  degrees of freedom
## Residual deviance:  46.26  on 71  degrees of freedom
## AIC: 62.26
## 
## Number of Fisher Scoring iterations: 8
# stepwise regression -- forward
bank.lm.step.forward <- step(logit.reg, direction = 'forward')
summary(bank.lm.step.forward) # check which variables it dropped
## 
## Call:
## glm(formula = D ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + 
##     R10 + R11 + R12 + R13 + R14 + R15 + R16 + R17 + R18 + R19 + 
##     R20 + R21 + R22 + R23 + R24, family = "binomial", data = train.df[, 
##     cols])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7648  -0.2132   0.0000   0.2771   2.3818  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -3.390819   6.775068  -0.500   0.6167  
## R1          -19.650633  11.101399  -1.770   0.0767 .
## R2           18.452872  36.877939   0.500   0.6168  
## R3          -42.330638  56.532125  -0.749   0.4540  
## R4           44.949386  36.307994   1.238   0.2157  
## R5           28.993741  20.050087   1.446   0.1482  
## R6            0.396395  64.210888   0.006   0.9951  
## R7          -11.665523  34.841143  -0.335   0.7378  
## R8            0.096131   0.062827   1.530   0.1260  
## R9           10.857686   6.686559   1.624   0.1044  
## R10          -6.014019   5.424448  -1.109   0.2676  
## R11         -13.096096  18.420252  -0.711   0.4771  
## R12           7.828232  16.355506   0.479   0.6322  
## R13          33.064908  60.030458   0.551   0.5818  
## R14         -20.190030 113.662651  -0.178   0.8590  
## R15          19.949880  78.742003   0.253   0.8000  
## R16         -23.158619  45.504278  -0.509   0.6108  
## R17         -77.761187 130.356922  -0.597   0.5508  
## R18          64.310392  96.878790   0.664   0.5068  
## R19          -0.003571   0.022080  -0.162   0.8715  
## R20          -0.101112   0.747634  -0.135   0.8924  
## R21          -6.511287   7.992917  -0.815   0.4153  
## R22           2.270397  48.570072   0.047   0.9627  
## R23         114.181228 159.848002   0.714   0.4750  
## R24         -63.563853 114.476286  -0.555   0.5787  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107.981  on 78  degrees of freedom
## Residual deviance:  40.646  on 54  degrees of freedom
## AIC: 90.646
## 
## Number of Fisher Scoring iterations: 10
# stepwise regression -- both
bank.lm.step.both <- step(logit.reg, direction = 'both')
summary(bank.lm.step.both) # check which variables it dropped
## 
## Call:
## glm(formula = D ~ R1 + R4 + R5 + R8 + R9 + R10 + R13, family = "binomial", 
##     data = train.df[, cols])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.87133  -0.40126   0.00423   0.39459   2.07313  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -7.66478    2.51602  -3.046  0.00232 **
## R1          -9.41290    3.97121  -2.370  0.01777 * 
## R4          11.30753    6.46450   1.749  0.08026 . 
## R5          13.08825    5.55260   2.357  0.01842 * 
## R8           0.06296    0.04125   1.526  0.12695   
## R9           5.24899    1.75314   2.994  0.00275 **
## R10         -5.28893    2.17079  -2.436  0.01483 * 
## R13         49.52674   16.80252   2.948  0.00320 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107.98  on 78  degrees of freedom
## Residual deviance:  46.26  on 71  degrees of freedom
## AIC: 62.26
## 
## Number of Fisher Scoring iterations: 8

Now that I’ve competed the 3 types of stepwise, I am going to compare the accuracy of all 3 to see which one is best.

library(forecast)

# use each of these models separately to predict the validation set.
bank.lm.step.back.pred <- predict(bank.lm.step.back, valid.df[,cols])

bank.lm.step.forward.pred <- predict(bank.lm.step.forward, valid.df[,cols])

bank.lm.step.both.pred <- predict(bank.lm.step.both, valid.df[,cols])

# Compare accuracy of all 3 methods
accuracy(bank.lm.step.back.pred, valid.df[,cols]$D)
##                 ME     RMSE      MAE MPE MAPE
## Test set 0.4637688 9.891104 6.367956 NaN  Inf
accuracy(bank.lm.step.forward.pred, valid.df[,cols]$D)
##                  ME     RMSE      MAE MPE MAPE
## Test set -0.2593213 11.74749 8.437461 NaN  Inf
accuracy(bank.lm.step.both.pred, valid.df[,cols]$D)
##                 ME     RMSE      MAE MPE MAPE
## Test set 0.4637688 9.891104 6.367956 NaN  Inf

Backward and both generate the same model

Forward method generates the model with the smallest RMSE and ME is closer to 0 than the other two

Therefore, the model produced from forward selection is best

This model does not remove any predictors from our original model

Model creation, prediction, and assessment

In this section I will use various classifiers to predict firm bankruptchy and assess the predictive performance of every model on a validation partition.

Logistic Regression

Since I already created a logistic model in the previous section. I just need to test its accuracy.

library(caret)
#  testing logit.reg

pred <- predict(logit.reg, train.df[, cols])

confusionMatrix(factor(ifelse(pred > 0.5, 1, 0)), factor(train.df[, cols]$D), positive = '0')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 32  6
##          1  2 39
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8102, 0.9553)
##     No Information Rate : 0.5696          
##     P-Value [Acc > NIR] : 0.0000000001589 
##                                           
##                   Kappa : 0.7964          
##                                           
##  Mcnemar's Test P-Value : 0.2888          
##                                           
##             Sensitivity : 0.9412          
##             Specificity : 0.8667          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.9512          
##              Prevalence : 0.4304          
##          Detection Rate : 0.4051          
##    Detection Prevalence : 0.4810          
##       Balanced Accuracy : 0.9039          
##                                           
##        'Positive' Class : 0               
## 
pred <- predict(logit.reg, valid.df[, cols])

confusionMatrix(factor(ifelse(pred > 0.5, 1, 0)), factor(valid.df[, cols]$D), positive = '0')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 25  3
##          1  7 18
##                                           
##                Accuracy : 0.8113          
##                  95% CI : (0.6803, 0.9056)
##     No Information Rate : 0.6038          
##     P-Value [Acc > NIR] : 0.001046        
##                                           
##                   Kappa : 0.6182          
##                                           
##  Mcnemar's Test P-Value : 0.342782        
##                                           
##             Sensitivity : 0.7812          
##             Specificity : 0.8571          
##          Pos Pred Value : 0.8929          
##          Neg Pred Value : 0.7200          
##              Prevalence : 0.6038          
##          Detection Rate : 0.4717          
##    Detection Prevalence : 0.5283          
##       Balanced Accuracy : 0.8192          
##                                           
##        'Positive' Class : 0               
## 

Logistic regression model has an accuracy of 73.58% on the validation data.

One other thing I can do is check the summary of the logistic model to see which which variables (if any) result in low p-values

summary(logit.reg)
## 
## Call:
## glm(formula = D ~ ., family = "binomial", data = train.df[, cols])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7648  -0.2132   0.0000   0.2771   2.3818  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -3.390819   6.775068  -0.500   0.6167  
## R1          -19.650633  11.101399  -1.770   0.0767 .
## R2           18.452872  36.877939   0.500   0.6168  
## R3          -42.330638  56.532125  -0.749   0.4540  
## R4           44.949386  36.307994   1.238   0.2157  
## R5           28.993741  20.050087   1.446   0.1482  
## R6            0.396395  64.210888   0.006   0.9951  
## R7          -11.665523  34.841143  -0.335   0.7378  
## R8            0.096131   0.062827   1.530   0.1260  
## R9           10.857686   6.686559   1.624   0.1044  
## R10          -6.014019   5.424448  -1.109   0.2676  
## R11         -13.096096  18.420252  -0.711   0.4771  
## R12           7.828232  16.355506   0.479   0.6322  
## R13          33.064908  60.030458   0.551   0.5818  
## R14         -20.190030 113.662651  -0.178   0.8590  
## R15          19.949880  78.742003   0.253   0.8000  
## R16         -23.158619  45.504278  -0.509   0.6108  
## R17         -77.761187 130.356922  -0.597   0.5508  
## R18          64.310392  96.878790   0.664   0.5068  
## R19          -0.003571   0.022080  -0.162   0.8715  
## R20          -0.101112   0.747634  -0.135   0.8924  
## R21          -6.511287   7.992917  -0.815   0.4153  
## R22           2.270397  48.570072   0.047   0.9627  
## R23         114.181228 159.848002   0.714   0.4750  
## R24         -63.563853 114.476286  -0.555   0.5787  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107.981  on 78  degrees of freedom
## Residual deviance:  40.646  on 54  degrees of freedom
## AIC: 90.646
## 
## Number of Fisher Scoring iterations: 10

Not many variables have very low p-values (none < 0.05)

There are two variables with p-values < 0.10

These are:

R3; (CFFO = CASH / ASSETS): p-value of 0.0805
R4; (COGS = CASH / DEBTS): p-value of 0.0613

Neural Network

Below is the code I used to create and assess a Neural Network model.

bank.df <- read.csv('Bankruptcy.csv', header = TRUE)


cols <- c('R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
cols2 <- c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')

# create new df with only variables of interest
bank2.df <- bank.df[,cols2]

## partitioning into training (60%) and validation (40%)
set.seed(1)

train.rows <- sample(row.names(bank2.df), dim(bank2.df)[1]*0.6)
train.df <- bank2.df[train.rows,]

valid.rows <- setdiff(row.names(bank2.df), train.rows)
valid.df <- bank2.df[valid.rows,]


# initialize normalized training, validation data, complete data frames to originals
train.norm.df <- train.df
valid.norm.df <- valid.df
bank.norm.df <- bank2.df


# use preProcess() from the caret package to normalize 
norm.values <- preProcess(train.df[, cols2], method=c("range"))
train.norm.df[, cols2] <- predict(norm.values, train.df[, cols2])
valid.norm.df[, cols2] <- predict(norm.values, valid.df[, cols2])
bank.norm.df[, cols2] <- predict(norm.values, bank2.df[, cols2])

#install.packages('neuralnet')
library(neuralnet)

# Fit a neural network model to the data. Use 1 hidden layers with 2 nodes in each

#single hidden layer with 5 nodes; training data
nn <- neuralnet(D ~., data = train.norm.df[,cols2], linear.output = FALSE, hidden = 2)
set.seed(1)

predict_trainNN1 <- compute(nn, train.norm.df[,cols])
prob <- predict_trainNN1$net.result
pred <- ifelse(prob>0.5, 1, 0)
confusionMatrix(factor(pred),factor(bank.df[train.rows,]$D), positive = '0')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 31  2
##          1  3 43
##                                             
##                Accuracy : 0.9367            
##                  95% CI : (0.8584, 0.9791)  
##     No Information Rate : 0.5696            
##     P-Value [Acc > NIR] : 0.0000000000002984
##                                             
##                   Kappa : 0.8704            
##                                             
##  Mcnemar's Test P-Value : 1                 
##                                             
##             Sensitivity : 0.9118            
##             Specificity : 0.9556            
##          Pos Pred Value : 0.9394            
##          Neg Pred Value : 0.9348            
##              Prevalence : 0.4304            
##          Detection Rate : 0.3924            
##    Detection Prevalence : 0.4177            
##       Balanced Accuracy : 0.9337            
##                                             
##        'Positive' Class : 0                 
## 
# 93.67% accurate (2 nodes)
# 93.67% accurate (5 nodes)



#single hidden layer with 5 nodes; valid data
set.seed(1)

predict_validNN1 <- compute(nn, valid.norm.df[,cols])
prob <- predict_validNN1$net.result
pred <- ifelse(prob>0.5, 1, 0)
confusionMatrix(factor(pred),factor(bank.df[valid.rows,]$D), positive = '0')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 23  2
##          1  9 19
##                                           
##                Accuracy : 0.7925          
##                  95% CI : (0.6589, 0.8916)
##     No Information Rate : 0.6038          
##     P-Value [Acc > NIR] : 0.00285         
##                                           
##                   Kappa : 0.5897          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.7188          
##             Specificity : 0.9048          
##          Pos Pred Value : 0.9200          
##          Neg Pred Value : 0.6786          
##              Prevalence : 0.6038          
##          Detection Rate : 0.4340          
##    Detection Prevalence : 0.4717          
##       Balanced Accuracy : 0.8118          
##                                           
##        'Positive' Class : 0               
## 
# 77.36% accurate (2 nodes)
# 84.91% accurate (5 nodes)

1 layer 2 nodes:

train: 96.2%

valid: 77.36%

1 layer 5 nodes:

train: 98.73%

valid: 81.13%

k-nearest neighbors algorithm

My final model’s will be built using k-NN.

cols <- c('R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')

cols2 <- c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')

bank.df <- read.csv('Bankruptcy.csv', header = TRUE)

## partitioning into training (60%) and validation (40%)
set.seed(1)

train.rows <- sample(rownames(bank.df), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.rows,]

valid.rows <- setdiff(rownames(bank.df), train.rows)
valid.df <- bank.df[valid.rows,]


library(caret)

# use preProcess() from the caret package to normalize
norm.values <- preProcess(train.df[, cols], method=c("center", "scale"))
train.norm.df[, cols] <- predict(norm.values, train.df[, cols])
valid.norm.df[, cols] <- predict(norm.values, valid.df[, cols])
bank.norm.df[, cols] <- predict(norm.values, bank.df[, cols])

library(FNN)
nn <- knn(train = train.norm.df[, cols], test = valid.norm.df[, cols],
          cl = train.norm.df[, 'D'], k = 1)
row.names(train.df)[attr(nn, "nn.index")]
##  [1] "90"  "112" "112" "86"  "39"  "34"  "112" "106" "33"  "125" "119"
## [12] "25"  "43"  "18"  "28"  "112" "35"  "86"  "29"  "43"  "76"  "86" 
## [23] "34"  "20"  "14"  "69"  "40"  "7"   "39"  "34"  "43"  "125" "126"
## [34] "86"  "120" "93"  "70"  "69"  "92"  "51"  "117" "74"  "92"  "97" 
## [45] "97"  "85"  "48"  "99"  "85"  "103" "84"  "79"  "68"
# initialize a data frame with two columns: k, and accuracy.
accuracy.df <- data.frame(k = seq(1, 14, 1), accuracy = rep(0, 14))
# compute knn for different k on validation.
for(i in 1:14) {
  knn.pred <- knn(train.norm.df[, cols], valid.norm.df[, cols],
                  cl = train.norm.df[, 'D'], k = i)
  accuracy.df[i, 2] <- confusionMatrix(knn.pred, factor(valid.norm.df[, 'D']), positive = '0')$overall[1]
}
accuracy.df
##     k  accuracy
## 1   1 0.6981132
## 2   2 0.7735849
## 3   3 0.8113208
## 4   4 0.8301887
## 5   5 0.7735849
## 6   6 0.7735849
## 7   7 0.7358491
## 8   8 0.7547170
## 9   9 0.7735849
## 10 10 0.7547170
## 11 11 0.7735849
## 12 12 0.7924528
## 13 13 0.7735849
## 14 14 0.8113208
# Choose k = 7


# Show the confusion matrix for the validation data that results from using the best k.
knn.pred <- knn(train.norm.df[, cols], valid.norm.df[, cols],
                cl = train.norm.df[, 'D'], k = 7)
confusionMatrix(knn.pred, factor(valid.norm.df[, 'D']), positive = '0')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 22  4
##          1 10 17
##                                           
##                Accuracy : 0.7358          
##                  95% CI : (0.5967, 0.8474)
##     No Information Rate : 0.6038          
##     P-Value [Acc > NIR] : 0.03165         
##                                           
##                   Kappa : 0.4738          
##                                           
##  Mcnemar's Test P-Value : 0.18145         
##                                           
##             Sensitivity : 0.6875          
##             Specificity : 0.8095          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.6296          
##              Prevalence : 0.6038          
##          Detection Rate : 0.4151          
##    Detection Prevalence : 0.4906          
##       Balanced Accuracy : 0.7485          
##                                           
##        'Positive' Class : 0               
## 
# Accuracy of 84.91%

K-NN model with k = 7 has an accuracy on the validation set of 84.91%

Conclusions

Below are the models I created along with their accuracy levels.

As you can see, the most accurate model of the three, was the k-nearest neighbors model.