The data consists of 66 failed firms who faced bankruptchy sometime between 1970 and 1982. For each of these failed firms a healthy firm of approximately the same size from the same industry was choses for comparison. For all 132 firms, 24 financial ratios are used in this dataset to measure firm healthiness.
Here is a list of the financial variables along with their abbreviations and definitions
NO: standard ID
D: Healthy or Failed Firm (0 = Failed, 1 = Healthy)
YR: Year of bankruptcy
R1: ASSETS; Total Assets (CASH / CURDEBT)
R2: CASH; CASH / SALES
R3: CFFO; Cash flow from operations (CASH / ASSETS)
R4: COGS; Cost of goods sold (CASH / DEBTS)
R5: CURASS; Current assets (CFFO / SALES)
R6: CURDEBT; Current debt (CFFO / ASSETS)
R7: DEBTS; Total debt (CFFO / DEBTS)
R8: INC; Income (COGS / INV)
R9: INCDEP; Income plus depreciation (CURASS / CURDEBT)
R10: INV; Inventory (CURASS / SALES)
R11: REC; Receivables (CURAS / ASSETS)
R12: SALES; Sales (CURDEBT / DEBTS)
R13: WCFO; Working capital from operations (INC / SALES)
R14 (INC / ASSETS)
R15 (INC / DEBTS)
R16 (INCDEP / SALES)
R17 (INCDEP / ASSETS)
R18 (INCEP / DEBTS)
R19 (SALES / REC)
R20 (SALES / ASSETS)
R21 (ASSETS / DEBTS)
R22 (WCFO / ASSETS)
R23 (WCFO / ASSETS)
R24 (WCFO / DEBTS)
First, I’ll explore the data to gain a preliminary understanding of which variabls might be important. I’ll begin by taking a look at the corrplot to see which variables correlate most with our target variable ‘D’
setwd('C:/Users/danny/Downloads')
bank.df <- read.csv('Bankruptcy.csv', header = TRUE)
# begin with a corr plot
cols <- c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
suppressWarnings(library(ggcorrplot))
ggcorrplot(cor(bank.df[,cols]), ggtheme = theme_classic, outline.color = 'white',
colors = c("#6D9EC1", "white", "#E46726"), lab = TRUE)
The variables with the highest correlations with D are:
R9; INCDEP = (CURASS / CURDEBT): with a coefficient of 0.47
R17; (INCDEP / ASSETS): with a coefficient of 0.47
R23; (WCFO / ASSETS): with a coefficient of 0.46
Since these 3 variables were most correlated with Bankruptcy in firms, I’ll focus specifically on these in 3 separate boxplots
library(ggplot2)
# Boxplot - R9 vs D
ggplot(aes(x= D, y= R9, fill = factor(D)), data = bank.df[,cols]) +
geom_jitter(alpha = .25) +
geom_boxplot(alpha = .5, aes(group = D)) +
labs(x= 'Healthy or Failed Firm (0 = Failed, 1 = Healthy)',
y= 'R9 (CURASS / CURDEBT)') +
scale_fill_discrete(name="Healthy or Failed Firm?\n\n0 = Failed\n1 = Healthy") +
ggtitle('R9 (INCDEP = CURASS / CURDEBT) grouped by D (Healthy or Failed Firm) Boxplot')
Bankruptcy seems to be associated with lower values of R9 (CURASS / CURDEBT)
# Boxplot - R17 vs D
ggplot(aes(x= D, y= R17, fill = factor(D)), data = bank.df[,cols]) +
geom_jitter(alpha = .25) +
geom_boxplot(alpha = .5, aes(group = D)) +
labs(x= 'Healthy or Failed Firm (0 = Failed, 1 = Healthy)',
y= 'R17 (INCDEP / ASSETS)') +
scale_fill_discrete(name="Healthy or Failed Firm?\n\n0 = Failed\n1 = Healthy") +
ggtitle('R17 (INCDEP / ASSETS) grouped by D (Healthy or Failed Firm) Boxplot')
Bankruptcy seems to be associated with lower values of R17 (INCDEP / ASSETS)
# Boxplot - R23 vs D
ggplot(aes(x= D, y= R23, fill = factor(D)), data = bank.df[,cols]) +
geom_jitter(alpha = .25) +
geom_boxplot(alpha = .5, aes(group = D)) +
labs(x= 'Healthy or Failed Firm (0 = Failed, 1 = Healthy))',
y= 'R23 (WCFO / ASSETS)') +
scale_fill_discrete(name="Healthy or Failed Firm?\n\n0 = Failed\n1 = Healthy") +
ggtitle('R23 (WCFO / ASSETS) grouped by D (Healthy or Failed Firm) Boxplot')
Bankruptcy seems to be associated with lower values of R23 (WCFO / ASSETS)
Next I’ll use PCA to assess whether there are groups of variables that convey the same information and how important that information is.
# Make cols vector of all predictor financial ratio variables for PCA
cols <- c('R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
pcs <- prcomp(bank.df[,cols], scale. = T)
summary(pcs)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.9846 1.7708 1.6492 1.47268 1.30930 1.04699
## Proportion of Variance 0.3712 0.1307 0.1133 0.09037 0.07143 0.04567
## Cumulative Proportion 0.3712 0.5018 0.6151 0.70551 0.77694 0.82262
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.02900 0.9625 0.89910 0.69731 0.51962 0.48177
## Proportion of Variance 0.04412 0.0386 0.03368 0.02026 0.01125 0.00967
## Cumulative Proportion 0.86674 0.9053 0.93902 0.95928 0.97053 0.98020
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.39314 0.33854 0.2498 0.21756 0.17159 0.15977
## Proportion of Variance 0.00644 0.00478 0.0026 0.00197 0.00123 0.00106
## Cumulative Proportion 0.98664 0.99142 0.9940 0.99599 0.99722 0.99828
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.12564 0.09732 0.08479 0.06974 0.04889 0.03954
## Proportion of Variance 0.00066 0.00039 0.00030 0.00020 0.00010 0.00007
## Cumulative Proportion 0.99894 0.99933 0.99963 0.99984 0.99993 1.00000
PC1 and PC2 account for 50.19% of the total variability
The first 5 principal components account for 77.7%
Before I move on to the various classifiers I’ll use to try and predict bankruptcy, I’ll use stepwise regression to see if some predictors could be dropped
cols <- c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
## partitioning into training (60%) and validation (40%)
set.seed(1)
train.rows <- sample(rownames(bank.df), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.rows,]
valid.rows <- setdiff(rownames(bank.df), train.rows)
valid.df <- bank.df[valid.rows,]
# run logistic regression
logit.reg <- glm(D ~ ., data = train.df[,cols], family = "binomial")
options(scipen=999)
# stepwise regression -- backward
bank.lm.step.back <- step(logit.reg, direction = 'backward')
summary(bank.lm.step.back) # check which variables it dropped
##
## Call:
## glm(formula = D ~ R1 + R4 + R5 + R8 + R9 + R10 + R13, family = "binomial",
## data = train.df[, cols])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.87133 -0.40126 0.00423 0.39459 2.07313
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.66478 2.51602 -3.046 0.00232 **
## R1 -9.41290 3.97121 -2.370 0.01777 *
## R4 11.30753 6.46450 1.749 0.08026 .
## R5 13.08825 5.55260 2.357 0.01842 *
## R8 0.06296 0.04125 1.526 0.12695
## R9 5.24899 1.75314 2.994 0.00275 **
## R10 -5.28893 2.17079 -2.436 0.01483 *
## R13 49.52674 16.80252 2.948 0.00320 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 107.98 on 78 degrees of freedom
## Residual deviance: 46.26 on 71 degrees of freedom
## AIC: 62.26
##
## Number of Fisher Scoring iterations: 8
# stepwise regression -- forward
bank.lm.step.forward <- step(logit.reg, direction = 'forward')
summary(bank.lm.step.forward) # check which variables it dropped
##
## Call:
## glm(formula = D ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 +
## R10 + R11 + R12 + R13 + R14 + R15 + R16 + R17 + R18 + R19 +
## R20 + R21 + R22 + R23 + R24, family = "binomial", data = train.df[,
## cols])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7648 -0.2132 0.0000 0.2771 2.3818
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.390819 6.775068 -0.500 0.6167
## R1 -19.650633 11.101399 -1.770 0.0767 .
## R2 18.452872 36.877939 0.500 0.6168
## R3 -42.330638 56.532125 -0.749 0.4540
## R4 44.949386 36.307994 1.238 0.2157
## R5 28.993741 20.050087 1.446 0.1482
## R6 0.396395 64.210888 0.006 0.9951
## R7 -11.665523 34.841143 -0.335 0.7378
## R8 0.096131 0.062827 1.530 0.1260
## R9 10.857686 6.686559 1.624 0.1044
## R10 -6.014019 5.424448 -1.109 0.2676
## R11 -13.096096 18.420252 -0.711 0.4771
## R12 7.828232 16.355506 0.479 0.6322
## R13 33.064908 60.030458 0.551 0.5818
## R14 -20.190030 113.662651 -0.178 0.8590
## R15 19.949880 78.742003 0.253 0.8000
## R16 -23.158619 45.504278 -0.509 0.6108
## R17 -77.761187 130.356922 -0.597 0.5508
## R18 64.310392 96.878790 0.664 0.5068
## R19 -0.003571 0.022080 -0.162 0.8715
## R20 -0.101112 0.747634 -0.135 0.8924
## R21 -6.511287 7.992917 -0.815 0.4153
## R22 2.270397 48.570072 0.047 0.9627
## R23 114.181228 159.848002 0.714 0.4750
## R24 -63.563853 114.476286 -0.555 0.5787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 107.981 on 78 degrees of freedom
## Residual deviance: 40.646 on 54 degrees of freedom
## AIC: 90.646
##
## Number of Fisher Scoring iterations: 10
# stepwise regression -- both
bank.lm.step.both <- step(logit.reg, direction = 'both')
summary(bank.lm.step.both) # check which variables it dropped
##
## Call:
## glm(formula = D ~ R1 + R4 + R5 + R8 + R9 + R10 + R13, family = "binomial",
## data = train.df[, cols])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.87133 -0.40126 0.00423 0.39459 2.07313
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.66478 2.51602 -3.046 0.00232 **
## R1 -9.41290 3.97121 -2.370 0.01777 *
## R4 11.30753 6.46450 1.749 0.08026 .
## R5 13.08825 5.55260 2.357 0.01842 *
## R8 0.06296 0.04125 1.526 0.12695
## R9 5.24899 1.75314 2.994 0.00275 **
## R10 -5.28893 2.17079 -2.436 0.01483 *
## R13 49.52674 16.80252 2.948 0.00320 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 107.98 on 78 degrees of freedom
## Residual deviance: 46.26 on 71 degrees of freedom
## AIC: 62.26
##
## Number of Fisher Scoring iterations: 8
Now that I’ve competed the 3 types of stepwise, I am going to compare the accuracy of all 3 to see which one is best.
library(forecast)
# use each of these models separately to predict the validation set.
bank.lm.step.back.pred <- predict(bank.lm.step.back, valid.df[,cols])
bank.lm.step.forward.pred <- predict(bank.lm.step.forward, valid.df[,cols])
bank.lm.step.both.pred <- predict(bank.lm.step.both, valid.df[,cols])
# Compare accuracy of all 3 methods
accuracy(bank.lm.step.back.pred, valid.df[,cols]$D)
## ME RMSE MAE MPE MAPE
## Test set 0.4637688 9.891104 6.367956 NaN Inf
accuracy(bank.lm.step.forward.pred, valid.df[,cols]$D)
## ME RMSE MAE MPE MAPE
## Test set -0.2593213 11.74749 8.437461 NaN Inf
accuracy(bank.lm.step.both.pred, valid.df[,cols]$D)
## ME RMSE MAE MPE MAPE
## Test set 0.4637688 9.891104 6.367956 NaN Inf
Backward and both generate the same model
Forward method generates the model with the smallest RMSE and ME is closer to 0 than the other two
Therefore, the model produced from forward selection is best
This model does not remove any predictors from our original model
In this section I will use various classifiers to predict firm bankruptchy and assess the predictive performance of every model on a validation partition.
Since I already created a logistic model in the previous section. I just need to test its accuracy.
library(caret)
# testing logit.reg
pred <- predict(logit.reg, train.df[, cols])
confusionMatrix(factor(ifelse(pred > 0.5, 1, 0)), factor(train.df[, cols]$D), positive = '0')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 32 6
## 1 2 39
##
## Accuracy : 0.8987
## 95% CI : (0.8102, 0.9553)
## No Information Rate : 0.5696
## P-Value [Acc > NIR] : 0.0000000001589
##
## Kappa : 0.7964
##
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.9412
## Specificity : 0.8667
## Pos Pred Value : 0.8421
## Neg Pred Value : 0.9512
## Prevalence : 0.4304
## Detection Rate : 0.4051
## Detection Prevalence : 0.4810
## Balanced Accuracy : 0.9039
##
## 'Positive' Class : 0
##
pred <- predict(logit.reg, valid.df[, cols])
confusionMatrix(factor(ifelse(pred > 0.5, 1, 0)), factor(valid.df[, cols]$D), positive = '0')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25 3
## 1 7 18
##
## Accuracy : 0.8113
## 95% CI : (0.6803, 0.9056)
## No Information Rate : 0.6038
## P-Value [Acc > NIR] : 0.001046
##
## Kappa : 0.6182
##
## Mcnemar's Test P-Value : 0.342782
##
## Sensitivity : 0.7812
## Specificity : 0.8571
## Pos Pred Value : 0.8929
## Neg Pred Value : 0.7200
## Prevalence : 0.6038
## Detection Rate : 0.4717
## Detection Prevalence : 0.5283
## Balanced Accuracy : 0.8192
##
## 'Positive' Class : 0
##
Logistic regression model has an accuracy of 73.58% on the validation data.
One other thing I can do is check the summary of the logistic model to see which which variables (if any) result in low p-values
summary(logit.reg)
##
## Call:
## glm(formula = D ~ ., family = "binomial", data = train.df[, cols])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7648 -0.2132 0.0000 0.2771 2.3818
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.390819 6.775068 -0.500 0.6167
## R1 -19.650633 11.101399 -1.770 0.0767 .
## R2 18.452872 36.877939 0.500 0.6168
## R3 -42.330638 56.532125 -0.749 0.4540
## R4 44.949386 36.307994 1.238 0.2157
## R5 28.993741 20.050087 1.446 0.1482
## R6 0.396395 64.210888 0.006 0.9951
## R7 -11.665523 34.841143 -0.335 0.7378
## R8 0.096131 0.062827 1.530 0.1260
## R9 10.857686 6.686559 1.624 0.1044
## R10 -6.014019 5.424448 -1.109 0.2676
## R11 -13.096096 18.420252 -0.711 0.4771
## R12 7.828232 16.355506 0.479 0.6322
## R13 33.064908 60.030458 0.551 0.5818
## R14 -20.190030 113.662651 -0.178 0.8590
## R15 19.949880 78.742003 0.253 0.8000
## R16 -23.158619 45.504278 -0.509 0.6108
## R17 -77.761187 130.356922 -0.597 0.5508
## R18 64.310392 96.878790 0.664 0.5068
## R19 -0.003571 0.022080 -0.162 0.8715
## R20 -0.101112 0.747634 -0.135 0.8924
## R21 -6.511287 7.992917 -0.815 0.4153
## R22 2.270397 48.570072 0.047 0.9627
## R23 114.181228 159.848002 0.714 0.4750
## R24 -63.563853 114.476286 -0.555 0.5787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 107.981 on 78 degrees of freedom
## Residual deviance: 40.646 on 54 degrees of freedom
## AIC: 90.646
##
## Number of Fisher Scoring iterations: 10
Not many variables have very low p-values (none < 0.05)
There are two variables with p-values < 0.10
These are:
R3; (CFFO = CASH / ASSETS): p-value of 0.0805
R4; (COGS = CASH / DEBTS): p-value of 0.0613
Below is the code I used to create and assess a Neural Network model.
bank.df <- read.csv('Bankruptcy.csv', header = TRUE)
cols <- c('R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
cols2 <- c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
# create new df with only variables of interest
bank2.df <- bank.df[,cols2]
## partitioning into training (60%) and validation (40%)
set.seed(1)
train.rows <- sample(row.names(bank2.df), dim(bank2.df)[1]*0.6)
train.df <- bank2.df[train.rows,]
valid.rows <- setdiff(row.names(bank2.df), train.rows)
valid.df <- bank2.df[valid.rows,]
# initialize normalized training, validation data, complete data frames to originals
train.norm.df <- train.df
valid.norm.df <- valid.df
bank.norm.df <- bank2.df
# use preProcess() from the caret package to normalize
norm.values <- preProcess(train.df[, cols2], method=c("range"))
train.norm.df[, cols2] <- predict(norm.values, train.df[, cols2])
valid.norm.df[, cols2] <- predict(norm.values, valid.df[, cols2])
bank.norm.df[, cols2] <- predict(norm.values, bank2.df[, cols2])
#install.packages('neuralnet')
library(neuralnet)
# Fit a neural network model to the data. Use 1 hidden layers with 2 nodes in each
#single hidden layer with 5 nodes; training data
nn <- neuralnet(D ~., data = train.norm.df[,cols2], linear.output = FALSE, hidden = 2)
set.seed(1)
predict_trainNN1 <- compute(nn, train.norm.df[,cols])
prob <- predict_trainNN1$net.result
pred <- ifelse(prob>0.5, 1, 0)
confusionMatrix(factor(pred),factor(bank.df[train.rows,]$D), positive = '0')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 31 2
## 1 3 43
##
## Accuracy : 0.9367
## 95% CI : (0.8584, 0.9791)
## No Information Rate : 0.5696
## P-Value [Acc > NIR] : 0.0000000000002984
##
## Kappa : 0.8704
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9118
## Specificity : 0.9556
## Pos Pred Value : 0.9394
## Neg Pred Value : 0.9348
## Prevalence : 0.4304
## Detection Rate : 0.3924
## Detection Prevalence : 0.4177
## Balanced Accuracy : 0.9337
##
## 'Positive' Class : 0
##
# 93.67% accurate (2 nodes)
# 93.67% accurate (5 nodes)
#single hidden layer with 5 nodes; valid data
set.seed(1)
predict_validNN1 <- compute(nn, valid.norm.df[,cols])
prob <- predict_validNN1$net.result
pred <- ifelse(prob>0.5, 1, 0)
confusionMatrix(factor(pred),factor(bank.df[valid.rows,]$D), positive = '0')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 2
## 1 9 19
##
## Accuracy : 0.7925
## 95% CI : (0.6589, 0.8916)
## No Information Rate : 0.6038
## P-Value [Acc > NIR] : 0.00285
##
## Kappa : 0.5897
##
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.7188
## Specificity : 0.9048
## Pos Pred Value : 0.9200
## Neg Pred Value : 0.6786
## Prevalence : 0.6038
## Detection Rate : 0.4340
## Detection Prevalence : 0.4717
## Balanced Accuracy : 0.8118
##
## 'Positive' Class : 0
##
# 77.36% accurate (2 nodes)
# 84.91% accurate (5 nodes)
1 layer 2 nodes:
train: 96.2%
valid: 77.36%
1 layer 5 nodes:
train: 98.73%
valid: 81.13%
My final model’s will be built using k-NN.
cols <- c('R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
cols2 <- c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
bank.df <- read.csv('Bankruptcy.csv', header = TRUE)
## partitioning into training (60%) and validation (40%)
set.seed(1)
train.rows <- sample(rownames(bank.df), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.rows,]
valid.rows <- setdiff(rownames(bank.df), train.rows)
valid.df <- bank.df[valid.rows,]
library(caret)
# use preProcess() from the caret package to normalize
norm.values <- preProcess(train.df[, cols], method=c("center", "scale"))
train.norm.df[, cols] <- predict(norm.values, train.df[, cols])
valid.norm.df[, cols] <- predict(norm.values, valid.df[, cols])
bank.norm.df[, cols] <- predict(norm.values, bank.df[, cols])
library(FNN)
nn <- knn(train = train.norm.df[, cols], test = valid.norm.df[, cols],
cl = train.norm.df[, 'D'], k = 1)
row.names(train.df)[attr(nn, "nn.index")]
## [1] "90" "112" "112" "86" "39" "34" "112" "106" "33" "125" "119"
## [12] "25" "43" "18" "28" "112" "35" "86" "29" "43" "76" "86"
## [23] "34" "20" "14" "69" "40" "7" "39" "34" "43" "125" "126"
## [34] "86" "120" "93" "70" "69" "92" "51" "117" "74" "92" "97"
## [45] "97" "85" "48" "99" "85" "103" "84" "79" "68"
# initialize a data frame with two columns: k, and accuracy.
accuracy.df <- data.frame(k = seq(1, 14, 1), accuracy = rep(0, 14))
# compute knn for different k on validation.
for(i in 1:14) {
knn.pred <- knn(train.norm.df[, cols], valid.norm.df[, cols],
cl = train.norm.df[, 'D'], k = i)
accuracy.df[i, 2] <- confusionMatrix(knn.pred, factor(valid.norm.df[, 'D']), positive = '0')$overall[1]
}
accuracy.df
## k accuracy
## 1 1 0.6981132
## 2 2 0.7735849
## 3 3 0.8113208
## 4 4 0.8301887
## 5 5 0.7735849
## 6 6 0.7735849
## 7 7 0.7358491
## 8 8 0.7547170
## 9 9 0.7735849
## 10 10 0.7547170
## 11 11 0.7735849
## 12 12 0.7924528
## 13 13 0.7735849
## 14 14 0.8113208
# Choose k = 7
# Show the confusion matrix for the validation data that results from using the best k.
knn.pred <- knn(train.norm.df[, cols], valid.norm.df[, cols],
cl = train.norm.df[, 'D'], k = 7)
confusionMatrix(knn.pred, factor(valid.norm.df[, 'D']), positive = '0')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22 4
## 1 10 17
##
## Accuracy : 0.7358
## 95% CI : (0.5967, 0.8474)
## No Information Rate : 0.6038
## P-Value [Acc > NIR] : 0.03165
##
## Kappa : 0.4738
##
## Mcnemar's Test P-Value : 0.18145
##
## Sensitivity : 0.6875
## Specificity : 0.8095
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.6296
## Prevalence : 0.6038
## Detection Rate : 0.4151
## Detection Prevalence : 0.4906
## Balanced Accuracy : 0.7485
##
## 'Positive' Class : 0
##
# Accuracy of 84.91%
K-NN model with k = 7 has an accuracy on the validation set of 84.91%
Below are the models I created along with their accuracy levels.
Logistic (73.58%)
Neural Network (81.13%)
K-NN (84.91%)
As you can see, the most accurate model of the three, was the k-nearest neighbors model.