By the end of this activity, students should be able to
Train and tune neural networks in R,
Change model architecture and choose an appropriate loss and metric functions.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the instructor.
We will work with companies bankruptcy data (it comes from Poland)
Source: https://www.kaggle.com/c/companies-bankruptcy-forecast/data
First we will load libraries and the data into R. Note that libraries that are not installed will be installed automatically. You can use this method of package installation in future - it will help to run your R codes on another computer.
Below are the dimensions of the data
library(tidyverse)
library(neuralnet)
library(caret)
B <- read.csv("bankruptcy_Train.csv")
Below are variable names
names(B)
## [1] "Attr1" "Attr2" "Attr3" "Attr4" "Attr5" "Attr6" "Attr7" "Attr8"
## [9] "Attr9" "Attr10" "Attr11" "Attr12" "Attr13" "Attr14" "Attr15" "Attr16"
## [17] "Attr17" "Attr18" "Attr19" "Attr20" "Attr21" "Attr22" "Attr23" "Attr24"
## [25] "Attr25" "Attr26" "Attr27" "Attr28" "Attr29" "Attr30" "Attr31" "Attr32"
## [33] "Attr33" "Attr34" "Attr35" "Attr36" "Attr37" "Attr38" "Attr39" "Attr40"
## [41] "Attr41" "Attr42" "Attr43" "Attr44" "Attr45" "Attr46" "Attr47" "Attr48"
## [49] "Attr49" "Attr50" "Attr51" "Attr52" "Attr53" "Attr54" "Attr55" "Attr56"
## [57] "Attr57" "Attr58" "Attr59" "Attr60" "Attr61" "Attr62" "Attr63" "Attr64"
## [65] "class"
We will randomly split the data into 50% training and 50% test sets.
set.seed(2021)
random_choice <- runif(nrow(B)) < 0.5
ind_train <- which(random_choice)
train_data <- B %>% slice(ind_train)
test_data <- B %>% slice(-ind_train)
cat("Records in the training set are", head(ind_train),"...\n")
## Records in the training set are 1 4 8 11 16 21 ...
Note that names are not informative, but the variables themselves are meaningful (look at the data source). Below are proportions of the two classes.
B %>% group_by(class) %>%
summarise(N = n())
## `summarise()` ungrouping output (override with `.groups` argument)
Here, 1 means bancrupcy and 0 means non-bankrupcy. The dataset is highly imbalanced.
Create a smaller dataset with only 1000 records. It should include all the 203 bankrupt cases from the original dataset and a random sample of 797 non-bankrupt cases. Then split this small dataset into 50% training and 50% test sets. Overwrite train_data
and test_data
.
set.seed(2021)
positive_cases <- B %>% filter(class == 1)
mini_data <- B %>%
filter(class == 0) %>%
sample_n(797) %>%
rbind(positive_cases)
dim(mini_data)
## [1] 1000 65
Now we have the following distribution of labels in the mini data:
mini_data %>% group_by(class) %>%
summarise(N = n())
## `summarise()` ungrouping output (override with `.groups` argument)
And now we will randomly split the data into 50% training and 50% test sets.
set.seed(2021)
ind_train <- sample(1:nrow(mini_data), round(nrow(mini_data) / 2))
train_data <- mini_data %>% slice(ind_train)
test_data <- mini_data %>% slice(-ind_train)
cat("Training data dim =", dim(train_data),"\n")
## Training data dim = 500 65
cat("Test data dim =", dim(test_data),"\n")
## Test data dim = 500 65
Distribution of labels in the training data:
train_data %>% group_by(class) %>%
summarise(N = n())
## `summarise()` ungrouping output (override with `.groups` argument)
Distribution of labels in the test data:
test_data %>% group_by(class) %>%
summarise(N = n())
## `summarise()` ungrouping output (override with `.groups` argument)
Here we will predict whether a company went bankrupt. We will first train a neural net model with 3 hidden layers of dimensions 5, 3, 2 respectively. The error function should be the cross-entropy and we set linear.output = FALSE
so that activation in the last layer is sigmoid.
Recall that sigmoid acvitation in the last layer is used for binary classification problems. For regression problems, we use the option linear.output = TRUE
.
set.seed(75)
mod_nn <- neuralnet(class ~ ., data = train_data,
hidden = c(5, 3, 2),
lifesign = "full",
err.fct = "ce",
linear.output = FALSE)
## hidden: 5, 3, 2 thresh: 0.01 rep: 1/1 steps: 1000 min thresh: 0.0273642064631657
## 2000 min thresh: 0.0273642064631657
## 3000 min thresh: 0.0164723058399647
## 3887 error: 7.02182 time: 4.06 secs
plot(mod_nn)
The number of parameters of our neural networks should be
65*5 + 6*3 + 4*2 + 3*1
## [1] 354
Below are final matrices of weights and biases in each layer:
mod_nn$weights
## [[1]]
## [[1]][[1]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.60431108 0.68271629 -1.68102448 0.10300801 -1.51406271
## [2,] -1.06814494 1.95303756 -11.55001799 0.83730389 -1.61019587
## [3,] -8.24378989 1.03696840 4.06564979 -0.68283487 0.71578591
## [4,] -0.93854257 -0.86519817 -3.49041571 0.87206125 1.20296788
## [5,] 1.11222983 0.17218177 3.42998460 -0.51838491 -0.52616810
## [6,] -12.05591690 -4.49777852 17.25645000 -65.30130418 -13.73932891
## [7,] -13.20465252 -2.00185027 18.45226031 7.02536620 -1.82481276
## [8,] 0.34716558 -2.12535742 14.91749262 -11.62606710 8.81724391
## [9,] -0.48749604 -0.49447813 -1.35788140 -2.31845667 1.52993632
## [10,] -0.04635635 0.54009596 -1.12919717 0.08042225 0.91240288
## [11,] 4.39096667 0.75387391 -1.78951125 8.45285849 -1.01439395
## [12,] 2.21022786 -0.67110322 -8.44072692 2.08826126 -1.58237335
## [13,] -1.61320415 -2.40080531 1.47251566 -3.62469716 -1.30680242
## [14,] 1.26215423 0.54935387 -0.12136892 0.03493443 1.82499008
## [15,] 0.76583498 -2.73376996 14.79645780 -10.50499480 9.68905445
## [16,] 2.10286295 1.21255775 11.73990971 2.50434295 -0.26175602
## [17,] -1.76206096 -0.29220730 -1.15900287 0.58490394 -1.18207328
## [18,] 0.80746281 0.58848660 -0.94810611 0.41989631 -1.20889701
## [19,] -1.17779919 0.37832435 4.80638546 -5.85219444 1.66661681
## [20,] -5.46886214 -1.11741643 2.20880414 -3.18537231 1.42981097
## [21,] -4.30576433 -18.02033332 -5.68465875 4.15019649 -4.04020949
## [22,] -1.17654704 -1.98448309 1.06700474 -0.83192537 0.74580970
## [23,] 0.58117408 0.47285467 -2.05741758 -9.45558392 1.95740693
## [24,] -2.76070738 -15.20142208 -18.31079414 1.83131671 -7.10026058
## [25,] -1.66149373 -5.96140687 9.43258480 -9.68449147 -7.01906442
## [26,] 5.06804154 3.78091268 2.47820412 0.61808175 -2.07313620
## [27,] -1.85877195 1.88751385 -4.70759441 1.87570260 -0.82571912
## [28,] -0.68069247 -0.71726878 1.67058882 -1.03350689 1.66832572
## [29,] -3.31938556 -2.49955040 0.14502271 -0.86621725 -1.16613786
## [30,] -0.90765780 0.04668382 -0.14674730 0.02370514 1.22630408
## [31,] -7.58426516 8.61365477 9.67616416 17.25925324 -19.87714097
## [32,] -24.36725999 -1.40476121 -0.10857375 -1.80977530 13.46246349
## [33,] 24.55116913 -21.65222752 -9.27120968 -18.98581840 15.99103544
## [34,] 0.35777047 -0.40155997 -9.32150354 2.01135695 -1.49046812
## [35,] 7.60869823 -2.91738343 -0.28180684 -0.29935786 0.95347572
## [36,] -0.68303122 1.42027361 3.75362358 -0.52109222 -2.36196715
## [37,] -1.34485446 0.51617938 1.77978451 1.03060304 -0.27933469
## [38,] 5.89860647 -3.94330053 -0.22496420 0.57792858 5.80855612
## [39,] -9.46427398 -0.89405278 14.12893978 2.11135717 -7.60599249
## [40,] 29.39506970 7.21070848 -15.02469585 -7.97080221 -7.84059562
## [41,] -2.70953230 -1.34592024 -2.08488616 -6.58300859 4.05400574
## [42,] -0.14431322 -1.09059109 -25.32436385 20.61447937 -0.62477123
## [43,] -9.75415117 6.84485781 8.92795145 -33.17669657 0.71217913
## [44,] -17.26737808 -2.19905768 -5.45492691 -3.22973066 -0.04349737
## [45,] -25.88749130 2.68811813 13.63980369 -16.08253999 -8.69241661
## [46,] 12.90872915 19.60117283 154.61295486 0.72283501 -36.94489586
## [47,] -0.79018601 1.22099565 11.63825961 -0.88756005 -0.80190073
## [48,] 7.44221683 -2.38016713 -3.51942727 -0.40152390 -6.58641677
## [49,] 2.21532370 0.79440746 5.56485529 0.13134037 1.79725620
## [50,] -19.97397376 -0.03754804 -14.64877842 -2.62815517 1.60601717
## [51,] -1.65656737 1.72308425 -4.13798691 -0.98969978 -1.97637214
## [52,] 0.12847827 1.06130699 -0.46435743 -2.40182458 -1.17160836
## [53,] 20.91730416 -3.87852324 0.47663238 -13.88658041 -5.00751557
## [54,] -3.78024190 11.02248284 0.04938895 -4.31093768 -1.55332140
## [55,] -10.31933334 0.54299494 5.40950876 -2.19158660 -0.73349757
## [56,] 0.70913466 0.36132069 5.17360822 2.35041135 3.59597501
## [57,] 18.58635415 -22.38511933 0.93033507 -15.17833885 1.71628892
## [58,] 19.32769517 -4.34667607 -62.07805468 79.07583539 1.23534065
## [59,] -0.55916293 0.20776035 1.03957689 0.64660630 -0.02218940
## [60,] 12.69212201 -8.64764643 -31.41648615 18.41535283 2.09853907
## [61,] -12.73747345 -14.24531516 10.24107617 19.61613163 -5.57700639
## [62,] -0.43325491 1.72539003 -3.94519773 -4.06060859 0.85945324
## [63,] -1.12273181 -4.55878236 1.87631281 -2.20239473 2.60055843
## [64,] -0.17664511 0.19157779 -0.76082948 1.44985424 -0.50108855
## [65,] 9.08345543 0.72288851 4.96778550 1.29116292 10.88466997
##
## [[1]][[2]]
## [,1] [,2] [,3]
## [1,] 1.286222 0.7020539 1.406864
## [2,] -3.846439 -0.2545404 -2.710771
## [3,] 3.917818 3.7666929 4.062724
## [4,] -2.018072 -3.5255266 -1.306198
## [5,] 2.983298 3.6883900 2.234086
## [6,] -2.865854 -5.3831612 -2.211112
##
## [[1]][[3]]
## [,1] [,2]
## [1,] -27.47565 -3.603100
## [2,] 56.98622 12.928058
## [3,] 114.10043 17.875314
## [4,] 76.89708 7.359352
##
## [[1]][[4]]
## [,1]
## [1,] 10.27186
## [2,] 41.99026
## [3,] -58.25023
The number of parameters can be also extracted from these matrices as follows:
mod_nn$weights[[1]] %>%
sapply(dim) %>%
apply(2, prod) %>%
sum
## [1] 354
And we can also see it as follows:
length(mod_nn$result.matrix) - 3
## [1] 354
The activation function is
mod_nn$act.fct
## function (x)
## {
## 1/(1 + exp(-x))
## }
## <bytecode: 0x0000000027168570>
## <environment: 0x0000000027168e68>
## attr(,"type")
## [1] "logistic"
The loss function is
mod_nn$err.fct
## function (x, y)
## {
## -(y * log(x) + (1 - y) * log(1 - x))
## }
## <bytecode: 0x0000000027164420>
## <environment: 0x0000000027165290>
## attr(,"type")
## [1] "ce"
Now we will construct test predictions and report the confusion matrix.
preds <- as.factor(0 + (predict(mod_nn, test_data) > 0.5))
confusionMatrix(preds, as.factor(test_data$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 364 79
## 1 26 31
##
## Accuracy : 0.79
## 95% CI : (0.7516, 0.8249)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 0.3163
##
## Kappa : 0.2601
##
## Mcnemar's Test P-Value : 3.881e-07
##
## Sensitivity : 0.9333
## Specificity : 0.2818
## Pos Pred Value : 0.8217
## Neg Pred Value : 0.5439
## Prevalence : 0.7800
## Detection Rate : 0.7280
## Detection Prevalence : 0.8860
## Balanced Accuracy : 0.6076
##
## 'Positive' Class : 0
##
What is the test accuracy of the baseline model that always predicts 0?
sum(test_data$class == 0) / nrow(test_data)
## [1] 0.78
What happens if we increase the number of layers and layer dimensions? Let’s try 7 hidden layers with 30 units in each layer.
set.seed(75)
mod_nn_deep <- neuralnet(class ~ ., data = train_data,
hidden = rep(30, 7),
lifesign = "full",
err.fct = "ce",
linear.output = FALSE)
## hidden: 30, 30, 30, 30, 30, 30, 30 thresh: 0.01 rep: 1/1 steps: 121 error: 0.00979 time: 2.15 secs
preds <- as.factor(0 + (predict(mod_nn_deep, test_data) > 0.5) )
confusionMatrix(preds, as.factor(test_data$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 335 63
## 1 55 47
##
## Accuracy : 0.764
## 95% CI : (0.7243, 0.8006)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 0.8210
##
## Kappa : 0.2939
##
## Mcnemar's Test P-Value : 0.5193
##
## Sensitivity : 0.8590
## Specificity : 0.4273
## Pos Pred Value : 0.8417
## Neg Pred Value : 0.4608
## Prevalence : 0.7800
## Detection Rate : 0.6700
## Detection Prevalence : 0.7960
## Balanced Accuracy : 0.6431
##
## 'Positive' Class : 0
##
Why didn’t the model’s accuracy improve?
Answer our model may be overfitting
We can tune the model using the train
method from the library caret
. Note that only models with 3 hidden layers can be trained by this method.
set.seed(75)
tune.grid.neuralnet <- expand.grid(
layer1 = c(50, 20, 10),
layer2 = c(20, 10, 5),
layer3 = c(3, 2, 1)
)
mod_nn_2 <- train(
class ~ . ,
data = train_data,
method = "neuralnet",
linear.output = FALSE,
tuneGrid = tune.grid.neuralnet, # cannot pass parameter hidden directly!!
metric = "RMSE",
trControl = trainControl(method = "cv", number = 5))
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
mod_nn_2
## Neural Network
##
## 500 samples
## 64 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results across tuning parameters:
##
## layer1 layer2 layer3 RMSE Rsquared MAE
## 10 5 1 0.3998825 0.2012422 0.1763252
## 10 5 2 0.3971417 0.2111815 0.1744492
## 10 5 3 0.4084575 0.1730876 0.1855241
## 10 10 1 0.4081001 0.2023376 0.1916990
## 10 10 2 0.4224462 0.1427129 0.1948755
## 10 10 3 0.3903390 0.2373030 0.1674458
## 10 20 1 0.4004492 0.1997220 0.1790735
## 10 20 2 0.4199324 0.1654852 0.2000652
## 10 20 3 0.4203653 0.1387490 0.1909716
## 20 5 1 0.3704361 0.2592923 0.1631674
## 20 5 2 0.3870812 0.2257079 0.1731324
## 20 5 3 0.3773181 0.2458773 0.1612272
## 20 10 1 0.4267131 0.1422676 0.2091904
## 20 10 2 0.4209697 0.1740845 0.2043676
## 20 10 3 0.3814126 0.2146065 0.1693697
## 20 20 1 0.3805500 0.2309660 0.1697271
## 20 20 2 0.3855011 0.2253153 0.1711371
## 20 20 3 0.3982365 0.1918417 0.1831975
## 50 5 1 0.3972125 0.1992303 0.1834211
## 50 5 2 0.3983526 0.1987116 0.1848707
## 50 5 3 0.3758235 0.2507362 0.1673165
## 50 10 1 0.3976205 0.1808487 0.1848926
## 50 10 2 0.3855857 0.2323172 0.1732149
## 50 10 3 0.3984896 0.1865720 0.1832599
## 50 20 1 0.3891976 0.2129353 0.1796962
## 50 20 2 0.3764137 0.2548640 0.1655929
## 50 20 3 0.4018003 0.1794333 0.1874106
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were layer1 = 20, layer2 = 5 and layer3 = 1.
The optimal combination of layer dimensions is
mod_nn_2$bestTune
Now let us find accuracy of this fine-tuned model
preds <- as.factor(0 + (predict(mod_nn_2, test_data) > 0.5) )
confusionMatrix(preds, as.factor(test_data$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 344 62
## 1 46 48
##
## Accuracy : 0.784
## 95% CI : (0.7453, 0.8193)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 0.4396
##
## Kappa : 0.336
##
## Mcnemar's Test P-Value : 0.1489
##
## Sensitivity : 0.8821
## Specificity : 0.4364
## Pos Pred Value : 0.8473
## Neg Pred Value : 0.5106
## Prevalence : 0.7800
## Detection Rate : 0.6880
## Detection Prevalence : 0.8120
## Balanced Accuracy : 0.6592
##
## 'Positive' Class : 0
##
What is fundamentally wrong with the method described above? Why didn’t we get the best possible accuracy? How do we train the model correctly?
Answer: the metric should be accuracy, not the root-mean-squared-error. Unfortunately, the library caret
does not allow training classification neuralnet
models - that’s why we pretended that our model is regression when we tuned it with the train
method. We can retrain the model with the correct error function and the combination of layer dimensions that we just found as follows:
set.seed(75)
mod_nn_final <- neuralnet(class ~ ., data = train_data,
hidden = c(10, 5, 1),
lifesign = "full",
err.fct = "ce",
linear.output = FALSE)
## hidden: 10, 5, 1 thresh: 0.01 rep: 1/1 steps: 280 error: 42.86785 time: 0.39 secs
preds <- as.factor(0 + (predict(mod_nn_final, test_data) > 0.5))
confusionMatrix(preds, as.factor(test_data$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 352 62
## 1 38 48
##
## Accuracy : 0.8
## 95% CI : (0.7622, 0.8342)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 0.15237
##
## Kappa : 0.3677
##
## Mcnemar's Test P-Value : 0.02145
##
## Sensitivity : 0.9026
## Specificity : 0.4364
## Pos Pred Value : 0.8502
## Neg Pred Value : 0.5581
## Prevalence : 0.7800
## Detection Rate : 0.7040
## Detection Prevalence : 0.8280
## Balanced Accuracy : 0.6695
##
## 'Positive' Class : 0
##