Info about the activity

Objectives

By the end of this activity, students should be able to

Train and tune neural networks in R,
Change model architecture and choose an appropriate loss and metric functions.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the instructor.

Data

We will work with companies bankruptcy data (it comes from Poland)

Source: https://www.kaggle.com/c/companies-bankruptcy-forecast/data

Loading data into R

First we will load libraries and the data into R. Note that libraries that are not installed will be installed automatically. You can use this method of package installation in future - it will help to run your R codes on another computer.

Below are the dimensions of the data

library(tidyverse)
library(neuralnet)
library(caret)

B <- read.csv("bankruptcy_Train.csv")

Below are variable names

names(B)

##  [1] "Attr1"  "Attr2"  "Attr3"  "Attr4"  "Attr5"  "Attr6"  "Attr7"  "Attr8" 
##  [9] "Attr9"  "Attr10" "Attr11" "Attr12" "Attr13" "Attr14" "Attr15" "Attr16"
## [17] "Attr17" "Attr18" "Attr19" "Attr20" "Attr21" "Attr22" "Attr23" "Attr24"
## [25] "Attr25" "Attr26" "Attr27" "Attr28" "Attr29" "Attr30" "Attr31" "Attr32"
## [33] "Attr33" "Attr34" "Attr35" "Attr36" "Attr37" "Attr38" "Attr39" "Attr40"
## [41] "Attr41" "Attr42" "Attr43" "Attr44" "Attr45" "Attr46" "Attr47" "Attr48"
## [49] "Attr49" "Attr50" "Attr51" "Attr52" "Attr53" "Attr54" "Attr55" "Attr56"
## [57] "Attr57" "Attr58" "Attr59" "Attr60" "Attr61" "Attr62" "Attr63" "Attr64"
## [65] "class"

We will randomly split the data into 50% training and 50% test sets.

set.seed(2021)
random_choice <- runif(nrow(B)) < 0.5
ind_train <- which(random_choice)

train_data <- B %>% slice(ind_train)
test_data <- B %>% slice(-ind_train)

cat("Records in the training set are", head(ind_train),"...\n")

## Records in the training set are 1 4 8 11 16 21 ...

Note that names are not informative, but the variables themselves are meaningful (look at the data source). Below are proportions of the two classes.

B %>% group_by(class) %>%
  summarise(N = n())

## `summarise()` ungrouping output (override with `.groups` argument)

Here, 1 means bancrupcy and 0 means non-bankrupcy. The dataset is highly imbalanced.

Question 1

Create a smaller dataset with only 1000 records. It should include all the 203 bankrupt cases from the original dataset and a random sample of 797 non-bankrupt cases. Then split this small dataset into 50% training and 50% test sets. Overwrite train_data and test_data.

set.seed(2021)
positive_cases <- B %>% filter(class == 1)
mini_data <- B %>% 
  filter(class == 0) %>%
  sample_n(797) %>%
  rbind(positive_cases)

dim(mini_data)

## [1] 1000   65

Now we have the following distribution of labels in the mini data:

mini_data %>% group_by(class) %>%
  summarise(N = n())

## `summarise()` ungrouping output (override with `.groups` argument)

And now we will randomly split the data into 50% training and 50% test sets.

set.seed(2021)
ind_train <- sample(1:nrow(mini_data), round(nrow(mini_data) / 2))

train_data <- mini_data %>% slice(ind_train)
test_data <- mini_data %>% slice(-ind_train)

cat("Training data dim =", dim(train_data),"\n")

## Training data dim = 500 65

cat("Test data dim =", dim(test_data),"\n")

## Test data dim = 500 65

Distribution of labels in the training data:

train_data %>% group_by(class) %>%
  summarise(N = n())

## `summarise()` ungrouping output (override with `.groups` argument)

Distribution of labels in the test data:

test_data %>% group_by(class) %>%
  summarise(N = n())

## `summarise()` ungrouping output (override with `.groups` argument)

Neural Net

Here we will predict whether a company went bankrupt. We will first train a neural net model with 3 hidden layers of dimensions 5, 3, 2 respectively. The error function should be the cross-entropy and we set linear.output = FALSE so that activation in the last layer is sigmoid.

Recall that sigmoid acvitation in the last layer is used for binary classification problems. For regression problems, we use the option linear.output = TRUE.

set.seed(75)

mod_nn <- neuralnet(class ~  ., data = train_data, 
                                    hidden = c(5, 3, 2),
                    lifesign = "full",
                    err.fct = "ce",
                    linear.output = FALSE)

## hidden: 5, 3, 2    thresh: 0.01    rep: 1/1    steps:    1000    min thresh: 0.0273642064631657
##                                                          2000    min thresh: 0.0273642064631657
##                                                          3000    min thresh: 0.0164723058399647
##                                                          3887    error: 7.02182  time: 4.06 secs

plot(mod_nn)

Looking inside the model

The number of parameters of our neural networks should be

65*5 + 6*3 + 4*2 + 3*1

## [1] 354

Below are final matrices of weights and biases in each layer:

mod_nn$weights

## [[1]]
## [[1]][[1]]
##               [,1]         [,2]         [,3]         [,4]         [,5]
##  [1,]  -0.60431108   0.68271629  -1.68102448   0.10300801  -1.51406271
##  [2,]  -1.06814494   1.95303756 -11.55001799   0.83730389  -1.61019587
##  [3,]  -8.24378989   1.03696840   4.06564979  -0.68283487   0.71578591
##  [4,]  -0.93854257  -0.86519817  -3.49041571   0.87206125   1.20296788
##  [5,]   1.11222983   0.17218177   3.42998460  -0.51838491  -0.52616810
##  [6,] -12.05591690  -4.49777852  17.25645000 -65.30130418 -13.73932891
##  [7,] -13.20465252  -2.00185027  18.45226031   7.02536620  -1.82481276
##  [8,]   0.34716558  -2.12535742  14.91749262 -11.62606710   8.81724391
##  [9,]  -0.48749604  -0.49447813  -1.35788140  -2.31845667   1.52993632
## [10,]  -0.04635635   0.54009596  -1.12919717   0.08042225   0.91240288
## [11,]   4.39096667   0.75387391  -1.78951125   8.45285849  -1.01439395
## [12,]   2.21022786  -0.67110322  -8.44072692   2.08826126  -1.58237335
## [13,]  -1.61320415  -2.40080531   1.47251566  -3.62469716  -1.30680242
## [14,]   1.26215423   0.54935387  -0.12136892   0.03493443   1.82499008
## [15,]   0.76583498  -2.73376996  14.79645780 -10.50499480   9.68905445
## [16,]   2.10286295   1.21255775  11.73990971   2.50434295  -0.26175602
## [17,]  -1.76206096  -0.29220730  -1.15900287   0.58490394  -1.18207328
## [18,]   0.80746281   0.58848660  -0.94810611   0.41989631  -1.20889701
## [19,]  -1.17779919   0.37832435   4.80638546  -5.85219444   1.66661681
## [20,]  -5.46886214  -1.11741643   2.20880414  -3.18537231   1.42981097
## [21,]  -4.30576433 -18.02033332  -5.68465875   4.15019649  -4.04020949
## [22,]  -1.17654704  -1.98448309   1.06700474  -0.83192537   0.74580970
## [23,]   0.58117408   0.47285467  -2.05741758  -9.45558392   1.95740693
## [24,]  -2.76070738 -15.20142208 -18.31079414   1.83131671  -7.10026058
## [25,]  -1.66149373  -5.96140687   9.43258480  -9.68449147  -7.01906442
## [26,]   5.06804154   3.78091268   2.47820412   0.61808175  -2.07313620
## [27,]  -1.85877195   1.88751385  -4.70759441   1.87570260  -0.82571912
## [28,]  -0.68069247  -0.71726878   1.67058882  -1.03350689   1.66832572
## [29,]  -3.31938556  -2.49955040   0.14502271  -0.86621725  -1.16613786
## [30,]  -0.90765780   0.04668382  -0.14674730   0.02370514   1.22630408
## [31,]  -7.58426516   8.61365477   9.67616416  17.25925324 -19.87714097
## [32,] -24.36725999  -1.40476121  -0.10857375  -1.80977530  13.46246349
## [33,]  24.55116913 -21.65222752  -9.27120968 -18.98581840  15.99103544
## [34,]   0.35777047  -0.40155997  -9.32150354   2.01135695  -1.49046812
## [35,]   7.60869823  -2.91738343  -0.28180684  -0.29935786   0.95347572
## [36,]  -0.68303122   1.42027361   3.75362358  -0.52109222  -2.36196715
## [37,]  -1.34485446   0.51617938   1.77978451   1.03060304  -0.27933469
## [38,]   5.89860647  -3.94330053  -0.22496420   0.57792858   5.80855612
## [39,]  -9.46427398  -0.89405278  14.12893978   2.11135717  -7.60599249
## [40,]  29.39506970   7.21070848 -15.02469585  -7.97080221  -7.84059562
## [41,]  -2.70953230  -1.34592024  -2.08488616  -6.58300859   4.05400574
## [42,]  -0.14431322  -1.09059109 -25.32436385  20.61447937  -0.62477123
## [43,]  -9.75415117   6.84485781   8.92795145 -33.17669657   0.71217913
## [44,] -17.26737808  -2.19905768  -5.45492691  -3.22973066  -0.04349737
## [45,] -25.88749130   2.68811813  13.63980369 -16.08253999  -8.69241661
## [46,]  12.90872915  19.60117283 154.61295486   0.72283501 -36.94489586
## [47,]  -0.79018601   1.22099565  11.63825961  -0.88756005  -0.80190073
## [48,]   7.44221683  -2.38016713  -3.51942727  -0.40152390  -6.58641677
## [49,]   2.21532370   0.79440746   5.56485529   0.13134037   1.79725620
## [50,] -19.97397376  -0.03754804 -14.64877842  -2.62815517   1.60601717
## [51,]  -1.65656737   1.72308425  -4.13798691  -0.98969978  -1.97637214
## [52,]   0.12847827   1.06130699  -0.46435743  -2.40182458  -1.17160836
## [53,]  20.91730416  -3.87852324   0.47663238 -13.88658041  -5.00751557
## [54,]  -3.78024190  11.02248284   0.04938895  -4.31093768  -1.55332140
## [55,] -10.31933334   0.54299494   5.40950876  -2.19158660  -0.73349757
## [56,]   0.70913466   0.36132069   5.17360822   2.35041135   3.59597501
## [57,]  18.58635415 -22.38511933   0.93033507 -15.17833885   1.71628892
## [58,]  19.32769517  -4.34667607 -62.07805468  79.07583539   1.23534065
## [59,]  -0.55916293   0.20776035   1.03957689   0.64660630  -0.02218940
## [60,]  12.69212201  -8.64764643 -31.41648615  18.41535283   2.09853907
## [61,] -12.73747345 -14.24531516  10.24107617  19.61613163  -5.57700639
## [62,]  -0.43325491   1.72539003  -3.94519773  -4.06060859   0.85945324
## [63,]  -1.12273181  -4.55878236   1.87631281  -2.20239473   2.60055843
## [64,]  -0.17664511   0.19157779  -0.76082948   1.44985424  -0.50108855
## [65,]   9.08345543   0.72288851   4.96778550   1.29116292  10.88466997
## 
## [[1]][[2]]
##           [,1]       [,2]      [,3]
## [1,]  1.286222  0.7020539  1.406864
## [2,] -3.846439 -0.2545404 -2.710771
## [3,]  3.917818  3.7666929  4.062724
## [4,] -2.018072 -3.5255266 -1.306198
## [5,]  2.983298  3.6883900  2.234086
## [6,] -2.865854 -5.3831612 -2.211112
## 
## [[1]][[3]]
##           [,1]      [,2]
## [1,] -27.47565 -3.603100
## [2,]  56.98622 12.928058
## [3,] 114.10043 17.875314
## [4,]  76.89708  7.359352
## 
## [[1]][[4]]
##           [,1]
## [1,]  10.27186
## [2,]  41.99026
## [3,] -58.25023

The number of parameters can be also extracted from these matrices as follows:

mod_nn$weights[[1]] %>% 
  sapply(dim) %>%
  apply(2, prod) %>%
  sum

## [1] 354

And we can also see it as follows:

length(mod_nn$result.matrix) - 3

## [1] 354

The activation function is

mod_nn$act.fct

## function (x) 
## {
##     1/(1 + exp(-x))
## }
## <bytecode: 0x0000000027168570>
## <environment: 0x0000000027168e68>
## attr(,"type")
## [1] "logistic"

The loss function is

mod_nn$err.fct

## function (x, y) 
## {
##     -(y * log(x) + (1 - y) * log(1 - x))
## }
## <bytecode: 0x0000000027164420>
## <environment: 0x0000000027165290>
## attr(,"type")
## [1] "ce"

Models’ accuracy

Now we will construct test predictions and report the confusion matrix.

preds <- as.factor(0 + (predict(mod_nn, test_data) > 0.5))
confusionMatrix(preds, as.factor(test_data$class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 364  79
##          1  26  31
##                                           
##                Accuracy : 0.79            
##                  95% CI : (0.7516, 0.8249)
##     No Information Rate : 0.78            
##     P-Value [Acc > NIR] : 0.3163          
##                                           
##                   Kappa : 0.2601          
##                                           
##  Mcnemar's Test P-Value : 3.881e-07       
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.2818          
##          Pos Pred Value : 0.8217          
##          Neg Pred Value : 0.5439          
##              Prevalence : 0.7800          
##          Detection Rate : 0.7280          
##    Detection Prevalence : 0.8860          
##       Balanced Accuracy : 0.6076          
##                                           
##        'Positive' Class : 0               
##

Question 2

What is the test accuracy of the baseline model that always predicts 0?

sum(test_data$class == 0) / nrow(test_data)

## [1] 0.78

Going deeper

What happens if we increase the number of layers and layer dimensions? Let’s try 7 hidden layers with 30 units in each layer.

set.seed(75)

mod_nn_deep <- neuralnet(class ~  ., data = train_data, 
                                    hidden = rep(30, 7),
                    lifesign = "full",
                    err.fct = "ce",
                    linear.output = FALSE)

## hidden: 30, 30, 30, 30, 30, 30, 30    thresh: 0.01    rep: 1/1    steps:     121 error: 0.00979  time: 2.15 secs

preds <- as.factor(0 + (predict(mod_nn_deep, test_data) > 0.5) )
confusionMatrix(preds, as.factor(test_data$class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 335  63
##          1  55  47
##                                           
##                Accuracy : 0.764           
##                  95% CI : (0.7243, 0.8006)
##     No Information Rate : 0.78            
##     P-Value [Acc > NIR] : 0.8210          
##                                           
##                   Kappa : 0.2939          
##                                           
##  Mcnemar's Test P-Value : 0.5193          
##                                           
##             Sensitivity : 0.8590          
##             Specificity : 0.4273          
##          Pos Pred Value : 0.8417          
##          Neg Pred Value : 0.4608          
##              Prevalence : 0.7800          
##          Detection Rate : 0.6700          
##    Detection Prevalence : 0.7960          
##       Balanced Accuracy : 0.6431          
##                                           
##        'Positive' Class : 0               
##

Question 3

Why didn’t the model’s accuracy improve?

Answer our model may be overfitting

Tuninig the model

We can tune the model using the train method from the library caret. Note that only models with 3 hidden layers can be trained by this method.

set.seed(75)

tune.grid.neuralnet <- expand.grid(
  layer1 = c(50, 20, 10),
  layer2 = c(20, 10, 5),
  layer3 = c(3, 2, 1)
)


mod_nn_2 <- train(
    class ~ . , 
    data = train_data,
    method = "neuralnet",
    linear.output = FALSE, 
    tuneGrid = tune.grid.neuralnet, # cannot pass parameter hidden directly!!
    metric = "RMSE",
    trControl = trainControl(method = "cv", number = 5))

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

mod_nn_2

## Neural Network 
## 
## 500 samples
##  64 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results across tuning parameters:
## 
##   layer1  layer2  layer3  RMSE       Rsquared   MAE      
##   10       5      1       0.3998825  0.2012422  0.1763252
##   10       5      2       0.3971417  0.2111815  0.1744492
##   10       5      3       0.4084575  0.1730876  0.1855241
##   10      10      1       0.4081001  0.2023376  0.1916990
##   10      10      2       0.4224462  0.1427129  0.1948755
##   10      10      3       0.3903390  0.2373030  0.1674458
##   10      20      1       0.4004492  0.1997220  0.1790735
##   10      20      2       0.4199324  0.1654852  0.2000652
##   10      20      3       0.4203653  0.1387490  0.1909716
##   20       5      1       0.3704361  0.2592923  0.1631674
##   20       5      2       0.3870812  0.2257079  0.1731324
##   20       5      3       0.3773181  0.2458773  0.1612272
##   20      10      1       0.4267131  0.1422676  0.2091904
##   20      10      2       0.4209697  0.1740845  0.2043676
##   20      10      3       0.3814126  0.2146065  0.1693697
##   20      20      1       0.3805500  0.2309660  0.1697271
##   20      20      2       0.3855011  0.2253153  0.1711371
##   20      20      3       0.3982365  0.1918417  0.1831975
##   50       5      1       0.3972125  0.1992303  0.1834211
##   50       5      2       0.3983526  0.1987116  0.1848707
##   50       5      3       0.3758235  0.2507362  0.1673165
##   50      10      1       0.3976205  0.1808487  0.1848926
##   50      10      2       0.3855857  0.2323172  0.1732149
##   50      10      3       0.3984896  0.1865720  0.1832599
##   50      20      1       0.3891976  0.2129353  0.1796962
##   50      20      2       0.3764137  0.2548640  0.1655929
##   50      20      3       0.4018003  0.1794333  0.1874106
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were layer1 = 20, layer2 = 5 and layer3 = 1.

The optimal combination of layer dimensions is

mod_nn_2$bestTune

Now let us find accuracy of this fine-tuned model

preds <- as.factor(0 + (predict(mod_nn_2, test_data) > 0.5) )
confusionMatrix(preds, as.factor(test_data$class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 344  62
##          1  46  48
##                                           
##                Accuracy : 0.784           
##                  95% CI : (0.7453, 0.8193)
##     No Information Rate : 0.78            
##     P-Value [Acc > NIR] : 0.4396          
##                                           
##                   Kappa : 0.336           
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.8821          
##             Specificity : 0.4364          
##          Pos Pred Value : 0.8473          
##          Neg Pred Value : 0.5106          
##              Prevalence : 0.7800          
##          Detection Rate : 0.6880          
##    Detection Prevalence : 0.8120          
##       Balanced Accuracy : 0.6592          
##                                           
##        'Positive' Class : 0               
##

Question 4

What is fundamentally wrong with the method described above? Why didn’t we get the best possible accuracy? How do we train the model correctly?

Answer: the metric should be accuracy, not the root-mean-squared-error. Unfortunately, the library caret does not allow training classification neuralnet models - that’s why we pretended that our model is regression when we tuned it with the train method. We can retrain the model with the correct error function and the combination of layer dimensions that we just found as follows:

set.seed(75)

mod_nn_final <- neuralnet(class ~  ., data = train_data, 
                                    hidden = c(10, 5, 1),
                    lifesign = "full",
                    err.fct = "ce",
                    linear.output = FALSE)

## hidden: 10, 5, 1    thresh: 0.01    rep: 1/1    steps:     280   error: 42.86785 time: 0.39 secs

preds <- as.factor(0 + (predict(mod_nn_final, test_data) > 0.5))
confusionMatrix(preds, as.factor(test_data$class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 352  62
##          1  38  48
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.7622, 0.8342)
##     No Information Rate : 0.78            
##     P-Value [Acc > NIR] : 0.15237         
##                                           
##                   Kappa : 0.3677          
##                                           
##  Mcnemar's Test P-Value : 0.02145         
##                                           
##             Sensitivity : 0.9026          
##             Specificity : 0.4364          
##          Pos Pred Value : 0.8502          
##          Neg Pred Value : 0.5581          
##              Prevalence : 0.7800          
##          Detection Rate : 0.7040          
##    Detection Prevalence : 0.8280          
##       Balanced Accuracy : 0.6695          
##                                           
##        'Positive' Class : 0               
##

Class 5, notebook 2 — fully connected neural networks