Info about the activity

Objectives

By the end of this activity, students should be able to

  1. Train and tune neural networks in R Torch.

Today we will be working with a very new library. It was released in Sep 2020. The downside is lack of online tutorials. Here are some links for your reference:

Note that Torch sits on top of a C++ library and, in order for it to work, you may have to install Visual C++ (you will need to do it if your RStudio crashes when you run this notebook). This is not different from other deep learning frameworks - all of them are based on some C++ library.

The advantage of Torch is the ease of installation as compared to, say, keras for R - the latter just provides syntax to run Python functions from within R and hence requires a full installation of Python.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the instructor.

Data

We will work with companies bankruptcy data (it comes from Poland)

Source: https://www.kaggle.com/c/companies-bankruptcy-forecast/data

Loading data into R

First we will load libraries and the data into R. Note that libraries that are not installed will be installed automatically. You can use this method of package installation in future - it will help to run your R codes on another computer.

Below are the dimensions of the data

library(tidyverse)
library(torch)

B <- read.csv("bankruptcy_Train.csv") 

Below are variable names

names(B)
##  [1] "Attr1"  "Attr2"  "Attr3"  "Attr4"  "Attr5"  "Attr6"  "Attr7"  "Attr8" 
##  [9] "Attr9"  "Attr10" "Attr11" "Attr12" "Attr13" "Attr14" "Attr15" "Attr16"
## [17] "Attr17" "Attr18" "Attr19" "Attr20" "Attr21" "Attr22" "Attr23" "Attr24"
## [25] "Attr25" "Attr26" "Attr27" "Attr28" "Attr29" "Attr30" "Attr31" "Attr32"
## [33] "Attr33" "Attr34" "Attr35" "Attr36" "Attr37" "Attr38" "Attr39" "Attr40"
## [41] "Attr41" "Attr42" "Attr43" "Attr44" "Attr45" "Attr46" "Attr47" "Attr48"
## [49] "Attr49" "Attr50" "Attr51" "Attr52" "Attr53" "Attr54" "Attr55" "Attr56"
## [57] "Attr57" "Attr58" "Attr59" "Attr60" "Attr61" "Attr62" "Attr63" "Attr64"
## [65] "class"

Note that names are not informative, but the variables themselves are meaningful (look at the data source). Below are proportions of the two classes.

B %>% group_by(class) %>%
  summarise(N = n())

Here, 1 means bankruptcy and 0 means non-bankruptcy. The dataset is highly imbalanced.

Subsetting

To speed things up, we will create a smaller dataset with only 1000 records. It will include all the 203 bankrupt cases from the original dataset and a random sample of 797 non-bankrupt cases. Then we will split this small dataset into 50% training and 50% test sets.

set.seed(2021)
positive_cases <- B %>% filter(class == 1)
mini_data <- B %>% 
  filter(class == 0) %>%
  sample_n(797) %>%
  rbind(positive_cases)

dim(mini_data)
## [1] 1000   65

Now we have the following distribution of labels in the mini data:

mini_data %>% group_by(class) %>%
  summarise(N = n())

GPU

We will first check if your computer is able to train deep learning models on a GPU (it will be much faster than doing it on CPU).

cuda_is_available()
## [1] FALSE

If output is TRUE, then your computer has a GPU and it is possible to train your deep learning models faster.

Training

Training and test data

Now we will create training and test datasets for our neural network model. We need to convert our data frame to matrix

set.seed(42)
train_ind <- sample(1:nrow(mini_data), size = floor(0.5 * nrow(mini_data)))
test_ind <- setdiff(1:nrow(mini_data), train_ind)

x_train = mini_data %>%
  slice(train_ind) %>%
  select(-class) %>%
  as.matrix %>% 
  torch_tensor(dtype = torch_float())
  
y_train = mini_data %>% slice(train_ind) %>% 
  pull(class) %>%
  torch_tensor(dtype = torch_float())

x_test = mini_data %>%
  slice(-train_ind) %>%
  select(-class) %>%
  as.matrix %>% 
  torch_tensor(dtype = torch_float())

y_test = mini_data %>% slice(-train_ind) %>% 
  pull(class) %>% 
  torch_tensor(dtype = torch_float())

Defining our model

We will construct a fully connected neural net with 3 hidden layers of dimensions 20, 10, and 5.

model = nn_sequential(

  # Layer 1
  nn_linear(64, 20),
  nn_relu(), 

  # Layer 2
  nn_linear(20, 10),
  nn_relu(),

  # Layer  3
  nn_linear(10, 5),
  nn_relu(),
  
  # Output Layer,
  nn_linear(5, 1),
  nn_sigmoid()
)

model
## An `nn_module` containing 1,571 parameters.
## 
## -- Modules ---------------------------------------------------------------------
## * 0: <nn_linear> #1,300 parameters
## * 1: <nn_relu> #0 parameters
## * 2: <nn_linear> #210 parameters
## * 3: <nn_relu> #0 parameters
## * 4: <nn_linear> #55 parameters
## * 5: <nn_relu> #0 parameters
## * 6: <nn_linear> #6 parameters
## * 7: <nn_sigmoid> #0 parameters

The model has been already initialized with some random weights and is capable of constructing predictions. For example, here is its prediction on the first training observation:

x_train[1, ] %>% model()
## torch_tensor
##  0.5367
## [ CPUFloatType{1} ]

Question 1

Can you figure out how to print our model’s parameters?

model$parameters
## $`0.weight`
## torch_tensor
## Columns 1 to 10-0.0614 -0.0276 -0.1080  0.0288 -0.0788  0.0618  0.1108 -0.1107  0.0586 -0.1184
##  0.0918 -0.0238  0.1045 -0.0265 -0.0278  0.0360  0.0603  0.0189  0.0587 -0.1156
## -0.0323  0.0987 -0.1220  0.0808 -0.0257 -0.0000 -0.0463 -0.0997 -0.0005  0.1240
##  0.0633 -0.0606  0.0670 -0.0682 -0.0007 -0.0778  0.0068  0.0206 -0.0261 -0.1045
##  0.0297 -0.0851 -0.1143 -0.0933 -0.0687 -0.0071 -0.0966  0.0924  0.1036 -0.0397
##  0.0755  0.0264  0.0422 -0.0002 -0.0232 -0.1122  0.0892 -0.0841 -0.0555 -0.0448
##  0.0594  0.0917  0.0978  0.0934  0.0799 -0.0632  0.0896  0.0294  0.1105  0.0208
## -0.1115  0.0920  0.0049 -0.1233 -0.0255  0.0659 -0.1078 -0.0626 -0.0473  0.1224
## -0.0741 -0.0128 -0.0797 -0.0613  0.0299 -0.1198  0.0612  0.0429  0.0265  0.0123
##  0.0013 -0.1186  0.0455  0.0195  0.0070  0.0060 -0.0347 -0.0180 -0.0911  0.0178
##  0.1141  0.0247 -0.0020  0.0911 -0.1018 -0.0926 -0.0112 -0.0511  0.0324 -0.0701
##  0.0061 -0.1035 -0.0333  0.0766 -0.0681 -0.0968 -0.0183 -0.0521  0.0382  0.1099
## -0.0419 -0.0712 -0.0997 -0.0471  0.0364  0.0880  0.1027  0.0800  0.0742  0.0092
## -0.0520  0.0068  0.1245  0.0162  0.0201  0.0587  0.0711  0.0877  0.0585 -0.0096
## -0.1135 -0.0502 -0.0832  0.0864  0.0815 -0.0460  0.0141  0.0417  0.1220 -0.0949
## -0.0411 -0.0162  0.0345 -0.0701  0.0104 -0.0500 -0.0114 -0.0888 -0.1038  0.0543
##  0.0942  0.1203 -0.0172  0.0363 -0.0316  0.0740 -0.1003 -0.0132 -0.0954  0.0355
## -0.1242 -0.1100  0.1065 -0.0191 -0.0812 -0.0259  0.0078 -0.0891 -0.0110  0.0740
## -0.0953  0.0667 -0.0529  0.1231 -0.0000  0.0331 -0.0093  0.0072  0.0736 -0.0939
## -0.0275 -0.1147  0.1088 -0.1211 -0.0197 -0.0979 -0.0352  0.0371  0.0040  0.1082
## 
## Columns 11 to 20 0.0556 -0.0450  0.0744  0.0948 -0.0768  0.0811  0.0700 -0.0810  0.0912 -0.0653
##  0.0477 -0.1007 -0.0260  0.1194 -0.1026 -0.0930  0.0928  0.0778 -0.1203  0.0215
##  0.0168 -0.0030 -0.0715  0.1036  0.0517  0.0643  0.0791  0.1245  0.1154  0.0874
## -0.0980 -0.1178 -0.0340 -0.0424  0.0535  0.0565  0.1012  0.0085 -0.0912 -0.0103
## -0.0689 -0.0513 -0.0896 -0.1244 -0.0313  0.0966  0.0328  0.0090 -0.0019 -0.0651
##  0.0984  0.1206  0.0575 -0.0300  0.0883 -0.0646  0.0152 -0.0071 -0.0362 -0.0698
## -0.0800 -0.0961 -0.0923  0.0673  0.0939 -0.0938  0.1049 -0.0573 -0.0599 -0.1147
##  0.0634 -0.0337 -0.0100 -0.0746  0.1047 -0.0363  0.0069  0.0141  0.0297  0.0112
## -0.0484  0.0251  0.1085  0.0775  0.0377  0.0614 -0.1025 -0.0752  0.0653 -0.0801
## ... [the output was truncated (use n=-1 to disable)]
## [ CPUFloatType{20,64} ]
## 
## $`0.bias`
## torch_tensor
##  0.0444
## -0.0373
## -0.0771
##  0.0416
## -0.0475
## -0.0762
## -0.1225
## -0.1181
##  0.0338
##  0.1208
## -0.0525
## -0.0673
## -0.1114
## -0.0704
##  0.1208
## -0.0283
## -0.0275
##  0.1100
##  0.1221
##  0.0194
## [ CPUFloatType{20} ]
## 
## $`2.weight`
## torch_tensor
## Columns 1 to 10 0.0559 -0.1051  0.0256  0.0968 -0.0110 -0.0242 -0.0668  0.0575  0.1299  0.0079
##  0.1115  0.1834 -0.0080 -0.1481 -0.1623 -0.1591 -0.0764 -0.0709  0.0851  0.1660
##  0.0503  0.0107  0.1913 -0.2037  0.1864 -0.0456 -0.1913 -0.1340  0.0570  0.0953
##  0.0055  0.1491  0.1190 -0.1151  0.1749 -0.1689  0.0552  0.2024 -0.0626  0.0792
## -0.0184  0.1827 -0.0251 -0.0367 -0.1553 -0.0939 -0.1651  0.0969  0.1532 -0.0310
##  0.1063 -0.1179 -0.0283 -0.1359 -0.0619 -0.0610 -0.0534 -0.0438  0.0456  0.0860
## -0.1573 -0.1906 -0.0850  0.1656  0.1991 -0.1004  0.0656  0.0955  0.0476 -0.1961
##  0.1947  0.1931 -0.1129  0.1401 -0.0581  0.1023  0.1809  0.1276  0.2086  0.1749
##  0.0068  0.1760 -0.0973  0.0478  0.1123  0.1475 -0.0563  0.0276 -0.0054  0.0995
##  0.0931  0.1374 -0.0640  0.2155  0.1860  0.1494  0.0262  0.0964 -0.1837  0.0984
## 
## Columns 11 to 20 0.2195 -0.1399  0.0496 -0.0596  0.1642  0.0085 -0.1241  0.1667  0.1219 -0.1642
## -0.1702  0.2015  0.1554  0.0141  0.1920  0.1104  0.0538 -0.1481  0.1870 -0.2197
##  0.1806 -0.1595 -0.0970  0.2043  0.0297 -0.0439  0.0956 -0.0126 -0.2172 -0.0246
##  0.0536  0.0003 -0.1941  0.1994  0.0973  0.1458  0.2200 -0.0388 -0.0264  0.0296
## -0.0059  0.0104 -0.0013  0.0140 -0.2213 -0.1287  0.1280 -0.1055  0.1875  0.2132
##  0.0504 -0.0251  0.0300  0.0469 -0.1555 -0.0754 -0.1030  0.2207  0.1750  0.0756
## -0.0160  0.0983 -0.0783 -0.0965 -0.0195 -0.0837 -0.0191 -0.1879 -0.0028  0.0932
##  0.1101  0.1475  0.1871 -0.0913  0.1297 -0.1858 -0.1432 -0.1936  0.0207  0.0932
##  0.0513 -0.2216  0.1819 -0.0111  0.0474  0.1820  0.2115  0.0633  0.0887 -0.1623
##  0.1983  0.1329 -0.1581  0.0658 -0.2195 -0.0284 -0.0301  0.0745 -0.1812  0.0357
## [ CPUFloatType{10,20} ]
## 
## $`2.bias`
## torch_tensor
##  0.1767
##  0.1269
## -0.0840
##  0.0260
##  0.0447
## -0.0952
##  0.1863
##  0.0859
## -0.1471
## -0.0617
## [ CPUFloatType{10} ]
## 
## $`4.weight`
## torch_tensor
##  0.1568 -0.1291  0.0845  0.1061 -0.1384 -0.3145  0.1253 -0.1603  0.1922 -0.0751
##  0.1032 -0.0791 -0.0854  0.1170 -0.0943  0.0935 -0.2901 -0.1593 -0.2617  0.1574
## -0.1167  0.1025 -0.2414  0.1522 -0.3130  0.2712  0.2435  0.2740 -0.2665  0.1782
## -0.1110 -0.0042 -0.2040 -0.2218 -0.1078 -0.0279  0.2415  0.0070 -0.0007 -0.1326
## -0.0392  0.1222 -0.1233  0.2063 -0.2100 -0.3080 -0.1415  0.0757  0.2534  0.1831
## [ CPUFloatType{5,10} ]
## 
## $`4.bias`
## torch_tensor
## -0.0264
## -0.0868
##  0.2882
##  0.1277
## -0.1038
## [ CPUFloatType{5} ]
## 
## $`6.weight`
## torch_tensor
## -0.0115  0.3051  0.3714 -0.0222  0.2972
## [ CPUFloatType{1,5} ]
## 
## $`6.bias`
## torch_tensor
## 0.01 *
##  2.4307
## [ CPUFloatType{1} ]

Settings for training

Now we will define the optimizer, the number of epochs (every epoch is one run through the entire training set, and the loss function that we call criterion)

optimizer <- optim_adam(model$parameters)
epochs <- 400
loss_function <- nn_bce_loss()  

Loss and accuracy

We will now write a function that computes loss and accuracy of our model. The first step is to convert class probabilities computed by our model to predicted classes. Here is how we do it:

predicted_classes <- function(pred_prob, threshold = 0.5) {
  (pred_prob > threshold) %>% as.numeric
}

x_test %>% model %>% predicted_classes %>% head
## [1] 1 1 1 1 1 1

Then we will need a function that computes accuracy of predictions:

acc <- function(x, y) {
  (x == y) %>% as.numeric %>% mean
}

x_test %>% model %>% predicted_classes %>% acc(y_test)
## [1] 0.192

And here is our validation function. It will compute training loss and training accuracy. It is convenient to package the answer in a data frame format.

validation <- function(x_train, y_train, x_test, y_test, epoch_no = 0) {
  y_train_pred <- model(x_train) 
  train_loss <- loss_function(y_train_pred, y_train) %>% as.numeric
  test_loss <- NA
  train_acc <- y_train_pred %>% predicted_classes %>% acc(y_train)
  test_acc <- NA
  data.frame(value =  c(train_loss, test_loss, train_acc, test_acc),
             type = c("loss", "loss", "accuracy", "accuracy"),
             dataset = c("train", "test", "train", "test"),
             epoch = epoch_no)
}

val_df <- validation(x_train, y_train, x_test, y_test)
val_df

Question 2

Update the validation function so that it also computes test loss and test accuracy.

validation <- function(x_train, y_train, x_test, y_test, epoch_no = 0) {
  y_train_pred <- model(x_train) 
  y_test_pred <- model(x_test)
  train_loss <- loss_function(y_train_pred, y_train) %>% as.numeric
  test_loss <- loss_function(y_test_pred, y_test) %>% as.numeric
  train_acc <- y_train_pred %>% predicted_classes %>% acc(y_train)
  test_acc <- y_test_pred %>% predicted_classes %>% acc(y_test)
  data.frame(value =  c(train_loss, test_loss, train_acc, test_acc),
             type = c("loss", "loss", "accuracy", "accuracy"),
             dataset = c("train", "test", "train", "test"),
             epoch = epoch_no)
}

val_df <- validation(x_train, y_train, x_test, y_test)
val_df

Training

Here, we explicitly implement gradient descent. Every epoch is one pass through the entire training set. On every 20th epoch, we will print training accuracy and training loss.

for(i in 1:epochs){
  optimizer$zero_grad()

  y_pred = model(x_train)
  loss = loss_function(y_pred, y_train)
  loss$backward() # this is gradient calculation
  optimizer$step() # this is one step of optimized gradient descent

  # Here, we find and print the training accuracy
  if(i %% 20 == 0){
    val_df <- validation(x_train, y_train, x_test, y_test)
    cat(" Epoch:", i,"Loss: ", val_df$value[1]," Accuracy:", val_df$value[3],"\n")
  }
}
##  Epoch: 20 Loss:  0.7121405  Accuracy: 0.248 
##  Epoch: 40 Loss:  0.6895846  Accuracy: 0.582 
##  Epoch: 60 Loss:  0.6711981  Accuracy: 0.802 
##  Epoch: 80 Loss:  0.6497552  Accuracy: 0.806 
##  Epoch: 100 Loss:  0.6130636  Accuracy: 0.81 
##  Epoch: 120 Loss:  0.5507026  Accuracy: 0.812 
##  Epoch: 140 Loss:  0.4742361  Accuracy: 0.826 
##  Epoch: 160 Loss:  0.4181295  Accuracy: 0.824 
##  Epoch: 180 Loss:  0.3822955  Accuracy: 0.852 
##  Epoch: 200 Loss:  0.3530053  Accuracy: 0.856 
##  Epoch: 220 Loss:  0.3272815  Accuracy: 0.856 
##  Epoch: 240 Loss:  0.3038602  Accuracy: 0.862 
##  Epoch: 260 Loss:  0.2822583  Accuracy: 0.87 
##  Epoch: 280 Loss:  0.2618349  Accuracy: 0.878 
##  Epoch: 300 Loss:  0.2414936  Accuracy: 0.898 
##  Epoch: 320 Loss:  0.2206193  Accuracy: 0.914 
##  Epoch: 340 Loss:  0.2005509  Accuracy: 0.924 
##  Epoch: 360 Loss:  0.1818673  Accuracy: 0.93 
##  Epoch: 380 Loss:  0.1633163  Accuracy: 0.934 
##  Epoch: 400 Loss:  0.1453451  Accuracy: 0.944

Question 3

Rewrite the code above so that instead of printing training accuracy and training loss on every 20th epoch, it will plot training and test accuracy and loss vs epoch in the end of the training.

We will need to re-initialize our model first (otherwise, gradient descent will proceed from current parameters.)

initialize(model)
## An `nn_module` containing 1,571 parameters.
## 
## -- Modules ---------------------------------------------------------------------
## * 0: <nn_linear> #1,300 parameters
## * 1: <nn_relu> #0 parameters
## * 2: <nn_linear> #210 parameters
## * 3: <nn_relu> #0 parameters
## * 4: <nn_linear> #55 parameters
## * 5: <nn_relu> #0 parameters
## * 6: <nn_linear> #6 parameters
## * 7: <nn_sigmoid> #0 parameters

Answer:

val_df <- validation(x_train, y_train, x_test, y_test)
for(i in 1:epochs){
  optimizer$zero_grad()

  y_pred <- model(x_train)
  loss <- loss_function(y_pred, y_train)
  loss$backward() # this is gradient calculation
  optimizer$step() # this is one step of optimized gradient descent

  # Here, we will calculate and save the metrics
  if(i %% 20 == 0){
    cat("Epoch ", i, "\n")
    val_df <- validation(x_train, y_train, x_test, y_test, i) %>%
      rbind(val_df)
  }
}
## Epoch  20 
## Epoch  40 
## Epoch  60 
## Epoch  80 
## Epoch  100 
## Epoch  120 
## Epoch  140 
## Epoch  160 
## Epoch  180 
## Epoch  200 
## Epoch  220 
## Epoch  240 
## Epoch  260 
## Epoch  280 
## Epoch  300 
## Epoch  320 
## Epoch  340 
## Epoch  360 
## Epoch  380 
## Epoch  400
val_df

And now we can plot it:

val_df %>% 
  ggplot(aes(x = epoch, y = value, group = dataset, color = dataset)) +
  facet_grid(. ~type) +
  geom_line()

Discussion

Note that the training loss goes down all the way - this is always going to be the case by definition of gradient descent (unless we do gradient descent in batches, which is a normal procedure in practice). Training accuracy goes up all the way and it is usually the case. However, test loss starts increasing at some point and test accuracy stops improving - this is the sign of overfitting.

Regularization

In practice, we usually want to do some form of regularization. We will add a dropout after layer 1 and include \(l_1\) regularization into our Adam optimizer:

model = nn_sequential(

  # Layer 1
  nn_linear(64, 20),
  nn_relu(), 
  nn_dropout(p = 0.2),
  
  # Layer 2
  nn_linear(20, 10),
  nn_relu(),

  # Layer  3
  nn_linear(10, 5),
  nn_relu(),
  
  # Output Layer,
  nn_linear(5, 1),
  nn_sigmoid()
)

optimizer <- optim_adam(model$parameters, weight_decay = 0.1)

And now we will do our training again:

val_df <- validation(x_train, y_train, x_test, y_test)
for(i in 1:epochs){
  optimizer$zero_grad()

  y_pred <- model(x_train)
  loss <- loss_function(y_pred, y_train)
  loss$backward() # this is gradient calculation
  optimizer$step() # this is one step of optimized gradient descent

  # Here, we will calculate and save the metrics
  if(i %% 20 == 0){
    cat("Epoch ", i, "\n")
    val_df <- validation(x_train, y_train, x_test, y_test, i) %>%
      rbind(val_df)
  }
}
## Epoch  20 
## Epoch  40 
## Epoch  60 
## Epoch  80 
## Epoch  100 
## Epoch  120 
## Epoch  140 
## Epoch  160 
## Epoch  180 
## Epoch  200 
## Epoch  220 
## Epoch  240 
## Epoch  260 
## Epoch  280 
## Epoch  300 
## Epoch  320 
## Epoch  340 
## Epoch  360 
## Epoch  380 
## Epoch  400
val_df %>% 
  ggplot(aes(x = epoch, y = value, group = dataset, color = dataset)) +
  facet_grid(. ~type) +
  geom_line()

Question 4

Reduce the values of the two regularization parameters. Experiment with them so that there is no overfitting any more but the test accuracy is still decent.

model = nn_sequential(
  # Layer 1
  nn_linear(64, 20),
  nn_dropout(p = 0.08),
  nn_relu(), 

  # Layer 2
  nn_linear(20, 10),
  nn_relu(),

  # Layer  3
  nn_linear(10, 5),
  nn_relu(),
  
  # Output Layer,
  nn_linear(5, 1),
  nn_sigmoid()
) 

optimizer <- optim_adam(model$parameters, weight_decay = 0.00005)

val_df <- validation(x_train, y_train, x_test, y_test)
for(i in 1:epochs){
  optimizer$zero_grad()

  y_pred <- model(x_train)
  loss <- loss_function(y_pred, y_train)
  loss$backward() # this is gradient calculation
  optimizer$step() # this is one step of optimized gradient descent

  # Here, we will calculate and save the metrics
  if(i %% 20 == 0){
    cat("Epoch ", i, "\n")
    val_df <- validation(x_train, y_train, x_test, y_test, i) %>%
      rbind(val_df)
  }
}
## Epoch  20 
## Epoch  40 
## Epoch  60 
## Epoch  80 
## Epoch  100 
## Epoch  120 
## Epoch  140 
## Epoch  160 
## Epoch  180 
## Epoch  200 
## Epoch  220 
## Epoch  240 
## Epoch  260 
## Epoch  280 
## Epoch  300 
## Epoch  320 
## Epoch  340 
## Epoch  360 
## Epoch  380 
## Epoch  400
val_df %>% 
  ggplot(aes(x = epoch, y = value, group = dataset, color = dataset)) +
  facet_grid(. ~type) +
  geom_line()