By the end of this activity, students should be able to
Torch
.Today we will be working with a very new library. It was released in Sep 2020. The downside is lack of online tutorials. Here are some links for your reference:
https://blogs.rstudio.com/ai/posts/2020-09-29-introducing-torch-for-r/
https://anderfernandez.com/en/blog/how-to-create-neural-networks-with-torch-in-r/
Note that Torch
sits on top of a C++ library and, in order for it to work, you may have to install Visual C++ (you will need to do it if your RStudio crashes when you run this notebook). This is not different from other deep learning frameworks - all of them are based on some C++ library.
The advantage of Torch
is the ease of installation as compared to, say, keras
for R - the latter just provides syntax to run Python functions from within R and hence requires a full installation of Python.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the instructor.
We will work with companies bankruptcy data (it comes from Poland)
Source: https://www.kaggle.com/c/companies-bankruptcy-forecast/data
First we will load libraries and the data into R. Note that libraries that are not installed will be installed automatically. You can use this method of package installation in future - it will help to run your R codes on another computer.
Below are the dimensions of the data
library(tidyverse)
library(torch)
B <- read.csv("bankruptcy_Train.csv")
Below are variable names
names(B)
## [1] "Attr1" "Attr2" "Attr3" "Attr4" "Attr5" "Attr6" "Attr7" "Attr8"
## [9] "Attr9" "Attr10" "Attr11" "Attr12" "Attr13" "Attr14" "Attr15" "Attr16"
## [17] "Attr17" "Attr18" "Attr19" "Attr20" "Attr21" "Attr22" "Attr23" "Attr24"
## [25] "Attr25" "Attr26" "Attr27" "Attr28" "Attr29" "Attr30" "Attr31" "Attr32"
## [33] "Attr33" "Attr34" "Attr35" "Attr36" "Attr37" "Attr38" "Attr39" "Attr40"
## [41] "Attr41" "Attr42" "Attr43" "Attr44" "Attr45" "Attr46" "Attr47" "Attr48"
## [49] "Attr49" "Attr50" "Attr51" "Attr52" "Attr53" "Attr54" "Attr55" "Attr56"
## [57] "Attr57" "Attr58" "Attr59" "Attr60" "Attr61" "Attr62" "Attr63" "Attr64"
## [65] "class"
Note that names are not informative, but the variables themselves are meaningful (look at the data source). Below are proportions of the two classes.
B %>% group_by(class) %>%
summarise(N = n())
Here, 1 means bankruptcy and 0 means non-bankruptcy. The dataset is highly imbalanced.
To speed things up, we will create a smaller dataset with only 1000 records. It will include all the 203 bankrupt cases from the original dataset and a random sample of 797 non-bankrupt cases. Then we will split this small dataset into 50% training and 50% test sets.
set.seed(2021)
positive_cases <- B %>% filter(class == 1)
mini_data <- B %>%
filter(class == 0) %>%
sample_n(797) %>%
rbind(positive_cases)
dim(mini_data)
## [1] 1000 65
Now we have the following distribution of labels in the mini data:
mini_data %>% group_by(class) %>%
summarise(N = n())
We will first check if your computer is able to train deep learning models on a GPU (it will be much faster than doing it on CPU).
cuda_is_available()
## [1] FALSE
If output is TRUE, then your computer has a GPU and it is possible to train your deep learning models faster.
Now we will create training and test datasets for our neural network model. We need to convert our data frame to matrix
set.seed(42)
train_ind <- sample(1:nrow(mini_data), size = floor(0.5 * nrow(mini_data)))
test_ind <- setdiff(1:nrow(mini_data), train_ind)
x_train = mini_data %>%
slice(train_ind) %>%
select(-class) %>%
as.matrix %>%
torch_tensor(dtype = torch_float())
y_train = mini_data %>% slice(train_ind) %>%
pull(class) %>%
torch_tensor(dtype = torch_float())
x_test = mini_data %>%
slice(-train_ind) %>%
select(-class) %>%
as.matrix %>%
torch_tensor(dtype = torch_float())
y_test = mini_data %>% slice(-train_ind) %>%
pull(class) %>%
torch_tensor(dtype = torch_float())
We will construct a fully connected neural net with 3 hidden layers of dimensions 20, 10, and 5.
model = nn_sequential(
# Layer 1
nn_linear(64, 20),
nn_relu(),
# Layer 2
nn_linear(20, 10),
nn_relu(),
# Layer 3
nn_linear(10, 5),
nn_relu(),
# Output Layer,
nn_linear(5, 1),
nn_sigmoid()
)
model
## An `nn_module` containing 1,571 parameters.
##
## -- Modules ---------------------------------------------------------------------
## * 0: <nn_linear> #1,300 parameters
## * 1: <nn_relu> #0 parameters
## * 2: <nn_linear> #210 parameters
## * 3: <nn_relu> #0 parameters
## * 4: <nn_linear> #55 parameters
## * 5: <nn_relu> #0 parameters
## * 6: <nn_linear> #6 parameters
## * 7: <nn_sigmoid> #0 parameters
The model has been already initialized with some random weights and is capable of constructing predictions. For example, here is its prediction on the first training observation:
x_train[1, ] %>% model()
## torch_tensor
## 0.5367
## [ CPUFloatType{1} ]
Can you figure out how to print our model’s parameters?
model$parameters
## $`0.weight`
## torch_tensor
## Columns 1 to 10-0.0614 -0.0276 -0.1080 0.0288 -0.0788 0.0618 0.1108 -0.1107 0.0586 -0.1184
## 0.0918 -0.0238 0.1045 -0.0265 -0.0278 0.0360 0.0603 0.0189 0.0587 -0.1156
## -0.0323 0.0987 -0.1220 0.0808 -0.0257 -0.0000 -0.0463 -0.0997 -0.0005 0.1240
## 0.0633 -0.0606 0.0670 -0.0682 -0.0007 -0.0778 0.0068 0.0206 -0.0261 -0.1045
## 0.0297 -0.0851 -0.1143 -0.0933 -0.0687 -0.0071 -0.0966 0.0924 0.1036 -0.0397
## 0.0755 0.0264 0.0422 -0.0002 -0.0232 -0.1122 0.0892 -0.0841 -0.0555 -0.0448
## 0.0594 0.0917 0.0978 0.0934 0.0799 -0.0632 0.0896 0.0294 0.1105 0.0208
## -0.1115 0.0920 0.0049 -0.1233 -0.0255 0.0659 -0.1078 -0.0626 -0.0473 0.1224
## -0.0741 -0.0128 -0.0797 -0.0613 0.0299 -0.1198 0.0612 0.0429 0.0265 0.0123
## 0.0013 -0.1186 0.0455 0.0195 0.0070 0.0060 -0.0347 -0.0180 -0.0911 0.0178
## 0.1141 0.0247 -0.0020 0.0911 -0.1018 -0.0926 -0.0112 -0.0511 0.0324 -0.0701
## 0.0061 -0.1035 -0.0333 0.0766 -0.0681 -0.0968 -0.0183 -0.0521 0.0382 0.1099
## -0.0419 -0.0712 -0.0997 -0.0471 0.0364 0.0880 0.1027 0.0800 0.0742 0.0092
## -0.0520 0.0068 0.1245 0.0162 0.0201 0.0587 0.0711 0.0877 0.0585 -0.0096
## -0.1135 -0.0502 -0.0832 0.0864 0.0815 -0.0460 0.0141 0.0417 0.1220 -0.0949
## -0.0411 -0.0162 0.0345 -0.0701 0.0104 -0.0500 -0.0114 -0.0888 -0.1038 0.0543
## 0.0942 0.1203 -0.0172 0.0363 -0.0316 0.0740 -0.1003 -0.0132 -0.0954 0.0355
## -0.1242 -0.1100 0.1065 -0.0191 -0.0812 -0.0259 0.0078 -0.0891 -0.0110 0.0740
## -0.0953 0.0667 -0.0529 0.1231 -0.0000 0.0331 -0.0093 0.0072 0.0736 -0.0939
## -0.0275 -0.1147 0.1088 -0.1211 -0.0197 -0.0979 -0.0352 0.0371 0.0040 0.1082
##
## Columns 11 to 20 0.0556 -0.0450 0.0744 0.0948 -0.0768 0.0811 0.0700 -0.0810 0.0912 -0.0653
## 0.0477 -0.1007 -0.0260 0.1194 -0.1026 -0.0930 0.0928 0.0778 -0.1203 0.0215
## 0.0168 -0.0030 -0.0715 0.1036 0.0517 0.0643 0.0791 0.1245 0.1154 0.0874
## -0.0980 -0.1178 -0.0340 -0.0424 0.0535 0.0565 0.1012 0.0085 -0.0912 -0.0103
## -0.0689 -0.0513 -0.0896 -0.1244 -0.0313 0.0966 0.0328 0.0090 -0.0019 -0.0651
## 0.0984 0.1206 0.0575 -0.0300 0.0883 -0.0646 0.0152 -0.0071 -0.0362 -0.0698
## -0.0800 -0.0961 -0.0923 0.0673 0.0939 -0.0938 0.1049 -0.0573 -0.0599 -0.1147
## 0.0634 -0.0337 -0.0100 -0.0746 0.1047 -0.0363 0.0069 0.0141 0.0297 0.0112
## -0.0484 0.0251 0.1085 0.0775 0.0377 0.0614 -0.1025 -0.0752 0.0653 -0.0801
## ... [the output was truncated (use n=-1 to disable)]
## [ CPUFloatType{20,64} ]
##
## $`0.bias`
## torch_tensor
## 0.0444
## -0.0373
## -0.0771
## 0.0416
## -0.0475
## -0.0762
## -0.1225
## -0.1181
## 0.0338
## 0.1208
## -0.0525
## -0.0673
## -0.1114
## -0.0704
## 0.1208
## -0.0283
## -0.0275
## 0.1100
## 0.1221
## 0.0194
## [ CPUFloatType{20} ]
##
## $`2.weight`
## torch_tensor
## Columns 1 to 10 0.0559 -0.1051 0.0256 0.0968 -0.0110 -0.0242 -0.0668 0.0575 0.1299 0.0079
## 0.1115 0.1834 -0.0080 -0.1481 -0.1623 -0.1591 -0.0764 -0.0709 0.0851 0.1660
## 0.0503 0.0107 0.1913 -0.2037 0.1864 -0.0456 -0.1913 -0.1340 0.0570 0.0953
## 0.0055 0.1491 0.1190 -0.1151 0.1749 -0.1689 0.0552 0.2024 -0.0626 0.0792
## -0.0184 0.1827 -0.0251 -0.0367 -0.1553 -0.0939 -0.1651 0.0969 0.1532 -0.0310
## 0.1063 -0.1179 -0.0283 -0.1359 -0.0619 -0.0610 -0.0534 -0.0438 0.0456 0.0860
## -0.1573 -0.1906 -0.0850 0.1656 0.1991 -0.1004 0.0656 0.0955 0.0476 -0.1961
## 0.1947 0.1931 -0.1129 0.1401 -0.0581 0.1023 0.1809 0.1276 0.2086 0.1749
## 0.0068 0.1760 -0.0973 0.0478 0.1123 0.1475 -0.0563 0.0276 -0.0054 0.0995
## 0.0931 0.1374 -0.0640 0.2155 0.1860 0.1494 0.0262 0.0964 -0.1837 0.0984
##
## Columns 11 to 20 0.2195 -0.1399 0.0496 -0.0596 0.1642 0.0085 -0.1241 0.1667 0.1219 -0.1642
## -0.1702 0.2015 0.1554 0.0141 0.1920 0.1104 0.0538 -0.1481 0.1870 -0.2197
## 0.1806 -0.1595 -0.0970 0.2043 0.0297 -0.0439 0.0956 -0.0126 -0.2172 -0.0246
## 0.0536 0.0003 -0.1941 0.1994 0.0973 0.1458 0.2200 -0.0388 -0.0264 0.0296
## -0.0059 0.0104 -0.0013 0.0140 -0.2213 -0.1287 0.1280 -0.1055 0.1875 0.2132
## 0.0504 -0.0251 0.0300 0.0469 -0.1555 -0.0754 -0.1030 0.2207 0.1750 0.0756
## -0.0160 0.0983 -0.0783 -0.0965 -0.0195 -0.0837 -0.0191 -0.1879 -0.0028 0.0932
## 0.1101 0.1475 0.1871 -0.0913 0.1297 -0.1858 -0.1432 -0.1936 0.0207 0.0932
## 0.0513 -0.2216 0.1819 -0.0111 0.0474 0.1820 0.2115 0.0633 0.0887 -0.1623
## 0.1983 0.1329 -0.1581 0.0658 -0.2195 -0.0284 -0.0301 0.0745 -0.1812 0.0357
## [ CPUFloatType{10,20} ]
##
## $`2.bias`
## torch_tensor
## 0.1767
## 0.1269
## -0.0840
## 0.0260
## 0.0447
## -0.0952
## 0.1863
## 0.0859
## -0.1471
## -0.0617
## [ CPUFloatType{10} ]
##
## $`4.weight`
## torch_tensor
## 0.1568 -0.1291 0.0845 0.1061 -0.1384 -0.3145 0.1253 -0.1603 0.1922 -0.0751
## 0.1032 -0.0791 -0.0854 0.1170 -0.0943 0.0935 -0.2901 -0.1593 -0.2617 0.1574
## -0.1167 0.1025 -0.2414 0.1522 -0.3130 0.2712 0.2435 0.2740 -0.2665 0.1782
## -0.1110 -0.0042 -0.2040 -0.2218 -0.1078 -0.0279 0.2415 0.0070 -0.0007 -0.1326
## -0.0392 0.1222 -0.1233 0.2063 -0.2100 -0.3080 -0.1415 0.0757 0.2534 0.1831
## [ CPUFloatType{5,10} ]
##
## $`4.bias`
## torch_tensor
## -0.0264
## -0.0868
## 0.2882
## 0.1277
## -0.1038
## [ CPUFloatType{5} ]
##
## $`6.weight`
## torch_tensor
## -0.0115 0.3051 0.3714 -0.0222 0.2972
## [ CPUFloatType{1,5} ]
##
## $`6.bias`
## torch_tensor
## 0.01 *
## 2.4307
## [ CPUFloatType{1} ]
Now we will define the optimizer, the number of epochs (every epoch is one run through the entire training set, and the loss function that we call criterion
)
optimizer <- optim_adam(model$parameters)
epochs <- 400
loss_function <- nn_bce_loss()
We will now write a function that computes loss and accuracy of our model. The first step is to convert class probabilities computed by our model to predicted classes. Here is how we do it:
predicted_classes <- function(pred_prob, threshold = 0.5) {
(pred_prob > threshold) %>% as.numeric
}
x_test %>% model %>% predicted_classes %>% head
## [1] 1 1 1 1 1 1
Then we will need a function that computes accuracy of predictions:
acc <- function(x, y) {
(x == y) %>% as.numeric %>% mean
}
x_test %>% model %>% predicted_classes %>% acc(y_test)
## [1] 0.192
And here is our validation function. It will compute training loss and training accuracy. It is convenient to package the answer in a data frame format.
validation <- function(x_train, y_train, x_test, y_test, epoch_no = 0) {
y_train_pred <- model(x_train)
train_loss <- loss_function(y_train_pred, y_train) %>% as.numeric
test_loss <- NA
train_acc <- y_train_pred %>% predicted_classes %>% acc(y_train)
test_acc <- NA
data.frame(value = c(train_loss, test_loss, train_acc, test_acc),
type = c("loss", "loss", "accuracy", "accuracy"),
dataset = c("train", "test", "train", "test"),
epoch = epoch_no)
}
val_df <- validation(x_train, y_train, x_test, y_test)
val_df
Update the validation function so that it also computes test loss and test accuracy.
validation <- function(x_train, y_train, x_test, y_test, epoch_no = 0) {
y_train_pred <- model(x_train)
y_test_pred <- model(x_test)
train_loss <- loss_function(y_train_pred, y_train) %>% as.numeric
test_loss <- loss_function(y_test_pred, y_test) %>% as.numeric
train_acc <- y_train_pred %>% predicted_classes %>% acc(y_train)
test_acc <- y_test_pred %>% predicted_classes %>% acc(y_test)
data.frame(value = c(train_loss, test_loss, train_acc, test_acc),
type = c("loss", "loss", "accuracy", "accuracy"),
dataset = c("train", "test", "train", "test"),
epoch = epoch_no)
}
val_df <- validation(x_train, y_train, x_test, y_test)
val_df
Here, we explicitly implement gradient descent. Every epoch is one pass through the entire training set. On every 20th epoch, we will print training accuracy and training loss.
for(i in 1:epochs){
optimizer$zero_grad()
y_pred = model(x_train)
loss = loss_function(y_pred, y_train)
loss$backward() # this is gradient calculation
optimizer$step() # this is one step of optimized gradient descent
# Here, we find and print the training accuracy
if(i %% 20 == 0){
val_df <- validation(x_train, y_train, x_test, y_test)
cat(" Epoch:", i,"Loss: ", val_df$value[1]," Accuracy:", val_df$value[3],"\n")
}
}
## Epoch: 20 Loss: 0.7121405 Accuracy: 0.248
## Epoch: 40 Loss: 0.6895846 Accuracy: 0.582
## Epoch: 60 Loss: 0.6711981 Accuracy: 0.802
## Epoch: 80 Loss: 0.6497552 Accuracy: 0.806
## Epoch: 100 Loss: 0.6130636 Accuracy: 0.81
## Epoch: 120 Loss: 0.5507026 Accuracy: 0.812
## Epoch: 140 Loss: 0.4742361 Accuracy: 0.826
## Epoch: 160 Loss: 0.4181295 Accuracy: 0.824
## Epoch: 180 Loss: 0.3822955 Accuracy: 0.852
## Epoch: 200 Loss: 0.3530053 Accuracy: 0.856
## Epoch: 220 Loss: 0.3272815 Accuracy: 0.856
## Epoch: 240 Loss: 0.3038602 Accuracy: 0.862
## Epoch: 260 Loss: 0.2822583 Accuracy: 0.87
## Epoch: 280 Loss: 0.2618349 Accuracy: 0.878
## Epoch: 300 Loss: 0.2414936 Accuracy: 0.898
## Epoch: 320 Loss: 0.2206193 Accuracy: 0.914
## Epoch: 340 Loss: 0.2005509 Accuracy: 0.924
## Epoch: 360 Loss: 0.1818673 Accuracy: 0.93
## Epoch: 380 Loss: 0.1633163 Accuracy: 0.934
## Epoch: 400 Loss: 0.1453451 Accuracy: 0.944
Rewrite the code above so that instead of printing training accuracy and training loss on every 20th epoch, it will plot training and test accuracy and loss vs epoch in the end of the training.
We will need to re-initialize our model first (otherwise, gradient descent will proceed from current parameters.)
initialize(model)
## An `nn_module` containing 1,571 parameters.
##
## -- Modules ---------------------------------------------------------------------
## * 0: <nn_linear> #1,300 parameters
## * 1: <nn_relu> #0 parameters
## * 2: <nn_linear> #210 parameters
## * 3: <nn_relu> #0 parameters
## * 4: <nn_linear> #55 parameters
## * 5: <nn_relu> #0 parameters
## * 6: <nn_linear> #6 parameters
## * 7: <nn_sigmoid> #0 parameters
Answer:
val_df <- validation(x_train, y_train, x_test, y_test)
for(i in 1:epochs){
optimizer$zero_grad()
y_pred <- model(x_train)
loss <- loss_function(y_pred, y_train)
loss$backward() # this is gradient calculation
optimizer$step() # this is one step of optimized gradient descent
# Here, we will calculate and save the metrics
if(i %% 20 == 0){
cat("Epoch ", i, "\n")
val_df <- validation(x_train, y_train, x_test, y_test, i) %>%
rbind(val_df)
}
}
## Epoch 20
## Epoch 40
## Epoch 60
## Epoch 80
## Epoch 100
## Epoch 120
## Epoch 140
## Epoch 160
## Epoch 180
## Epoch 200
## Epoch 220
## Epoch 240
## Epoch 260
## Epoch 280
## Epoch 300
## Epoch 320
## Epoch 340
## Epoch 360
## Epoch 380
## Epoch 400
val_df
And now we can plot it:
val_df %>%
ggplot(aes(x = epoch, y = value, group = dataset, color = dataset)) +
facet_grid(. ~type) +
geom_line()
Note that the training loss goes down all the way - this is always going to be the case by definition of gradient descent (unless we do gradient descent in batches, which is a normal procedure in practice). Training accuracy goes up all the way and it is usually the case. However, test loss starts increasing at some point and test accuracy stops improving - this is the sign of overfitting.
In practice, we usually want to do some form of regularization. We will add a dropout after layer 1 and include \(l_1\) regularization into our Adam optimizer:
model = nn_sequential(
# Layer 1
nn_linear(64, 20),
nn_relu(),
nn_dropout(p = 0.2),
# Layer 2
nn_linear(20, 10),
nn_relu(),
# Layer 3
nn_linear(10, 5),
nn_relu(),
# Output Layer,
nn_linear(5, 1),
nn_sigmoid()
)
optimizer <- optim_adam(model$parameters, weight_decay = 0.1)
And now we will do our training again:
val_df <- validation(x_train, y_train, x_test, y_test)
for(i in 1:epochs){
optimizer$zero_grad()
y_pred <- model(x_train)
loss <- loss_function(y_pred, y_train)
loss$backward() # this is gradient calculation
optimizer$step() # this is one step of optimized gradient descent
# Here, we will calculate and save the metrics
if(i %% 20 == 0){
cat("Epoch ", i, "\n")
val_df <- validation(x_train, y_train, x_test, y_test, i) %>%
rbind(val_df)
}
}
## Epoch 20
## Epoch 40
## Epoch 60
## Epoch 80
## Epoch 100
## Epoch 120
## Epoch 140
## Epoch 160
## Epoch 180
## Epoch 200
## Epoch 220
## Epoch 240
## Epoch 260
## Epoch 280
## Epoch 300
## Epoch 320
## Epoch 340
## Epoch 360
## Epoch 380
## Epoch 400
val_df %>%
ggplot(aes(x = epoch, y = value, group = dataset, color = dataset)) +
facet_grid(. ~type) +
geom_line()
Reduce the values of the two regularization parameters. Experiment with them so that there is no overfitting any more but the test accuracy is still decent.
model = nn_sequential(
# Layer 1
nn_linear(64, 20),
nn_dropout(p = 0.08),
nn_relu(),
# Layer 2
nn_linear(20, 10),
nn_relu(),
# Layer 3
nn_linear(10, 5),
nn_relu(),
# Output Layer,
nn_linear(5, 1),
nn_sigmoid()
)
optimizer <- optim_adam(model$parameters, weight_decay = 0.00005)
val_df <- validation(x_train, y_train, x_test, y_test)
for(i in 1:epochs){
optimizer$zero_grad()
y_pred <- model(x_train)
loss <- loss_function(y_pred, y_train)
loss$backward() # this is gradient calculation
optimizer$step() # this is one step of optimized gradient descent
# Here, we will calculate and save the metrics
if(i %% 20 == 0){
cat("Epoch ", i, "\n")
val_df <- validation(x_train, y_train, x_test, y_test, i) %>%
rbind(val_df)
}
}
## Epoch 20
## Epoch 40
## Epoch 60
## Epoch 80
## Epoch 100
## Epoch 120
## Epoch 140
## Epoch 160
## Epoch 180
## Epoch 200
## Epoch 220
## Epoch 240
## Epoch 260
## Epoch 280
## Epoch 300
## Epoch 320
## Epoch 340
## Epoch 360
## Epoch 380
## Epoch 400
val_df %>%
ggplot(aes(x = epoch, y = value, group = dataset, color = dataset)) +
facet_grid(. ~type) +
geom_line()