A tour of torch for R: Create your own Dataset

Unless the data you’re working with comes with torch, one has to wrap it in a Dataset.

1 torch Dataset objects

A Dataset is an R6 object that knows how to iterate over data.

While a Dataset may have any number of methods – each responsible for some aspect of pre-processing logic, for example – just three methods are required:

  • initialize() , to pre-process and store the data;

  • .getitem(i), to pick the item at position i, and

  • .length(), to indicate to the DataLoader how many items it has.

Let’s explore some penguins data!

1.1 Penguins

library(tidyverse)
library(torch)
library(palmerpenguins)

penguins %>% 
  glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex               <fct> male, female, female, NA, female, male, female, male~
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

Predictors are of two different types, categorical and continuous.

  • Continuous features of R type double may be fed to torch without further ado. We just directly use them to initialize a torch tensor, which will be of type Float

  • Torch needs all data to be in numerical form, so vectors of type character need to become factors - which can then be treated as numeric via level extraction.

1.2 Categorical data in deep learning

We have two options

  • One hot encoding

  • Leave them as they are and have the network map each discrete value to a multidimensional continuous representation aka embedding

Embedding modules expect their inputs to be of type Long

A tensor created from an R value will have the correct type if we make sure it’s an integer: Why??

as.factor("one") %>% 
  as.numeric() %>% 
  as.integer() %>% 
  torch_tensor()
torch_tensor
 1
[ CPULongType{1} ]

1.3 A dataset for penguins

initialize: prepare and store data

continuous features to torch Float

categorical to torch Long

Important

torch_tensor does not handle data frames or tibbles

The whole splitting data into continuous, categorical and outcome variables looks very much Pythonish. Makes sense because torch does not take tibbles?

penguins_dataset <- dataset(
  name = "penguins_dataset",
  
  # Preprocess data
  initialize = function(df){
    
    df <- df %>% 
      drop_na()
    
    # Continuous features (x_cont)
    # don't need much transformation
    x_cont <- df %>%
      select(-species) %>% 
      select(where(is.numeric)) %>% 
      as.matrix()
    self$x_cont <- torch_tensor(x_cont)
    
    
    # Categorical features (x_cat)
    # need to be factors then integers
    self$x_cat <- df %>% 
      select(-species) %>% 
      select(where(is.factor)) %>% 
      mutate(across(everything(), as.integer)) %>%
      as.matrix() %>% 
      torch_tensor()
    
    # target data y
    # categorical ---> factor then integer
    self$y <- df %>% 
      select(species) %>% 
      mutate(across(everything(), as.integer)) %>% 
      as.matrix() %>% 
      torch_tensor()
  },
  
  
  # Get item to pick data at position i
  .getitem = function(i){
    list(x_cont = self$x_cont[i, ], x_cat = self$x_cat[i, ], y = self$y[i])
  },
  
  # Length
  .length = function(){
    self$y$size()[1]
  }
  
)

Let’s see if this behaves like we want it to. We randomly split the data into training and validation sets and query their respective lengths:

set.seed(2056)
library(rsample)

# Split specification
penguins_split <- initial_split(penguins, prop = 0.73)

# Extract train and test set
train <- training(penguins_split)
test <- testing(penguins_split)


train_ds <- penguins_dataset(train)
valid_ds <- penguins_dataset(test)

length(train_ds)
[1] 243
length(valid_ds)
[1] 90

We can index into Datasets in an R-like way:

train_ds[1]
$x_cont
torch_tensor
   37.5000
   18.5000
  199.0000
 4475.0000
 2009.0000
[ CPUFloatType{5} ]

$x_cat
torch_tensor
 2
 2
[ CPULongType{2} ]

$y
torch_tensor
 1
[ CPULongType{1} ]

We then use the Datasets to instantiate DataLoaders:

A DataLoader needs to know where to get the data – namely, from the Dataset it gets passed –, as well as how many items should go in a batch. Optionally, it can return data in random order (shuffle = TRUE).

train_dl <- train_ds %>% 
  dataloader(batch_size = 16, shuffle = TRUE)

valid_dl <- valid_ds %>% 
  dataloader(batch_size = 16, shuffle = FALSE)

1.4 Classifying penguins - the network

We just heard that embedding layers work with a datatype that’s different from most other neural network layers. It is therefore convenient to have them work in a space of their own, that is, put them into a dedicated container.

Here we define a specialized module that has one embedding layer for every categorical feature. It gets passed the cardinalities of the respective features, and creates an nn_embedding() for each of them.

When called, it iterates over its submodules, lets them do their work, and returns the concatenated output.

embedding_module <- nn_module(
  
  # initialize module parameters
  initialize = function(cardinalities){
    self$embeddings <- nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim = ceiling(x/2))))
  },
  
  # Transformation that will be done to input 
  forward = function(x){
    # empty list vector
    embedded <- vector(mode = "list", length = length(self$embeddings))
    for(i in 1:length(self$embeddings)){
      embedded[[i]] = self$embeddings[[i]](x[ ,i])
      
    }
    
    torch_cat(embedded, dim = 2)
    
  }
)

That will take care of representing our embeddings. The code chunk below illustrates what the output of the embedding layer will look like:

# Convert categorical varibales to torch tensors
x_cat <- penguins %>% drop_na() %>% 
      select(-species) %>% 
      select(where(is.factor)) %>% 
      mutate(across(everything(), as.integer)) %>%
      as.matrix() %>% 
      torch_tensor()


# Create a list of embedding modules
cardinalities = c(length(levels(penguins$island)), length(levels(penguins$sex)))
embeddings = nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim = ceiling(x/2))))


# Empty list vector --> to make it easy to concatenate?
embedded <- vector(mode = "list", length = length(embeddings))

# Populate list vector with embeddings for different categories
# Return embeddings for island input
embedded[[1]] = embeddings[[1]](x_cat[,1])
# Return embeddings for sex input
embedded[[2]] = embeddings[[2]](x_cat[,2])

# concatenate
torch_cat(embedded, dim = 2)
torch_tensor
 0.9309 -0.7050  1.8705
 0.9309 -0.7050  1.1274
 0.9309 -0.7050  1.1274
 0.9309 -0.7050  1.1274
 0.9309 -0.7050  1.8705
 0.9309 -0.7050  1.1274
 0.9309 -0.7050  1.8705
 0.9309 -0.7050  1.1274
 0.9309 -0.7050  1.8705
 0.9309 -0.7050  1.8705
 0.9309 -0.7050  1.1274
 0.9309 -0.7050  1.1274
 0.9309 -0.7050  1.8705
 0.9309 -0.7050  1.1274
 0.9309 -0.7050  1.8705
-0.2280 -1.6256  1.1274
-0.2280 -1.6256  1.8705
-0.2280 -1.6256  1.1274
-0.2280 -1.6256  1.8705
-0.2280 -1.6256  1.8705
-0.2280 -1.6256  1.1274
-0.2280 -1.6256  1.8705
-0.2280 -1.6256  1.1274
-0.2280 -1.6256  1.1274
-0.2280 -1.6256  1.8705
-1.0926 -3.1703  1.1274
-1.0926 -3.1703  1.8705
-1.0926 -3.1703  1.1274
-1.0926 -3.1703  1.8705
-1.0926 -3.1703  1.1274
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{333,3} ][ grad_fn = <CatBackward0> ]

Next, let’s create the network for training.

The top-level module has three submodules: said embedding_module and two linear layers.

The first linear layer takes the output from embedding_module , computes an affine transformation as it sees fit, and passes its result to the output layer. output then has three units, one for every possible target class.

The activation function we apply to the raw aggregation, nnf_log_softmax(), composes two operations: the popular-in-deep-learning softmax normalization and taking the logarithm. Like that, we end up with the format expected by nnf_nll_loss(), the loss function that computes the negative log likelihood (NLL) loss between inputs and targets.

net <- nn_module(
  "penguin_net",
  
  # Initialize model states
  initialize = function(
    cardinalities,
    n_cont,
    fc_dim,
    output_dim){
    
    self$embedder <- embedding_module(cardinalities)
    self$fc1 <- nn_linear(in_features = sum(purrr::map(cardinalities, function(x) ceiling(x/2)) %>% 
                                              unlist()) + n_cont,## Accounts for dim of embeddings 
                          fc_dim)
    
    self$output <- nn_linear(fc_dim, output_dim)
    
  },
  
  # Transformation that will be performed on data
  forward = function(x_cont, x_cat){
    embedded = self$embedder(x_cat)
    
    all <- torch_cat(list(embedded, x_cont$to(dtype = torch_long())), dim = 2)
    
    all %>% 
      self$fc1() %>% 
      nnf_relu() %>% 
      self$output() %>% 
      nnf_log_softmax(dim = 2) # Make acolumns add to 1?
    
  }
)

Let’s instantiate the top-level module:

model <- net(
  cardinalities = c(length(levels(penguins$island)), length(levels(penguins$sex))),
  n_cont = 5,
  fc_dim = 32,
  output_dim = 3
)

And we’re ready for training!

1.5 Model training

# Create an optimizer, we tell it what parameters it is supposed to work on.
optimizer <- optim_adam(model$parameters, lr = 0.01)
optimizer
<optim_adam>
  Inherits from: <torch_Optimizer>
  Public:
    add_param_group: function (param_group) 
    clone: function (deep = FALSE) 
    defaults: list
    initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08, 
    load_state_dict: function (state_dict) 
    param_groups: list
    state: State, R6
    state_dict: function () 
    step: function (closure = NULL) 
    zero_grad: function () 
  Private:
    step_helper: function (closure, loop_fun) 

Training loop

Dataloaders can be iterated on using the coro::loop() function combined with a for loop. The reason we need coro::loop() is that batches in dataloaders are only computed when they are actually used, to avoid large memory usage.

set.seed(2056)
for (epoch in 1:20){
  model$train()
  train_losses <- c()
  
  coro::loop(for (batch in train_dl){
    
    # Prevent accumulation of gradients
    optimizer$zero_grad()
    output <- model(batch$x_cont, batch$x_cat)
    loss <- nnf_nll_loss(output, batch$y$squeeze_())
    
    # Make back propagation to calculate gradients
    loss$backward()
    # Update weights
    optimizer$step()
    
    train_losses <- c(train_losses, loss$item())
    
    
  })
  
  model$eval()
  valid_losses <- c()
  
  coro::loop(for (batch in valid_dl){
    
    
    output <- model(batch$x_cont, batch$x_cat)
    loss <- nnf_nll_loss(output, batch$y$squeeze_())
    valid_losses <- c(valid_losses, loss$item())
  })
  
  cat(sprintf("Loss at epoch %d: training: %3.3f, validation: %3.3f\n", epoch, mean(train_losses), mean(valid_losses)))
  
}
Loss at epoch 1: training: 120.791, validation: 81.105
Loss at epoch 2: training: 27.289, validation: 15.232
Loss at epoch 3: training: 14.928, validation: 12.079
Loss at epoch 4: training: 12.071, validation: 13.160
Loss at epoch 5: training: 10.891, validation: 3.534
Loss at epoch 6: training: 5.411, validation: 7.769
Loss at epoch 7: training: 6.924, validation: 3.426
Loss at epoch 8: training: 3.698, validation: 1.647
Loss at epoch 9: training: 2.497, validation: 5.548
Loss at epoch 10: training: 3.201, validation: 0.837
Loss at epoch 11: training: 3.311, validation: 1.361
Loss at epoch 12: training: 3.046, validation: 3.993
Loss at epoch 13: training: 3.295, validation: 9.792
Loss at epoch 14: training: 5.729, validation: 11.132
Loss at epoch 15: training: 10.798, validation: 13.195
Loss at epoch 16: training: 7.994, validation: 7.162
Loss at epoch 17: training: 8.621, validation: 6.510
Loss at epoch 18: training: 9.442, validation: 5.797
Loss at epoch 19: training: 6.904, validation: 10.492
Loss at epoch 20: training: 3.276, validation: 3.353