Unless the data you’re working with comes with torch, one has to wrap it in a Dataset.
1 torch Dataset objects
A Dataset is an R6 object that knows how to iterate over data.
While a Dataset may have any number of methods – each responsible for some aspect of pre-processing logic, for example – just three methods are required:
initialize() , to pre-process and store the data;
.getitem(i), to pick the item at position i, and
.length(), to indicate to the DataLoader how many items it has.
Predictors are of two different types, categorical and continuous.
Continuous features of R type double may be fed to torch without further ado. We just directly use them to initialize a torch tensor, which will be of type Float
Torch needs all data to be in numerical form, so vectors of type character need to become factors - which can then be treated as numeric via level extraction.
1.2 Categorical data in deep learning
We have two options
One hot encoding
Leave them as they are and have the network map each discrete value to a multidimensional continuous representation aka embedding
Embedding modules expect their inputs to be of type Long
A tensor created from an R value will have the correct type if we make sure it’s an integer: Why??
torch_tensor does not handle data frames or tibbles
The whole splitting data into continuous, categorical and outcome variables looks very much Pythonish. Makes sense because torch does not take tibbles?
penguins_dataset <-dataset(name ="penguins_dataset",# Preprocess datainitialize =function(df){ df <- df %>%drop_na()# Continuous features (x_cont)# don't need much transformation x_cont <- df %>%select(-species) %>%select(where(is.numeric)) %>%as.matrix() self$x_cont <-torch_tensor(x_cont)# Categorical features (x_cat)# need to be factors then integers self$x_cat <- df %>%select(-species) %>%select(where(is.factor)) %>%mutate(across(everything(), as.integer)) %>%as.matrix() %>%torch_tensor()# target data y# categorical ---> factor then integer self$y <- df %>%select(species) %>%mutate(across(everything(), as.integer)) %>%as.matrix() %>%torch_tensor() },# Get item to pick data at position i.getitem =function(i){list(x_cont = self$x_cont[i, ], x_cat = self$x_cat[i, ], y = self$y[i]) },# Length.length =function(){ self$y$size()[1] })
Let’s see if this behaves like we want it to. We randomly split the data into training and validation sets and query their respective lengths:
set.seed(2056)library(rsample)# Split specificationpenguins_split <-initial_split(penguins, prop =0.73)# Extract train and test settrain <-training(penguins_split)test <-testing(penguins_split)train_ds <-penguins_dataset(train)valid_ds <-penguins_dataset(test)length(train_ds)
We then use the Datasets to instantiate DataLoaders:
A DataLoader needs to know where to get the data – namely, from the Dataset it gets passed –, as well as how many items should go in a batch. Optionally, it can return data in random order (shuffle = TRUE).
We just heard that embedding layers work with a datatype that’s different from most other neural network layers. It is therefore convenient to have them work in a space of their own, that is, put them into a dedicated container.
Here we define a specialized module that has one embedding layer for every categorical feature. It gets passed the cardinalities of the respective features, and creates an nn_embedding() for each of them.
When called, it iterates over its submodules, lets them do their work, and returns the concatenated output.
embedding_module <-nn_module(# initialize module parametersinitialize =function(cardinalities){ self$embeddings <-nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim =ceiling(x/2)))) },# Transformation that will be done to input forward =function(x){# empty list vector embedded <-vector(mode ="list", length =length(self$embeddings))for(i in1:length(self$embeddings)){ embedded[[i]] = self$embeddings[[i]](x[ ,i]) }torch_cat(embedded, dim =2) })
That will take care of representing our embeddings. The code chunk below illustrates what the output of the embedding layer will look like:
# Convert categorical varibales to torch tensorsx_cat <- penguins %>%drop_na() %>%select(-species) %>%select(where(is.factor)) %>%mutate(across(everything(), as.integer)) %>%as.matrix() %>%torch_tensor()# Create a list of embedding modulescardinalities =c(length(levels(penguins$island)), length(levels(penguins$sex)))embeddings =nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim =ceiling(x/2))))# Empty list vector --> to make it easy to concatenate?embedded <-vector(mode ="list", length =length(embeddings))# Populate list vector with embeddings for different categories# Return embeddings for island inputembedded[[1]] = embeddings[[1]](x_cat[,1])# Return embeddings for sex inputembedded[[2]] = embeddings[[2]](x_cat[,2])# concatenatetorch_cat(embedded, dim =2)
The top-level module has three submodules: said embedding_module and two linear layers.
The first linear layer takes the output from embedding_module , computes an affine transformation as it sees fit, and passes its result to the output layer. output then has three units, one for every possible target class.
The activation function we apply to the raw aggregation, nnf_log_softmax(), composes two operations: the popular-in-deep-learning softmax normalization and taking the logarithm. Like that, we end up with the format expected by nnf_nll_loss(), the loss function that computes the negative log likelihood (NLL) loss between inputs and targets.
net <-nn_module("penguin_net",# Initialize model statesinitialize =function( cardinalities, n_cont, fc_dim, output_dim){ self$embedder <-embedding_module(cardinalities) self$fc1 <-nn_linear(in_features =sum(purrr::map(cardinalities, function(x) ceiling(x/2)) %>%unlist()) + n_cont,## Accounts for dim of embeddings fc_dim) self$output <-nn_linear(fc_dim, output_dim) },# Transformation that will be performed on dataforward =function(x_cont, x_cat){ embedded = self$embedder(x_cat) all <-torch_cat(list(embedded, x_cont$to(dtype =torch_long())), dim =2) all %>% self$fc1() %>%nnf_relu() %>% self$output() %>%nnf_log_softmax(dim =2) # Make acolumns add to 1? })
Let’s instantiate the top-level module:
model <-net(cardinalities =c(length(levels(penguins$island)), length(levels(penguins$sex))),n_cont =5,fc_dim =32,output_dim =3)
And we’re ready for training!
1.5 Model training
# Create an optimizer, we tell it what parameters it is supposed to work on.optimizer <-optim_adam(model$parameters, lr =0.01)optimizer
<optim_adam>
Inherits from: <torch_Optimizer>
Public:
add_param_group: function (param_group)
clone: function (deep = FALSE)
defaults: list
initialize: function (params, lr = 0.001, betas = c(0.9, 0.999), eps = 1e-08,
load_state_dict: function (state_dict)
param_groups: list
state: State, R6
state_dict: function ()
step: function (closure = NULL)
zero_grad: function ()
Private:
step_helper: function (closure, loop_fun)
Training loop
Dataloaders can be iterated on using the coro::loop() function combined with a for loop. The reason we need coro::loop() is that batches in dataloaders are only computed when they are actually used, to avoid large memory usage.
set.seed(2056)for (epoch in1:20){ model$train() train_losses <-c() coro::loop(for (batch in train_dl){# Prevent accumulation of gradients optimizer$zero_grad() output <-model(batch$x_cont, batch$x_cat) loss <-nnf_nll_loss(output, batch$y$squeeze_())# Make back propagation to calculate gradients loss$backward()# Update weights optimizer$step() train_losses <-c(train_losses, loss$item()) }) model$eval() valid_losses <-c() coro::loop(for (batch in valid_dl){ output <-model(batch$x_cont, batch$x_cat) loss <-nnf_nll_loss(output, batch$y$squeeze_()) valid_losses <-c(valid_losses, loss$item()) })cat(sprintf("Loss at epoch %d: training: %3.3f, validation: %3.3f\n", epoch, mean(train_losses), mean(valid_losses)))}
Loss at epoch 1: training: 120.791, validation: 81.105
Loss at epoch 2: training: 27.289, validation: 15.232
Loss at epoch 3: training: 14.928, validation: 12.079
Loss at epoch 4: training: 12.071, validation: 13.160
Loss at epoch 5: training: 10.891, validation: 3.534
Loss at epoch 6: training: 5.411, validation: 7.769
Loss at epoch 7: training: 6.924, validation: 3.426
Loss at epoch 8: training: 3.698, validation: 1.647
Loss at epoch 9: training: 2.497, validation: 5.548
Loss at epoch 10: training: 3.201, validation: 0.837
Loss at epoch 11: training: 3.311, validation: 1.361
Loss at epoch 12: training: 3.046, validation: 3.993
Loss at epoch 13: training: 3.295, validation: 9.792
Loss at epoch 14: training: 5.729, validation: 11.132
Loss at epoch 15: training: 10.798, validation: 13.195
Loss at epoch 16: training: 7.994, validation: 7.162
Loss at epoch 17: training: 8.621, validation: 6.510
Loss at epoch 18: training: 9.442, validation: 5.797
Loss at epoch 19: training: 6.904, validation: 10.492
Loss at epoch 20: training: 3.276, validation: 3.353
Source Code
---title: "A tour of torch for R: Create your own Dataset"toc: truetoc-location: leftformat: html: number-sections: true html-math-method: katex code-tools: true code-fold: false code-link: trueeditor: visualexecute: warning: false message: false---::: callout-noteUsing `torch dataset` in R.Implemented from the example provided at:- <https://torch.mlverse.org/start/custom_dataset/>- <https://github.com/MicrosoftDocs/ml-basics/blob/master/05a%20-%20Deep%20Neural%20Networks%20(PyTorch).ipynb>:::Unless the data you're working with comes with torch, one has to wrap it in a Dataset.## torch Dataset objectsA Dataset is an R6 object that knows how to iterate over data.While a `Dataset` may have any number of methods -- each responsible for some aspect of pre-processing logic, for example -- just three methods are required:- `initialize()` , to pre-process and store the data;- `.getitem(i)`, to pick the item at position `i`, and- `.length()`, to indicate to the `DataLoader` how many items it has.Let's explore some penguins data!### Penguins```{r}library(tidyverse)library(torch)library(palmerpenguins)penguins %>%glimpse()```Predictors are of two different types, categorical and continuous.- Continuous features of R type `double` may be fed to torch without further ado. We just directly use them to initialize a torch tensor, which will be of type `Float`- Torch needs all data to be in numerical form, so vectors of type character need to become factors - which can then be treated as numeric via level extraction.### Categorical data in deep learningWe have two options- One hot encoding- Leave them as they are and have the network map each discrete value to a multidimensional continuous representation aka `embedding`Embedding modules expect their inputs to be of type `Long`> A tensor created from an R value will have the correct type if we make sure it's an `integer`: ***Why??***```{r}as.factor("one") %>%as.numeric() %>%as.integer() %>%torch_tensor()```### A dataset for penguins`initialize`: prepare and store data> continuous features to `torch``Float`>> categorical to `torch``Long`::: callout-important`torch_tensor` does not handle `data frames` or `tibbles`The whole splitting data into continuous, categorical and outcome variables looks very much Pythonish.Makes sense because torch does not take tibbles?:::```{r}penguins_dataset <-dataset(name ="penguins_dataset",# Preprocess datainitialize =function(df){ df <- df %>%drop_na()# Continuous features (x_cont)# don't need much transformation x_cont <- df %>%select(-species) %>%select(where(is.numeric)) %>%as.matrix() self$x_cont <-torch_tensor(x_cont)# Categorical features (x_cat)# need to be factors then integers self$x_cat <- df %>%select(-species) %>%select(where(is.factor)) %>%mutate(across(everything(), as.integer)) %>%as.matrix() %>%torch_tensor()# target data y# categorical ---> factor then integer self$y <- df %>%select(species) %>%mutate(across(everything(), as.integer)) %>%as.matrix() %>%torch_tensor() },# Get item to pick data at position i.getitem =function(i){list(x_cont = self$x_cont[i, ], x_cat = self$x_cat[i, ], y = self$y[i]) },# Length.length =function(){ self$y$size()[1] })```Let's see if this behaves like we want it to.We randomly split the data into training and validation sets and query their respective lengths:```{r}set.seed(2056)library(rsample)# Split specificationpenguins_split <-initial_split(penguins, prop =0.73)# Extract train and test settrain <-training(penguins_split)test <-testing(penguins_split)train_ds <-penguins_dataset(train)valid_ds <-penguins_dataset(test)length(train_ds)length(valid_ds)```We can index into `Datasets` in an R-like way:```{r}train_ds[1]```We then use the `Dataset`s to instantiate `DataLoader`s:A `DataLoader` needs to know where to get the data -- namely, from the `Dataset` it gets passed --, as well as how many items should go in a batch.Optionally, it can return data in random order (`shuffle = TRUE`).```{r}train_dl <- train_ds %>%dataloader(batch_size =16, shuffle =TRUE)valid_dl <- valid_ds %>%dataloader(batch_size =16, shuffle =FALSE)```### Classifying penguins - the networkWe just heard that embedding layers work with a datatype that's different from most other neural network layers.It is therefore convenient to have them work in a space of their own, that is, put them into a dedicated container.Here we define a specialized module that has one embedding layer for every categorical feature.It gets passed the cardinalities of the respective features, and creates an `nn_embedding()` for each of them.When called, it iterates over its submodules, lets them do their work, and returns the concatenated output.```{r}embedding_module <-nn_module(# initialize module parametersinitialize =function(cardinalities){ self$embeddings <-nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim =ceiling(x/2)))) },# Transformation that will be done to input forward =function(x){# empty list vector embedded <-vector(mode ="list", length =length(self$embeddings))for(i in1:length(self$embeddings)){ embedded[[i]] = self$embeddings[[i]](x[ ,i]) }torch_cat(embedded, dim =2) })```That will take care of representing our embeddings.The code chunk below illustrates what the output of the embedding layer will look like:```{r}# Convert categorical varibales to torch tensorsx_cat <- penguins %>%drop_na() %>%select(-species) %>%select(where(is.factor)) %>%mutate(across(everything(), as.integer)) %>%as.matrix() %>%torch_tensor()# Create a list of embedding modulescardinalities =c(length(levels(penguins$island)), length(levels(penguins$sex)))embeddings =nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim =ceiling(x/2))))# Empty list vector --> to make it easy to concatenate?embedded <-vector(mode ="list", length =length(embeddings))# Populate list vector with embeddings for different categories# Return embeddings for island inputembedded[[1]] = embeddings[[1]](x_cat[,1])# Return embeddings for sex inputembedded[[2]] = embeddings[[2]](x_cat[,2])# concatenatetorch_cat(embedded, dim =2)```Next, let's create the network for training.The top-level module has three submodules: said `embedding_module` and two linear layers.The first linear layer takes the output from `embedding_module` , computes an affine transformation as it sees fit, and passes its result to the output layer.`output` then has three units, one for every possible target class.The activation function we apply to the raw aggregation, `nnf_log_softmax()`, composes two operations: the popular-in-deep-learning `softmax` normalization and taking the logarithm.Like that, we end up with the format expected by `nnf_nll_loss()`, the loss function that computes the negative log likelihood (NLL) loss between inputs and targets.```{r}net <-nn_module("penguin_net",# Initialize model statesinitialize =function( cardinalities, n_cont, fc_dim, output_dim){ self$embedder <-embedding_module(cardinalities) self$fc1 <-nn_linear(in_features =sum(purrr::map(cardinalities, function(x) ceiling(x/2)) %>%unlist()) + n_cont,## Accounts for dim of embeddings fc_dim) self$output <-nn_linear(fc_dim, output_dim) },# Transformation that will be performed on dataforward =function(x_cont, x_cat){ embedded = self$embedder(x_cat) all <-torch_cat(list(embedded, x_cont$to(dtype =torch_long())), dim =2) all %>% self$fc1() %>%nnf_relu() %>% self$output() %>%nnf_log_softmax(dim =2) # Make acolumns add to 1? })```Let's instantiate the top-level module:```{r}model <-net(cardinalities =c(length(levels(penguins$island)), length(levels(penguins$sex))),n_cont =5,fc_dim =32,output_dim =3)```And we're ready for training!### Model training```{r}# Create an optimizer, we tell it what parameters it is supposed to work on.optimizer <-optim_adam(model$parameters, lr =0.01)optimizer```Training loopDataloaders can be iterated on using the `coro::loop()` function combined with a `for` loop.The reason we need `coro::loop()` is that batches in `dataloaders` are only computed when they are actually used, to avoid large memory usage.```{r}set.seed(2056)for (epoch in1:20){ model$train() train_losses <-c() coro::loop(for (batch in train_dl){# Prevent accumulation of gradients optimizer$zero_grad() output <-model(batch$x_cont, batch$x_cat) loss <-nnf_nll_loss(output, batch$y$squeeze_())# Make back propagation to calculate gradients loss$backward()# Update weights optimizer$step() train_losses <-c(train_losses, loss$item()) }) model$eval() valid_losses <-c() coro::loop(for (batch in valid_dl){ output <-model(batch$x_cont, batch$x_cat) loss <-nnf_nll_loss(output, batch$y$squeeze_()) valid_losses <-c(valid_losses, loss$item()) })cat(sprintf("Loss at epoch %d: training: %3.3f, validation: %3.3f\n", epoch, mean(train_losses), mean(valid_losses)))}```