library(torch)
library(torchvision)
A tour of Torch
This is I getting up to speed with using torch
in R from the following resources:
1 torch components
1.1 Creating tensors
Tensors are the core data structures in torch.
1.1.1 From R objects
Tensors can be created from R atomic vectors, matrices and arrays using torch_tensor
function:
# Create torch tensor from atomic vector
<- torch_tensor(c(1, 2, 3))
x x
torch_tensor
1
2
3
[ CPUFloatType{3} ]
Seems default tensor configuration is by rows?
# Create torch tensor from matrix
<- matrix(runif(6), nrow = 2)
m <- torch_tensor(m)
x x
torch_tensor
0.3777 0.3414 0.2290
0.7923 0.0225 0.9320
[ CPUFloatType{2,3} ]
# Create torch tensors from arrays
<- array(runif(16), dim = c(4, 2, 2))
a <- torch_tensor(a)
x x
torch_tensor
(1,.,.) =
0.6295 0.3177
0.0594 0.7283
(2,.,.) =
0.2989 0.5853
0.9285 0.7687
(3,.,.) =
0.9246 0.7718
0.2538 0.4100
(4,.,.) =
0.4965 0.4471
0.0952 0.5655
[ CPUFloatType{4,2,2} ]
1.1.2 Using initialization functions
Tensors can also be created using torch initialization functions.
# Return a tensor filled with values drawn from a unit normal distribution
<- torch_randn(5, 2, 3)
x x
torch_tensor
(1,.,.) =
0.2484 0.5517 1.6350
-0.8673 0.5252 -1.5049
(2,.,.) =
0.5737 -0.2363 -0.3382
-0.7096 0.8415 -1.1913
(3,.,.) =
-0.8613 0.8600 0.4103
-0.1573 -1.9381 0.7452
(4,.,.) =
0.1448 -1.4085 2.6813
1.0972 2.9855 -1.3764
(5,.,.) =
0.7437 -0.5917 -0.5778
-0.0745 -0.0062 0.3855
[ CPUFloatType{5,2,3} ]
torch_zeros(5)
torch_tensor
0
0
0
0
0
[ CPUFloatType{5} ]
1.1.3 Converting back to R
torch provides methods e.g as.array
, as.matrix
, as.numeric
, as.integer
etc to convert tensors back to R.
# as.array is the most general method and allows
# converting any type of tensor
<- torch_randn(2, 2)
x as.array(x)
[,1] [,2]
[1,] -0.4641779 0.7707540
[2,] -0.2247133 -0.4308816
1.2 Tensor attributes
data type
device
dimensions
require_grad
1.2.1 Accessing attributes
Tensor attributes can be accessed using the $ operator:
<- torch_randn(2, 2)
x # Access data type
$dtype x
torch_Float
# Access device
$device x
torch_device(type='cpu')
# Access dimensions
$shape x
[1] 2 2
# Require gradient
$requires_grad x
[1] FALSE
1.2.2 Modifying attributes
Default tensor attributes can be modified when creating the tensors or later using the $to
method.
# Modify tensor during creation
<- torch_randn(2, 2, dtype = torch_float64())
x x
torch_tensor
-0.1781 0.8839
-0.6321 1.4523
[ CPUDoubleType{2,2} ]
# Modify using $to
<- torch_randn(2, 2)
x <- x$to(dtype = torch_float16())
x x
torch_tensor
-1.3828 1.0049
0.5552 -0.4260
[ CPUHalfType{2,2} ]
1.2.3 CUDA devices
Moving between devices is also done with the $to method, but only cpu devices are available to all systems. Moving between devices is an important operation because tensor operations happen on the device the tensor is located; so if you want to use the fast GPU implementations, you need to move tensors to the CUDA device. A common pattern in torch is to create a device object at the beginning of your script and reuse it as you create and move tensors. For example:
#Create device
<- if (cuda_is_available()) "cuda" else "cpu"
device <- ifelse(cuda_is_available(), "cuda", "cpu")
device
<- torch_randn(2, 2, device = device)
x x
torch_tensor
-1.2370 0.3609
-0.8836 0.1419
[ CPUFloatType{2,2} ]
<- x$to(device = device)
y y
torch_tensor
-1.2370 0.3609
-0.8836 0.1419
[ CPUFloatType{2,2} ]
1.3 Indexing tensors
Indexing tensors in torch is very similar to indexing vectors, matrices and arrays in R — with an important difference when using negative indexes.
In torch negative indexes don’t remove the element, instead selection happens starting from the end which is used more frequently.
<- torch_tensor(1:5)
x x
torch_tensor
1
2
3
4
5
[ CPULongType{5} ]
Index tensors
# Take the first element
1] x[
torch_tensor
1
[ CPULongType{} ]
# Negative index from last
-1] x[
torch_tensor
5
[ CPULongType{} ]
# Select first 3 elements
1:3] x[
torch_tensor
1
2
3
[ CPULongType{3} ]
# Selecting from 3rd element to last using N
3:N] x[
torch_tensor
3
4
5
[ CPULongType{3} ]
# Select the last 2 elements
-2:N] x[
torch_tensor
4
5
[ CPULongType{2} ]
# Select using a boolean tensor
>2] x[x
torch_tensor
3
4
5
[ CPULongType{3} ]
1.3.1 Multidimensional selections
When indexing a tensor with multiple dimensions, you can use dimension-specific indices separated by commas, just like in R. For example:
<- torch_randn(2, 2, 3)
x x
torch_tensor
(1,.,.) =
-0.5067 -2.2206 -0.5674
-3.0301 -0.4091 -0.8339
(2,.,.) =
-2.1815 1.5919 -0.7555
0.5166 -0.3398 -1.1716
[ CPUFloatType{2,2,3} ]
# Selecting the first element in every dimension
1, 1, ] x[
torch_tensor
-0.5067
-2.2206
-0.5674
[ CPUFloatType{3} ]
# Select everything from a dimension using empty argument
1] x[, ,
torch_tensor
-0.5067 -3.0301
-2.1815 0.5166
[ CPUFloatType{2,2} ]
# or
1] x[..,
torch_tensor
-0.5067 -3.0301
-2.1815 0.5166
[ CPUFloatType{2,2} ]
# You can also add a new dimension using the newaxis sugar
x[.., newaxis]
torch_tensor
(1,1,.,.) =
-0.5067
-2.2206
-0.5674
(2,1,.,.) =
-2.1815
1.5919
-0.7555
(1,2,.,.) =
-3.0301
-0.4091
-0.8339
(2,2,.,.) =
0.5166
-0.3398
-1.1716
[ CPUFloatType{2,2,3,1} ]
# By default when you select a single element from a
# dimension it's dropped
# you can change this behavior by setting drop = FALSE
1, ..] x[
torch_tensor
-0.5067 -2.2206 -0.5674
-3.0301 -0.4091 -0.8339
[ CPUFloatType{2,3} ]
# Subset assignment is also supported
1, 1, 1] <- 0
x[1, 1 , 1] x[
torch_tensor
0
[ CPUFloatType{} ]
1.4 Array Computation
torch provides more than 200 functions and methods that operate on tensors. They range from mathematical operations to utilities for reshaping and modifying tensors.
Most operations have both CPU and GPU backends, and torch will use the backend corresponding to the tensor device.
See some examples below:
<- c(1, 2, 3) %>%
x torch_tensor()
# Subtract other scaled by alpha
%>%
x torch_sub(1)
torch_tensor
0
1
2
[ CPUFloatType{3} ]
# many torch_* functions have a corresponding tensor method
$sub(1) x
torch_tensor
0
1
2
[ CPUFloatType{3} ]
%>%
x torch_exp() %>%
torch_log()
torch_tensor
1
2
3
[ CPUFloatType{3} ]
Full documentation: https://torch.mlverse.org/docs/reference/index.html#section-mathematical-operations-on-tensors
1.4.1 Reduction functions
<- rbind(c(1,2,3), 4:6) %>%
x torch_tensor()
x
torch_tensor
1 2 3
4 5 6
[ CPUFloatType{2,3} ]
# Sum of all elements in the input tensor
$sum() x
torch_tensor
21
[ CPUFloatType{} ]
# Reduce the first dimension i.e sum all rows for each column
# Reduce rows by adding columns?
%>%
x torch_sum(dim = 1)
torch_tensor
5
7
9
[ CPUFloatType{3} ]
# Reduce the 2nd dimension: sum all columns for each row
# Reduce columns by adding rows?
%>%
x torch_sum(dim = 2)
torch_tensor
6
15
[ CPUFloatType{2} ]
1.4.2 Broadcasting
Allows one to use tensors of different shapes when executing binary/arithmetic operations.
# Simplest broadcasting example
torch_tensor(c(1, 2, 3)) + 1
torch_tensor
2
3
4
[ CPUFloatType{3} ]
# Adding a (3,2) matrix to a (2) vector
torch_ones(3, 2) + torch_tensor(c(1, 2))
torch_tensor
2 3
2 3
2 3
[ CPUFloatType{3,2} ]
torch_ones(2, 3) + torch_tensor(c(1, 2, 3))
torch_tensor
2 3 4
2 3 4
[ CPUFloatType{2,3} ]
# Danger will robinson
torch_ones(10, 1) + torch_tensor(rep(1, 10))
torch_tensor
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
[ CPUFloatType{10,10} ]
1.5 What’s Autograd?
Autograd allows torch to compute exact derivatives of tensor operations with minimal code changes. It’s the central feature for making torch useful for training neural network models.
Suppose we have the operation y = x^3 and we want to compute the derivative \frac{dy}{dx} for a point x = 2 . Answer is 12. See: wolfram
# Create `x` that requires gradient
<- torch_tensor(2, requires_grad = TRUE)
x
# Compute `y` as usual
<- x^3
y
# Call backward which does the Actual computation of gradients
$backward()
y
# Extract the derivative
$grad x
torch_tensor
12
[ CPUFloatType{1} ]
1.5.1 Disabling autograd
It might be useful to disable autograd for a few operations e.g when performing inference on a model
<- torch_tensor(2, requires_grad = TRUE)
x
with_no_grad({
<- x^3
y
})
# Fails since there is no operation tracked by autograd
# y$backward()
1.5.2 A slightly more advanced example
Suppose now that we have a function f(x) = 3x^2 - 2x and we want to find it’s minimum using gradient descent. We define this function in R with:
<- function(x){
f 3*x^2 - 2*x
}
Next we are going to use the gradient descent algorithm to find its minimum
We must disable autograd (using
with_no_grad
) when updating weights, because we don’t want autograd to track the weight updating operation, as we don’t want to back-propagate this operation later.The update operation must happen in-place on the weight tensor, otherwise this tensor is no longer a leaf tensor and torch can no longer back-propagate gradients for it.
We must manually erase the gradients that we just used to update the tensors, usually by setting them to zero. This must also happen in-place, by using, for example,
x$grad$zero_()
. By default torch accumulates gradients after backward, and in general we want to start afresh the gradient accumulation after the weight update.
The torch optim_
functions will help us not having to remember all the details.
# Define learning rate
<- 0.1
lr
# Number of iterations
<- 20
num_iter
# Start with a random number
<- torch_randn(1, requires_grad = TRUE)
x
for (i in seq_len(num_iter)){
<- f(x)
y $backward()
ywith_no_grad({
$sub_(lr*x$grad)
x$grad$zero_()
x
})
} x
torch_tensor
0.3333
[ CPUFloatType{1} ][ requires_grad = TRUE ]
If we make a plot of this function, we can easily see that this value corresponds to the minimum:
library(tidyverse)
<- tibble::tibble(x = as.numeric(x), y = f(x))
df
%>%
df ggplot(mapping = aes(x = x, y = y)) +
geom_point(color = "red", size = 2) +
stat_function(fun = f) +
xlim(-2, 2)
1.6 Optimizers
torch optimizers are torch’s abstraction to encapsulate weight-updating logic.
First, let’s come back to the example we used in the autograd chapter. In this example we manually updated the weights using the simplest form of the gradient descent algorithm.
The logic for updating the weights is wrapped in the with_no_grad
block.
# Start with a random number
<- torch_randn(1, requires_grad = TRUE)
x
for (i in seq_len(num_iter)){
= f(x)
y $backward()
y
# ------ -> updating the weights
with_no_grad({
$sub_(lr*x$grad)
x$grad$zero_()
x
})# ------ <- updating the weights
}
x
torch_tensor
0.3333
[ CPUFloatType{1} ][ requires_grad = TRUE ]
We can rewrite this piece using the packaged torch optimizers.
# Learning rate
<- 0.1
lr
# Number of iterations
<- 20
num_iter
# Starting point
<- torch_randn(1, requires_grad = TRUE)
x
# We create the optimizer and the first argument is a list
# of weights we want to optimize. It also takes a
# learning rate argument
<- optim_sgd(x, lr = lr)
optimizer
for (i in seq_len(num_iter)){
# Refresh the grad attribute of all parameters
$zero_grad()
optimizer<- f(x)
y $backward()
y# Perform one update step fo all parameters
$step()
optimizer
} x
torch_tensor
0.3333
[ CPUFloatType{1} ][ requires_grad = TRUE ]
Note that we no longer need to wrap our calls in with_no_grad
, nor manually perform the update step with an in-place operation. The optimizer takes care of all of these implementation details. We still need to compute the loss and use the $backward()
method to populate the grad
attribute of the each parameter.
1.7 Neural network modules
In torch all layers and models are called neural network modules , or for short, nn_module
s.
Deep learning models can be thought of as functions that operate on tensors but these functions have a special technical feature though: they have a state (weights and parameters) which change during training.
1.7.1 Implemented nn_modules
torch provides implementations of many of the most common neural network layers, like convolutional, recurrent, pooling and activation layers, as well as common loss functions.
# Apply a linear module and define structure
<- nn_linear(in_features = 10, out_features = 1)
linear linear
An `nn_module` containing 11 parameters.
-- Parameters ------------------------------------------------------------------
* weight: Float [1:1, 1:10]
* bias: Float [1:1]
<- torch_randn(3, 10)
x linear(x)
torch_tensor
-0.2091
0.5430
-0.7450
[ CPUFloatType{3,1} ][ grad_fn = <AddmmBackward0> ]
Instances of nn_modules
also have methods that are useful; for example, to inspect their parameters or move them to a different device).
# List parameters
str(linear$parameters)
List of 2
$ weight:Float [1:1, 1:10]
$ bias :Float [1:1]
# Access individual parameters
$weight linear
torch_tensor
0.1418 0.1971 0.3020 0.1634 -0.2996 -0.1990 -0.1604 0.0243 -0.1020 -0.2254
[ CPUFloatType{1,10} ][ requires_grad = TRUE ]
$bias linear
torch_tensor
0.01 *
1.7467
[ CPUFloatType{1} ][ requires_grad = TRUE ]
# Moves the parameters to the specified device
$to(device = "cpu") linear
Find list of modules: Function reference
1.7.2 Custom nn_modules
To build a custom nn_module
one requires 2 functions:
initialize
: The initialize method is used to initialize model parameters and has access to theself
object that can be used to share states between methodsforward
: The forward method describes the transformations that the nn_module is going to perform on input data.
# Create a Linear nn_module
<- nn_module(
Linear # Initialize model param
initialize = function(in_features, out_features){
# Indicates to nn_module that x is a parameter
$w <- nn_parameter(torch_randn(in_features, out_features))
self$b <- nn_parameter(torch_zeros(out_features))
self
},
# Describe trans to data
forward = function(input){
# Matrix multiplication
torch_mm(input, self$w) + self$b
}
)
# Create an instance of it
<- Linear(in_features = 10, out_features = 1)
lin lin
An `nn_module` containing 11 parameters.
-- Parameters ------------------------------------------------------------------
* w: Float [1:10, 1:1]
* b: Float [1:1]
We now have an instance of the Linear
module that is called lin
. We are now able to use this instance to actually perform the linear model computation on a tensor. We use the instance as an R function, but it will actually delegate to the forward
method that we defined earlier.
<- torch_randn(3, 10)
x lin(x)
torch_tensor
1.4604
-0.5456
-0.9156
[ CPUFloatType{3,1} ][ grad_fn = <AddBackward0> ]
1.7.3 Combining multiple modules
nn_modules
can also include sub-modules ,and this is what allows us to write modules using the same abstraction that we use to write layers.
For example, let’s build a multi-layer perceptron module with a ReLu activation.
# MLP with ReLu activation
<- nn_module(
nn_mlp # Initialize model states
initialize = function(in_features, hidden_features, out_features){
$fc1 = nn_linear(in_features, hidden_features)
self$relu = nn_relu()
self$fc2 = nn_linear(hidden_features, out_features)
self
},# Define transformations that will be performed
forward = function(input){
%>%
input $fc1() %>%
self$relu() %>%
self$fc2()
self
}
)<- nn_mlp(in_features = 10, hidden_features = 5, out_features = 1)
mlp mlp
An `nn_module` containing 61 parameters.
-- Modules ---------------------------------------------------------------------
* fc1: <nn_linear> #55 parameters
* relu: <nn_relu> #0 parameters
* fc2: <nn_linear> #6 parameters
# Calling the model
<- torch_randn(3, 10)
x mlp(x)
torch_tensor
0.1746
0.0258
0.1690
[ CPUFloatType{3,1} ][ grad_fn = <AddmmBackward0> ]
In torch there’s no difference between module and models, i.e., an nn_module
can be as low-level as a ReLu activation, or a much higher-level ResNet model.
1.7.4 Sequential modules
When the forward
method in the nn_module
just calls the submodules in a sequence like in the previous example, one can use nn_sequential
container to skip writing the forward
method:
<- nn_sequential(
mlp nn_linear(10, 5),
nn_relu(),
nn_linear(5, 1)
)
mlp
An `nn_module` containing 61 parameters.
-- Modules ---------------------------------------------------------------------
* 0: <nn_linear> #55 parameters
* 1: <nn_relu> #0 parameters
* 2: <nn_linear> #6 parameters
1.7.5 Functional API
Most nn_*
modules have a nnf_*
counterpart, for example, nnf_relu()
and nn_relu()
.
Sometimes the functional API is more convenient, specially if the module counterpart does not include parameters, because it allows you to avoid initializing the module.
Didn’t get this clearly so more reading on this later
1.7.6 Example: training a linear model
Let’s use everything we learned until now to train a linear model on simulated data. First, let’s simulate a data set.
We will generate a matrix with 100 observations of 3 variables, all randomly generated from the standard normal distribution. The response tensor will be generated using the equation: y=0.5+2∗x1−3∗x2+x3+noise We also add a small amount of noise sample from N(0,0.1).
# Generate a matrix with 100 observations of 3 variables
<-torch_randn(100, 3)
x
# Equation for output tensor
<- 0.5 + 2*x[,1] - 3*x[,2] + x[,3] + torch_randn(100)/10
y
# y dimension to be 100X1
<- y[, newaxis]
y y
torch_tensor
-0.8791
2.6720
0.8301
-0.7848
-0.7519
-2.3814
2.4995
4.6559
-5.5641
-5.9122
-3.0375
-2.3074
-1.5036
-1.0461
1.4571
2.5013
-4.6902
0.6530
-7.4685
3.2246
0.8506
3.3885
-3.6155
2.2077
-0.9626
2.2580
1.1011
3.4453
1.3866
1.7551
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{100,1} ]
We now define our model and optimizer:
# Define model: MLP
# model <- nn_sequential(
# nn_linear(in_features = 3, out_features = 32),
# nn_relu(),
# nn_linear(in_features = 32, out_features = 1)
# )
<- nn_linear(in_features = 3, out_features = 1)
model model
An `nn_module` containing 4 parameters.
-- Parameters ------------------------------------------------------------------
* weight: Float [1:1, 1:3]
* bias: Float [1:1]
# Define optimizer that implements SGD
<- optim_sgd(model$parameters, lr = 0.1)
opt opt
<optim_sgd>
Inherits from: <torch_Optimizer>
Public:
add_param_group: function (param_group)
clone: function (deep = FALSE)
defaults: list
initialize: function (params, lr = optim_required(), momentum = 0, dampening = 0,
load_state_dict: function (state_dict)
param_groups: list
state: State, R6
state_dict: function ()
step: function (closure = NULL)
zero_grad: function ()
Private:
step_helper: function (closure, loop_fun)
Training loop
# Training loop to see whether we can obtain function weights back
for (iter in 1:10){
# Refresh the grad attribute of all parameters
$zero_grad()
opt<- model(x)
pred <- nnf_mse_loss(y, pred)
loss # calculates the gradients/back propagation
$backward()
loss# use the optimizer to update model parameters
$step()
optcat("Loss at step ", iter, ": ", loss$item(), "\n")
}
Loss at step 1 : 16.37103
Loss at step 2 : 10.025
Loss at step 3 : 6.223922
Loss at step 4 : 3.91603
Loss at step 5 : 2.495534
Loss at step 6 : 1.609513
Loss at step 7 : 1.049808
Loss at step 8 : 0.6920332
Loss at step 9 : 0.4608567
Loss at step 10 : 0.3100319
the idiom of zeroing gradients is here to stay: Values stored in grad
fields accumulate; whenever we’re done using them, we need to zero them out before reuse.
We can finally see the final parameter values. Compare them to the theoretical values and they should be similar to the values we used to simulate our data.
$weight model
torch_tensor
1.7861 -2.6981 0.7968
[ CPUFloatType{1,3} ][ requires_grad = TRUE ]
$bias model
torch_tensor
0.2601
[ CPUFloatType{1} ][ requires_grad = TRUE ]
Save model for inferencing using torch_save
saveRDS
doesn’t work correctly for torch models.
# # Finally save model
# torch_save(model, "model.pt")
#
# # To reload model
# torch_load("model.pt")
1.8 Datasets and dataloaders
torch_dataset
is the object representing data in torch
1.8.1 Custom datasets
A new torch_dataset
can be created using the dataset
function, which requires the following 3 functions as arguments:
initilize
: takes inputs for dataset initialization.getitem
: takes single integer as input and returns an observation of the dataset.length
: returns total number of observations
# Custom torch_dataset
<- dataset(
mydataset initialize = function(n_rows, n_cols){
$x <- torch_randn(n_rows, n_cols)
self$y <- torch_randn(n_rows)
self
},
# We subset the previously initialized x and y using index provided
.getitem = function(index){
list(self$x[index, ], self$y[index])
},
# Number of rows by looking at the initialized tensor x
.length = function(){
$x$shape[1]
self
}
)
The dataset
function creates a definition of how to initialize and get elements from a dataset and compute length. Initialize dataset and start extracting elements from it:
# Initialize
<- mydataset(n_rows = 10, n_cols = 3)
ds
# length
length(ds)
[1] 10
# Extract first observation
1] ds[
[[1]]
torch_tensor
-0.3693
0.0837
-0.5006
[ CPUFloatType{3} ]
[[2]]
torch_tensor
-2.22254
[ CPUFloatType{} ]
# or equivalently
$.getitem(1) ds
[[1]]
torch_tensor
-0.3693
0.0837
-0.5006
[ CPUFloatType{3} ]
[[2]]
torch_tensor
-2.22254
[ CPUFloatType{} ]
1.8.2 Common patterns:
The dataset()
function allows us to define data loading and pre-processing in a very flexible way. We can decide how to implement the dataset
in the way it works best for our problem.
See:
1.8.3 Dataloaders
Dataloaders are torch’s abstraction used to iterate over datasets in batches, and optionally shuffle and prepare data in parallel.
A dataloader is created by passing a dataset instance to the dataloader()
function:
library(torchvision)
# Taking the validation dataset
<- mnist_dataset(root = "data-raw/mnist", download = TRUE, train = FALSE)
mnist
# Data loader
<- dataloader(mnist, batch_size = 32, shuffle = TRUE)
dl
# Number of batches we can extract from dataloader
length(dl)
[1] 313
length()
returns the number of batches we can to extract from the dataloader.
Dataloaders can be iterated on using the coro::loop()
function combined with a for
loop. The reason we need coro::loop()
is that batches in dataloaders
are only computed when they are actually used, to avoid large memory usage.
<- 0
total ::loop(for (batch in dl){
coro<- total + batch$x$shape[1]
total
}) total
[1] 10000
You can think of dataloaders as an object similar to an R list with the important difference that the elements are not actually computed yet, and they get computed every time you loop trough it.