A tutorial from Quantitative Science Consulting
This tutorial will show you how to train your first neural
network in R. I am glossing over the mathematical details and focusing
on implementation so that you can follow this process with your own
data. We are using the incredibly powerful
torch package
for R, which allows you to build, train, and deploy neural networks
in R.
In this tutorial, we will predict the price of diamonds based on
some features of the diamonds, e.g., carat, size, etc. Price is a
continuous response variable (or “target”), and you could use the same
code to predict any continuous target in your own dataset (e.g., height,
distance)
# for manipulating data
library(dplyr)
# for parallel processing
library(purrr)
# for neural network functionality
library(torch)You may be prompted to install additional software for torch. Type
Yes.
Here we will only consider continous features. NNs that have
categorical features are more advanced. Let’s keep it simple for now,
but in a future tutorial, I’ll show you how to include categorical
features. If you want to use your own data, this is the only
step you need to change. Specifically, change the
raw_data object, specify the column names of coninuous
features, and specify the target name.
url_data <- "https://raw.githubusercontent.com/mikeyEcology/Dada-Analysis-on-Diamonds-Dataset/main/diamonds.csv"
raw_data <- url_data |>
url() |>
vroom::vroom()
continuous_features <- c("carat", "depth", "table", "x", "y", "z")
target_name <- "price"In this example, we are using data on diamonds. We predict the price of the diamonds based on some features. These data are from Kaggle, you can read more details and get feature definitions here.
Many statistical and machine learning methods perform better if the data are standardized and scaled. For neural network is is very important and failure to do so can lead to the problem of vanishing or exploding gradients. Fortunately, base R has a function to scale our covariates. Note that we will need to unscale the output of our model to get real world predictions.
features <- raw_data |>
dplyr::select(dplyr::all_of(continuous_features)) |>
scale() |>
# scale converts the data into an unfortunate format. Make it back to df
tibble::as_tibble()
target <- raw_data |>
dplyr::select(dplyr::all_of(target_name)) |>
scale() |>
tibble::as_tibble()
train_val <- dplyr::bind_cols(features, target)After scaling these features, we expect them to have a mean = 0 and sd = 1. Our values might be slightly different from 0, and 1, but they should be close
## carat depth table x y
## 3.563583e-14 5.518015e-13 -3.419117e-14 1.151112e-13 8.160402e-14
## z price
## -2.502553e-13 -6.337301e-15
## carat depth table x y z price
## 1 1 1 1 1 1 1
You should always plot your data before fitting models! There are many ways to look at your data and that is the subject of another tutorial. We’ll just plot each variable against the other ones.
You might be unfamilliar with this type of data construct in R. Object oriented programming is less common in R and it can take some getting used to.
dset <- torch::dataset(
name = "dset",
initialize = function(indices) {
data <- self$prepare_data(train_val[indices, ])
self$x <- data[[1]]
self$y <- data[[2]]
},
.getitem = function(i) {
x <- self$x[i, ]
y <- self$y[i, ]
list(x, y)
},
.length = function() {
dim(self$y)[1]
},
prepare_data = function(input) {
feature_cols <- input |>
dplyr::select(dplyr::all_of(continuous_features)) |>
as.matrix()
target_col <- input |>
dplyr::select(dplyr::all_of(target_name)) |>
as.matrix()
list(
torch_tensor(feature_cols),
torch_tensor(target_col)
)
}
)One portion of the data (in this case 80%) will be used to train the network. The remainder of the data will be used to validate the net: to determine how good the network is as predicting data.
Dataloaders are special structures for torch (and other deep learning software) that allow the data to be loaded appropirately in batches. To set up a dataloader, we need to first set up a Dataset (using the dataset class defined above).
# bigger batch sizes train faster. General concensus is that this is not a hyperparameter worth tuning.
batch_size = 256
train_ds <- dset(train_indices)
train_dl <- train_ds |>
dataloader(batch_size = batch_size, shuffle = TRUE)
valid_ds <- dset(valid_indices)
valid_dl <- valid_ds |>
dataloader(batch_size = batch_size, shuffle = FALSE)The model structure is also defined using Object Oriented
Programming. It will take some getting used to, but it is very
customizable. We are building a NN with one hidden layer and
fc1_dim neurons in that layer. Our output layer has only 1
neuron because this is regression (there is one value predicted by the
network).
See comments in code for details
# This is the number of neurons in the hidden layer of the neural network
fc1_dim <- 32
# The number of features is the number of neurons in the input layer
num_numerical <- length(continuous_features)
# build this neural net
model <- net(num_numerical, fc1_dim)Send the model to a Graphics Processing Unit (GPU) if you have one.
This is very important for bigger models; a GPU speeds up computaion. If
you are concerned about trying to use your GPU, set
device<-"cpu". If you are using the GPU, always send the
model there before specifying the optimizer (below)
device <- if (cuda_is_available()) torch_device("cuda:0") else "cpu"
model <- model$to(device = device)Define the hyperparameters. These are options that you can “tune” (adjust) to try and improve model fit
Training and validation are conducted in this one loop
for (epoch in 1:num_epochs) {
model$train()
train_losses <- c()
coro::loop(for (b in train_dl) {
optimizer$zero_grad()
output <- model(b[[1]]$to(device = device))
loss <- loss_func(output, b[[2]]$to(dtype = torch_float(), device = device))
loss$backward()
optimizer$step()
train_losses <- c(train_losses, loss$item())
})
model$eval()
valid_losses <- c()
coro::loop(for (b in valid_dl) {
output <- model(b[[1]]$to(device = device))
loss <- loss_func(output, b[[2]]$to(dtype = torch_float(), device = device))
valid_losses <- c(valid_losses, loss$item())
})
cat(sprintf("Epoch %d: training loss: %3f, validation loss: %3f\n", epoch, mean(train_losses), mean(valid_losses)))
}## Epoch 1: training loss: 0.908324, validation loss: 0.758647
## Epoch 2: training loss: 0.684765, validation loss: 0.658025
## Epoch 3: training loss: 0.624438, validation loss: 0.621939
## Epoch 4: training loss: 0.599572, validation loss: 0.603768
## Epoch 5: training loss: 0.585223, validation loss: 0.592853
## Epoch 6: training loss: 0.576741, validation loss: 0.585594
## Epoch 7: training loss: 0.570447, validation loss: 0.580454
## Epoch 8: training loss: 0.566900, validation loss: 0.576667
## Epoch 9: training loss: 0.563283, validation loss: 0.573795
## Epoch 10: training loss: 0.560574, validation loss: 0.571567
# we set model to evaluation model so it knows not to calculate gradients now
model$eval()
# predict the target for the validation dataset
target_pred <- c()
coro::loop(for (b in valid_dl) {
output <- model(b[[1]]$to(device = device))
#target_pred <- c(target_pred, output)
for (i in 1:length(output)) {
pred_array <- as_array(output)[i,]
target_pred <- c(target_pred, pred_array)
}
})
# get the observed target from the validation dataset
# we use `as_array` to convert from tensors to standard R data structures
target_obs <- as_array(valid_ds$y)Recall that we scaled the data before fitting a model. We need to unscale the data to see these data points in their real value.
# get mean and std dev from original data (these were used when scaling by the `scale` function)
raw_target <- raw_data |>
dplyr::select(dplyr::all_of(target_name)) |>
dplyr::pull()
std_dev <- sd(raw_target)
mn <- mean(raw_target)
# unscale
y_obs <- (target_obs * std_dev) + mn
y_pred <- (target_pred *std_dev) + mnplot(
y_obs, y_pred, main = "Model evaluation",
xlab = "Observed target value",
ylab = "Predicted target value",
axes = FALSE, pch = 16
)
segments(
x0=0, y0=0, x1=max(y_obs), y1=max(y_obs), lty = 2, lwd = 4, col = "red"
)
axis(1)
axis(2, las = 2)Correlation between observed and predicted
## [,1]
## [1,] 0.8930453
## [,1]
## [1,] 0.9595739
You can also use your other favorite methods to compare observed to predicted values.
This is my first data science tutorial. Is it helpful for you? Please reach out on twitter or linkedin to tell me more information on the types of tutorials that would be helpful.
Follow QSC on linkedin to stay up to date with the latest tutorials.
?optim_adam,
and play with the options of this function.fc1_dim)?