Scaling is an important step before applying some machine learning models such as logistic regression or neural networks. As the data set grows, scaling is taking much more time.

In R, to perform scaling, the function base::scale is available, but can be time and RAM consuming. In order to make scaling and some other data preparation steps faster, dataPreparation package has been developed. This package offers functions that are fast, easy to use and robust. It is using the power of data.table and some computational tricks.

In this article we are going to benchmark base::scale and dataPreparation::fastScale functions when applying scaling to a train and a test set.

Build a train and test

For this demonstration, we are going to build a train and test set. Those sets will be random values stored in a data.table of 100 columns. Train will contain 80 000 lines and test 20 000 lines.

n_cols = 100
n_lines_train = 80000
n_lines_test = 20000
train = as.data.table(matrix(runif(n_lines_train * n_cols), nrow = n_lines_train, ncol = n_cols))
test = as.data.table(matrix(runif(n_lines_test * n_cols), nrow = n_lines_test, ncol = n_cols))

Scaling with scale

Using scale, one need to first scale the train set; then retrieve center and scale to apply them to test set.

In order to make reproducible tests, we send a copy of train and test to avoid modifying them.

# Scale train set
microbenchmark(
  train_scaled <- scale(copy(train))
)
# Unit: milliseconds
#                                expr      min       lq    mean   median
#  train_scaled <- scale(copy(train)) 529.4791 576.6235 624.192 625.5126
#        uq      max neval
#  665.3654 752.8822   100
# Retrieve scaling values
center <- attr(train_scaled, "scaled:center")
sd <- attr(train_scaled, "scaled:scale")
# Scale test set
microbenchmark(
   test_scaled <- scale(copy(test), center = center, scale = sd)
)
# Unit: milliseconds
#                                                           expr      min
#  test_scaled <- scale(copy(test), center = center, scale = sd) 33.94018
#        lq     mean   median      uq      max neval
#  35.91891 43.44076 39.58881 48.2702 114.7943   100

As one can see in those benchmarks, scale was about 12 times slower to compute scales on train than on test (which is only 4 times bigged).

Scaling with fastScale

Using fastScale in a data science project is a bit more intuitive: First compute the scales (mean and sd for each variable); then apply scales to train and test.

# Build scaling values
scales <- build_scales(train, verbose = FALSE)
# Apply on train set
microbenchmark(
  train_scaled <- fastScale(copy(train), scales = scales, verbose = FALSE)
)
# Unit: milliseconds
#                                                                      expr
#  train_scaled <- fastScale(copy(train), scales = scales, verbose = FALSE)
#      min       lq     mean   median       uq      max neval
#  30.5678 34.13303 54.92997 39.79589 45.73551 119.2982   100
# Apply on test set
microbenchmark(
  test_scaled <- fastScale(copy(test), scales = scales, verbose = FALSE)
)
# Unit: milliseconds
#                                                                    expr
#  test_scaled <- fastScale(copy(test), scales = scales, verbose = FALSE)
#       min       lq     mean   median       uq      max neval
#  8.444013 8.911639 9.583864 9.251124 9.880742 18.06961   100

As one can see in those benchmarks, fastScale was about 5 times slower to compute scales on train than on test. And with this specific number of lines and columns, fastScale is performing computations at least 4 times faster.

Benchmark with growing number of lines

Performing this comparison with different number of lines, one can draw the following graph:

As one can see, the bigger your data set is, the more interesting it is to use fastScale instead of scale.

Conclusion

We presented here a benchmark for one function of dataPreparation package. There are a few more available that are fast and easy to use. So if you liked it, please go check the package documentation (by installing it or on CRAN)

We hope that this package is helpful, that it helped you prepare your data in a faster way.

If you would like to give us some feedback, report some issues, add some features to this package, please tell us on GitHub. Also if you want to contribute, please don’t hesitate to contact us.