Scaling is an important step before applying some machine learning models such as logistic regression or neural networks. As the data set grows, scaling is taking much more time.
In R, to perform scaling, the function base::scale
is available, but can be time and RAM consuming. In order to make scaling and some other data preparation steps faster, dataPreparation package has been developed. This package offers functions that are fast, easy to use and robust. It is using the power of data.table and some computational tricks.
In this article we are going to benchmark base::scale
and dataPreparation::fastScale
functions when applying scaling to a train and a test set.
For this demonstration, we are going to build a train and test set. Those sets will be random values stored in a data.table of 100 columns. Train will contain 80 000 lines and test 20 000 lines.
n_cols = 100
n_lines_train = 80000
n_lines_test = 20000
train = as.data.table(matrix(runif(n_lines_train * n_cols), nrow = n_lines_train, ncol = n_cols))
test = as.data.table(matrix(runif(n_lines_test * n_cols), nrow = n_lines_test, ncol = n_cols))
Using scale, one need to first scale the train set; then retrieve center and scale to apply them to test set.
In order to make reproducible tests, we send a copy of train and test to avoid modifying them.
# Scale train set
microbenchmark(
train_scaled <- scale(copy(train))
)
# Unit: milliseconds
# expr min lq mean median
# train_scaled <- scale(copy(train)) 529.4791 576.6235 624.192 625.5126
# uq max neval
# 665.3654 752.8822 100
# Retrieve scaling values
center <- attr(train_scaled, "scaled:center")
sd <- attr(train_scaled, "scaled:scale")
# Scale test set
microbenchmark(
test_scaled <- scale(copy(test), center = center, scale = sd)
)
# Unit: milliseconds
# expr min
# test_scaled <- scale(copy(test), center = center, scale = sd) 33.94018
# lq mean median uq max neval
# 35.91891 43.44076 39.58881 48.2702 114.7943 100
As one can see in those benchmarks, scale
was about 12 times slower to compute scales on train than on test (which is only 4 times bigged).
Using fastScale in a data science project is a bit more intuitive: First compute the scales (mean and sd for each variable); then apply scales
to train and test.
# Build scaling values
scales <- build_scales(train, verbose = FALSE)
# Apply on train set
microbenchmark(
train_scaled <- fastScale(copy(train), scales = scales, verbose = FALSE)
)
# Unit: milliseconds
# expr
# train_scaled <- fastScale(copy(train), scales = scales, verbose = FALSE)
# min lq mean median uq max neval
# 30.5678 34.13303 54.92997 39.79589 45.73551 119.2982 100
# Apply on test set
microbenchmark(
test_scaled <- fastScale(copy(test), scales = scales, verbose = FALSE)
)
# Unit: milliseconds
# expr
# test_scaled <- fastScale(copy(test), scales = scales, verbose = FALSE)
# min lq mean median uq max neval
# 8.444013 8.911639 9.583864 9.251124 9.880742 18.06961 100
As one can see in those benchmarks, fastScale
was about 5 times slower to compute scales on train than on test. And with this specific number of lines and columns, fastScale
is performing computations at least 4 times faster.
Performing this comparison with different number of lines, one can draw the following graph:
As one can see, the bigger your data set is, the more interesting it is to use fastScale
instead of scale
.
We presented here a benchmark for one function of dataPreparation package. There are a few more available that are fast and easy to use. So if you liked it, please go check the package documentation (by installing it or on CRAN)
We hope that this package is helpful, that it helped you prepare your data in a faster way.
If you would like to give us some feedback, report some issues, add some features to this package, please tell us on GitHub. Also if you want to contribute, please don’t hesitate to contact us.