2024-05-28

Performant R Code

Many books have been written about writing faster code in R and other languages.

Our goal here is to cover what I view as the most common and useful tips and approaches for writing high-performing R code. These are laid out roughly in the order you should try them, roughly increasing in complexity:

  • Don’t Do It
  • Know The Language
  • Do Less
  • Do Smarter
  • Do With More

Don’t Do It

Premature optimization

“Premature optimization is the root of all evil” is one of the most famous quotes in computer science.

Premature optimization is a phrase used to describe a situation where a programmer lets performance considerations affect the design of a piece of code. This can result in a design that is not as clean as it could have been or code that is incorrect, because the code is complicated by the optimization and the programmer is distracted by optimizing.

Wikipedia

Premature optimization

Premature optimization

There’s no better optimization…

…than not running code at all. Be aware of tools like targets for computationally demanding analysis projects.

“The package skips costly runtime for tasks that are already up to date, orchestrates the necessary computation with implicit parallel computing, and abstracts files as R objects.”

Know The Language

Use R’s vectorisation when possible

Loops in R are not necessarily slow…but whenever possible, make use of R’s built-in vectorisation (ability to operate concurrently on all elements of a vector).

a <- c(1, 1, 2, 3, 5, 8)
a + 1
## [1] 2 2 3 4 6 9

Use R’s vectorisation when possible

Loops in R are not necessarily slow…but whenever possible, make use of R’s built-in vectorisation (ability to operate concurrently on all elements of a vector).

# Naive, non-vectorised approach
lsum <- 0
for(i in 1:length(x)) {
    lsum <- lsum + log(x[i])
}

# The clearer, faster R approach
lsum <- sum(log(x))

“This is speaking R with a C accent—a strong accent” R Inferno

Use R’s vectorisation when possible

Timing difference:

## Unit: nanoseconds
##               expr  min   lq  mean median   uq     max neval
##        loop(1:100) 2910 2990 15300   3030 3300 1200000   100
##  vectorised(1:100)  615  820  5450    943 1020  443000   100

(We’re using the microbenchmark package to quantify the speed of different approaches.)

Note: vectorisation is most efficient with single, long vectors. Think clearly about when your data needs to be grouped/nested data, and when it doesn’t.

Use dedicated functions and packages

Specialized functions such as colSums(), rowSums(), colMeans(), and rowMeans() are fast.

# Create a large matrix and test different approaches
x <- cbind(x1 = rnorm(1e4), x2 = rnorm(1e4))

mbm <- microbenchmark(rowSums(x), 
                      x[,1] + x[,2],
                      apply(x, 1, sum))
print(mbm, signif = 3)
## Unit: microseconds
##              expr     min      lq   mean  median     uq    max neval
##        rowSums(x)    7.95    8.98   11.2    9.68   12.7   53.8   100
##   x[, 1] + x[, 2]   17.50   19.00   23.7   19.80   27.6   73.0   100
##  apply(x, 1, sum) 4930.00 5060.00 5570.0 5700.00 5860.0 9200.0   100

Use the best tool

Years ago a colleague and I wrote the RCMIP5 package for handling and summarizing CMIP5 data in R.

It was super slow.

Then we discovered fast, specialized tools such as CDO, Panoply, and NetCDF.

Don’t reinvent the wheel; use the best (hopefully, freely available and open source) tool for ths job.

Do Less

Don’t carry unnecessary data

Subset/filter data before computing on it; otherwise, you’re doing unnecessary work.

Here we compute the mean price by color for “Ideal” diamonds:

library(ggplot2) # for 'diamonds'
library(dplyr)
postfilter <- function() {
    diamonds %>% 
        group_by(cut, color) %>% 
        summarise(mean(price), .groups = "drop") %>%
        filter(cut == "Ideal")
}

Don’t carry unnecessary data

Subset/filter data before computing on it; otherwise, you’re doing unnecessary work.

Same operation, but first we isolate just the data we’re interested in:

prefilter <- function() {
    diamonds %>% 
        filter(cut == "Ideal") %>% 
        group_by(color) %>% 
        summarise(mean(price))
}

Don’t carry unnecessary data

Subset/filter data before computing on it; otherwise, you’re doing unnecessary work.

Timing difference:

## Unit: milliseconds
##          expr  min   lq mean median   uq   max neval
##  postfilter() 1.43 1.46 1.73   1.48 1.51 11.40   100
##   prefilter() 1.28 1.32 1.43   1.34 1.38  2.88   100

This difference will get worse with larger data frames; carrying less data will also reduce memory usage.

Don’t do unnecessary things

Move any unnecesssary computations outside of loops or repeatedly-called functions.

For example, here we repeatedly calculate avg inside the loop:

very_slow_average <- function(x) {
    sm <- 0
    for(i in seq_along(x)) {
        sm <- sm + x[i]
        avg <- sm / i
    }
    avg
}

Don’t do unnecessary things

Move any unnecesssary computations outside of loops or repeatedly-called functions.

Here we calculate avg only once, after the loop is finished:

slow_average <- function(x) {
    sm <- 0
    for(i in x) {
        sm <- sm + i
    }
    sm / length(x)
}

Don’t do unnecessary things

Move any unnecesssary computations outside of loops or repeatedly-called functions.

Timing difference:

## Unit: microseconds
##                      expr  min   lq  mean median   uq     max neval
##  very_slow_average(1:100) 3.61 3.65 14.30   3.65 3.71 1070.00   100
##       slow_average(1:100) 1.11 1.15  8.67   1.15 1.19  752.00   100
##               mean(1:100) 1.19 1.23  1.38   1.27 1.31    6.93   100

Wait, why isn’t there a bigger difference between slow_average() and R’s built-in mean()?!?

R’s JIT compiler

Most of ther slow_average() calls weren’t so slow because of R’s Just In Time (JIT) compiler.

An (old) example from Dirk Eddelbuettel:

h <- function(n, x=1) for (i in 1:n) x=(1+x)^(-1)
print(h)
## function(n, x=1) for (i in 1:n) x=(1+x)^(-1)

Printing h shows it to be a function, just as we expect.

R’s JIT compiler

h(1e6) # run h() with a large n
print(h)
## function(n, x=1) for (i in 1:n) x=(1+x)^(-1)
## <bytecode: 0x110f7fad0>

That’s changed! Behind the scenes R has replaced our function with a compiled version of it. This compiled version is usually much faster.

R is very smart about when to compile functions (technically, closures), and you don’t usually have to think about it. But it can mean that time-intensive functions run faster upon re-use.

Do Smarter

Understand memory allocation

When we rbind() (and similar operations), R computes how much memory is needed for the new object on the heap, allocates that, copies everything to the new location, and frees the old objects. This is expensive.

Creating a data frame with 100 copies of cars, calling rbind() each time:

rbind_in_loop <- function() {
    out <- data.frame()
    for(i in 1:100) out <- rbind(out, iris)
    out
}

Understand memory allocation

When we rbind() (and similar operations), R computes how much memory is needed for the new object on the heap, allocates that, copies everything to the new location, and frees the old objects. This is expensive.

Creating a list of the 100 data frames and then calling rbind() once:

use_a_list <- function() {
    out <- list()
    for(i in 1:100) out[[i]] <- iris
    do.call("rbind", out) # or dplyr::bind_rows()
}

Understand memory allocation

When we rbind() (and similar operations), R computes how much memory is needed for the new object on the heap, allocates that, copies everything to the new location, and frees the old objects. This is expensive.

Timing difference:

## Unit: milliseconds
##             expr   min    lq  mean median    uq  max neval
##  rbind_in_loop() 35.70 38.10 42.80  39.40 43.30 78.0   100
##     use_a_list()  4.36  4.66  5.61   4.92  5.73 30.7   100

So use_a_list() is ~10x faster…and it’s much more memory efficient.

Understand what triggers a copy

It’s not always obvious what triggers an (expensive) object copy.

df <- data.frame(x = 1:2, y = 3:4)

library(lobstr)
ref(df) # print location of each component of df
## â–ˆ [1:0x115553488] <df[,2]> 
## ├─x = [2:0x107d34ca8] <int> 
## └─y = [3:0x107d34d88] <int>

We see that the two columns of df are stored at locations

  • 0x107d34ca8 (the x column), and
  • 0x107d34d88 (the y column)

Understand what triggers a copy

Adding a column does not trigger a copy, as R only has to create a vector for the new column:

df$z <- "A"
obj_addrs(df)
## [1] "0x107d34ca8" "0x107d34d88" "0x115716288"

Adding a row, however, means that every column has to be copied to a new object (and so the memory addresses change). This is slow.

df[3,] <- c(-1, -1, "B")
obj_addrs(df)
## [1] "0x1105cd570" "0x110623e08" "0x12440c7e8"

Understand your algorithms

A clueless prime number algorithm:

slow_prime <- function(n) {
    if(n == 2) return(TRUE)
    for(i in 2:(n-1)) {
        if(n %% i == 0) return(FALSE)
    }
    return(TRUE)
}

slow_prime(7417)
## [1] TRUE
slow_prime(7418)
## [1] FALSE

Understand your algorithms

A slightly less naive prime number algorithm:

less_slow_prime <- function(n) {
    if(n == 2) return(TRUE)
    for(i in 2:sqrt(n)) {              # <----
        if(n %% i == 0) return(FALSE)
    }
    return(TRUE)
}

less_slow_prime(7417)
## [1] TRUE
less_slow_prime(7418)
## [1] FALSE

Understand your algorithms

Timing difference:

## Unit: microseconds
##                expr min     lq   mean median     uq    max neval
##       slow_prime(n) 405 425.00 484.00 439.00 479.00 1380.0   100
##  less_slow_prime(n)   5   5.29   5.85   5.49   5.94   12.9   100

Understand the R Profiler

Typically, 10% of code consumes 90% of the time. How do you find the slow part of your code?

One option is to insert lots of timing statements… but the more powerful approach is tools::Rprof().

Profiling works by frequently writing out the call stack as code executes.

# Log profiling information to a tempfile
Rprof(tmp <- tempfile())
# Run the 100-copies-of-cars example from a few slides back
x <- rbind_in_loop() # 
# Stop profiling
Rprof()

Understand the R Profiler

The profiling data makes it clear that almost all the time of rbind_in_loop() is spent in the rbind() call–that’s the expensive step.

summaryRprof(tmp)$by.total
                  total.time total.pct self.time self.pct
"rbind_in_loop"          0.12    100.00      0.00     0.00
"rbind"                  0.12    100.00      0.00     0.00
"[<-.factor"             0.08     66.67      0.08    66.67
"[<-"                    0.08     66.67      0.00     0.00
"as.vector"              0.02     16.67      0.02    16.67
"vapply"                 0.02     16.67      0.02    16.67
"as.vector.factor"       0.02     16.67      0.00     0.00

Understand the R Profiler

RStudio has built-in support for using and visualizing code profiling.

Do With More

Parallelize locally

For tasks that are embarrassingly parallel and compute-bound, R’s built-in parallel package can be a life-changer.

Use mclapply() to split a job across the different cores of your local machine:

expensive_job <- function(x) Sys.sleep(2)

# Standard approach using lapply()
serial_approach <- function() lapply(1:4, expensive_job)

# Parallelized version using mclapply()
library(parallel)
parallel_approach <- function() mclapply(1:4, 
                                         expensive_job, 
                                         mc.cores = 4)

Parallelize locally

For tasks that are embarrassingly parallel and compute-bound, R’s built-in parallel package can be a life-changer.

Timing difference:

## Unit: seconds
##                 expr  min   lq mean median   uq  max neval
##    serial_approach() 7.89 7.89 7.89   7.89 7.89 7.89     1
##  parallel_approach() 1.99 1.99 1.99   1.99 1.99 1.99     1

Parallelize locally

For tasks that are embarrassingly parallel and compute-bound, R’s built-in parallel package can be a life-changer.

Important caveats:

  • There can be many subtleties with running parallelized code
  • Things are harder on Windows
  • If your jobs take lots of memory, you will need to think carefully about how that will play out
  • There is a small computational cost to sending jobs to new cores, so evaluate how ‘big’ (expensive) jobs are and how they should be split up.

Rcpp

Thanks to the Rcpp package it’s straightforward to link R and C++ code; as a compiled and low-level language, the latter will usually be significantly faster.

Consider an R function to test whether a number is odd:

isOddR <- function(num = 10L) { 
    result <- (num %% 2L == 1L) 
    return(result)
}

isOddR(42L)
## [1] FALSE

Rcpp

We can write a C++ version of this function and call it from R:

library(Rcpp)
cppFunction("
bool isOddCpp(int num = 10) {
   bool result = (num % 2 == 1);
   return result;
}")

isOddCpp
## function (num = 10L) 
## .Call(<pointer: 0x103814780>, num)
isOddCpp(42L)
## [1] FALSE

This is more computationally efficient and lets you use high-performance C++ libraries for specialized tasks.

Note: Windows users will have to install Rtools.

Rcpp

C++ code also has access to R functions:

evalCpp("Rcpp::rnorm(3)")
## [1] -1.0546128 -0.2999353 -0.8682443

For more information, see the Rcpp documentation and Eddelbuettel and Balamuta (2018).

HPC

Cluster and high performance computing (HPC) resources provide maximal speedup for many kinds of jobs, but there can be lots of subtleties and pain points: getting access, queue wait times, using MPI, handling networked filesystems, shared libraries, implementing post-processing, etc.

There’s a full example of using R on HPC at /compass/fme200002/r_cluster_example/:

. /etc/profile.d/modules.bash
module purge
module load r

# Change directory into the example
cd /compass/fme200002/r_cluster_example

# R script to run 
EXAMPLE_SCRIPT="/compass/fme200002/r_cluster_example/example.R"

# run script with the slurm array index as the only argument to the script 
Rscript $EXAMPLE_SCRIPT $SLURM_ARRAY_TASK_ID

Conclusion

Evaluate costs and benefits

Remember, premature optimization is the root of all evil!

Resources