Performance pitfalls of rbind()

One of the worst performance anti-patterns in R is the repeated use of rbind() within loops. This is unfortunately a very easy trap to fall into if you are trying to construct a data frame out of parts.

The problem

Say we are trying to do analysis on data from the last 28 days. The data is stored separately by day, so we need to retrieve each days data, do some manipulations of it and then combine it together.

First lets create a function to retrieve the data for a given day. In this example we are just going to always return the same dataset for simplicity, but in a real example this could be a from a database, URL, etc.

The data we will use for this example is from the palmerpenguins package and is pretty small.

get_data <- function(day) {
  as.data.frame(palmerpenguins::penguins)
}

dplyr::glimpse(get_data(1))

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

A common approach to solve this problem is to do this in a for loop, and use rbind() to combine the results.

f0 <- function(days) {
  results <- data.frame()

  for (d in days) {
    # Get data
    day_data <- get_data(d)

    # Process data
    # ...

    # Store data
    results <- rbind(results, day_data)
  }
  results
}

However if you use this code and increase the number of days you will observe it takes increasingly long to load.

library(bench)
bench_time(f0(1:10))

process    real 
 15.5ms  15.6ms

bench_time(f0(1:100))

process    real 
  559ms   559ms

bench_time(f0(1:200))

process    real 
  2.25s   2.25s

This is true even though the data size of the final result is still relatively small.

dplyr::glimpse(f0(1:200))

Rows: 68,800
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

In fact the performance is roughly exponential \(\approx num\_iterations * final\_data\_size / 2\)

bnch <- bench::press(
  num_days = c(1, 2, 4, 8, 16, 32, 64, 128),
  {
    days <- seq_len(num_days)
    bench::mark(f0(days))
  }
)

library(ggplot2)

bnch |> 
  ggplot() + 
    aes(num_days, as.numeric(min)) +
    geom_line(linetype = "dashed") + 
    geom_point() + 
    labs(title = "Performance of repeated rbind()", x = "Number of days", y = "Seconds")

The reason for this performance is because rbind() is very dumb. Every time you call it it simply copies all the data from the arguments together and returns the resulting object. This means that for each iteration of the loop you are copying all of the previous data.

We can see this by adding some simple logging to our previous example.

say <- function(...) {
  rlang::inform(glue::glue(..., .envir = parent.frame()))
}

f0.5 <- function(days) {
  results <- data.frame()

  for (d in days) {
    # Get data
    day_data <- get_data(d)

    # Process data
    # ...

    # Store data
    results <- rbind(results, day_data)
    say("day: {d} | Rows copied: {row_count}", row_count = nrow(results) - nrow(day_data))
  }
  results
}

res <- f0.5(1:10)

day: 1 | Rows copied: 0
day: 2 | Rows copied: 344
day: 3 | Rows copied: 688
day: 4 | Rows copied: 1032
day: 5 | Rows copied: 1376
day: 6 | Rows copied: 1720
day: 7 | Rows copied: 2064
day: 8 | Rows copied: 2408
day: 9 | Rows copied: 2752
day: 10 | Rows copied: 3096

Solution 1 - `do.call(rbind)`

The first way we could start to address this issue is changing our strategy slightly. Instead of calling rbind() on each iteration of the loop, we will instead put the results into a list as we go, and then combine them all at once with do.call(rbind()) outside the loop. Note: If you know the number of total iterations up front it is fastest to pre-allocate the size of the list up front.

f1 <- function(days) {
  results <- vector("list", length(days))
  i <- 1
  for (d in days) {
    # Get data
    day_data <- get_data(d)

    # Process data
    # ...

    # Store data
    results[[i]] <- day_data
    i <- i + 1
  }
  do.call(rbind, results)
}

This rescues much of the performance, and our code is no longer exponential.

bnch <- bench::press(
  num_days = c(1, 2, 4, 8, 16, 32, 64, 128),
  {
    days <- seq_len(num_days)
    bench::mark(
      "rbind" = f0(days),
      "do.call(rbind)" = f1(days)
    )
  }
)

bnch |> 
  ggplot() + 
    aes(num_days, as.numeric(min), group = as.character(expression), color = expression) +
    geom_line(linetype = "dashed") + 
    geom_point() + 
    labs(
      title = "Performance of repeated rbind() and do.call(rbind)",
      x = "Number of days",
      y = "Seconds"
    )

Solution 2 and 3 - `dplyr::bind_rows() and data.table::rbindlist()`

Solution 1 is already a large improvement, but there exists some even more efficient ways of combining our list data together in CRAN packages dplyr and data.table; dplyr::bind_rows() and data.table::rbindlist(). These two functions are essentially drop in replacements for the do.call(rbind) line in f1, though they return tibble and data.table objects rather than data.frames. Fortunately you can convert both of these to regular data.frame with no performance costs using as.data.frame().

f2 <- function(days) {
  results <- vector("list", length(days))
  i <- 1
  for (d in days) {
    # Get data
    day_data <- get_data(d)

    # Process data
    # ...

    # Store data
    results[[i]] <- day_data
    i <- i + 1
  }
  as.data.frame(dplyr::bind_rows(results))
}

f3 <- function(days) {
  results <- vector("list", length(days))
  i <- 1
  for (d in days) {
    # Get data
    day_data <- get_data(d)

    # Process data
    # ...

    # Store data
    results[[i]] <- day_data
    i <- i + 1
  }
  as.data.frame(data.table::rbindlist(results))
}

Both of these functions work in very similar ways, the main advantage they have over do.call(rbind) is they are smarter about pre-allocating the correct amount of space for the final result and do the expensive part of the operations in C rather than using R code. This results in a moderate additional speed improvement.

bnch <- bench::press(
  num_days = c(1, 2, 4, 8, 16, 32, 64, 128),
  {
    days <- seq_len(num_days)
    bench::mark(
      "do.call(rbind)" = f1(days),
      "dplyr::bind_rows" = f2(days),
      "data.table::rbindlist" = f3(days)
    )
  }
)

bnch |> 
  ggplot() + 
    aes(num_days, as.numeric(min), group = as.character(expression), color = expression) +
    geom_line(linetype = "dashed") + 
    geom_point() + 
    labs(
      title = "Performance of repeated rbind() and do.call(rbind)",
      x = "Number of days",
      y = "Seconds"
    )

The performance between dplyr::bind_rows() and data.table::rbindlist() is largely the same, on this dataset rbindlist() tends to be marginally faster, but uses \(\approx2x\) as much memory.

library(dplyr)

bnch |>
  mutate("fun" = as.character(expression)) |>
  select(fun, num_days, min, mem_alloc)

# A tibble: 24 × 4
   fun                   num_days      min mem_alloc
   <chr>                    <dbl> <bch:tm> <bch:byt>
 1 do.call(rbind)               1  364.6µs    77.2KB
 2 dplyr::bind_rows             1  298.8µs   150.2KB
 3 data.table::rbindlist        1  272.3µs     3.1MB
 4 do.call(rbind)               2  639.4µs   148.6KB
 5 dplyr::bind_rows             2  464.6µs    27.2KB
 6 data.table::rbindlist        2  410.6µs    95.3KB
 7 do.call(rbind)               4    1.2ms   333.8KB
 8 dplyr::bind_rows             4  782.5µs    54.1KB
 9 data.table::rbindlist        4  799.3µs   157.6KB
10 do.call(rbind)               8    2.4ms   873.4KB
# … with 14 more rows

Solution 4 - `purrr::map_dfr()`

For many cases, including this one you can reduce the amount of bookkeeping code needed by using purrr:::map_dfr(), which will handle the allocation of the list and iteration for you.

f4 <- function(days) {
  purrr::map_dfr(days,
    function(d) {
      # Get data
      day_data <- get_data(d)

      # Process data
      # ...

      day_data
    }
  )
}

The performance of map_dfr() is essentially equivalent to dplyr::bind_rows() as it uses bind_rows() under the hood.

Conclusion

If you have code using rbind() to repeatably combine data frame’s together it can be a significant performance bottleneck.

You should convert it to storing the results in a list and using either do.call(rbind()), dplyr::bind_rows() or data.table::rbindlist(). Alternatively tweaking your code so it can be used with purrr::map_dfr() is another good solution.

The problem

Solution 1 - do.call(rbind)

Solution 2 and 3 - dplyr::bind_rows() and data.table::rbindlist()

Solution 4 - purrr::map_dfr()

Conclusion

Solution 1 - `do.call(rbind)`

Solution 2 and 3 - `dplyr::bind_rows() and data.table::rbindlist()`

Solution 4 - `purrr::map_dfr()`