One of the worst performance anti-patterns in R is the repeated use of rbind() within loops. This is unfortunately a very easy trap to fall into if you are trying to construct a data frame out of parts.
The problem
Say we are trying to do analysis on data from the last 28 days. The data is stored separately by day, so we need to retrieve each days data, do some manipulations of it and then combine it together.
First lets create a function to retrieve the data for a given day. In this example we are just going to always return the same dataset for simplicity, but in a real example this could be a from a database, URL, etc.
The data we will use for this example is from the palmerpenguins package and is pretty small.
A common approach to solve this problem is to do this in a for loop, and use rbind() to combine the results.
f0 <-function(days) { results <-data.frame()for (d in days) {# Get data day_data <-get_data(d)# Process data# ...# Store data results <-rbind(results, day_data) } results}
However if you use this code and increase the number of days you will observe it takes increasingly long to load.
library(bench)bench_time(f0(1:10))
process real
15.5ms 15.6ms
bench_time(f0(1:100))
process real
559ms 559ms
bench_time(f0(1:200))
process real
2.25s 2.25s
This is true even though the data size of the final result is still relatively small.
In fact the performance is roughly exponential \(\approx num\_iterations * final\_data\_size / 2\)
bnch <- bench::press(num_days =c(1, 2, 4, 8, 16, 32, 64, 128), { days <-seq_len(num_days) bench::mark(f0(days)) })library(ggplot2)bnch |>ggplot() +aes(num_days, as.numeric(min)) +geom_line(linetype ="dashed") +geom_point() +labs(title ="Performance of repeated rbind()", x ="Number of days", y ="Seconds")
The reason for this performance is because rbind() is very dumb. Every time you call it it simply copies all the data from the arguments together and returns the resulting object. This means that for each iteration of the loop you are copying all of the previous data.
We can see this by adding some simple logging to our previous example.
say <-function(...) { rlang::inform(glue::glue(..., .envir =parent.frame()))}f0.5<-function(days) { results <-data.frame()for (d in days) {# Get data day_data <-get_data(d)# Process data# ...# Store data results <-rbind(results, day_data)say("day: {d} | Rows copied: {row_count}", row_count =nrow(results) -nrow(day_data)) } results}res <-f0.5(1:10)
The first way we could start to address this issue is changing our strategy slightly. Instead of calling rbind() on each iteration of the loop, we will instead put the results into a list as we go, and then combine them all at once with do.call(rbind()) outside the loop. Note: If you know the number of total iterations up front it is fastest to pre-allocate the size of the list up front.
f1 <-function(days) { results <-vector("list", length(days)) i <-1for (d in days) {# Get data day_data <-get_data(d)# Process data# ...# Store data results[[i]] <- day_data i <- i +1 }do.call(rbind, results)}
This rescues much of the performance, and our code is no longer exponential.
bnch <- bench::press(num_days =c(1, 2, 4, 8, 16, 32, 64, 128), { days <-seq_len(num_days) bench::mark("rbind"=f0(days),"do.call(rbind)"=f1(days) ) })bnch |>ggplot() +aes(num_days, as.numeric(min), group =as.character(expression), color = expression) +geom_line(linetype ="dashed") +geom_point() +labs(title ="Performance of repeated rbind() and do.call(rbind)",x ="Number of days",y ="Seconds" )
Solution 2 and 3 - dplyr::bind_rows() and data.table::rbindlist()
Solution 1 is already a large improvement, but there exists some even more efficient ways of combining our list data together in CRAN packages dplyr and data.table; dplyr::bind_rows() and data.table::rbindlist(). These two functions are essentially drop in replacements for the do.call(rbind) line in f1, though they return tibble and data.table objects rather than data.frames. Fortunately you can convert both of these to regular data.frame with no performance costs using as.data.frame().
f2 <-function(days) { results <-vector("list", length(days)) i <-1for (d in days) {# Get data day_data <-get_data(d)# Process data# ...# Store data results[[i]] <- day_data i <- i +1 }as.data.frame(dplyr::bind_rows(results))}f3 <-function(days) { results <-vector("list", length(days)) i <-1for (d in days) {# Get data day_data <-get_data(d)# Process data# ...# Store data results[[i]] <- day_data i <- i +1 }as.data.frame(data.table::rbindlist(results))}
Both of these functions work in very similar ways, the main advantage they have over do.call(rbind) is they are smarter about pre-allocating the correct amount of space for the final result and do the expensive part of the operations in C rather than using R code. This results in a moderate additional speed improvement.
bnch <- bench::press(num_days =c(1, 2, 4, 8, 16, 32, 64, 128), { days <-seq_len(num_days) bench::mark("do.call(rbind)"=f1(days),"dplyr::bind_rows"=f2(days),"data.table::rbindlist"=f3(days) ) })bnch |>ggplot() +aes(num_days, as.numeric(min), group =as.character(expression), color = expression) +geom_line(linetype ="dashed") +geom_point() +labs(title ="Performance of repeated rbind() and do.call(rbind)",x ="Number of days",y ="Seconds" )
The performance between dplyr::bind_rows() and data.table::rbindlist() is largely the same, on this dataset rbindlist() tends to be marginally faster, but uses \(\approx2x\) as much memory.
For many cases, including this one you can reduce the amount of bookkeeping code needed by using purrr:::map_dfr(), which will handle the allocation of the list and iteration for you.
f4 <-function(days) { purrr::map_dfr(days,function(d) {# Get data day_data <-get_data(d)# Process data# ... day_data } )}
The performance of map_dfr() is essentially equivalent to dplyr::bind_rows() as it uses bind_rows() under the hood.
Conclusion
If you have code using rbind() to repeatably combine data frame’s together it can be a significant performance bottleneck.
You should convert it to storing the results in a list and using either do.call(rbind()), dplyr::bind_rows() or data.table::rbindlist(). Alternatively tweaking your code so it can be used with purrr::map_dfr() is another good solution.