Parallel Processing in R

Setup

Install necessary packages

In this lesson, we’re going to use the tidyverse and nycflights13 packages. They’ll both provide us with data, and we’ll use some tidyverse functions in both our setup and in the task we’ll parallelize.

If you don’t have them installed yet, run the following:

install.packages(c("tidyverse", "nycflights13"))

Load packages

Once the packages are installed on your system, you can load them (as well as the base parallel package) into our working environment.

library(tidyverse)
library(parallel)
library(nycflights13)

Make data explicit in workspace

We’re going to use the diamonds and flights data sets today. They come from the packages we loaded in, so they’re technically already in our environment. If you want to see them in the Environment pane in RStudio, though, you can use the data() function to make them visible.

We’re using these because they’re reasonably large without being overwhelming, they come from commonly downloaded R packages, and they have categorical variables that we can use to quickly create a large number of smaller data sets for practicing parallelization.

data(diamonds)
data(flights)

Practicing with diamonds

We’re going to start off with our diamonds data, which has about 54,000 observations.

head(diamonds)
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

Splitting our data

We’ll be practicing “embarrassingly parallel” tasks today, which encompass any task that can be easily split into chunks to be handled more-or-less independently from one another. Our task will be “plot all of these data sets.” We’re keeping it simple because we want to focus on learning parallelization as a concept rather than having learners trip up on the code.

To prepare, we’re going to turn diamonds into a list of data sets by splitting it along the clarity, cut, and color variables.

diamond_list <- diamonds |>
  group_split(clarity, cut, color)

That gives us 276 different data frames to plot. In a real-world situation, it’s easy enough to imagine receiving a few hundred files that need to be put through the same processing or analysis steps.

Let’s create a simple scatter plot of our first set using ggplot2, just to make sure our “analysis” is working.

ggplot(diamond_list[[1]], aes(x = carat, y = price)) +
  geom_point()

Activity 1

Split the flights data frame into a list of data frames based on the destination variable dest. Try creating a scatter plot of the first item to show the relationship between departure delay (dep_delay) and arrival delay (arr_delay).

Show solution
flights_list <- flights |>
  group_split(dest)

ggplot(flights_list[[1]], aes(x = dep_delay, y = arr_delay)) + 
  geom_point()

Plotting each item (serially)

Now we know the plot code works. The next step is to write some code that will plot each of our subsets, and save them to our disk.

I’m going to save all them as “image.png,” which will overwrite with each iteration. In the real world, we’d write the function in such a way that the file name would be meaningful and distinct.

lapply(diamond_list, function(i) {
  ggplot(i, aes(x = carat, y = price)) +
    geom_point()
  
  ggsave("image.png", width = 7, height = 5)
})

Timing your code

That took a little bit! Let’s see exactly how long, though. We’re going to wrap the lapply() in system.time() to get a measurement.

system.time(
  lapply(diamond_list, function(i) {
    ggplot(i, aes(x = carat, y = price)) +
      geom_point()
    
    ggsave("image.png", width = 7, height = 7)
  })
)
   user  system elapsed 
 23.807   0.301  24.133 

Parallelization

That took a decent amount of time again, but now we know exactly how long! Let’s try to speed that up by parallelizing the code. The good news is that this is actually not a difficult operation, and we can re-use most of the code we’ve got already.

Detecting cores

First, let’s see how many cores we have available to us in our current environment. This will help us figure out the kinds of resources we can (or want to) dedicate to our task.

detectCores()
[1] 15

Windows

The process can differ a bit at this point depending on your operating system. The first approach we’ll go through wil work on any operating system, but Windows users must use this. MacOS and Linux users have a second option that’s a bit simpler.

First, we need to decide how many cores we’re going to utilize, and set them up into a cluster. Then, we need to send instructions to each core to load the packages we want and provide them access to the data they’ll use.

cl <- makePSOCKcluster(8)
clusterEvalQ(cl, library(tidyverse))
clusterExport(cl, "diamond_list")

We’re going to copy and paste the lapply() we used before, but now change it to parLapply(). We also need to give it a new initial agrument, which is the name of our cluster (in this case cl).

parLapply(cl, diamond_list, function(i) {
  ggplot(i, aes(x = carat, y = price)) +
    geom_point() 
  
  ggsave("image.png", width = 7, height = 7)
})

Activity 2

Time your parallelized code using system.time(). How much faster does it run than the serial version?

Show solution
system.time(
  parLapply(cl, diamond_list, function(i) {
    ggplot(i, aes(x = carat, y = price)) +
      geom_point()
    
    ggsave("image.png", width = 7, height = 7)
  })
)
   user  system elapsed 
  0.003   0.007   3.659 

Stop the cluster

The final thing we’ll do in this version is to stop the cluster.

stopCluster(cl)

MacOS and Linux

For MacOS and Linux, the code is a bit simpler. We don’t need to worry about creating the cluster or sending commands or data to the cores. Instead, we can just use the mclapply() function instead of lapply(). We’re going to give it an additional argument, mc.cores, and R will handle the rest.

system.time(
  mclapply(diamond_list, mc.cores = 8, function(i) {
    ggplot(i, aes(x = carat, y = price)) +
      geom_point() 
    
    ggsave("image.png", width = 7, height = 7)
  })
)
   user  system elapsed 
 21.305   0.429   3.718 

Activity 3

Try running serial and parallel versions of the plotting and image-saving code we used for the flights data, but on the entire flights_list. How much time does parallelizing save?

Show solution
system.time(
  lapply(flights_list, function(i) {
    ggplot(i, aes(x = dep_delay, y = arr_delay)) + 
      geom_point()    
    
    ggsave("./image.png", width = 7, height = 7)
  })
)
   user  system elapsed 
 11.879   0.166  12.060 
Show solution
system.time(
  mclapply(flights_list, mc.cores = 8, function(i) {
    ggplot(i, aes(x = dep_delay, y = arr_delay)) + 
      geom_point()
    
    ggsave("./image.png", width = 7, height = 7)
  })
)
   user  system elapsed 
 10.106   0.367   1.826