Parallel in R and Cloud Computing

Introduction

Being able to do parallel computing in data science has becoming very important. Not only because the the amount of data we could digest is tremendously large, but parallel computing could make it faster and it also means that we utilises every computing we could have in our arsenal.

These days personal computers have come with multiple cores. As we understood from this article, the more CPU (workers) we use for processing a task, the faster it will give us the outcome of those task. This could save data scientists time and be even more productive (run more model and testing).

This article will brief you about basic computer architecture (cores and CPUs), Parallelising in R, and how to setup R studio in the cloud.

Computer Architecture

In this article, we will only talk about CPUs and cores. CPU stands for central processing unit, it is the place where all the computation is happening inside every motherboard. Cores are the processing units inside the processor. To visualise, please see the image below.

CPU and Core Units

Each core unit is capable of doing one computational task. So being able to compute multiple tasks at once will save you time and increase productivity. Most of modern laptops these days come with at least four cores.

This is what you call parallelising, utilising multiple cores to do a set of computational task.

Parallelisling in R

As a default, R runs serially, it runs only one one core / thread. For example the function sum() runs will process the whole dataset in a single core. If we could utilise four cores to calculate a subset of the dataset, a quarter each, and add the four subtotals in the end, we could have a much faster outcome.

There are three packages you have to know to do parallel computing in R.

parallel
doParallel
foreach

Let’s load these packages in our environment

for (package in c("parallel","doParallel","foreach")) {
  if (!package %in% installed.packages()) {
    install.packages(package)
  }
  if (!package %in% .packages()) {
    library(package, character.only = TRUE)
  }
}

## Loading required package: foreach

## Loading required package: iterators

These packages do not have an extensive amount of functions compared to tidyverse. Let’s list all the function for each package.

ls("package:parallel")

##  [1] "clusterApply"        "clusterApplyLB"      "clusterCall"        
##  [4] "clusterEvalQ"        "clusterExport"       "clusterMap"         
##  [7] "clusterSetRNGStream" "clusterSplit"        "detectCores"        
## [10] "getDefaultCluster"   "makeCluster"         "makeForkCluster"    
## [13] "makePSOCKcluster"    "mc.reset.stream"     "mcaffinity"         
## [16] "mccollect"           "mclapply"            "mcMap"              
## [19] "mcmapply"            "mcparallel"          "nextRNGStream"      
## [22] "nextRNGSubStream"    "parApply"            "parCapply"          
## [25] "parLapply"           "parLapplyLB"         "parRapply"          
## [28] "parSapply"           "parSapplyLB"         "pvec"               
## [31] "setDefaultCluster"   "splitIndices"        "stopCluster"

ls("package:doParallel")

## [1] "registerDoParallel"  "stopImplicitCluster"

ls("package:foreach")

##  [1] "%:%"                "%do%"               "%dopar%"           
##  [4] "accumulate"         "foreach"            "getDoParName"      
##  [7] "getDoParRegistered" "getDoParVersion"    "getDoParWorkers"   
## [10] "getDoSeqName"       "getDoSeqRegistered" "getDoSeqVersion"   
## [13] "getDoSeqWorkers"    "getErrorIndex"      "getErrorValue"     
## [16] "getexports"         "getResult"          "makeAccum"         
## [19] "registerDoSEQ"      "setDoPar"           "setDoSeq"          
## [22] "times"              "when"

How do I know how many cores do I have ?

To check how many cores do you have, use the detectCores function from parallel package

detectCores()

## [1] 4

The number of four represents the number of cores you have in your machine. The more cores you have, the more processing power you can utilise.

lapply and mclapply

Let’s have an example of a single core calculation and a multicore calculation (parallelism) using lapply and mclapply

In this example we will create a function to convert fahrenheit to celcius with large readings, 10 millions elements.

x <- c(10000000:1)
fahrenheit_to_celcius <- function(F_temp) {
  C_temp <- ((F_temp - 32) * (5 / 9))
  return(C_temp)
}

Now, let’s run lapply. We will use system.time function to determine the elapsed time (in seconds) the system takes to do each calculation.

system.time(lapply(x,fahrenheit_to_celcius))

##    user  system elapsed 
##  13.420   0.433  16.085

Next, let’s do it in parallel. Note - the mc.cores argument is to determine how many cores do you want to use for the calculation. It is a good practice to leave one core for general system tasks (Operating System task) and dedicate the rest of the cores for the calculation.

system.time(mclapply(x,fahrenheit_to_celcius,mc.cores= 3))

##    user  system elapsed 
##  17.577   4.275   9.565

See the difference in the elapsed time it takes by using mclapply.

How do I know if my machine is truly using the number of cores I want it to run ?

Good question!

In UNIX based machine like Macbook computer, you can run htop in the terminal and you will see something similar to this.

htop command

The number between 1 to 8 indicates each core you have in your system. In here it looks like there is not much processing happening. You can see percentages at the end of each row.

When you use mclapply or any of the parallelism functions, the bar next to each numbers should be full or close to full and the percentages will hit close to 100%.

Parallel cores

If you do not have htop command in your terminal, you can run the command below to download them first.

$ brew install htop

Once it is downloaded and installed, then you run htop

$ htop

If you are using windows based laptop, there is a similar function called NTop. For more information, check this link at github.

What if I just want run every computation in parallel without having to know each parallel function.

Often, we want to run regression modelling without having to write another function to insert them in mclapply, for example.

You can run the command below to force your regression modelling in parallel.

library(doParallel)
library(foreach)
cluster = makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

BONUS! R in the Cloud

If you have limited cores in your computer and wanting to run copmlex regression model with large datasets, you can utilise the limited credit provided by most of the cloud computing providers to setup R studio with multiple cores. I had an experience running PCR model that took me at least 5 hours in a single core and with seven cores, it finishes it 1.5 hours.

Below is a step-by-step guide of creating an R instance in Google Cloud Platform.

In your browser, navigate to https://cloud.google.com
On the top right, click Sign in.
Sign in with yout gmail account.
You will be directed to the main page, on the top right, click Console.
On the pop up, click the check box under Terms and Conditions and click “Agree and Continue” on the bottom right.
There is a banner on top of your browser screen. Click the Activate button to activate the free trial.
On the new page, click the checkbox under Terms of Service and click the Continue button.
On the next page, choose individual as your account type. Fill up the the rest of the details required and click the “Start My Free Trial Button”. Don’t worry, it won’t charge you unless you manually upgrade to a paid account.
Once that done, you will be directed back to the console page. There will be a pop up. Hit Got it once you finished reading.
On the left panel, navigate to “Compute Engine” and click “VM Instance”.
On the VM instance page, when available, click the Create button.
On the VM creation page, fill up the details as required. You can change the location (Region option) of your instance to Sydney for example. Don’t forget to change to “Second” under Generation.
Next, change the machine type according to desire, you can go up to 80 CPUs / cores. Some Regions have more CPUs option and some has less. Scroll down to see all the options.
Scroll down to Boot Disk. Choose the latest Ubuntu like below.
Under the firewall section, tick the “Allow HTTP traffic”.
Scroll down and click the “Create” button.
The virtual machine will take some time to be ready. Once it is, you will see a green tick next to it.
Now, let’s increase the security of our instance. Click the three-stripes on the top-left, navigate to VPC network then click the Firewall rules.
Click the “Create a Firewall rule” button.
On the firewall rule creation page, fill up the Name, Description, Target tags, Source IP ranges, and Protocols and ports as below. Leave everything as default. Click Create when you are done.
Confirms that the firewall rule you just created is listed.
Go back to your VM instances.
Click the three dots next to your instance, and click Stop. If there isa pop up, click Stop button as well.
Click your instance, and click Edit button on top of the browser screen.
Scroll down to Network tags, type “rstudio”, then enter. Scroll down and hit Save.
Now, lets install R studio.
Back to your instance, click the arrow next to SSH button and click the first option.
On the new page, you will see a Terminal / Command Prompt like interface. Run the command below one-by-one. You can copy paste.

sudo apt-get update sudo apt-get upgrade sudo apt-get install gdebi-core wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-1.2.1335-amd64.deb sudo gdebi rstudio-server-1.2.1335-amd64.deb

Back to your instance, locate the External IP address to access your R Studio.
Open a new tab and type the IP address. In this case it would be http://35.244.103.111:8787.
You will be presented with your R Studio interface.
To check out htop, you can go to step 25 and 26, and type htop instead of the command above.
Enjoy!