R on Raijin

This document gives a brief overview on how to use R (and RStudio) on the National Computational Infrastructure. The full list of supported applications can be found here.

Getting an account on Raijin

Raijin, named after the Shinto God of thunder, lightning and storms, is a Fujitsu Primergy high-performance, distributed-memory cluster.

The system, which was installed in late 2012, and which entered production use in June 2013, comprises:

57,472 cores (Intel Xeon Sandy Bridge technology, 2.6 GHz) in 3592 compute nodes;
160 TBytes (approx.) of main memory;
Infiniband FDR interconnect; and
10 PBytes (approx.) of usable fast filesystem (for short-term scratch space).

The memory specification across the nodes is heterogeneous in order to provide a configuration capable of accommodating the requirements of most applications, and providing also for large-memory jobs. Accordingly:

Two-thirds of the nodes have 32 GBytes, i.e., 2 GBytes/core;
Almost one-third of the nodes have 64 GBytes, i.e., 4 GBytes/core; while
Two per cent of the nodes have 128 GBytes, i.e., 8 GBytes/core.

Follow the instructions here to get an account.

Logging in to Raijin

ssh -Y <username>@raijin.nci.org.au

E.g. for me it is:

ssh -Y ggt251@raijin.nci.org.au

The -Y is important because it enables X11 which is required for RStudio to work over a ssh connection. On a Mac you’ll need XQuartz or on Windows you’ll need Xming installed.

Once you’re in, the prompt will look something like this:

[ggt251@raijin5 ~]$

where ggt251 is your username. The number after raijin tells you which of the 6 possible login (head) nodes you’re on.

To run R you first need to load the R module:

module load R/3.1.0

then you can either run in the command line by typing R. Using R is moderately well documented.

RStudio is not centerally supported by NCI, but it is installed locally for MSI users (users who belong to the fh0 group). To access it you use the following command (after loading the R module as above),

/short/fh0/rstudio-0.98.1028/bin/rstudio

The head node should only be used for prototyping and code testing, you’re sharing the resources of the node with everyone. To use Raijin properly you need to bury in further to one of the compute nodes.

Using RStudio in interactive batch job mode

To access a compute node interactively (i.e. in the same sort of way that you access a head node) use the following command:

qsub -I -l walltime=00:10:00,mem=500Mb,ncpus=2,wd -P fh0 -q express -X

The argument walltime=hh:mm:ss specifies how long the interactive session will last. This is used to schedule your “batch job” - even though you’re using the node interactively it is still allocated like a batch job. If you terminate the session early MSI will only be charged for the amount of time actually used, but it’s best not to use a smallish value here because it will be scheduled faster.

The argument mem=500Mb says that 500Mb or RAM will be allocated and ncpus=1 allocates 1 core - if your code is designed to run on multiple cores or needs more RAM to run, you need to ask for it here.

The -P fh0argument is required - it specifies that the time charges will be allocated the MSI group. You probably don’t have the required permissions to charge to any other groups.

You can asked to be put in the priority queue using -q express. This is recommended when running an interactive batch job because otherwise (using q- normal) you might have to wait a while before your request is scheduled. The downside to using -q express is that the amount charged to MSI is 3 times the actual walltime. The express should really only be used for testing, debugging etc. It also has smaller limits - the max for ncpus is 128 and max memory per core is 32GB.

Once you’ve accessed an interactive batch job your prompt will look something like this:

[ggt251@r3151 ~]$

In this case, I was allocated to the 3151 node (it will be a number between 1 and 3592). The resources you asked for are yours and yours alone for the time you asked for. Once that time has elapsed (or you type exit) you will be logged out and the resources will go back to the scheduler to redistribute.

You can now run R or Rstudio just as you would have on the head node (or your own computer).

Batch processing R scripts

When running a simulation or a program that you expect to run for quite some time it makes more sense to run the script in batch mode.

Inputs

To do this you need an R script and a batch script (a special plain text file).

The batch script can be created using RStudio on Raijin when you’re at the login node level (or on your personal computer) by selecting File > New File > Text File. The structure of the batch script needs to look like this:

#!/bin/csh
#PBS -l wd
#PBS -q normal
#PBS -l walltime=00:00:30
#PBS -l mem=50MB
#PBS -l ncpus=1
#PBS -P fh0
module load R/3.1.0
R CMD BATCH batch.R

Note that these are the same arguments as used in the interactive batch job mode. The tricky one is walltime if this is too long, then you’re wasting resources (which are limited, MSI has a finite walltime allocation). Worse, if it is too short any unsaved results will be lost. Identifying an appropriate walltime takes practice and experience.

The lines after the #PBS inputs get executed at the command line. So as before, the first thing to do is load the R module, then the next line effectively opens R and runs the file batch.R.

In this example the contents of batch.R are:

### Test batch script
rm(list=ls())
N=100
n=1000
m1 = m2 = vector(length=n)
for(i in 1:n){
  x = rnorm(N)
  m1[i] = mean(x)
  m2[i] = median(x)
}
save.image(file="image.RData")

Note the rm(list=ls()) at the start and the save.image function at the end. This clears the workspace first then at the end saves the workspace so you can load it later and look at the results. Of course, you can be more targetted in what and how you save the results, but this is a start.

Once you’ve got both an R script and a batch script you run it using

qsub batchfile

where batchfile is the name of the plain text batch script file.

Outputs

After the run time has passed the you’ll notice a few extra files in your working directory. One has the output (what code was run) and another has any errors. In this case it will be batchfile.o**** and batchfile.e**** where batchfile is the name of the batch script used and **** is a number.

You can quickly inspect these files using the cat command:

cat batchfile.o****
cat batchfile.e****

Successful jobs should have an empty error file. Things to look out for in the error file are

Command not found.
=>> PBS: job terminated: walltime 172818sec exceeded limit 172800sec
=>> PBS: job terminated: per node mem 2227620kb exceeded limit 2097152kb
Segmentation fault.

You’ll also have the RData file which you saved your workspace to. You can open this on the Raijin head node or download it to your computer (perhaps the easiest way to download it to your computer is to open RStudio and use its file browser panel to download it). You can load RData files into a R using the load function:

load("~/PATH/image.RData")

Parallel processing in R

Now that you’ve got access to lots of cores, the easiest way to use it (in a simulation) is to replace big loops using the foreach package. The only trick is that you need to register a “parallel backend” - doMC works well for unix systems (such as Raijin and macs but won’t work on Windows).

Specifying the .combine argument allows you to customise how the results are aggregated at the end of the loop.

require(doMC)

## Loading required package: doMC
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

require(foreach)
registerDoMC(cores=3) # this should equal ncpus
n=10
result = foreach(j = 1:n, .combine=rbind) %dopar% {
  # EXPERIMENT
  # last line is returned as a row in the result matrix
  rep(j,4)
}
result

##           [,1] [,2] [,3] [,4]
## result.1     1    1    1    1
## result.2     2    2    2    2
## result.3     3    3    3    3
## result.4     4    4    4    4
## result.5     5    5    5    5
## result.6     6    6    6    6
## result.7     7    7    7    7
## result.8     8    8    8    8
## result.9     9    9    9    9
## result.10   10   10   10   10