GridR on the RCE

The RCE is designed to work in two ways - the first is called RCE Powered, which lets you use large amounts of memory associated with a single process. The second is batch mode, which runs the same code across a large number of less powerful machines. There is a package that provides a bridge between these two, called GridR. This package lets you run a large memory R process using RCE Powered applications, and then run operations in parallel using batch mode from within your R process.

Using GridR via `ssh`

The simplest way to use GridR is to use Condor over SSH.

A branch to use local condor was in development, but is not compatible with the current architecture of the RCE.

In order to use GridR, you need to generate a pair of SSH keys, and use a development version of the GridR package on Github.

Set up SSH Keys

SSH Keys can be used to provide password-free access within machines on the IQSS network. To set them up, you need to generate keys on the server, and add them to the authorized_keys file. In particular, follow these steps:

Open an RCE termimal
- SSH to the RCE (rce.hmdc.harvard.edu)
- Alternatively, Go to Applications => System Tools => Terminal
- mkdir -p $HOME/.ssh
- Note that the -p option will create this directory only if necessary
- chmod 0700 $HOME/.ssh
- This makes this file accessible to only you
- ssh-keygen -t dsa -f $HOME/.ssh/id_dsa -P ''
- cat id_dsa.pub >> $HOME/.ssh/authorized_keys2

It is crucial that the file id_dsa is not shared, as there is no password assocaited with it!

Installation

The devtools package can be used to install packages off of Github, including specific branches.

The grid.init line will set up the necessary setttings for GridR to know how to submit jobs to the condor cluster. Note that sshRemoteIp will definitely change in the coming year, so if things suddently stop working, this may be why. If this happens, e-mail support@help.hmdc.harvard.edu and hopefully Bill will tell you what to do! Clearly, you will also need to replace astorer with your user name.

library(devtools)
install_github(repo = "GridR", username = "alexstorer", ref = "debug")

## Installing github repo GridR/debug from alexstorer
## Downloading GridR.zip from https://github.com/alexstorer/GridR/archive/debug.zip
## Installing package from /scratch/RtmpWvGk9f/GridR.zip
## arguments 'minimized' and 'invisible' are for Windows only
## Installing GridR
## '/usr/lib64/R/bin/R' --vanilla CMD INSTALL  \
##   '/scratch/RtmpWvGk9f/devtools636625b4d958/GridR-debug'  \
##   --library='/nfs/home/A/astorer/.R/library-x86_64' --install-tests

library(GridR)

## Loading required package: codetools

grid.init(sshRemoteIp = "rce6-4.priv.hmdc.harvard.edu", sshUsername = "astorer", 
    service = "condor.ssh", localTmpDir = ".tmpclient", sshRemoteDir = "$HOME/.tmp", 
    debug = TRUE)

## GridR is not using a config file
## starting GridR without Scheduler
## GridR variableSharing will be disabled, because sharedDir is not specified.

The GridR package is still under development, and the version on Github is designed to work with the RCE.

A Simple Example

The following example should work as written:

  a <- function(s, p, q) {
      return(s + p + q)
  }
  grid.apply("y", a, c(0, 1, 2), 1, c(100, 200, 300), wait = TRUE, check = TRUE, 
      batch = c(1, 3))

  ## [1] "Trying to SSH as follows:"
  ## [1] "ssh -f astorer@rce6-4.priv.hmdc.harvard.edu \"mkdir -p $HOME/.tmp/\""
  ## [1] "scp -B grid-cod6-2.clus.hmdc.harvard.edu-25446-2013-11-26-15-52-02-0-script.R grid-cod6-2.clus.hmdc.harvard.edu-25446-2013-11-26-15-52-02-0-fx astorer@rce6-4.priv.hmdc.harvard.edu:$HOME/.tmp/ 2>&1"
  ## Grid job finished, result written to variable y

The above code should run to completion as follows:

Grid job finished, result written to variable y 
> y
[[1]]
[1] 101

[[2]]
[1] 201

[[3]]
[1] 301

[[4]]
[1] 102

[[5]]
[1] 202

[[6]]
[1] 302

[[7]]
[1] 103

[[8]]
[1] 203

[[9]]
[1] 303

The function a simply adds the three numbers it's given. Let's look at the function call:

grid.apply("y", a, c(0, 1, 2), 1, c(100, 200, 300), wait = TRUE, check = TRUE, 
    batch = c(1, 3))

## [1] "Trying to SSH as follows:"
## [1] "ssh -f astorer@rce6-4.priv.hmdc.harvard.edu \"mkdir -p $HOME/.tmp/\""
## [1] "scp -B grid-cod6-2.clus.hmdc.harvard.edu-25446-2013-11-26-15-52-19-1-script.R grid-cod6-2.clus.hmdc.harvard.edu-25446-2013-11-26-15-52-19-1-fx astorer@rce6-4.priv.hmdc.harvard.edu:$HOME/.tmp/ 2>&1"
## Grid job finished, result written to variable y

The function a is being called on three arguments. The option batch=c(1,3) tells GridR that the first and third arguments are treated as a collection of arguments to be used in sequence, rather than as vectors themselves. GridR then constructs 9 calls, with the following assignments:

  y[[1]] <- a(0, 1, 100)
  y[[2]] <- a(0, 1, 200)
  y[[3]] <- a(0, 1, 300)
  y[[4]] <- a(1, 1, 100)
  y[[5]] <- a(1, 1, 200)
  y[[6]] <- a(1, 1, 300)
  y[[7]] <- a(2, 1, 100)
  y[[8]] <- a(2, 1, 200)
  y[[9]] <- a(2, 1, 300)

Each one of these calls, however, is computed on the batch cluster, which is managed by a system called condor. Condor will put these 9 jobs into the queue, and then process them as quickly as possible, given the shared resources of the compute cluster.

Once you call grid.apply, these jobs will be created and submitted.

The wait=TRUE option will prevent R from returning until all of the grid computations are complete.

A Realistic Example

One common use case for embarrassingly parallel applications is in simulating Markov Chains. One popular tool for this is called stan, which we can call in parallel using GridR.