Background:

Many computationally demanding statistical procedures, such as Bootstrapping and Markov Chain Monte Carlo, can be speeded up significantly by using several connected computers in parallel.

Snowfall is a toplevel useability wrapper for snow to make parallel programming even more easy and comfortable. The package snow (an acronym for Simple Network Of Workstations) provides a high-level interface for using a workstation cluster for parallel computations in R. snow relies on the Master / Slave model of communication in which one device or process (known as the master) controls one or more other devices or processes (known as slaves).

Snow can use one of four communications mechanisms: sockets, PVM, MPI,or NetWorkSpaces (NWS). NWS support was provided by Steve Weston. PVM clusters use the rpvm package; MPI clusters use package Rmpi; NWS clusters use package nws. If pvm is used, then pvm must be started, either using a pvm console (e.g the pvm text console or the graphical xpvm console, both available with pvm) or from R using functions provided by rpvm. Similarly, LAM-MPI must be started, e.g. using lamboot, for MPI clusters that use Rmpi and LAM-MPI. If NWS is used, the NetWorkSpaces server must be running. SOCK clusters are the easiest approach for using snow on a single multi-core computer as they require no additional software.

If you want to run programs only on your (multi-core) computer without any cluster of many machines, you do not have to setup the cluster yourself, it will be started implicitly in snowfalls initialisation.Using two or more machines for cluster calculations, you need to setup a LAM/MPI cluster and start cluster explicitely.

Basic implementation:

Basically, usage of snowfall always works with the following scheme:

If the initialisation fails, probably because of missing base libraries Rmpi and snow, snowfall falls back to sequential mode with a warning message.

In sequential and parallel execution, all functions are useable in both modes in the same way and returning the same results.

# install snowfall if required (it includes snow):

if(!require(snowfall)) install.packages("snowfall")

# load snowfall:

library(snowfall)

# example process:

process <- function(parallel = FALSE,cpus = NULL){
    sfInit(parallel = parallel, cpus = cpus)
    sfLapply( 1:10^6, log10 )
    sfStop()
}

# computational time in sequential mode:

system.time(process(parallel = FALSE))
##    user  system elapsed 
##   2.469   0.016   2.484
# computational time in parallel with 3 cpus:

system.time(process(parallel =TRUE,cpus=2))
##    user  system elapsed 
##   0.958   0.028   2.638

Writting parallel programs with Snowfall:

If you detected parts of your program which can be parallelised (loops etc) it is in most cases a fast step to give them a parallel run.

First, rewrite them using Rs list operators (lapply, apply) instead of loops (if they are not yet calculated by list operators). Then write a wrapper function to be called by the list operators and manage a single parallel step. Note there are no local variables, only the data from the list index will be given as argument.

If you need more than one variable argument, you need to make the required variables global (assign to global environment) and export them to all slaves.Snowfall provides some functions to make this process easier (take a look at the package help).

sfInit( parallel=TRUE, cpus=3 )

b <- c( 3.4, 5.7, 10.8, 8, 7 )

# Export a and b in their current state to all slaves.
sfExport( "b" )

parWrapper <- function( datastep, add1, add2 ) {
    return(datastep * b[datastep] + add1 - add2)
}

# Calls parWrapper with each value of a and additional arguments 2 and 3.

sfSapply( 1:5, parWrapper,2,3)
## [1]  2.4 10.4 31.4 31.0 34.0
sfStop()

Intermediate result saving: sfClusterApplySR.

Another helpful function for long running clusters is sfClusterApplySR, which saves intermediate results after processing \(n\)-indices (where n is the amount of CPUs). If it is likely you have to interrupt your program (probably because of server maintenance) you can start using sfClusterApplySR and restart your program without the results produced up to the shutdown time. The result files are saved in the temporary folder \(~/.sfCluster/RESTORE/x\), where x is a string with a given name and the name of the input R-file. sfClusterApplySR is called like lapply.

If using the function sfClusterApplySR result are always saved in the intermediate result file. But, if cluster stopped and results could be restored, restore itself is only done if explicitly stated. This aims to prevent false results if a program was interrupted by intend and restarted with different internal parameters (where with automatical restore probably results from previous runs would be inserted). So handle with care if you want to restore!

sfInit( parallel=TRUE, cpus=2 )

# Saves under Name default
resultA <- sfClusterApplySR( somelist, somefunc )

# Must be another name.
resultB <- sfClusterApplySR( someotherlist, someotherfunc, name="CALC_TWO")

sfStop()

If you only use one call to sfClusterApplySR in your program, the parameter name does not need to be changed, it only is important if you use more than one call to sfClusterApplySR.

If cluster stops probably during run of \(someotherfunc\) and restarted with restore-Option, the complete result of resultA is loaded and therefore no calculation on \(somefunc\) is done. resultB is restored with all the data available at shutdown and calculation begins with the first undefined result.

Note on restoring errors: If restoration of data fails (probably because list size is different in saving and current run), sfClusterApplySR stops. For securely reason it does not delete the RESTORE-files itself, but prompt the user the complete path to delete manually and explicitly.

Source:

[1] http://cran.r-project.org/web/packages/snow/README

[2] http://www.sfu.ca/~sblay/R/snow.html

[3] http://cran.r-project.org/web/packages/snowfall/vignettes/snowfall.pdf