December 9, 2016

Need for speed

R is widely seen as being 'slow' (see julia web page)

But, if you use a few specific tools, then this becomes irrelevant because of the powerful tools in various packages in R

An aside

Pure R, when the most efficient vectorized code is used, appears to be 1/2x the speed of the most efficient C++.

See Hadley Wickham's page on Rcpp, scroll down to "Vector input, vector output"… ), noting that if it took 10 minutes to write the C++ code, it would have to be 150,000 times faster to make it worth it.

Need for speed

Spatial simulation means doing the same thing over and over and over … so we need speed

We will show how to profile your code at the end of this section.

Predictive Ecology blog posts about R speed

"Vectorization"

  • This is at the core of making R fast. If you don't do this, then it is probably not useful to use R as a simulation engine.
# Instead of 
a <- vector()
for (i in 1:1000) {
  a[i] <- rnorm(1)
}

# use vectorized version, which is built into the functions
a <- rnorm(1000)

Vectors and Matrices

  • These are as fast as you can get in R
  • Fast numerical operations
  • Faster than data.frame
  • Anything that is in pure vectors or matrices is 'fast enough'
  • It is always a challenge to keep all code in vectors and matrices
  • Thus the following packages…

Spatial simulation

  • To work with spatial simulation (e.g., time and space), it requires more than just spatial data manipulation
  • Sometimes it is just base R stuff
  • Need to learn how to make functions (allows reusability)
  • Need to learn a few key packages that are critical for speed

Key packages for spatial simulation

  • base package – everything matrix or vector is 'fast'
  • raster - for spatially referenced matrices

    • not always fast enough, sometimes we copy the data into a matrix, then manipulate, then return the data to the raster object
  • sp - equivalent of vector shapefiles in a GIS

    • Polygons, Points, Lines
    • Not always fast, but essential to have
  • see also sf

Key packages for spatial simulation

  • data.table

    • For data.frame type data (i.e., columns of data)
    • Very fast when object gets large, but is actually slower if the data.frame is small (<100,000 rows)
  • SpaDES – many functions; will be moved into a separate package soon

  • Rcpp

    • R interface to C++ . When you need something fast, and you can't get it fast enough with existing tools/packages, you can create your own (we will not go further into this here)

What we will do here

  • We will go through SpaDES functions quickly, because there are fewer tutorials online for these
  • We will show links to various tutorials for raster, sp, data.table, Rcpp
  • Each person should decide which tool is the most useful to them
  • Put something into practice

SpaDES functions

  • These are all potentially useful for building spatio-temporal models
?`spades-package` # section 2 shows many functions

# e.g.,
?spread
?move
?cir
?adj
?distanceFromEachPoint

Working with spatial data

raster

sp

sf

  • Relatively new
  • Implements latest GIS data standards
  • Very fast, especially reading/writing large data
  • CRAN
  • GitHub

The data.table package

From every data.table user ever:

WOW that's fast!


install.packages('data.table')

(at least for large tables!)

data.table tutorials

raster and data.table together

  • The current implementation of LANDIS-SpaDES uses a "reduced" data structure throughout

  • Instead of keeping rasters of everything (one can imagine that there is redundancy, i.e., 2 pixels next to each other may be identical)

  • We make one raster of "id" and one data.table with a column called "id"

  • Then we can have as many columns as we want of information about each of these places

  • Like "polygons", but for rasters, and dynamic… can change over time

  • This may be useful for your own module

raster and data.table together

  • There is a key helper function:
?rasterizeReduced

What does this do?

The Rcpp package

Profiling and Benchmarking (1)

  • In general, the usual claim is to worry about 'execution speed later'
  • This is not 100% true with R
  • If you use vectorization (no or few loops), and these packages listed here, then you will have a good start
  • AFTER that, then you can use 2 great tools:

    • profvis package (built into the latest Rstudio previews, but not the official release version)
    • microbenchmark package

Profiling and Benchmarking (2)

microbenchmark::microbenchmark(
  loop = {
    a <- vector()
    for (i in 1:1000) a[i] <- rnorm(1)
  },
  vectorized = { a <- rnorm(1000) }
)
## Unit: microseconds
##        expr      min       lq      mean    median       uq       max neval
##        loop 4359.033 4769.864 5861.9357 5217.3890 6005.141 43314.569   100
##  vectorized   92.191   93.715  102.4152   95.3465  101.329   175.683   100

Profiling and Benchmarking (3)

If you have Rstudio version >=0.99.1208, then it has profiling as a menu item.

  • alternatively, we wrap any block of code with profvis

  • This can be a spades() call, so it will show you the entire model:

profvis::profvis({a <- rnorm(10000000)})

Profiling the spades call

Try it:

mySim <- simInit(
   times = list(start = 0.0, end = 2.0, timeunit = "year"),
   params = list(
     .globals = list(stackName = "landscape", burnStats = "nPixelsBurned")
   ),
   modules = list("randomLandscapes", "fireSpread", "caribouMovement"),
   paths = list(modulePath = system.file("sampleModules", package = "SpaDES"))
)
profvis::profvis({spades(mySim)})

When to profile

  • First, you should have started building your code with the packages we have discussed
  • It will be too late if you have loops in your code, and you are ready to profile to improve it

If you have used these tools, then:

  • When you have mostly finished whatever we are coding
  • Don't ever start making code more efficient until you have profiled
  • It is almost impossible to tell which bits are the slow parts, without profiling or benchmarking

Strategies for profiling

  • Can do an entire SpaDES model call
  • Can pinpoint specific functions
  • Can test alternative ways of implementing the same thing

Next

Building models with modules …