Getting the most out of R

December 9, 2016

Need for speed

R is widely seen as being 'slow' (see julia web page)

But, if you use a few specific tools, then this becomes irrelevant because of the powerful tools in various packages in R

An aside

Pure R, when the most efficient vectorized code is used, appears to be 1/2x the speed of the most efficient C++.

See Hadley Wickham's page on Rcpp, scroll down to "Vector input, vector output"… ), noting that if it took 10 minutes to write the C++ code, it would have to be 150,000 times faster to make it worth it.

Need for speed

Spatial simulation means doing the same thing over and over and over … so we need speed

We will show how to profile your code at the end of this section.

Predictive Ecology blog posts about R speed

"Vectorization"

This is at the core of making R fast. If you don't do this, then it is probably not useful to use R as a simulation engine.

# Instead of 
a <- vector()
for (i in 1:1000) {
  a[i] <- rnorm(1)
}

# use vectorized version, which is built into the functions
a <- rnorm(1000)

Vectors and Matrices

These are as fast as you can get in R
Fast numerical operations
Faster than data.frame
Anything that is in pure vectors or matrices is 'fast enough'
It is always a challenge to keep all code in vectors and matrices
Thus the following packages…

Spatial simulation

To work with spatial simulation (e.g., time and space), it requires more than just spatial data manipulation
Sometimes it is just base R stuff
Need to learn how to make functions (allows reusability)
Need to learn a few key packages that are critical for speed

Key packages for spatial simulation

base package – everything matrix or vector is 'fast'
raster - for spatially referenced matrices
- not always fast enough, sometimes we copy the data into a matrix, then manipulate, then return the data to the raster object
sp - equivalent of vector shapefiles in a GIS
- Polygons, Points, Lines
- Not always fast, but essential to have
see also sf

Key packages for spatial simulation

data.table
- For data.frame type data (i.e., columns of data)
- Very fast when object gets large, but is actually slower if the data.frame is small (<100,000 rows)
SpaDES – many functions; will be moved into a separate package soon
Rcpp
- R interface to C++ . When you need something fast, and you can't get it fast enough with existing tools/packages, you can create your own (we will not go further into this here)

What we will do here

We will go through SpaDES functions quickly, because there are fewer tutorials online for these
We will show links to various tutorials for raster, sp, data.table, Rcpp
Each person should decide which tool is the most useful to them
Put something into practice

`SpaDES` functions

These are all potentially useful for building spatio-temporal models

?`spades-package` # section 2 shows many functions

# e.g.,
?spread
?move
?cir
?adj
?distanceFromEachPoint

Working with spatial data

Excellent Cheatsheet

`raster`

tutorials
- NEON
- geoscripting-wur
- Official vignette
- Using R as a GIS
- many more available

`sp`

Quite an old and mature package
Tutorials
- stack exchange
- NEON

`sf`

Relatively new
Implements latest GIS data standards
Very fast, especially reading/writing large data
CRAN
GitHub

The `data.table` package

From every data.table user ever:

WOW that's fast!

install.packages('data.table')

(at least for large tables!)

`data.table` tutorials

`raster` and `data.table` together

The current implementation of LANDIS-SpaDES uses a "reduced" data structure throughout
Instead of keeping rasters of everything (one can imagine that there is redundancy, i.e., 2 pixels next to each other may be identical)
We make one raster of "id" and one data.table with a column called "id"
Then we can have as many columns as we want of information about each of these places
Like "polygons", but for rasters, and dynamic… can change over time
This may be useful for your own module

`raster` and `data.table` together

There is a key helper function:

?rasterizeReduced

What does this do?

The `Rcpp` package

From every Rcpp user ever:

WOW! Just wow.

install.packages('Rcpp')

Essentially see Dirk Eddelbuettel's book, Seamless R and C++
Hadley's section in Advanced R

Profiling and Benchmarking (1)

In general, the usual claim is to worry about 'execution speed later'
This is not 100% true with R
If you use vectorization (no or few loops), and these packages listed here, then you will have a good start
AFTER that, then you can use 2 great tools:
- profvis package (built into the latest Rstudio previews, but not the official release version)
- microbenchmark package

Profiling and Benchmarking (2)

microbenchmark::microbenchmark(
  loop = {
    a <- vector()
    for (i in 1:1000) a[i] <- rnorm(1)
  },
  vectorized = { a <- rnorm(1000) }
)

## Unit: microseconds
##        expr      min       lq      mean    median       uq       max neval
##        loop 4359.033 4769.864 5861.9357 5217.3890 6005.141 43314.569   100
##  vectorized   92.191   93.715  102.4152   95.3465  101.329   175.683   100

Profiling and Benchmarking (3)

If you have Rstudio version >=0.99.1208, then it has profiling as a menu item.

alternatively, we wrap any block of code with profvis
This can be a spades() call, so it will show you the entire model:

profvis::profvis({a <- rnorm(10000000)})

Profiling the `spades` call

Try it:

mySim <- simInit(
   times = list(start = 0.0, end = 2.0, timeunit = "year"),
   params = list(
     .globals = list(stackName = "landscape", burnStats = "nPixelsBurned")
   ),
   modules = list("randomLandscapes", "fireSpread", "caribouMovement"),
   paths = list(modulePath = system.file("sampleModules", package = "SpaDES"))
)
profvis::profvis({spades(mySim)})

When to profile

First, you should have started building your code with the packages we have discussed
It will be too late if you have loops in your code, and you are ready to profile to improve it

If you have used these tools, then:

When you have mostly finished whatever we are coding
Don't ever start making code more efficient until you have profiled
It is almost impossible to tell which bits are the slow parts, without profiling or benchmarking

Strategies for profiling

Can do an entire SpaDES model call
Can pinpoint specific functions
Can test alternative ways of implementing the same thing

Building models with modules …

Need for speed

An aside

Need for speed

Predictive Ecology blog posts about R speed

"Vectorization"

Vectors and Matrices

Spatial simulation

Key packages for spatial simulation

Key packages for spatial simulation

What we will do here

SpaDES functions

Working with spatial data

raster

sp

sf

The data.table package

data.table tutorials

raster and data.table together

raster and data.table together

The Rcpp package

Profiling and Benchmarking (1)

Profiling and Benchmarking (2)

Profiling and Benchmarking (3)

Profiling the spades call

When to profile

Strategies for profiling

Next

`SpaDES` functions

`raster`

`sp`

`sf`

The `data.table` package

`data.table` tutorials

`raster` and `data.table` together

`raster` and `data.table` together

The `Rcpp` package

Profiling the `spades` call