Efficient R Code

library("microbenchmark")
library("profvis")
library("ggplot2")

## Warning: package 'ggplot2' was built under R version 3.6.2

library("benchmarkme")

## Warning: package 'benchmarkme' was built under R version 3.6.2

library("compiler")
library("memoise")
library("DiagrammeR")

## Warning: package 'DiagrammeR' was built under R version 3.6.2

library("dplyr")
library("data.table")
library("tibble")

## Warning: package 'tibble' was built under R version 3.6.2

library("tidyr")

## Warning: package 'tidyr' was built under R version 3.6.2

library("stringr")
library("Rcpp")

Benchmarking

for multi-line benchmarks, the term to be evaluated can be surrounded by curly brackets, {}

df = data.frame(v = 1:4, name = letters[1:4])
microbenchmark(df[3, 2], df[3, "name"], df$name[3],
                unit = "s")

## Unit: seconds
##           expr        min          lq         mean      median          uq
##       df[3, 2] 2.6828e-05 2.73960e-05 3.041567e-05 2.77705e-05 2.81015e-05
##  df[3, "name"] 2.6515e-05 2.72565e-05 2.906456e-05 2.76885e-05 2.80630e-05
##     df$name[3] 7.7270e-06 8.53500e-06 9.508680e-06 9.22650e-06 9.53250e-06
##          max neval
##  0.000216272   100
##  0.000057466   100
##  0.000022864   100

x = 1:100 # initiate vector to cumulatively sum

# Method 1: with a for loop (10 lines)
cs_for = function(x){
  for(i in x){
    if(i == 1){
      xc = x[i]
    } else {
      xc = c(xc, sum(x[1:i]))
    }
  }
  xc
}

# Method 2: with apply (3 lines)
cs_apply = function(x){
  sapply(x, function(x) sum(1:x))
}

# Method 3: cumsum (1 line, not shown)
microbenchmark(cs_for(x), cs_apply(x), cumsum(x))

## Unit: nanoseconds
##         expr    min       lq      mean   median       uq     max neval
##    cs_for(x) 169789 212184.5 322481.51 223609.5 242868.0 9043050   100
##  cs_apply(x) 122224 128553.0 202406.59 131819.5 146704.5 5422187   100
##    cumsum(x)    813   1071.0   1723.01   1436.0   1824.5    6045   100

Profiling

profvis(expr = {
  
  # Stage 1: load packages
  library("rnoaa") # not necessary as data pre-saved
  library("ggplot2")
  
  # Stage 2: load and process data
  out = readRDS("extdata/out-ice.Rds")
  df = dplyr::rbind_all(out, id = "Year")
  
  # Stage 3: visualise output
  ggplot(df, aes(long, lat, group = paste(group, Year))) +
    geom_path(aes(colour = Year)) 
  ggsave("figures/icesheet-test.png")
}, interval = 0.01, prof_output = "ice-prof")

Efficient Sep-up

Sys.info()

##                                                                                            sysname 
##                                                                                           "Darwin" 
##                                                                                            release 
##                                                                                           "18.7.0" 
##                                                                                            version 
## "Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64" 
##                                                                                           nodename 
##                                                                                       "PT51.local" 
##                                                                                            machine 
##                                                                                           "x86_64" 
##                                                                                              login 
##                                                                                     "miguelarquez" 
##                                                                                               user 
##                                                                                     "miguelarquez" 
##                                                                                     effective_user 
##                                                                                     "miguelarquez"

R Version

installr::updateR()

Storing package names in a character vector object such as pkgs is also useful because it allows us to refer back to them again and again.

pkgs = c("raster", "leaflet", "rgeos") # package names
install.packages(pkgs)

inst = lapply(pkgs, library, character.only = TRUE) # load them

On Windows the installr package helps manage and update R packages with system-level dependencies. For example the Rtools package for compiling C/C++ code on Windows can be installed with the following command:

installr::install.rtools()

updating packages

update.packages(ask = FALSE) # update packages automatically

R startup arguments

A number of arguments can be appended to the R startup command (R in a shell environment) which relate to startup. The following are particularly important:

--no-environ and --no-init arguments tell R to only look for startup files (described in the next section) in the current working directory.
--no-restore tells R not to load a file called .RData (the default name for R session files) that may be present in the current working directory.
--no-save tells R not to ask the user if they want to save objects saved in RAM when the session is ended with q()

A concise way to load a ‘vanilla’ version of R, with all of the above options enabled is with an option of the same name:

R --vanilla

Location of startup file

Two files are read each time R starts (unless one of the command line options outlined above is used):

.Renviron, the primary purpose of which is to set environment variables. These tell R where to find external programs and can hold user-specific information than needs to be kept secret, typically API keys.
.Rprofile is a plain text file (which is always called .Rprofile, hence its name) that simply runs lines of R code every time R starts. If you want R to check for package updates each time it starts (as explained in the previous section), you simply add the relevant line somewhere in this file.

Confusingly, multiple versions of these files can exist on the same computer, only one of which will be used per session. Note also that these files should only be changed with caution and if you know what you are doing. This is because they can make your R version behave differently to other R installations, potentially reducing the reproducibility of your code.

Files in three folders are important in this process:

R_HOME, the directory in which R is installed. The etc sub-directory can contain start-up files read early on in the start-up process. Find out where your R_HOME is with the R.home() command.
HOME, the user’s home directory. Typically this is /home/username on Unix machines or C:\Users\username on Windows (since Windows 7). Ask R where your home directory is with, Sys.getenv("HOME").

R’s current working directory. This is reported by getwd().

File paths provided by Windows operating systems will not always work in R. Specifically, if you use a path that contains single backslashes, such as C:\DATA\data.csv, as provided by Windows, this will generate the error: Error: unexpected input in “C:”. To overcome this issue R provides two functions, file.path()and normalizePath(). The former can be used to specify file locations without having to use symbols to represent relative file paths, as follows: file.path(“C:”, “DATA”, “data.csv”). The latter takes any input string for a file name and outputs a text string that is standard (canonical) for the operating system.

normalizePath(“C:/DATA/data.csv”), for example, outputs C:\DATA\data.csv on a Windows machine but C:/DATA/data.csv on Unix-based platforms. Note that only the latter would work on both platforms so standard Unix file path notation is safe for all operating systems.

options(prompt = "R", continue = " ", digits = 4, show.signif.stars = FALSE)

The R prompt, from the boring > to the exciting R>.
The number of digits displayed.
Removing the stars after significant p -values.
Removing the + in multi-line functions.

To avoid setting the CRAN mirror each time you run install.packages() you can permanently set the mirror in your .Rprofile

# `local` creates a new, empty environment
# This avoids polluting .GlobalEnv with the object r
local({
  r = getOption("repos")             
  r["CRAN"] = "https://cran.rstudio.com/"
  options(repos = r)
})

Efficient programming

Be careful never to grow vectors.
Vectorise code whenever possible.
Use factors when appropriate.
Avoid unnecessary computation by caching variables.
Byte compile packages for an easy performance boost.

run_if <-  function (n, min = 0, max = 1) 
  .Call(C_runif, n, min, max)

n = 100

method1 = function(n) {
  vec = NULL # Or vec = c()
  for(i in seq_len(n))
    vec = c(vec, i)
  vec
}

method2 = function(n) {
  vec = numeric(n)
  for(i in seq_len(n))
    vec[i] = i
  vec
}

method3 = function(n) seq_len(n)


microbenchmark(times = 100, unit = "s", 
               method1(n), method2(n), method3(n))

## Unit: seconds
##        expr       min        lq      mean    median        uq      max neval
##  method1(n) 4.434e-05 4.672e-05 1.009e-04 5.752e-05 6.233e-05 0.004204   100
##  method2(n) 1.061e-05 1.129e-05 6.498e-05 1.199e-05 1.367e-05 0.005196   100
##  method3(n) 4.760e-07 6.830e-07 1.721e-05 8.370e-07 1.200e-06 0.001614   100

Vectorized code

# non vectorized code
log_sum = 0
for(i in 1:length(x))
  log_sum = log_sum + log(x[i])

#vectorized code


log_sum = sum(log(x))

Communicating with the user

When we create a function we often want the function to give efficient feedback on the current state. For example, are there missing arguments or has a numerical calculation failed. There are three main techniques for communicating with the user.

Fatal errors are raised by calling the stop(), i.e. execution is terminated.

Suppose we call a function that raises an error. What then? Efficient, robust code catches the error and handles it appropriately. Errors can be caught using try() and tryCatch(). For example,

# Suppress the error message
good = try(1 + 1, silent = TRUE)
bad = try(1 + "1", silent = TRUE)

if(class(bad) == "try-error")
  # Do something 
    stop("error")

warnings

Warnings are generated using the warning() function. When a warning is raised, it indicates potential problems. For example,ean(NULL) returns NA and also raises a warning

Warnings can be hidden using `suppressWarnings()

Informative output

To give informative output, use the message() function. Another function used for printing messages is cat(). In general cat() should only be used in print()/show() methods, e..

Factors

Factors are much maligned objects. While at times they are awkward, they do have their uses. A factor is used to store categorical variables. This data type is unique to R (or at least not common among programming languages). The difference between factors and strings is important because R treats factors and strings differently. Although factors look similar to character vectors, they are actually integers. This leads to initially surprising behaviour

x = 4:6

c(factor(x))

## [1] 1 2 3

m = c("January", "December", "March")

fac_m = factor(m, levels = month.name)
sort(fac_m)

## [1] January  March    December
## 12 Levels: January February March April May June July August ... December

Apply family

StackOverFlow post

https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family

https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

The apply functions can be an alternative to writing for loops.

Each function takes at least two arguments: an object and another function. The function is passed as an argument.
Every apply function has the dots, ..., argument that is used to pass on arguments to the function that is given as an argument
apply: apply function over array margins
by: apply a function to a dataframe split by factors
eapply: apply a function over other values in an environment
lapply: apply a function over a list or vector
mapply: apply a function to multiple list or vector arguments
rapply: recursively apply a function to a list
tapply: apply a function over a ragged array

The functions sapply() and vapply() are similar to lapply(), but the return type is not necessary a list.

# MARGIN=1: corresponds to rows
apply(matrix(1:9, ncol = 3), 1, sd)

## [1] 3 3 3

Additional arguments can be passed to the function that is to be applied to the data. For example, to pass the na.rm argument to the sd function, we have:

apply(matrix(1:9, ncol = 3), 1, sd, na.rm = TRUE)

## [1] 3 3 3

Caching variables

A straightforward method for speeding up code is to calculate objects once and reuse the value when necessary. This could be as simple as replacing sd(x) in multiple function calls with the object sd_x that is defined once and reused. For example, suppose we wish to normalise each column of a matrix. However, instead of using the standard deviation of each column, we will use the standard deviation of the entire data set

apply(x, 2, function(i) mean(i) / sd(x))

This is inefficient since the value of sd(x) is constant and thus recalculating the standard deviation for every column is unnecessary. Instead we should evaluate once and store the result

# cached version
sd_x = sd(x)
apply(x, 2, function(i) mean(i) / sd_x)

If we compare the two methods on a 100 row by 1000 column matrix, the cached version is around 100 times faster

Memoise package

A more advanced form of caching is to use the memoise package. If a function is called multiple times with the same input, it may be possible to speed things up by keeping a cache of known answers that it can retrieve.

Byte compiler

We create an inefficient function for calculating the mean. This function takes in a vector, calculates the length and then updates the m variable.

mean_r = function(x) {
  m = 0
  n = length(x)
  for(i in seq_len(n))
    m = m + x[i] / n
  m
}

This is clearly a bad function and we should just use the mean() function, but it’s a useful comparison. Compiling the function is straightforwar

cmp_mean_r = cmpfun(mean_r)

Then we use the microbenchmark() function to compare the three variants

# Generate some data
x = rnorm(1000)
microbenchmark(times = 10, unit = "ms", # milliseconds
          mean_r(x), cmp_mean_r(x), mean(x))

## Unit: milliseconds
##           expr      min       lq    mean   median       uq     max neval
##      mean_r(x) 0.070153 0.070394 0.68445 0.070666 0.074573 6.19278    10
##  cmp_mean_r(x) 0.070225 0.070458 0.07339 0.070857 0.072772 0.08345    10
##        mean(x) 0.006811 0.007082 0.01116 0.007315 0.009102 0.03198    10

A final option is to use just-in-time (JIT) compilation. The enableJIT() function disables JIT compilation if the argument is 0. Arguments 1, 2, or 3 implement different levels of optimisation. JIT can also be enabled by setting the environment variable R_ENABLE_JIT, to one of these values recommended setting the compile level to the maximum value of 3.

Efficient Workflow

Start without writing code but with a clear mind and perhaps a pen and paper. This will ensure you keep your objectives at the forefront of your mind, without getting lost in the technology.
Make a plan. The size and nature will depend on the project but time-lines, resources and ‘chunking’ the work will make you more effective when you start.
Select the packages you will use for implementing the plan early. Minutes spent researching and selecting from the available options could save hours in the future.
Document your work at every stage: work can only be effective if it’s communicated clearly and code can only be efficiently understood if it’s commented.
Make your entire workflow as reproducible as possible. knitr can help with this in the phase of documentation.

SMART workflow

Specific: is the objective clearly defined and self-contained?
Measurable: is there a clear indication of its completion?
Attainable: can the target be achieved? This can also refer to Assigned (to a person).
Realistic: have sufficient resources been allocated to the task?
Time-bound: is there an associated completion date or milestone?

mermaid("gantt
        Section Initiation
        Planning           :a1, 2016-01-01, 10d
        Data processing    :after a1  , 30d")

Efficient I/O

library("rio")
library("readr")
library("data.table")
library("feather")
library("WDI")

If possible, keep the names of local files downloaded from the internet or copied onto your computer unchanged. This will help you trace the provenance of the data in the future.
R’s native file format is .Rds. These files can be imported and exported using readRDS() and saveRDS() for fast and space efficient data storage.
Use import() from the rio package to efficiently import data from a wide range of formats, avoiding the hassle of loading format-specific libraries.
Use the readr or data.table equivalents of read.table() to efficiently import large text files.
Use file.size() and object.size() to keep track of the size of files and R objects and take action if they get too big.

rio package

rio aims to “simplify the process of importing data into R and exporting data from R.

# Specify a file
fname = system.file("voc_voyages.tsv", package = "efficient")
# Import the file (uses the fread function from data.table)
voyages = import(fname)
# Export the file as an Excel spreadsheet
export(voyages, "voc_voyages.xlsx")

capitals = import("https://github.com/mledoze/countries/raw/master/countries.json")

Below around 1 MB read.csv() is actually faster than read_csv() while fread is much faster than both, although these savings are likely to be inconsequential for such small datasets.

For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read.csv(). This efficiency gain may be inconsequential for a one-off file of 100 MB running on a fast computer (which still takes less than a minute with read.csv()), but could represent an important speed-up if you frequently load large text files

read_*() decides what class each variable is based on the first 1000 rows, rather than all rows, as base read.*() functions do. Printing the offending elemen

Binary formats

.Rds and .RData are R’s native binary file formats. These formats are optimised for speed and compression ratios. But what is the difference between them? The following code chunk demonstrates the key difference between them:

save(df_co2, file = "extdata/co2.RData")
saveRDS(df_co2, "extdata/co2.Rds")
load("extdata/co2.RData")
df_co2_rds = readRDS("extdata/co2.Rds")
identical(df_co2, df_co2_rds)

The first method is the most widely used. It uses the save() function which takes any number of R objects and writes them to a file, which must be specified by the file = argument save() is like save.image(), which saves all the objects currently loaded in R

Feather file format

Feather was developed as a collaboration between R and Python developers to create a fast, light and language agnostic format for storing data frames.

write_feather(df_co2, "extdata/co2.feather")
df_co2_feather = read_feather("extdata/co2.feather")

import feather as ft
path = 'data/co2.feather'
df_co2_feather = ft.read_dataframe(path)

Getting data from internet

url = "https://www.monetdb.org/sites/default/files/voc_tsvs.zip"
download.file(url, "voc_tsvs.zip") # download file
unzip("voc_tsvs.zip", exdir = "data") # unzip files
file.remove("voc_tsvs.zip") # tidy up by removing the zip file

## [1] TRUE

url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/co2.csv"
download.file(url, "co2.csv")
df_co2 = read_csv("extdata/co2.csv")

Accessing data stored in packages

data(package = "dplyr")

list.files(system.file("extdata", package = "readr"))

## [1] "challenge.csv"     "epa78.txt"         "example.log"      
## [4] "fwf-sample.txt"    "massey-rating.txt" "mtcars.csv"       
## [7] "mtcars.csv.bz2"    "mtcars.csv.zip"

Efficient Data Carpentry

Time spent preparing your data at the beginning can save hours of frustration in the long run.
‘Tidy data’ provides a concept for organising data and the package tidyr provides some functions for this work.
The data_frame class defined by the tibble package makes datasets efficient to print and easy to work with.
dplyr provides fast and intuitive data processing functions; data.table has unmatched speed for some data processing applications.
The %>% ‘pipe’ operator can help clarify complex data processing workflows.

You can create a tibble data frame row-by-row using the tribble function

Tidying data with tidyr and regular expressions

Data tidying includes data cleaning and data reshaping. Data cleaning is the process of re-formatting and labelling messy data. Packages including stringi and stringr can help update messy character strings using regular expressions; assertive and assertr packages can perform diagnostic checks for data integrity at the outset of a data analysis project.

Tidy data has the following characteristics (H. Wickham 2014b):

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Optimising code

Before you start to optimise your code, ensure you know where the bottleneck lies; use a code profiler.
If the data in your data frame is all of the same type, consider converting it to a matrix for a speed boost.
Use specialised row and column functions whenever possible.
The parallel package is ideal for Monte-Carlo simulations. For optimal performance, consider re-writing key parts of your code in C++.

profvis({
  data(movies, package = "ggplot2movies") # Load data
  movies = movies[movies$Comedy == 1,]
  plot(movies$year, movies$rating)
  model = loess(rating ~ year, data = movies) # loess regression line
  j = order(movies$year)
  lines(movies$year[j], model$fitted[j]) # Add line to the plot
})

Efficient base R

If() and ifelse()

An additional quirk of ifelse() is that although it is more programmer efficient, as it is more concise and understandable than multi-line alternatives, it is often less computationally efficient than a more verbose alternative. This is illustrated with the following benchmark, in which the second option runs around 20 times faster, despite the results being identical:

marks = runif(n = 10e6, min = 30, max = 99)
system.time({
  result1 = ifelse(marks >= 40, "pass", "fail")
})

##    user  system elapsed 
##   2.814   0.149   2.998

#>    user  system elapsed 
#>   4.012   0.276   4.286
system.time({
  result2 = rep("fail", length(marks)) 
  result2[marks >= 40] = "pass"
})

##    user  system elapsed 
##   0.183   0.000   0.184

#>    user  system elapsed 
#>   0.192   0.032   0.223
identical(result1, result2)

## [1] TRUE

system.time({
  result3 = dplyr::if_else(marks >= 40, "pass", "fail")
})

##    user  system elapsed 
##   1.053   0.092   1.225

#>    user  system elapsed 
#>   1.032   0.104   1.134
identical(result1, result3)

## [1] TRUE

#> [1] TRUE

Sorting and ordering

Sorting a vector is relatively quick; sorting a vector of length \(10^7\) takes around 0.01 seconds. If you only sort a vector once at the top of a script, then don’t worry too much about this. However if you are sorting inside a loop, or in a shiny application, then it can be worthwhile thinking about how to optimise this operation.

There are currently three sorting algorithms, c("shell", "quick", "radix") that can be specified in the sort() function; withadix being a new addition to R 3.3. Typically the radix (the non-default option) is the most computationally efficient option for most situations (it is around 20% faster when sorting a large vector of doubles).

Another useful trick is to partially order the results. For example, if you only want to display the top ten results, then use theartial argument, i.e.sort(x, partial = 1:10)`. For very large vectors, this can give a three fold speed increase.

A factor is just a vector of integers with associated levels. Occasionally we want to convert a factor into its numerical equivalent. The most efficient way of doing this (especially for long factors) is:

as.numeric(levels(f))[f]

the non-vectorised version, &&, only executes the second component if needed. This is efficient and leads to neater code, e.g

# We only calculate the mean if data doesn't contain NAs
if(all(!is.na(x)) && mean(x) > 0) {
  # Do something
}

Row and columns operations

There are optimised functions for calculating row and columns sums/means, rowSums(), colSums(), rowMeans() and colMeans() that should be used whenever possible. The package matrixStats contains many optimised row/col functions

is.na() and anyNA()

To test whether a vector (or other object) contains missing values we use the is.na() function. Often we are interested in whether a vector contains any missing values. In this case, anyNA(x) is more efficient than any(is.na(x))

Integers

Integers are more space efficient. The code below compares the size of an integer vector to a standard numeric vector:

pryr::object_size(1:10000)

## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp

## 40 kB

#> 40 kB
pryr::object_size(y = seq(1, 10000, by=1.0))

## 80 kB

#> 80 kB

is.integer(1L + 1)

## [1] FALSE

#> [1] FALSE

Sparse Matrix

Another data structure that can be stored efficiently is a sparse matrix. This is simply a matrix where most of the elements are zero. Conversely, if most elements are non-zero, the matrix is considered dense. The proportion of non-zero elements is called the sparsity. Large sparse matrices often crop up when performing numerical calculations. Typically, our data isn’t sparse but the resulting data structures we create may be sparse. There are a number of techniques/methods used to store sparse matrices. Methods for creating sparse matrices can be found in the Matrix package

library("Matrix")
N = 10000
sp = sparseMatrix(1:N, 1:N, x = 1)
m = diag(1, N, N)

pryr::object_size(sp)
#> 161 kB
pryr::object_size(m)
#> 800 MB

Rcpp

A C++ function is similar to an R function: you pass a set of inputs to the function, some code is run, a single object is returned. However there are some key differences.

In the C++ function each line must be terminated with ; In R, we use ; only when we have multiple statements on the same line.
We must declare object types in the C++ version. In particular we need to declare the types of the function arguments, return value and any intermediate objects we create.
The function must have an explicit return statement. Similar to R, there can be multiple returns, but the function will terminate when it hits it’s first return statement.
You do not use assignment when creating a function.
Object assignment must use = sign. The <- operator isn’t valid.
One line comments can be created using //. Multi-line comments are created using /…/

Suppose we want to create a function that adds two numbers together. In R this would be a simple one line affair:

add_r = function(x, y) x + y

In C++ it is a bit more long winded:


/* Return type double
 * Two arguments, also doubles
 */

double add_cpp(double x, double y) { 
  double value = x + y;
  return value;
}

cppFunction('

    double add_cpp(double x, double y) {
        double value = x + y;
        return value;
      }
  
')

add_cpp

## function (x, y) 
## .Call(<pointer: 0x117c0d620>, x, y)

add_cpp(1, 2)

## [1] 3

C++ data types

The most basic type of variable is an integer, int. An int variable can store a value in the range −32768 to +32767 . To store floating point numbers, there are single precision numbers, float and double precision numbers, double. A double takes twice as much memory as a float (in general, we should always work with double precision numbers unless we have a compelling reason to switch to floats). For single characters, we use the char data type

char A single character.
int An integer.
float A single precision floating point number.
double A double-precision floating point number.
void A valueless quantity

The sourceCpp() function

the cppFunction() is great for getting small examples up and running. But it is better practice to put your C++ code in a separate file (with file extension cpp) and use the function call sourceCpp("path/to/file.cpp") to compile them. However we need to include a few headers at the top of the file. The first line we add gives us access to the Rcpp functions. The file Rcpp.h contains a list of function and class definitions supplied by Rcpp. This file will be located where Rcpp is installed.


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double add_cpp(double x, double y) {
  double value = x + y;
  return value;
}

Efficient Hardware

Use the package benchmarkme to assess your CPUs number crunching ability; is it worth upgrading your hardware?
If possible, add more RAM.
Double check that you have installed a 64 -bit version of R.
Cloud computing is a cost effective way of obtaining more compute power.
A solid state drive typically won’t have much impact on the speed of your R code, but will increase your overall productivity since I/0 is much faster.
A rough rule of thumb is that your RAM should be three times the size of your data set.
Having an SSD drive doesn’t make much difference to R. However, the reduction in boot time and general tasks makes an SSD drive a wonderful purchase.

Efficient collaboration

Have a consistent coding style.
Think carefully about your comments and keep them up to date.
Use version control whenever possible.
Use informative commit messages.
Don’t be afraid to elicit feedback from colleagues.

There are general principles that most programmers agree on, such as:

Use modular code;
Comment your code;
Don’t Repeat Yourself (DRY);
Be concise, clear and consistent.