library("microbenchmark")
library("profvis")
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.6.2
library("benchmarkme")
## Warning: package 'benchmarkme' was built under R version 3.6.2
library("compiler")
library("memoise")
library("DiagrammeR")
## Warning: package 'DiagrammeR' was built under R version 3.6.2
library("dplyr")
library("data.table")
library("tibble")
## Warning: package 'tibble' was built under R version 3.6.2
library("tidyr")
## Warning: package 'tidyr' was built under R version 3.6.2
library("stringr")
library("Rcpp")
for multi-line benchmarks, the term to be evaluated can be surrounded by curly brackets, {}
df = data.frame(v = 1:4, name = letters[1:4])
microbenchmark(df[3, 2], df[3, "name"], df$name[3],
unit = "s")
## Unit: seconds
## expr min lq mean median uq
## df[3, 2] 2.6828e-05 2.73960e-05 3.041567e-05 2.77705e-05 2.81015e-05
## df[3, "name"] 2.6515e-05 2.72565e-05 2.906456e-05 2.76885e-05 2.80630e-05
## df$name[3] 7.7270e-06 8.53500e-06 9.508680e-06 9.22650e-06 9.53250e-06
## max neval
## 0.000216272 100
## 0.000057466 100
## 0.000022864 100
x = 1:100 # initiate vector to cumulatively sum
# Method 1: with a for loop (10 lines)
cs_for = function(x){
for(i in x){
if(i == 1){
xc = x[i]
} else {
xc = c(xc, sum(x[1:i]))
}
}
xc
}
# Method 2: with apply (3 lines)
cs_apply = function(x){
sapply(x, function(x) sum(1:x))
}
# Method 3: cumsum (1 line, not shown)
microbenchmark(cs_for(x), cs_apply(x), cumsum(x))
## Unit: nanoseconds
## expr min lq mean median uq max neval
## cs_for(x) 169789 212184.5 322481.51 223609.5 242868.0 9043050 100
## cs_apply(x) 122224 128553.0 202406.59 131819.5 146704.5 5422187 100
## cumsum(x) 813 1071.0 1723.01 1436.0 1824.5 6045 100
profvis(expr = {
# Stage 1: load packages
library("rnoaa") # not necessary as data pre-saved
library("ggplot2")
# Stage 2: load and process data
out = readRDS("extdata/out-ice.Rds")
df = dplyr::rbind_all(out, id = "Year")
# Stage 3: visualise output
ggplot(df, aes(long, lat, group = paste(group, Year))) +
geom_path(aes(colour = Year))
ggsave("figures/icesheet-test.png")
}, interval = 0.01, prof_output = "ice-prof")
Sys.info()
## sysname
## "Darwin"
## release
## "18.7.0"
## version
## "Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64"
## nodename
## "PT51.local"
## machine
## "x86_64"
## login
## "miguelarquez"
## user
## "miguelarquez"
## effective_user
## "miguelarquez"
installr::updateR()
Storing package names in a character vector object such as pkgs
is also useful because it allows us to refer back to them again and again.
pkgs = c("raster", "leaflet", "rgeos") # package names
install.packages(pkgs)
inst = lapply(pkgs, library, character.only = TRUE) # load them
On Windows the installr package helps manage and update R packages with system-level dependencies. For example the Rtools package for compiling C/C++ code on Windows can be installed with the following command:
installr::install.rtools()
update.packages(ask = FALSE) # update packages automatically
A number of arguments can be appended to the R startup command (R in a shell environment) which relate to startup. The following are particularly important:
--no-environ
and --no-init
arguments tell R to only look for startup files (described in the next section) in the current working directory.
--no-restore
tells R not to load a file called .RData
(the default name for R session files) that may be present in the current working directory.
--no-save
tells R not to ask the user if they want to save objects saved in RAM when the session is ended with q()
A concise way to load a ‘vanilla’ version of R, with all of the above options enabled is with an option of the same name:
R --vanilla
Two files are read each time R starts (unless one of the command line options outlined above is used):
.Renviron
, the primary purpose of which is to set environment variables. These tell R where to find external programs and can hold user-specific information than needs to be kept secret, typically API keys.
.Rprofile
is a plain text file (which is always called .Rprofile
, hence its name) that simply runs lines of R code every time R starts. If you want R to check for package updates each time it starts (as explained in the previous section), you simply add the relevant line somewhere in this file.
Confusingly, multiple versions of these files can exist on the same computer, only one of which will be used per session. Note also that these files should only be changed with caution and if you know what you are doing. This is because they can make your R version behave differently to other R installations, potentially reducing the reproducibility of your code.
Files in three folders are important in this process:
R_HOME
, the directory in which R is installed. The etc sub-directory can contain start-up files read early on in the start-up process. Find out where your R_HOME
is with the R.home()
command.
HOME
, the user’s home directory. Typically this is /home/username
on Unix machines or C:\Users\username
on Windows (since Windows 7). Ask R where your home directory is with, Sys.getenv("HOME")
.
R’s current working directory. This is reported by getwd()
.
File paths provided by Windows operating systems will not always work in R. Specifically, if you use a path that contains single backslashes, such as C:\DATA\data.csv
, as provided by Windows, this will generate the error: Error: unexpected input in “C:”. To overcome this issue R provides two functions, file.path()
and normalizePath()
. The former can be used to specify file locations without having to use symbols to represent relative file paths, as follows: file.path(“C:”, “DATA”, “data.csv”)
. The latter takes any input string for a file name and outputs a text string that is standard (canonical) for the operating system.
normalizePath(“C:/DATA/data.csv”)
, for example, outputs C:\DATA\data.csv
on a Windows machine but C:/DATA/data.csv
on Unix-based platforms. Note that only the latter would work on both platforms so standard Unix file path notation is safe for all operating systems.
options(prompt = "R", continue = " ", digits = 4, show.signif.stars = FALSE)
To avoid setting the CRAN mirror each time you run install.packages()
you can permanently set the mirror in your .Rprofile
# `local` creates a new, empty environment
# This avoids polluting .GlobalEnv with the object r
local({
r = getOption("repos")
r["CRAN"] = "https://cran.rstudio.com/"
options(repos = r)
})
run_if <- function (n, min = 0, max = 1)
.Call(C_runif, n, min, max)
n = 100
method1 = function(n) {
vec = NULL # Or vec = c()
for(i in seq_len(n))
vec = c(vec, i)
vec
}
method2 = function(n) {
vec = numeric(n)
for(i in seq_len(n))
vec[i] = i
vec
}
method3 = function(n) seq_len(n)
microbenchmark(times = 100, unit = "s",
method1(n), method2(n), method3(n))
## Unit: seconds
## expr min lq mean median uq max neval
## method1(n) 4.434e-05 4.672e-05 1.009e-04 5.752e-05 6.233e-05 0.004204 100
## method2(n) 1.061e-05 1.129e-05 6.498e-05 1.199e-05 1.367e-05 0.005196 100
## method3(n) 4.760e-07 6.830e-07 1.721e-05 8.370e-07 1.200e-06 0.001614 100
# non vectorized code
log_sum = 0
for(i in 1:length(x))
log_sum = log_sum + log(x[i])
#vectorized code
log_sum = sum(log(x))
When we create a function we often want the function to give efficient feedback on the current state. For example, are there missing arguments or has a numerical calculation failed. There are three main techniques for communicating with the user.
Fatal errors are raised by calling the stop()
, i.e. execution is terminated.
Suppose we call a function that raises an error. What then? Efficient, robust code catches the error and handles it appropriately. Errors can be caught using try()
and tryCatch()
. For example,
# Suppress the error message
good = try(1 + 1, silent = TRUE)
bad = try(1 + "1", silent = TRUE)
if(class(bad) == "try-error")
# Do something
stop("error")
Warnings are generated using the warning()
function. When a warning is raised, it indicates potential problems. For example,ean(NULL)
returns NA and also raises a warning
Warnings can be hidden using `suppressWarnings()
To give informative output, use the message()
function. Another function used for printing messages is cat()
. In general cat()
should only be used in print()
/show()
methods, e..
Factors are much maligned objects. While at times they are awkward, they do have their uses. A factor is used to store categorical variables. This data type is unique to R (or at least not common among programming languages). The difference between factors and strings is important because R treats factors and strings differently. Although factors look similar to character vectors, they are actually integers. This leads to initially surprising behaviour
x = 4:6
c(factor(x))
## [1] 1 2 3
m = c("January", "December", "March")
fac_m = factor(m, levels = month.name)
sort(fac_m)
## [1] January March December
## 12 Levels: January February March April May June July August ... December
StackOverFlow post
https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
The apply functions can be an alternative to writing for loops.
Each function takes at least two arguments: an object and another function. The function is passed as an argument.
Every apply function has the dots, ...
, argument that is used to pass on arguments to the function that is given as an argument
apply
: apply function over array margins
by
: apply a function to a dataframe split by factors
eapply
: apply a function over other values in an environment
lapply
: apply a function over a list or vector
mapply
: apply a function to multiple list or vector arguments
rapply
: recursively apply a function to a list
tapply
: apply a function over a ragged array
The functions sapply()
and vapply()
are similar to lapply()
, but the return type is not necessary a list.
# MARGIN=1: corresponds to rows
apply(matrix(1:9, ncol = 3), 1, sd)
## [1] 3 3 3
Additional arguments can be passed to the function that is to be applied to the data. For example, to pass the na.rm
argument to the sd
function, we have:
apply(matrix(1:9, ncol = 3), 1, sd, na.rm = TRUE)
## [1] 3 3 3
A straightforward method for speeding up code is to calculate objects once and reuse the value when necessary. This could be as simple as replacing sd(x)
in multiple function calls with the object sd_x
that is defined once and reused. For example, suppose we wish to normalise each column of a matrix. However, instead of using the standard deviation of each column, we will use the standard deviation of the entire data set
apply(x, 2, function(i) mean(i) / sd(x))
This is inefficient since the value of sd(x)
is constant and thus recalculating the standard deviation for every column is unnecessary. Instead we should evaluate once and store the result
# cached version
sd_x = sd(x)
apply(x, 2, function(i) mean(i) / sd_x)
If we compare the two methods on a 100 row by 1000 column matrix, the cached version is around 100 times faster
A more advanced form of caching is to use the memoise package. If a function is called multiple times with the same input, it may be possible to speed things up by keeping a cache of known answers that it can retrieve.
We create an inefficient function for calculating the mean. This function takes in a vector, calculates the length and then updates the m
variable.
mean_r = function(x) {
m = 0
n = length(x)
for(i in seq_len(n))
m = m + x[i] / n
m
}
This is clearly a bad function and we should just use the mean()
function, but it’s a useful comparison. Compiling the function is straightforwar
cmp_mean_r = cmpfun(mean_r)
Then we use the microbenchmark()
function to compare the three variants
# Generate some data
x = rnorm(1000)
microbenchmark(times = 10, unit = "ms", # milliseconds
mean_r(x), cmp_mean_r(x), mean(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## mean_r(x) 0.070153 0.070394 0.68445 0.070666 0.074573 6.19278 10
## cmp_mean_r(x) 0.070225 0.070458 0.07339 0.070857 0.072772 0.08345 10
## mean(x) 0.006811 0.007082 0.01116 0.007315 0.009102 0.03198 10
A final option is to use just-in-time (JIT) compilation. The enableJIT()
function disables JIT compilation if the argument is 0
. Arguments 1
, 2
, or 3
implement different levels of optimisation. JIT can also be enabled by setting the environment variable R_ENABLE_JIT
, to one of these values recommended setting the compile level to the maximum value of 3.
Start without writing code but with a clear mind and perhaps a pen and paper. This will ensure you keep your objectives at the forefront of your mind, without getting lost in the technology.
Make a plan. The size and nature will depend on the project but time-lines, resources and ‘chunking’ the work will make you more effective when you start.
Select the packages you will use for implementing the plan early. Minutes spent researching and selecting from the available options could save hours in the future.
Document your work at every stage: work can only be effective if it’s communicated clearly and code can only be efficiently understood if it’s commented.
Make your entire workflow as reproducible as possible. knitr can help with this in the phase of documentation.
mermaid("gantt
Section Initiation
Planning :a1, 2016-01-01, 10d
Data processing :after a1 , 30d")
library("rio")
library("readr")
library("data.table")
library("feather")
library("WDI")
If possible, keep the names of local files downloaded from the internet or copied onto your computer unchanged. This will help you trace the provenance of the data in the future.
R’s native file format is .Rds
. These files can be imported and exported using readRDS()
and saveRDS()
for fast and space efficient data storage.
Use import()
from the rio
package to efficiently import data from a wide range of formats, avoiding the hassle of loading format-specific libraries.
Use the readr
or data.table
equivalents of read.table()
to efficiently import large text files.
Use file.size()
and object.size()
to keep track of the size of files and R objects and take action if they get too big.
rio aims to “simplify the process of importing data into R and exporting data from R.
# Specify a file
fname = system.file("voc_voyages.tsv", package = "efficient")
# Import the file (uses the fread function from data.table)
voyages = import(fname)
# Export the file as an Excel spreadsheet
export(voyages, "voc_voyages.xlsx")
capitals = import("https://github.com/mledoze/countries/raw/master/countries.json")
Below around 1 MB read.csv()
is actually faster than read_csv()
while fread
is much faster than both, although these savings are likely to be inconsequential for such small datasets.
For files beyond 100 MB in size fread()
and read_csv()
can be expected to be around 5 times faster than read.csv()
. This efficiency gain may be inconsequential for a one-off file of 100 MB running on a fast computer (which still takes less than a minute with read.csv()
), but could represent an important speed-up if you frequently load large text files
read_*()
decides what class each variable is based on the first 1000 rows, rather than all rows, as base read.*()
functions do. Printing the offending elemen
.Rds
and .RData
are R’s native binary file formats. These formats are optimised for speed and compression ratios. But what is the difference between them? The following code chunk demonstrates the key difference between them:
save(df_co2, file = "extdata/co2.RData")
saveRDS(df_co2, "extdata/co2.Rds")
load("extdata/co2.RData")
df_co2_rds = readRDS("extdata/co2.Rds")
identical(df_co2, df_co2_rds)
The first method is the most widely used. It uses the save()
function which takes any number of R objects and writes them to a file, which must be specified by the file =
argument save()
is like save.image()
, which saves all the objects currently loaded in R
Feather was developed as a collaboration between R and Python developers to create a fast, light and language agnostic format for storing data frames.
write_feather(df_co2, "extdata/co2.feather")
df_co2_feather = read_feather("extdata/co2.feather")
import feather as ft
path = 'data/co2.feather'
df_co2_feather = ft.read_dataframe(path)
url = "https://www.monetdb.org/sites/default/files/voc_tsvs.zip"
download.file(url, "voc_tsvs.zip") # download file
unzip("voc_tsvs.zip", exdir = "data") # unzip files
file.remove("voc_tsvs.zip") # tidy up by removing the zip file
## [1] TRUE
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/co2.csv"
download.file(url, "co2.csv")
df_co2 = read_csv("extdata/co2.csv")
data(package = "dplyr")
list.files(system.file("extdata", package = "readr"))
## [1] "challenge.csv" "epa78.txt" "example.log"
## [4] "fwf-sample.txt" "massey-rating.txt" "mtcars.csv"
## [7] "mtcars.csv.bz2" "mtcars.csv.zip"
Time spent preparing your data at the beginning can save hours of frustration in the long run.
‘Tidy data’ provides a concept for organising data and the package tidyr provides some functions for this work.
The data_frame
class defined by the tibble package makes datasets efficient to print and easy to work with.
dplyr provides fast and intuitive data processing functions; data.table has unmatched speed for some data processing applications.
The %>%
‘pipe’ operator can help clarify complex data processing workflows.
You can create a tibble
data frame row-by-row using the tribble
function
Data tidying includes data cleaning and data reshaping. Data cleaning is the process of re-formatting and labelling messy data. Packages including stringi and stringr can help update messy character strings using regular expressions; assertive and assertr packages can perform diagnostic checks for data integrity at the outset of a data analysis project.
Tidy data has the following characteristics (H. Wickham 2014b):
Before you start to optimise your code, ensure you know where the bottleneck lies; use a code profiler.
If the data in your data frame is all of the same type, consider converting it to a matrix for a speed boost.
Use specialised row and column functions whenever possible.
The parallel package is ideal for Monte-Carlo simulations. For optimal performance, consider re-writing key parts of your code in C++.
profvis({
data(movies, package = "ggplot2movies") # Load data
movies = movies[movies$Comedy == 1,]
plot(movies$year, movies$rating)
model = loess(rating ~ year, data = movies) # loess regression line
j = order(movies$year)
lines(movies$year[j], model$fitted[j]) # Add line to the plot
})
An additional quirk of ifelse()
is that although it is more programmer efficient, as it is more concise and understandable than multi-line alternatives, it is often less computationally efficient than a more verbose alternative. This is illustrated with the following benchmark, in which the second option runs around 20 times faster, despite the results being identical:
marks = runif(n = 10e6, min = 30, max = 99)
system.time({
result1 = ifelse(marks >= 40, "pass", "fail")
})
## user system elapsed
## 2.814 0.149 2.998
#> user system elapsed
#> 4.012 0.276 4.286
system.time({
result2 = rep("fail", length(marks))
result2[marks >= 40] = "pass"
})
## user system elapsed
## 0.183 0.000 0.184
#> user system elapsed
#> 0.192 0.032 0.223
identical(result1, result2)
## [1] TRUE
system.time({
result3 = dplyr::if_else(marks >= 40, "pass", "fail")
})
## user system elapsed
## 1.053 0.092 1.225
#> user system elapsed
#> 1.032 0.104 1.134
identical(result1, result3)
## [1] TRUE
#> [1] TRUE
Sorting a vector is relatively quick; sorting a vector of length \(10^7\) takes around 0.01 seconds. If you only sort a vector once at the top of a script, then don’t worry too much about this. However if you are sorting inside a loop, or in a shiny application, then it can be worthwhile thinking about how to optimise this operation.
There are currently three sorting algorithms, c("shell", "quick", "radix")
that can be specified in the sort()
function; withadix
being a new addition to R 3.3. Typically the radix
(the non-default option) is the most computationally efficient option for most situations (it is around 20% faster when sorting a large vector of doubles).
Another useful trick is to partially order the results. For example, if you only want to display the top ten results, then use theartial argument, i.e.
sort(x, partial = 1:10)`. For very large vectors, this can give a three fold speed increase.
A factor is just a vector of integers with associated levels. Occasionally we want to convert a factor into its numerical equivalent. The most efficient way of doing this (especially for long factors) is:
as.numeric(levels(f))[f]
the non-vectorised version, &&
, only executes the second component if needed. This is efficient and leads to neater code, e.g
# We only calculate the mean if data doesn't contain NAs
if(all(!is.na(x)) && mean(x) > 0) {
# Do something
}
There are optimised functions for calculating row and columns sums/means, rowSums()
, colSums()
, rowMeans()
and colMeans()
that should be used whenever possible. The package matrixStats contains many optimised row/col functions
To test whether a vector (or other object) contains missing values we use the is.na()
function. Often we are interested in whether a vector contains any missing values. In this case, anyNA(x)
is more efficient than any(is.na(x))
Integers are more space efficient. The code below compares the size of an integer vector to a standard numeric vector:
pryr::object_size(1:10000)
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
## 40 kB
#> 40 kB
pryr::object_size(y = seq(1, 10000, by=1.0))
## 80 kB
#> 80 kB
is.integer(1L + 1)
## [1] FALSE
#> [1] FALSE
Another data structure that can be stored efficiently is a sparse matrix. This is simply a matrix where most of the elements are zero. Conversely, if most elements are non-zero, the matrix is considered dense. The proportion of non-zero elements is called the sparsity. Large sparse matrices often crop up when performing numerical calculations. Typically, our data isn’t sparse but the resulting data structures we create may be sparse. There are a number of techniques/methods used to store sparse matrices. Methods for creating sparse matrices can be found in the Matrix package
library("Matrix")
N = 10000
sp = sparseMatrix(1:N, 1:N, x = 1)
m = diag(1, N, N)
pryr::object_size(sp)
#> 161 kB
pryr::object_size(m)
#> 800 MB
A C++ function is similar to an R function: you pass a set of inputs to the function, some code is run, a single object is returned. However there are some key differences.
In the C++ function each line must be terminated with ; In R, we use ; only when we have multiple statements on the same line.
We must declare object types in the C++ version. In particular we need to declare the types of the function arguments, return value and any intermediate objects we create.
The function must have an explicit return statement. Similar to R, there can be multiple returns, but the function will terminate when it hits it’s first return statement.
You do not use assignment when creating a function.
Object assignment must use = sign. The <- operator isn’t valid.
One line comments can be created using //. Multi-line comments are created using /…/
Suppose we want to create a function that adds two numbers together. In R this would be a simple one line affair:
add_r = function(x, y) x + y
In C++ it is a bit more long winded:
/* Return type double
* Two arguments, also doubles
*/
double add_cpp(double x, double y) {
double value = x + y;
return value;
}
cppFunction('
double add_cpp(double x, double y) {
double value = x + y;
return value;
}
')
add_cpp
## function (x, y)
## .Call(<pointer: 0x117c0d620>, x, y)
add_cpp(1, 2)
## [1] 3
The most basic type of variable is an integer, int
. An int
variable can store a value in the range −32768 to +32767 . To store floating point numbers, there are single precision numbers, float
and double
precision numbers, double
. A double
takes twice as much memory as a float
(in general, we should always work with double precision numbers unless we have a compelling reason to switch to floats). For single characters, we use the char
data type
char
A single character.int
An integer.float
A single precision floating point number.double
A double-precision floating point number.void
A valueless quantitythe cppFunction()
is great for getting small examples up and running. But it is better practice to put your C++ code in a separate file (with file extension cpp) and use the function call sourceCpp("path/to/file.cpp")
to compile them. However we need to include a few headers at the top of the file. The first line we add gives us access to the Rcpp functions. The file Rcpp.h contains a list of function and class definitions supplied by Rcpp. This file will be located where Rcpp is installed.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double add_cpp(double x, double y) {
double value = x + y;
return value;
}
Use the package benchmarkme to assess your CPUs number crunching ability; is it worth upgrading your hardware?
If possible, add more RAM.
Double check that you have installed a 64 -bit version of R.
Cloud computing is a cost effective way of obtaining more compute power.
A solid state drive typically won’t have much impact on the speed of your R code, but will increase your overall productivity since I/0 is much faster.
A rough rule of thumb is that your RAM should be three times the size of your data set.
Having an SSD drive doesn’t make much difference to R. However, the reduction in boot time and general tasks makes an SSD drive a wonderful purchase.
There are general principles that most programmers agree on, such as: