R has unfortunate reputation of being slow. In a competiton of raw speed with a language like C, R would come second. This is a unfair comparison to make. Instead the total programming time is made of three components. Thinking, Coding & Running. In many statistical analysis you want to try/test multiple algorithms. Therefore you want the thinking and coding part optimised and this is R’s strength.
By the time you have loaded data in R, created a scatterplot and fitted a regression line our friendly C programmer would have just launched their code editor and would be checking on stack overflow how to read a csv file. We use R because it is good for statistics. However with the advent of big data and complex statistical algorithms we may need to optimise our code.
Note:
Main R releases happen in April with smaller incremental releases throughout the year.
## _
## platform x86_64-w64-mingw32
## arch x86_64
## os mingw32
## system x86_64, mingw32
## status
## major 3
## minor 5.3
## year 2019
## month 03
## day 11
## svn rev 76217
## language R
## version.string R version 3.5.3 (2019-03-11)
## nickname Great Truth
# Assign the variable major to the major component
major <- version$major
# Assign the variable minor to the minor component
minor <- version$minorEvery R programmer at one point or another has complained that my R code is slow. But, what do you mean by slow is it 1 sec slow, 1 min slow or 1 hour slow. This is obviously problem depenedent. What you need is code that is fast enough. to determine if it is worth changing your code you need to compare your existing solutions with one or more alternatives. This is Benchmarking. You simply run and time all the solutions/options that you have with all other things kept same and then select the fastest.
Benchmarking is a two step process
n that alters data size)Suppose you want to generte a sequence of integers
system.time()We get three numbers
user time: CPU time charged for the execution of user instructions.
system time: CPU time charged for execution by the system on behalf ofthe calling process.
elapsed time: Approximately the sum of user and system,this is the number we typically care about.
## user system elapsed
## 0 0 0
## user system elapsed
## 0 0 0
## user system elapsed
## 1.08 0.28 1.36
Sometime we also need the result of the code. In this case we use the arrow (<-) operator. Using arrow operator inside a function call performs two tasks.Using the <- operator inside a function call will create a new (or overwrite an existing) object.This makes ‘<-’ useful inside system.time() since we can store the result.
This allows us to time and store the operation. Where as the equals operator (=) performs only one thing argument passing or object assignment.
Along with calculating elapsed time it is also worth calculating relative time which is simply a ratio. In the below example you can see that using relative time we can say that seq_by function is 26 times slower than the colon function.
As with all things in R there is a package that simplifies benchmarking. The microbenchmark package is a wrapper around system.time function and makes it straight forward when comparing multiple functions. The key function in this package is microbenchmark. The times argument allows us to tell how many times should we run each function. In the result the cld column gives us a statistical ranking of each function.
library("microbenchmark")
n <- 1e8
microbenchmark(colon(n), seq_default(n), seq_by(n), times = 10) # Run each function 10 times## Unit: nanoseconds
## expr min lq mean median uq
## colon(n) 400 700 113050 3250 7400
## seq_default(n) 6200 8800 210010 36350 69400
## seq_by(n) 1199049600 1281619700 1348972310 1315636650 1388946600
## max neval
## 1091700 10
## 1802900 10
## 1576603100 10
One of the most common tasks we perform is reading in data from CSV files. However, for large CSV files this can be slow. One neat trick is to read in the data and save as an R binary file (rds) using saveRDS(). To read in the rds file, we use readRDS().
Note: Since rds is R’s native format for storing single objects, you have not introduced any third-party dependencies that may change in the future.
## user system elapsed
## 0.05 0.00 0.05
# Write the data to csv
write.csv(movies_rds, "movies.csv")
# How long does it take to read movies from CSV?
system.time(movies_csv <- read.csv("movies.csv"))## user system elapsed
## 0.40 0.03 0.43
The microbenchmark() function makes it easier to compare multiple function calls at once by compiling all the relevant information in a data frame. It does this by running each function call multiple times, recording the time it took for the function to run each time, then computing summary statistics for each expression as you can see here.
# Load the microbenchmark package
library("microbenchmark")
# Compare the two functions
compare <- microbenchmark(read.csv("movies.csv"),
readRDS("movies.rds"),
times = 10)
# Print compare
compare## Unit: milliseconds
## expr min lq mean median uq max
## read.csv("movies.csv") 403.9977 404.9286 448.29266 420.32195 458.0394 582.2420
## readRDS("movies.rds") 52.5306 53.9747 55.98524 55.74855 57.7381 61.9119
## neval
## 10
## 10
benchmarkme package allows you to know where your machine stands or helps you decide whether you need a machine upgrade. We basically run the same piece of code on our machone and compare the results.
The main function in this packages is benchmark_std(). These becnhmarks are standard R operations such as loops and matrix operations. On a normal machine this code will take around 4 minutes to run. Once the benchmark is complete you can upload/compare your result for other people to compare/use.
# Load the benchmarkme package
library("benchmarkme")
# Assign the variable `ram` to the amount of RAM on this machine
ram <- get_ram()
ram
# Assign the variable `cpu` to the cpu specs
cpu <- get_cpu()
cpu
# Run each benchmark 3 times
res <- benchmark_std(runs = 3)
plot(res)The benchmarkme package allows you to run a set of standardized benchmarks and compare your results to other users. One set of benchmarks tests is reading and writing speeds.The function call benchmark_io() records the length of time it takes to read and write a 5MB(size argument) file.
The first rule of R club - Never ever grow a vector. Preallocation
If we had programmed in C we would have been aware of memory allocation since that language hold us accountable for reclaiming memory. In R memory allocation happens automatically. When you assign a variable R has to allocate memory in RAM to store that variable. This takes time so an important way to make your code run faster is to minimize the amount of memory allocation that R has to perform.
In the below example you can see three differnt methods to generate a sequence of integers
## Method 1 - The obvious and best way
x <- 1:n
## Method 2 - Not so bad
## Start with a vector of length n and then loop to change the vector entries. Length of x doesn't change in
## the loop
x <- vector("numeric", n) # length n
for(i in 1:n)
x[i] <- i
## Method 3 - Don't ever do this!
## the object starts empty and gradually fill up the vector with integers
x <- NULL # Length zero
for(i in 1:n)
x <- c(x, i)The reason that method 3 above is slowest because requesting memory is a relatively slow operation this introduces terrible bottlenecks. So remember never ever grow a vector. In the first two methods x is pre-allocated / length of x does not change in the loop.
The growing() function defined below generates n random standard normal numbers, but grows the size of the vector each time an element is added!
Pre-allocating the vector is significantly faster than growing the vector!
# Method 1 - Growing
growing <- function(n) {
x = NULL
for(i in 1:n)
x = c(x, rnorm(1))
x
}
# Use `<-` with system.time() to store the result as res_grow
system.time(res_grow <- growing(30000))## user system elapsed
## 0.79 0.18 0.99
# Method 2 - Preallocating
pre_allocate <- function(n) {
x = numeric(n) # Create a vector of length n
for(i in 1:n)
x[i] = rnorm(1)
x
}
# Use `<-` with system.time() to store the result as res_allocate
n <- 30000
system.time(res_allocate <- pre_allocate(n))## user system elapsed
## 0.04 0.00 0.04
The second rule of R club - Use a vectorized solution wherever possible. For loops in R are just slow!
When you call a R function you eventually call a C/FORTRAN function. This underlying code is heavily optimised. A general rule for making your code run faster is to access the underlying optimised C/FORTRAN code as quickly as possible. The fewer the functions call the better. This basically means vectorized code. Many R functions are vectorized example
rnorm() accepts a single number but return a vector.
mean() accepts a vector but returns a single number.
In the below example we are generating a million random numbers using standard normal distribution.We have preallocated the vector but still the vectorized version of rnorm is 40 times faster.
library(microbenchmark)
n <- 1e6
x <- vector("numeric", n)
microbenchmark(
x <- rnorm(n),
{
for(i in seq_along(x)) x[i] <- rnorm(1)
},
times = 10 )## Unit: milliseconds
## expr min lq
## x <- rnorm(n) 51.7433 51.9396
## { for (i in seq_along(x)) x[i] <- rnorm(1) } 1538.0340 1567.0962
## mean median uq max neval
## 55.09353 51.95465 52.0414 83.1987 10
## 1590.35734 1580.25295 1598.5305 1682.2149 10
There are three parts the solution is divided into
1.Allocation: Similar in both the methods x <- vector("numeric", n)
2.Generation: In the Forloop we have one million calls to the rnorm function whereas in the vectorized version we have just one call
3.Assignment: One million calls to the assignment method in the for loop whereas there is just one assignment call in the vectroized method.
So in conclusion it is not that the for loop in R is slow it is just that there are 2 million more calls to the same function when you use the for loop.
x <- rnorm(10)
x2 <- numeric(length(x))
microbenchmark(
x2_imp <- x * x,
for(i in 1:10)
x2[i] <- x[i] * x[i]
,times = 10 )## Unit: nanoseconds
## expr min lq mean median uq
## x2_imp <- x * x 300 400 3230 2100 4400
## for (i in 1:10) x2[i] <- x[i] * x[i] 1689400 1943800 2211760 2153000 2517800
## max neval
## 12500 10
## 2996700 10
# Initial code
n <- 100
total <- 0
x <- runif(n)
microbenchmark(
for(i in 1:n)
total <- total + log(x[i]),
# Rewrite in a single line. Store the result in log_sum
log_sum <- sum(log(x)),
times = 10 )## Unit: microseconds
## expr min lq mean median uq
## for (i in 1:n) total <- total + log(x[i]) 1523.8 1660.7 2062.93 2042.50 2258.3
## log_sum <- sum(log(x)) 4.2 5.1 9.46 5.75 8.6
## max neval
## 3010.6 10
## 37.8 10
The third rule of R club - Use a matrix whenever appropriate
The key data structure in R is data frame. These objects are so useful that they are copied in other languages(ex. Python Pandas Dataframes). Dataframe is a tabular structure for representing data. Columns are variables of interest and rows are observations. Column in a data frame must be of the same type but row in a data frame can be of different types.
Dataframe is actually a list of vectors where each column is a single vector. This has implications for storage. As each column is stored at a separate location you just need to find the column and retrieve the entire object. However if you want to retrieve entire row you need to find the starting locations of every single column and then select what you need which is time consuming.
Matrix have similar rectangular structure and has ususal subsetting and extracting operations. The crucial difference is that a matrix can have only one data type(no mix and match). This makes storage easier as the entire matrix is stored as one continuous block. Selecting columns is easy we find the start of the column and retrieve the data. Selecting rows is also straightforward - Find the first value and then increment along the matrix.
Benchmarking is for timining multiple functions and selecting the optimal function. However this doesn’t help us finding the bottlenecks, if we don’t know where the bottleneck is then how are we going to compare different functions. To find bottlenecks you need to use code profiling. We run the code and take snapshots of the call stack at regular intervals. This gives you data on how long each function took to run.
R comes with a built in tool for code profiling called as Rprof(). Unfortunately Rprof() isn’t that user friendly. An alternate tool is profvis package. Also profvis is integrated into R studio so it is easy to use.
Using R Studio - Select the lines of code and just select the Proflie Selected line(s) from the Profile tab
profvis() - It records the call stack at regular intervals.This generates an interactive page that describes the amount of time we spend on each line of code. The two measurements returned by profvis are amount of memory used and time spent with each function. Don’t worry on the units but focus on the relative contribution of each component.
library(profvis)
profvis({
data(movies, package ="ggplot2movies")
braveheart <- movies[7288,]
movies <- movies[movies$Action==1,]
plot(movies$year, movies$rating, xlab = "Year", ylab = "Rating")
# local regression line
model <- loess(rating ~ year, data = movies)
j <- order(movies$year)
lines(movies$year[j], model$fitted[j], col = "forestgreen")
points(braveheart$year, braveheart$rating, pch = 21, bg = "steelblue")
})# Load the microbenchmark package
library("microbenchmark")
# The previous data frame solution is defined
# d() Simulates 6 dices rolls
d <- function() {
data.frame(
d1 = sample(1:6, 3, replace = TRUE),
d2 = sample(1:6, 3, replace = TRUE)
)
}
# Complete the matrix solution
m <- function() {
matrix(sample(1:6, 6, replace = TRUE), ncol = 2)
}
# Use microbenchmark to time m() and d()
microbenchmark(
data.frame_solution = d(),
matrix_solution = m()
)## Unit: microseconds
## expr min lq mean median uq max neval
## data.frame_solution 130.7 134.05 171.728 137.15 147.6 2676.7 100
## matrix_solution 5.0 5.80 27.270 7.35 8.1 1995.1 100
## Warning in matrix(c(2, 4, 5), c(4, 1, 4)): data length [3] is not a sub-multiple
## or multiple of the number of rows [4]
# Define the previous solution
app <- function(x) {
apply(x, 1, sum)
}
# Define the new solution
r_sum <- function(x) {
rowSums(x)
}
# Compare the methods
microbenchmark(
app_sol = app(rolls),
r_sum_sol = r_sum(rolls)
)## Unit: microseconds
## expr min lq mean median uq max neval
## app_sol 15.9 16.8 29.882 17.2 18.2 1097.8 100
## r_sum_sol 3.2 3.4 16.168 3.6 3.8 1115.6 100
The & operator will always evaluate both its arguments. That is, if you type x & y, R will always try to work out what x and y are. There are some cases where this is inefficient. For example, if x is FALSE, then x & y will always be FALSE, regardless of the value of y. Thus, you can save a little processing time by not calculating it. The && operator takes advantage of this trick, and doesn’t bother to calculate y if it doesn’t make a difference to the overall result.
One thing to note is that && only works on single logical values, i.e., logical vectors of length 1 (like you would pass into an if condition), but & also works on vectors of length greater than 1.
# Example data
is_double <- c(FALSE,TRUE,TRUE)
# Define the previous solution
move <- function(is_double) {
if (is_double[1] & is_double[2] & is_double[3]) {
current <- 11 # Go To Jail
}
}
# Define the improved solution
improved_move <- function(is_double) {
if (is_double[1] && is_double[2] && is_double[3]) {
current <- 11 # Go To Jail
}
}
## microbenchmark both solutions
microbenchmark(move, improved_move, times = 1e5)## Warning in microbenchmark(move, improved_move, times = 1e+05): Could not measure
## a positive execution time for 30268 evaluations.
## Unit: nanoseconds
## expr min lq mean median uq max neval
## move 0 0 2.028 0 0 20700 1e+05
## improved_move 0 0 1.952 0 0 20100 1e+05
Faster the CPU, faster your code will run. For last decade the speed fo CPU’s have stabilised because when a CPU becomes faster it also becomes hotter and the problem is that it is difficult to keep them cool. So computer manafacturers moved from making single core machine to multi-core machine. Unfortunately R is single threaded. This means that by default it will will use only one core. But by using parallel package you can make sections of your code use all available cores.
## [1] 8
## $vendor_id
## [1] "GenuineIntel"
##
## $model_name
## [1] "Intel(R) Core(TM) i5-1035G4 CPU @ 1.10GHz"
##
## $no_of_cores
## [1] 8
The Rule of thumb to check if a loop can leverage multi-core computing is to check if it can be run backwards/reverse without affecting the results.
The vast majority of statistical methods haven’t been designed with parallel computing in mind. This is starting to change with the advent of big data but many standard methods don’t fit in the parallel paradigm.
Embarrasingly Parallel - Easiest type of parallel computing since a little effort is need to divide the problem into a set of parallel tasks. Below is an example of eight monte carlo simulation which can be run in parallel and then combined using combine function.
In Below example we cannot use parallel computing because order of evaluation is important.But order of evaluation in parallel computing can’t be predicted and hence we’ll get the wrong answer, since x[3] may get evaluated before x[2]. In sequential algorithms The ith value depends on the previous value.
total <- no_of_rolls <- 0 # Initialise
while(total < 10) {
total <- total + sample(1:6, 1)
if(total %% 2 == 0) total <- 0 # If even. Reset to 0
no_of_rolls <- no_of_rolls + 1
}
no_of_rolls## [1] 7
play <- function() {
total <- no_of_rolls <- 0
while(total < 10) {
total <- total + sample(1:6, 1)
# If even. Reset to 0
if(total %% 2 == 0) total <- 0
no_of_rolls <- no_of_rolls + 1
}
no_of_rolls
}
# and construct a loop to play the game:
results <- numeric(100)
for(i in seq_along(results))
results[i] <- play()The parallel package comes with R and enable us to write code that will work on operating systems. It has parallel version of standard functions. A simple version to run in parallel is the apply(glorified for loop) function. In apply we are applying a function to each row or column of a matrix. in the below example we will be using parApply
# Using traditional apply
m <- matrix(rnorm(100000), ncol = 10)
res <- apply(m, 1, median) # middle arguemnt 1 is for rows(2 - columns)
# Using parApply from parallel package
library("parallel")
# No fo cores to be used
copies_of_r <- 4
cl <- makeCluster(copies_of_r)
# Run the loop
parApply(cl, m, 1, median)
# Stop the cluster/ free up the resources
stopCluster(cl)Unfortunately there is an additional overhead when running in parallel. When we are uisng multiple cores this requires communication between CPUs. If the job is really very fast then the communication can swamp the potential benefit. Therefore you need to benchmark both single core and parallel core options.
sapply() is just another way off writing for loop. Applying a function to each value of the vector. There is also a little speed difference between apply and a std loop. To run this code in parallel you simply replace sapply with parSapply
Example where parallel computing can be applied is that of Bootstrapping. The idea behind bootstrapping is that in real world you cannot sample from the population. Instead you resample from the original sample. You need to follow two steps
Sample from the original data set with replacement(this means data points from the original dataset could appear multiple times in the new sample). So if our orignal dataset was of size 100 then our new sample will also be of size 100.
Calculate the correlation coefficient of our new sample.
Repeat the step 1 & 2 number of times. The distribution of correlation values gives us measure of uncertainty(confidence interval) for our statistic.
Also please note you will observe/notice the difference in speed/boost in performance while using parallel computing when the no of sample size is large(>1000)
bootstrap <- function(data_set) {
# Sample with replacement
s <- sample(1:nrow(data_set), replace = TRUE)
new_data <- data_set[s,]
# Calculate the correlation
cor(new_data$Attack, new_data$Defense)
}
# 100 independent bootstrap simulations
sapply(1:100, function(i) bootstrap(pokemon))
# Do the similar steps using parallel computing
library("parallel")
no_of_cores <- 4
cl <- makeCluster(no_of_cores)
clusterExport(cl, c("bootstrap", "pokemon")) # Export function and dataset
parSapply(cl, 1:100, function(i) bootstrap(pokemon))
stopCluster(cl)play <- function() {
total <- no_of_rolls <- 0
while(total < 10) {
total <- total + sample(1:6, 1)
# If even. Reset to 0
if(total %% 2 == 0) total <- 0
no_of_rolls <- no_of_rolls + 1
}
no_of_rolls
}
# Using loop or sapply
res <- sapply(1:100, function(i) play())
# Create a cluster via makeCluster (2 cores)
cl <- makeCluster(2)
# Export the play() function to the cluster
clusterExport(cl, "play")
# Parallelize this code
res <- parSapply(cl,1:100, function(i) play())
# Stop the cluster
stopCluster(cl)# Set the number of games to play
no_of_games <- 1e5
## Time serial version
system.time(serial <- sapply(1:no_of_games, function(i) play()))## user system elapsed
## 7.79 0.00 7.81
## Set up cluster
cl <- makeCluster(4)
clusterExport(cl, "play")
## Time parallel version
system.time(par <- parSapply(cl, 1:no_of_games, function(i) play()))## user system elapsed
## 0.05 0.00 3.03