The idea to make the set of the papers for R script enhancement and advanced performance have arisen due to:
Content
- vectorization
- fast data reading
- fast data writing
- script enhancement
fast reading data
The overview of fast reading options see here
fast writing data
The overview of fast reading options see here
R script enhancements
Vectorise and pre-allocate data structures
Examples are taken from here
# Create the data frame
col1 <- runif (12^5, 0, 2)
col2 <- rnorm (12^5, 0, 2)
col3 <- rpois (12^5, 3)
col4 <- rchisq (12^5, 2)
df <- data.frame (col1, col2, col3, col4)
The logic we are about to optimise:
For every row on this data frame (df), check if the sum of all values is greater than 4.
If it is, a new 5th variable gets the value “greater_than_4”, else, it gets “lesser_than_4”.
# Original R code: Before vectorization and pre-allocation
system.time({
for (i in 1:nrow(df)) { # for every row
if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) { # check if > 4
df[i, 5] <- "greater_than_4" # assign 5th column
} else {
df[i, 5] <- "lesser_than_4" # assign 5th column
}
}
})
user system elapsed
936.58 62.32 1233.08
df[,5] <- NULL
# after vectorization and pre-allocation
output <- character (nrow(df)) # initialize output vector for vectorization
# loop with preallocated vector
system.time({
for (i in 1:nrow(df)) {
if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) {
output[i] <- "greater_than_4" #looping through the vector but not a dataframe
} else {
output[i] <- "lesser_than_4" #looping through the vector but not a dataframe
}
}
# actual vectorization: in place of row-by-row transactions we manipulate the single vector only.
df$output <- output
}
)
user system elapsed
31.77 0.07 42.01
df[,5] <- NULL
Place IF statements (expressions) outside the loop.
Examples are taken from here
# initialize output vector for vectorization
output <- character (nrow(df))
# the expression in IF statement
condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4 # condition check outside the loop
system.time({
for (i in 1:nrow(df)) {
if (condition[i]) {
output[i] <- "greater_than_4"
} else {
output[i] <- "lesser_than_4"
}
}
df$output <- output
})
user system elapsed
1.61 0.00 2.53
df[,5] <- NULL
The LOOP only for True conditions
Examples are taken from here
# initialize output vector for vectorization
output <- character (nrow(df))
# the expression in IF statement
condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4 # condition check outside the loop
system.time({
for (i in (1:nrow(df))[condition]) { # run loop only for true conditions
if (condition[i]) {
output[i] <- "greater_than_4"
} else {
output[i] <- "lesser_than_4"
}
}
# actual vectorization: in place of row-by-row transactions we manipulate the single vector only.
df$output <- output
})
user system elapsed
0.39 0.00 0.41
df[,5] <- NULL
Use ifelse() whenever possible
Examples are taken from here
You can make this logic much simpler and faster by using the ifelse() statement.
Looks like this is going to be a highly preferred option to speed up simple loops.
system.time({
output <- ifelse ((df$col1 + df$col2 + df$col3 + df$col4) > 4, "greater_than_4", "lesser_than_4")
df$output <- output
})
user system elapsed
0.17 0.00 0.17
df[,5] <- NULL
Using which()
Examples are taken from here
system.time({
want = which(rowSums(df) > 4)
output = rep("less than 4", times = nrow(df))
output[want] = "greater than 4"
df$output <- output
})
user system elapsed
0.02 0.00 0.02
df[,5] <- NULL
Use apply family of functions instead of for-loops
Examples are taken from here
# apply family
system.time({
myfunc <- function(x) {
if ((x['col1'] + x['col2'] + x['col3'] + x['col4']) > 4) {
"greater_than_4"
} else {
"lesser_than_4"
}
}
output <- apply(df[, c(1:4)], 1, FUN=myfunc) # apply 'myfunc' on every row
df$output <- output
})
user system elapsed
3.01 0.02 3.85
df[,5] <- NULL
Use byte code compilation for functions cmpfun()
We will use initial dataframe as the input and the initial FOR loop (as in the first example).
f <- function(df){
for (i in 1:nrow(df)) { # for every row
if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) { # check if > 4
df[i, 5] <- "greater_than_4" # assign 5th column
} else {
df[i, 5] <- "lesser_than_4" # assign 5th column
}
}
}
library(compiler)
g <- cmpfun(f)
system.time(
g(df)
)
user system elapsed
558.29 254.10 872.24
df[,5] <- NULL
---
title: "Enhanced R script"
output: html_notebook
author: Demyd Dzyuban
date: 16/12/2016
---

The idea to make the set of the papers for R script enhancement and advanced performance have arisen due to:  


* [R in action](https://www.manning.com/books/r-in-action-second-edition).
It is strongly recommended for wide range of newcomers, computational geeks, etc. as a starting point to travel to amazing R infernal shores.  

* [R INFERNO](http://www.burns-stat.com/documents/books/the-r-inferno/).
This is an excellent way to start incredible trip through R infernal circles :-)

* [Advanced R](http://adv-r.had.co.nz/).
Nice book to read for those who survived previous stages.

* [Stackoverflow.com](http://stackoverflow.com/).
The endless fountain of the eternal wisdom of R dwellers :-)

* [www.r-bloggers.com](https://www.r-bloggers.com/)
Very nice place of smart, plain and short hints, comments and explanations.



### **Content**

* vectorization
* fast data reading 
* fast data writing
* script enhancement
  


#### **vectorization**

The brief example of [see here - 2.3.2 Example of vectorization](https://rpubs.com/demydd/210541)


#### **fast reading data**
The overview of fast reading options [see here ](https://rpubs.com/demydd/210541)



#### **fast writing data**
The overview of fast reading options [see here ](https://rpubs.com/demydd/210541)


### **R script enhancements**


#### *Vectorise and pre-allocate data structures*
Examples are taken from [here](https://www.r-bloggers.com/strategies-to-speedup-r-code/)


```{r}
# Create the data frame
col1 <- runif (12^5, 0, 2)
col2 <- rnorm (12^5, 0, 2)
col3 <- rpois (12^5, 3)
col4 <- rchisq (12^5, 2)
df <- data.frame (col1, col2, col3, col4)

```

*The logic we are about to optimise:*  
For every row on this data frame (df), check if the sum of all values is greater than 4. 

If it is, a new 5th variable gets the value "greater_than_4", else, it gets "lesser_than_4".

```{r}

# Original R code: Before vectorization and pre-allocation
system.time({
  for (i in 1:nrow(df)) { # for every row
    if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) { # check if > 4
      df[i, 5] <- "greater_than_4" # assign 5th column
    } else {
      df[i, 5] <- "lesser_than_4" # assign 5th column
    }
  }
})

df[,5] <- NULL

```

```{r}
# after vectorization and pre-allocation
output <- character (nrow(df)) # initialize output vector for vectorization

# loop with preallocated vector
system.time({
  for (i in 1:nrow(df)) {
    if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) {
      output[i] <- "greater_than_4" #looping through the vector but not a dataframe
    } else {
      output[i] <- "lesser_than_4" #looping through the vector but not a dataframe
    }
  }
  
# actual vectorization: in place of row-by-row transactions we manipulate the single vector only.  
df$output <- output 
}
)

df[,5] <- NULL
```

#### *Place IF statements (expressions) outside the loop.*

Examples are taken from [here](https://www.r-bloggers.com/strategies-to-speedup-r-code/)

```{r}
# initialize output vector for vectorization
output <- character (nrow(df))

# the expression in IF statement
condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4  # condition check outside the loop
system.time({
  for (i in 1:nrow(df)) {
    if (condition[i]) {
      output[i] <- "greater_than_4"
    } else {
      output[i] <- "lesser_than_4"
    }
  }
  df$output <- output
})

df[,5] <- NULL
```


#### *The LOOP only for True conditions*

Examples are taken from [here](https://www.r-bloggers.com/strategies-to-speedup-r-code/)

```{r}
# initialize output vector for vectorization
output <- character (nrow(df))

# the expression in IF statement
condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4  # condition check outside the loop

system.time({
  for (i in (1:nrow(df))[condition]) {  # run loop only for true conditions
    if (condition[i]) {
      output[i] <- "greater_than_4"
    } else {
      output[i] <- "lesser_than_4"
    }
  }
# actual vectorization: in place of row-by-row transactions we manipulate the single vector only.    
df$output <- output
})

df[,5] <- NULL
```

#### *Use ifelse() whenever possible*

Examples are taken from [here](https://www.r-bloggers.com/strategies-to-speedup-r-code/)

You can make this logic much simpler and faster by using the ifelse() statement. 

Looks like this is going to be a highly preferred option to speed up simple loops.

```{r}
system.time({
  output <- ifelse ((df$col1 + df$col2 + df$col3 + df$col4) > 4, "greater_than_4", "lesser_than_4")
  df$output <- output
})

df[,5] <- NULL
```

#### *Using which()*

Examples are taken from [here](https://www.r-bloggers.com/strategies-to-speedup-r-code/)

```{r}

system.time({
  want = which(rowSums(df) > 4)
  output = rep("less than 4", times = nrow(df))
  output[want] = "greater than 4"
  df$output <- output
}) 
df[,5] <- NULL
```

#### Use apply family of functions instead of for-loops

Examples are taken from [here](https://www.r-bloggers.com/strategies-to-speedup-r-code/)

```{r}
# apply family
system.time({
  myfunc <- function(x) {
    if ((x['col1'] + x['col2'] + x['col3'] + x['col4']) > 4) {
      "greater_than_4"
    } else {
      "lesser_than_4"
    }
  }
  output <- apply(df[, c(1:4)], 1, FUN=myfunc)  # apply 'myfunc' on every row
  df$output <- output
})
df[,5] <- NULL
```

#### Use byte code compilation for functions cmpfun()

We will use initial dataframe as the input and the initial FOR loop (as in the first example).

```{r}

f <- function(df){
   for (i in 1:nrow(df)) { # for every row
    if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) { # check if > 4
      df[i, 5] <- "greater_than_4" # assign 5th column
    } else {
      df[i, 5] <- "lesser_than_4" # assign 5th column
    }
  }
}

library(compiler)

g <- cmpfun(f)

system.time(
  g(df)
)
df[,5] <- NULL
```






