Speed of Data frame vs. Matrix

Started from an E-mail

William Dunlap | R-help mailing list | 17 Mar 2014 | Subject: Re: data frame vs. matrix https://stat.ethz.ch/pipermail/r-help/2014-March/372323.html

MM, 2016: From the timings below, note how much faster R is two years later!

MM, 2018: Since R version 3.4.0, by default, R uses just in time (= JIT) auto-compilation. The following line disables this — so we can make the comparison as in earlier versions of R:

compiler::enableJIT(-1) # shows current auto-compilation "level".

[1] 3

## Turn it off:
oL <- compiler::enableJIT(0) # (-> level 0: it is turned off)

Duncan Murdoch’s analysis suggests another way to do this: extract the x vector, operate on that vector in a loop, then insert the result into the data.frame. I added a df="quicker" option to your df argument and made the test dataset deterministic so we could verify that the algorithms do the same thing:

dumkoll <- function(n = 1000, df = TRUE){
        dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n)))
        if (identical(df, "quicker")) {
                x <- dfr$x
                for(i in 2:length(x)) {
                    x[i] <- x[i-1]
                }
                dfr$x <- x
        } else if (df){
                for (i in 2:NROW(dfr)){
                        # if (!(i %% 100)) cat("i = ", i, "\n")
                        dfr$x[i] <- dfr$x[i-1]
                }
        }else{
                dm <- as.matrix(dfr)
                for (i in 2:NROW(dm)){
                        # if (!(i %% 100)) cat("i = ", i, "\n")
                        dm[i, 1] <- dm[i-1, 1]
                }
                dfr$x <- dm[, 1]
        }
        dfr
}

(Bill Dunlap:) Timings for $10^4$, $2* 10^4$, and $4* 10^4$ show that the time is quadratic in n for the df=TRUE case and close to linear in the other cases, with the new method taking about 60% the time of the matrix method:

n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4)
sapply(n, function(n) system.time(dumkoll(n, df=FALSE))[1:3])

            10k   20k   40k
user.self 0.018 0.036 0.072
sys.self  0.000 0.000 0.001
elapsed   0.018 0.036 0.073

## BD:           10k  20k  40k
##    user.self 0.11 0.22 0.43
##    sys.self  0.02 0.00 0.00
##    elapsed   0.12 0.22 0.44

sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3])

            10k   20k   40k
user.self 0.352 0.818 2.519
sys.self  0.000 0.000 0.000
elapsed   0.354 0.824 2.540

## BD:           10k   20k   40k
##    user.self 3.59 14.74 78.37
##    sys.self  0.00  0.11  0.16
##    elapsed   3.59 14.91 78.81

sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3])

            10k   20k   40k
user.self 0.009 0.017 0.036
sys.self  0.000 0.000 0.000
elapsed   0.009 0.017 0.035

# BD:           10k  20k  40k
#    user.self 0.06 0.12 0.26
#    sys.self  0.00 0.00 0.00
#    elapsed   0.07 0.13 0.27

I also timed the 2 faster cases for n=10^6 and the time still looks linear in n, with vector approach still taking about 60% the time of the matrix approach. ((NB vvvvvvvvvv knitr feature))

system.time(dumkoll(n=10^6, df=FALSE))

   user  system elapsed 
  1.151   0.132   1.294

# BD:   user  system elapsed
#      11.65    0.12   11.82

system.time(dumkoll(n=10^6, df="quicker"))

   user  system elapsed 
  0.102   0.004   0.107

# BD:   user  system elapsed
#       6.79    0.08    6.91

The results from each method are identical:

identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE))

[1] TRUE

identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker"))

[1] TRUE

If your data.frame has columns of various types, then as.matrix will coerce them all to a common type (often character), so it may give you the wrong result in addition to being unnecessarily slow.

Bill Dunlap TIBCO Software wdunlap tibco.co

Rprof("dumkoll.Rprof", interval = 0.005) # start profiling
dd <- dumkoll(50000, df=TRUE)
Rprof(NULL) # stop profiling
## ?Rprof
sr <- summaryRprof("dumkoll.Rprof")
sr

$by.self
                  self.time self.pct total.time total.pct
"dumkoll"             2.970    76.06      3.905    100.00
"$<-.data.frame"      0.170     4.35      0.235      6.02
"[[.data.frame"       0.155     3.97      0.395     10.12
"$"                   0.100     2.56      0.615     15.75
"[["                  0.075     1.92      0.470     12.04
"$<-"                 0.070     1.79      0.305      7.81
"%in%"                0.070     1.79      0.145      3.71
"<Anonymous>"         0.050     1.28      0.065      1.66
"sys.call"            0.050     1.28      0.050      1.28
"$.data.frame"        0.045     1.15      0.515     13.19
"NROW"                0.030     0.77      0.035      0.90
"all"                 0.025     0.64      0.025      0.64
"names"               0.025     0.64      0.025      0.64
"-"                   0.015     0.38      0.015      0.38
".row_names_info"     0.015     0.38      0.015      0.38
".subset2"            0.010     0.26      0.010      0.26
"is.atomic"           0.010     0.26      0.010      0.26
"dim"                 0.005     0.13      0.005      0.13
"is.matrix"           0.005     0.13      0.005      0.13
"nargs"               0.005     0.13      0.005      0.13
"oldClass"            0.005     0.13      0.005      0.13

$by.total
                      total.time total.pct self.time self.pct
"dumkoll"                  3.905    100.00     2.970    76.06
"block_exec"               3.905    100.00     0.000     0.00
"call_block"               3.905    100.00     0.000     0.00
"eval"                     3.905    100.00     0.000     0.00
"evaluate_call"            3.905    100.00     0.000     0.00
"evaluate::evaluate"       3.905    100.00     0.000     0.00
"evaluate"                 3.905    100.00     0.000     0.00
"handle"                   3.905    100.00     0.000     0.00
"in_dir"                   3.905    100.00     0.000     0.00
"knitr::knit"              3.905    100.00     0.000     0.00
"process_file"             3.905    100.00     0.000     0.00
"process_group.block"      3.905    100.00     0.000     0.00
"process_group"            3.905    100.00     0.000     0.00
"rmarkdown::render"        3.905    100.00     0.000     0.00
"timing_fn"                3.905    100.00     0.000     0.00
"withCallingHandlers"      3.905    100.00     0.000     0.00
"withVisible"              3.905    100.00     0.000     0.00
"$"                        0.615     15.75     0.100     2.56
"$.data.frame"             0.515     13.19     0.045     1.15
"[["                       0.470     12.04     0.075     1.92
"[[.data.frame"            0.395     10.12     0.155     3.97
"$<-"                      0.305      7.81     0.070     1.79
"$<-.data.frame"           0.235      6.02     0.170     4.35
"%in%"                     0.145      3.71     0.070     1.79
"<Anonymous>"              0.065      1.66     0.050     1.28
"sys.call"                 0.050      1.28     0.050     1.28
"NROW"                     0.035      0.90     0.030     0.77
"all"                      0.025      0.64     0.025     0.64
"names"                    0.025      0.64     0.025     0.64
"-"                        0.015      0.38     0.015     0.38
".row_names_info"          0.015      0.38     0.015     0.38
".subset2"                 0.010      0.26     0.010     0.26
"is.atomic"                0.010      0.26     0.010     0.26
"dim"                      0.005      0.13     0.005     0.13
"is.matrix"                0.005      0.13     0.005     0.13
"nargs"                    0.005      0.13     0.005     0.13
"oldClass"                 0.005      0.13     0.005     0.13

$sample.interval
[1] 0.005

$sampling.time
[1] 3.905

So, indeed, the culprit is $<-, and specifically almost only the data.frame method of that.

A “free” way to increase performance of R functions: R’s byte compiler:

require(compiler)

Loading required package: compiler

help(package = "compiler")# fails to give anything (Rstudio bug !)
library(help = "compiler")# the old fashioned way works fine

These are not evaluated (when the *.Rmd is knit into Markdown):

?cmpfun # interesting, notably
example(cmpfun) # shows indeed speedups of almost 50% in one case (on MM's notebook)

So, we now can compile our function and see how much that helps:

dumkoll2 <- cmpfun(dumkoll)

require(microbenchmark)

Loading required package: microbenchmark

Let’s use a somewhat small n

n <- 2000
mbd <- microbenchmark(dumkoll(n),               dumkoll2(n),
                      dumkoll(n, df=FALSE),     dumkoll2(n, df=FALSE),
                      dumkoll(n, df="quicker"), dumkoll2(n, df="quicker"), times = 25)
mbd

Unit: microseconds
                        expr       min        lq       mean    median
                  dumkoll(n) 41018.298 41773.930 43140.9819 42569.952
                 dumkoll2(n) 37042.018 37855.456 39306.9822 38843.909
      dumkoll(n, df = FALSE)  3556.881  3673.366  3830.3426  3751.006
     dumkoll2(n, df = FALSE)  2132.099  2310.885  2415.1664  2373.064
  dumkoll(n, df = "quicker")  1890.222  1986.398  2136.1364  2048.263
 dumkoll2(n, df = "quicker")   360.558   407.426   450.4333   427.902
        uq       max neval   cld
 44040.776 47499.921    25     e
 39841.634 49404.805    25    d 
  3952.892  4517.149    25   c  
  2430.689  3092.909    25  b   
  2303.910  2783.479    25  b   
   479.398   724.098    25 a

plot(mbd, log="y")

Wow, I’m slightly surprised that the compiler helped quite a bit, notably for the faster solutions (matrix and vector “[<-” calls).

print(sessionInfo(), locale=FALSE)

R version 3.4.4 Patched (2018-03-19 r74567)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora 26 (Twenty Six)

Matrix products: default
BLAS: /scratch/users/maechler/R/D/r-patched/inst-shlib/lib/libRblas.so
LAPACK: /scratch/users/maechler/R/D/r-patched/inst-shlib/lib/libRlapack.so

attached base packages:
[1] compiler  graphics  grDevices datasets  stats     utils     methods  
[8] base     

other attached packages:
[1] microbenchmark_1.4-4 knitr_1.20           sfsmisc_1.1-1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16     magrittr_1.5     splines_3.4.4    MASS_7.3-49     
 [5] lattice_0.20-35  multcomp_1.4-8   stringr_1.3.0    tools_3.4.4     
 [9] grid_3.4.4       TH.data_1.0-8    htmltools_0.3.6  yaml_2.1.18     
[13] survival_2.41-3  rprojroot_1.3-2  digest_0.6.15    Matrix_1.2-12   
[17] codetools_0.2-15 evaluate_0.10.1  rmarkdown_1.9    sandwich_2.4-0  
[21] stringi_1.1.7    backports_1.1.2  mvtnorm_1.0-7    zoo_1.8-1

structure(Sys.info()[c("nodename","sysname", "version")], class="simple.list")

         _                                  
nodename nb-mm4                             
sysname  Linux                              
version  #1 SMP Tue Nov 21 21:10:40 UTC 2017

Speed of Data frame vs. Matrix

William Dunlap and Martin Maechler

Run on 2018-04-15