Started from an E-mail
William Dunlap | R-help mailing list | 17 Mar 2014 | Subject: Re: data frame vs. matrix https://stat.ethz.ch/pipermail/r-help/2014-March/372323.html
MM, 2016: From the timings below, note how much faster R is two years later!
MM, 2018: Since R version 3.4.0, by default, R uses just in time (= JIT) auto-compilation. The following line disables this — so we can make the comparison as in earlier versions of R:
compiler::enableJIT(-1) # shows current auto-compilation "level".
[1] 3
## Turn it off:
oL <- compiler::enableJIT(0) # (-> level 0: it is turned off)
Duncan Murdoch’s analysis suggests another way to do this: extract the x
vector, operate on that vector in a loop, then insert the result into the data.frame. I added a df="quicker"
option to your df
argument and made the test dataset deterministic so we could verify that the algorithms do the same thing:
dumkoll <- function(n = 1000, df = TRUE){
dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n)))
if (identical(df, "quicker")) {
x <- dfr$x
for(i in 2:length(x)) {
x[i] <- x[i-1]
}
dfr$x <- x
} else if (df){
for (i in 2:NROW(dfr)){
# if (!(i %% 100)) cat("i = ", i, "\n")
dfr$x[i] <- dfr$x[i-1]
}
}else{
dm <- as.matrix(dfr)
for (i in 2:NROW(dm)){
# if (!(i %% 100)) cat("i = ", i, "\n")
dm[i, 1] <- dm[i-1, 1]
}
dfr$x <- dm[, 1]
}
dfr
}
(Bill Dunlap:) Timings for \(10^4\), \(2* 10^4\), and \(4* 10^4\) show that the time is quadratic in n for the df=TRUE case and close to linear in the other cases, with the new method taking about 60% the time of the matrix method:
n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4)
sapply(n, function(n) system.time(dumkoll(n, df=FALSE))[1:3])
10k 20k 40k
user.self 0.018 0.036 0.072
sys.self 0.000 0.000 0.001
elapsed 0.018 0.036 0.073
## BD: 10k 20k 40k
## user.self 0.11 0.22 0.43
## sys.self 0.02 0.00 0.00
## elapsed 0.12 0.22 0.44
sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3])
10k 20k 40k
user.self 0.352 0.818 2.519
sys.self 0.000 0.000 0.000
elapsed 0.354 0.824 2.540
## BD: 10k 20k 40k
## user.self 3.59 14.74 78.37
## sys.self 0.00 0.11 0.16
## elapsed 3.59 14.91 78.81
sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3])
10k 20k 40k
user.self 0.009 0.017 0.036
sys.self 0.000 0.000 0.000
elapsed 0.009 0.017 0.035
# BD: 10k 20k 40k
# user.self 0.06 0.12 0.26
# sys.self 0.00 0.00 0.00
# elapsed 0.07 0.13 0.27
I also timed the 2 faster cases for n=10^6 and the time still looks linear in n, with vector approach still taking about 60% the time of the matrix approach. ((NB vvvvvvvvvv knitr
feature))
system.time(dumkoll(n=10^6, df=FALSE))
user system elapsed
1.151 0.132 1.294
# BD: user system elapsed
# 11.65 0.12 11.82
system.time(dumkoll(n=10^6, df="quicker"))
user system elapsed
0.102 0.004 0.107
# BD: user system elapsed
# 6.79 0.08 6.91
The results from each method are identical:
identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE))
[1] TRUE
identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker"))
[1] TRUE
If your data.frame has columns of various types, then as.matrix
will coerce them all to a common type (often character), so it may give you the wrong result in addition to being unnecessarily slow.
Bill Dunlap TIBCO Software wdunlap tibco.co
Rprof("dumkoll.Rprof", interval = 0.005) # start profiling
dd <- dumkoll(50000, df=TRUE)
Rprof(NULL) # stop profiling
## ?Rprof
sr <- summaryRprof("dumkoll.Rprof")
sr
$by.self
self.time self.pct total.time total.pct
"dumkoll" 2.970 76.06 3.905 100.00
"$<-.data.frame" 0.170 4.35 0.235 6.02
"[[.data.frame" 0.155 3.97 0.395 10.12
"$" 0.100 2.56 0.615 15.75
"[[" 0.075 1.92 0.470 12.04
"$<-" 0.070 1.79 0.305 7.81
"%in%" 0.070 1.79 0.145 3.71
"<Anonymous>" 0.050 1.28 0.065 1.66
"sys.call" 0.050 1.28 0.050 1.28
"$.data.frame" 0.045 1.15 0.515 13.19
"NROW" 0.030 0.77 0.035 0.90
"all" 0.025 0.64 0.025 0.64
"names" 0.025 0.64 0.025 0.64
"-" 0.015 0.38 0.015 0.38
".row_names_info" 0.015 0.38 0.015 0.38
".subset2" 0.010 0.26 0.010 0.26
"is.atomic" 0.010 0.26 0.010 0.26
"dim" 0.005 0.13 0.005 0.13
"is.matrix" 0.005 0.13 0.005 0.13
"nargs" 0.005 0.13 0.005 0.13
"oldClass" 0.005 0.13 0.005 0.13
$by.total
total.time total.pct self.time self.pct
"dumkoll" 3.905 100.00 2.970 76.06
"block_exec" 3.905 100.00 0.000 0.00
"call_block" 3.905 100.00 0.000 0.00
"eval" 3.905 100.00 0.000 0.00
"evaluate_call" 3.905 100.00 0.000 0.00
"evaluate::evaluate" 3.905 100.00 0.000 0.00
"evaluate" 3.905 100.00 0.000 0.00
"handle" 3.905 100.00 0.000 0.00
"in_dir" 3.905 100.00 0.000 0.00
"knitr::knit" 3.905 100.00 0.000 0.00
"process_file" 3.905 100.00 0.000 0.00
"process_group.block" 3.905 100.00 0.000 0.00
"process_group" 3.905 100.00 0.000 0.00
"rmarkdown::render" 3.905 100.00 0.000 0.00
"timing_fn" 3.905 100.00 0.000 0.00
"withCallingHandlers" 3.905 100.00 0.000 0.00
"withVisible" 3.905 100.00 0.000 0.00
"$" 0.615 15.75 0.100 2.56
"$.data.frame" 0.515 13.19 0.045 1.15
"[[" 0.470 12.04 0.075 1.92
"[[.data.frame" 0.395 10.12 0.155 3.97
"$<-" 0.305 7.81 0.070 1.79
"$<-.data.frame" 0.235 6.02 0.170 4.35
"%in%" 0.145 3.71 0.070 1.79
"<Anonymous>" 0.065 1.66 0.050 1.28
"sys.call" 0.050 1.28 0.050 1.28
"NROW" 0.035 0.90 0.030 0.77
"all" 0.025 0.64 0.025 0.64
"names" 0.025 0.64 0.025 0.64
"-" 0.015 0.38 0.015 0.38
".row_names_info" 0.015 0.38 0.015 0.38
".subset2" 0.010 0.26 0.010 0.26
"is.atomic" 0.010 0.26 0.010 0.26
"dim" 0.005 0.13 0.005 0.13
"is.matrix" 0.005 0.13 0.005 0.13
"nargs" 0.005 0.13 0.005 0.13
"oldClass" 0.005 0.13 0.005 0.13
$sample.interval
[1] 0.005
$sampling.time
[1] 3.905
So, indeed, the culprit is $<-
, and specifically almost only the data.frame
method of that.
A “free” way to increase performance of R functions: R’s byte compiler:
require(compiler)
Loading required package: compiler
help(package = "compiler")# fails to give anything (Rstudio bug !)
library(help = "compiler")# the old fashioned way works fine
These are not evaluated (when the *.Rmd is knit into Markdown):
?cmpfun # interesting, notably
example(cmpfun) # shows indeed speedups of almost 50% in one case (on MM's notebook)
So, we now can compile our function and see how much that helps:
dumkoll2 <- cmpfun(dumkoll)
require(microbenchmark)
Loading required package: microbenchmark
Let’s use a somewhat small n
n <- 2000
mbd <- microbenchmark(dumkoll(n), dumkoll2(n),
dumkoll(n, df=FALSE), dumkoll2(n, df=FALSE),
dumkoll(n, df="quicker"), dumkoll2(n, df="quicker"), times = 25)
mbd
Unit: microseconds
expr min lq mean median
dumkoll(n) 41018.298 41773.930 43140.9819 42569.952
dumkoll2(n) 37042.018 37855.456 39306.9822 38843.909
dumkoll(n, df = FALSE) 3556.881 3673.366 3830.3426 3751.006
dumkoll2(n, df = FALSE) 2132.099 2310.885 2415.1664 2373.064
dumkoll(n, df = "quicker") 1890.222 1986.398 2136.1364 2048.263
dumkoll2(n, df = "quicker") 360.558 407.426 450.4333 427.902
uq max neval cld
44040.776 47499.921 25 e
39841.634 49404.805 25 d
3952.892 4517.149 25 c
2430.689 3092.909 25 b
2303.910 2783.479 25 b
479.398 724.098 25 a
plot(mbd, log="y")
Wow, I’m slightly surprised that the compiler helped quite a bit, notably for the faster solutions (matrix and vector “[<-” calls).
print(sessionInfo(), locale=FALSE)
R version 3.4.4 Patched (2018-03-19 r74567)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora 26 (Twenty Six)
Matrix products: default
BLAS: /scratch/users/maechler/R/D/r-patched/inst-shlib/lib/libRblas.so
LAPACK: /scratch/users/maechler/R/D/r-patched/inst-shlib/lib/libRlapack.so
attached base packages:
[1] compiler graphics grDevices datasets stats utils methods
[8] base
other attached packages:
[1] microbenchmark_1.4-4 knitr_1.20 sfsmisc_1.1-1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 magrittr_1.5 splines_3.4.4 MASS_7.3-49
[5] lattice_0.20-35 multcomp_1.4-8 stringr_1.3.0 tools_3.4.4
[9] grid_3.4.4 TH.data_1.0-8 htmltools_0.3.6 yaml_2.1.18
[13] survival_2.41-3 rprojroot_1.3-2 digest_0.6.15 Matrix_1.2-12
[17] codetools_0.2-15 evaluate_0.10.1 rmarkdown_1.9 sandwich_2.4-0
[21] stringi_1.1.7 backports_1.1.2 mvtnorm_1.0-7 zoo_1.8-1
structure(Sys.info()[c("nodename","sysname", "version")], class="simple.list")
_
nodename nb-mm4
sysname Linux
version #1 SMP Tue Nov 21 21:10:40 UTC 2017