Benchmarking Drew's alternating-sign log sum

I was curious about Drew's example showing that doing the sum, \[ \sum_{i=1}^n (-1)^{i+1} \log(i) \] in R using a for loop took only \(\approx 1.3\) times longer than doing it in a vectorized fashion. That went against most things I had been told or encountered myself when programming in R.

Investigating further seemed like a good opportunity to practice with Markdown (and maybe get an RPubs account).

So, let's recreate it. Here is the function that uses a for loop:

f1 <- function(ind) {
    sum <- 0
    for (i in ind) {
        if (i%%2 == 0) 
            sum <- sum - log(i) else sum <- sum + log(i)
    }
    sum
}

Here is version 2, that does an sapply:

f2 <- function(ind) {
    sum(sapply(X = ind, FUN = function(i) if (i%%2 == 0) -log(i) else log(i)))
}

And here we have the vectorized version:

f3 <- function(ind) {
    sign <- replicate(length(ind), 1L)
    sign[seq.int(from = 2, to = length(ind), by = 2)] <- -1L
    sum(sign * log(ind))
}

And while we are at it, let's throw another one into the mix that uses vectorization and relies on recycling so we don't have to type as much:

f4 <- function(ind) {
    sum(c(1, -1) * log(ind))
}

So, now we rbenchmark these:

library(rbenchmark)
ind <- 1:50000
benchmark(f1(ind), f2(ind), f3(ind), f4(ind), replications = 10)
##      test replications elapsed relative user.self sys.self user.child
## 1 f1(ind)           10   0.985    31.77     0.981    0.005          0
## 2 f2(ind)           10   2.378    76.71     2.360    0.018          0
## 3 f3(ind)           10   0.748    24.13     0.736    0.011          0
## 4 f4(ind)           10   0.031     1.00     0.027    0.004          0
##   sys.child
## 1         0
## 2         0
## 3         0
## 4         0

Holy Crap Batman! We see here that f1 didn't do nearly as well as f3 because f1 did well, but rather because f3 did quite poorly. This, of course, begs the question, “which part of the perfectly fine-looking R code in f3 is the culprit?”

But before we try to answer that question I am going to take a quick digression into microbenchmark which is a package that Hadley Wickham recommends in the advanced R book that he is writing (check it out here and the section on microbenchmarking is here). If you haven't run across this book, do check it out. It is a fabulous resource.

A digression into microbenchmark

Let's try it:

library(microbenchmark)
microbenchmark(f1(ind), f2(ind), f3(ind), f4(ind), times = 10)
## Unit: milliseconds
##     expr     min      lq  median      uq    max neval
##  f1(ind)  97.311  97.839 102.397 105.304 120.28    10
##  f2(ind) 173.658 183.795 192.045 220.447 276.78    10
##  f3(ind)  71.584  78.554  93.350 100.370 131.16    10
##  f4(ind)   1.626   1.689   2.254   5.087  26.19    10

OK, similar results, but perhaps with more accurate timings. Apparently microbenchark is intended for timing only very small pieces of code, but this is what the authors say about it:

microbenchmark serves as a more accurate replacement of the often seen system.time(replicate(1000, expr)) expression. It tries hard to accurately measure only the time it takes to evaluate expr. To achieved this, the sub-millisecond (supposedly nanosecond) accurate timing functions most modern operating systems provide are used. Additionally all evaluations of the expressions are done in C code to minimze any overhead.

Now, let's find the culprit

Let's try some of the cool profiling methods that Drew told us about.

Here we will run f3 25 times while running Rprof()

Rprof(interval = 0.02)
for (i in 1:20) f3(ind)
Rprof(NULL)
summaryRprof()

Now, we don't actually evaluate that chunk while knitting to HTML because when evaluated within knitr it is imbedded within some evaluation functions that make the summaryRprof ouput rather nasty. But I will just paste it in here:

$by.self
                 self.time self.pct total.time total.pct
"lapply"              0.82    56.94       1.22     84.72
"FUN"                 0.40    27.78       0.40     27.78
"unlist"              0.10     6.94       0.62     43.06
"f3"                  0.06     4.17       1.44    100.00
"*"                   0.04     2.78       0.04      2.78
"unique.default"      0.02     1.39       0.02      1.39

$by.total
                 total.time total.pct self.time self.pct
"f3"                   1.44    100.00      0.06     4.17
"replicate"            1.34     93.06      0.00     0.00
"sapply"               1.34     93.06      0.00     0.00
"lapply"               1.22     84.72      0.82    56.94
"simplify2array"       0.64     44.44      0.00     0.00
"unlist"               0.62     43.06      0.10     6.94
"unique"               0.60     41.67      0.00     0.00
"FUN"                  0.40     27.78      0.40    27.78
"*"                    0.04      2.78      0.04     2.78
"unique.default"       0.02      1.39      0.02     1.39

$sample.interval
[1] 0.02

$sampling.time
[1] 1.44

Aha! There is the problem. Just a typo really. f3 uses replicate where it probably wants to use rep or even rep.int. Check out the source code for replicate:

function (n, expr, simplify = "array") 
sapply(integer(n), eval.parent(substitute(function(...) expr)), 
    simplify = simplify)
<bytecode: 0x100eeaf20>
<environment: namespace:base>

replicate is actually a wrapper for sapply.

So, let's see what happens if we use rep.int there instead. Make a new function like f3 but use rep.int instead, and call that f5:

f5 <- function(ind) {
    sign <- rep.int(1L, length(ind))
    sign[seq.int(from = 2, to = length(ind), by = 2)] <- -1L
    sum(sign * log(ind))
}

Then microbenchmark them all:

microbenchmark(f1(ind), f2(ind), f3(ind), f4(ind), f5(ind), times = 10)
## Unit: milliseconds
##     expr     min      lq  median      uq    max neval
##  f1(ind) 101.396 108.730 114.925 121.844 148.53    10
##  f2(ind) 181.568 204.382 213.661 231.774 248.90    10
##  f3(ind)  58.376  62.911  81.815  88.162 100.24    10
##  f4(ind)   1.593   1.614   1.987   5.724  14.18    10
##  f5(ind)   2.488   2.515   5.364   7.019  15.77    10

That seems a little more like it. f5 is now almost as fast as f4.

Cool Stuff.