| time | speed | fracerr | total | func | compiler |
|---|---|---|---|---|---|
| 2.28 | 438538721 | 0.007 | 1.52 | fold | apple-clang |
| 2.28 | 438195750 | 0.003 | 1.51 | template | apple-clang |
| 3.10 | 322280516 | 0.012 | 2.04 | handopt | apple-clang |
| 2.34 | 427787560 | 0.010 | 1.60 | naive | apple-clang |
| 2.02 | 496065165 | 0.015 | 1.33 | constructor | apple-clang |
| 6.09 | 164121688 | 0.012 | 4.01 | fold | gcc-10 |
| 6.03 | 165740201 | 0.011 | 4.00 | template | gcc-10 |
| 6.52 | 153364308 | 0.006 | 4.31 | handopt | gcc-10 |
| 6.52 | 153467732 | 0.014 | 4.31 | naive | gcc-10 |
| 5.29 | 189178294 | 0.010 | 3.49 | constructor | gcc-10 |
Expression Template Benchmarking
1 Purpose of this document
This document explores the code generation and performance of expression templates in the context of dual numbers. In particular, we are concerned with artithmetic operations on dual numbers that occur in the field of automatic differentiation.
2 Benchmarking results
We have collected performance data on an MacBook Pro laptop with ann Intel i9 core chip. We used the nanobench library to collect performance information.
Our collected data have the following form:
Each benchmark execution was repeated 10 times in the data shown below.
Figure 1 shows a comparison of the different functions with results grouped by compiler. Note that each plot has a different scale on the \(x\) axis.
We will compare first the measured speed of the constructor call to the other functions. The code measured by constructor is shown in Listing 1. This is intended to be as similar as possible to the code used in the benchmarking function used for the function we wish to compare. Listing 2 shows the actual benchmark executed. The function call differs from that used for benchmarking the various multiplication functions in that it takes 6 doubles by value, rather than 3 DualNums by const reference.
Listing 1: The only_construct function used in benchmarking.
DualNum5
only_construct(double a, double b, double c, double d, double e, double f)
{
// Return uses DualNum5 constructor using 6 doubles
return {a, b, c, d, e, f};
}Listing 2: Benchmark function for timing the DualNum5 constructor.
void
run_bench_constructor(ankerl::nanobench::Bench* bench, char const* name)
{
ankerl::nanobench::Rng rng(137);
auto gen = [&]() { return rng.uniform01(); };
double a = gen();
double b = gen();
double c = gen();
double d = gen();
double e = gen();
double f = gen();
bench->run(name, [&]() {
DualNum5 z = only_construct(a, b, c, d, e, f);
ankerl::nanobench::doNotOptimizeAway(z);
});
}It appears that GCC 10 does a much worse job than the other compilers of optimizing the construction time. Perhaps inspecting the assembly language code will explain why. ?@lst-gcc10-ctor-asm and @list-gcc11-ctor-asm show the assembly language generated for the same C++ source code by GCC 10 and GCC 11, the worst- and best-performing of the GCC implementations. The code difference is not large, although the performance difference is quite significant.
Listing 3: Optimized assembly code output from GCC 10 for the only_construct function.
vunpcklpd %xmm3, %xmm2, %xmm2
vunpcklpd %xmm1, %xmm0, %xmm1
movq %rdi, %rax
vinsertf128 $0x1, %xmm2, %ymm1, %ymm0
vmovupd %ymm0, (%rdi)
vmovsd %xmm4, 32(%rdi)
vmovsd %xmm5, 40(%rdi)
vzeroupper
retListing 4: Optimized assembly code output from GCC 11 for the only_construct function.
vunpcklpd %xmm3, %xmm2, %xmm2
vunpcklpd %xmm1, %xmm0, %xmm1
movq %rdi, %rax
vinsertf128 $0x1, %xmm2, %ymm1, %ymm0
vunpcklpd %xmm5, %xmm4, %xmm4
vmovupd %ymm0, (%rdi)
vmovupd %xmm4, 32(%rdi)
vzeroupper
retNext we look at the same data organized to allow easy comparison for each function across the different compilers.
We can subtract the construction time from both of the other calculations to infer how much time is actually spent in the calculations. Since we do not have the time broken down by individual run we calculate means, and subtract them.
I am not yet sure what to make of Figure 3.