Expression Template Benchmarking

Author

Marc Paterno

Published

March 2, 2023

1 Purpose of this document

This document explores the code generation and performance of expression templates in the context of dual numbers. In particular, we are concerned with artithmetic operations on dual numbers that occur in the field of automatic differentiation.

2 Benchmarking results

We have collected performance data on an MacBook Pro laptop with ann Intel i9 core chip. We used the nanobench library to collect performance information.

Our collected data have the following form:

time	speed	fracerr	total	func	compiler
2.28	438538721	0.007	1.52	fold	apple-clang
2.28	438195750	0.003	1.51	template	apple-clang
3.10	322280516	0.012	2.04	handopt	apple-clang
2.34	427787560	0.010	1.60	naive	apple-clang
2.02	496065165	0.015	1.33	constructor	apple-clang
6.09	164121688	0.012	4.01	fold	gcc-10
6.03	165740201	0.011	4.00	template	gcc-10
6.52	153364308	0.006	4.31	handopt	gcc-10
6.52	153467732	0.014	4.31	naive	gcc-10
5.29	189178294	0.010	3.49	constructor	gcc-10

Each benchmark execution was repeated 10 times in the data shown below.

Figure 1 shows a comparison of the different functions with results grouped by compiler. Note that each plot has a different scale on the \(x\) axis.

Figure 1: Comparison between different functions for each compiler.

We will compare first the measured speed of the constructor call to the other functions. The code measured by constructor is shown in Listing 1. This is intended to be as similar as possible to the code used in the benchmarking function used for the function we wish to compare. Listing 2 shows the actual benchmark executed. The function call differs from that used for benchmarking the various multiplication functions in that it takes 6 doubles by value, rather than 3 DualNums by const reference.

Listing 1: The only_construct function used in benchmarking.


DualNum5
only_construct(double a, double b, double c, double d, double e, double f)
{
  // Return uses DualNum5 constructor using 6 doubles
  return {a, b, c, d, e, f};
}

Listing 2: Benchmark function for timing the DualNum5 constructor.

void
run_bench_constructor(ankerl::nanobench::Bench* bench, char const* name)
{
  ankerl::nanobench::Rng rng(137);
  auto                   gen = [&]() { return rng.uniform01(); };
  double                 a   = gen();
  double                 b   = gen();
  double                 c   = gen();
  double                 d   = gen();
  double                 e   = gen();
  double                 f   = gen();
  bench->run(name, [&]() {
    DualNum5 z = only_construct(a, b, c, d, e, f);
    ankerl::nanobench::doNotOptimizeAway(z);
  });
}

It appears that GCC 10 does a much worse job than the other compilers of optimizing the construction time. Perhaps inspecting the assembly language code will explain why. ?@lst-gcc10-ctor-asm and @list-gcc11-ctor-asm show the assembly language generated for the same C++ source code by GCC 10 and GCC 11, the worst- and best-performing of the GCC implementations. The code difference is not large, although the performance difference is quite significant.

Listing 3: Optimized assembly code output from GCC 10 for the only_construct function.

vunpcklpd       %xmm3, %xmm2, %xmm2
vunpcklpd       %xmm1, %xmm0, %xmm1
movq    %rdi, %rax
vinsertf128     $0x1, %xmm2, %ymm1, %ymm0
vmovupd %ymm0, (%rdi)
vmovsd  %xmm4, 32(%rdi)
vmovsd  %xmm5, 40(%rdi)
vzeroupper
ret

Listing 4: Optimized assembly code output from GCC 11 for the only_construct function.

vunpcklpd       %xmm3, %xmm2, %xmm2
vunpcklpd       %xmm1, %xmm0, %xmm1
movq    %rdi, %rax
vinsertf128     $0x1, %xmm2, %ymm1, %ymm0
vunpcklpd       %xmm5, %xmm4, %xmm4
vmovupd %ymm0, (%rdi)
vmovupd %xmm4, 32(%rdi)
vzeroupper
ret

Next we look at the same data organized to allow easy comparison for each function across the different compilers.

Figure 2: Comparisons between different compilers for each function.

We can subtract the construction time from both of the other calculations to infer how much time is actually spent in the calculations. Since we do not have the time broken down by individual run we calculate means, and subtract them.

I am not yet sure what to make of Figure 3.

Figure 3: Comparison of calculation times by compiler.