Data Structure Profiling
version 2
Purpose
This document describes the microbenchmarking work I have done to help choose between possible data structures to use for the SimPhotonsLite class.
Note for version 2: this document corrects an error in the data collection present in version 1. In the earlier version, the apparently good performance of the hashmap solution was caused by an error in the code that filled the collections. The result of this error was that the number of measurements present in the std::map<int,int> and hashmap containers was smaller than expected (and never larger than 1000 elements). The conclusions reached are affected by this change.
Contenders for the design
The original design for SimPhotonsLite is a std::map<int, int>. Possible contenders for a redesign include:
std::unordered_map<int, int>. I call thishashmap.- A container of
recordobjects, each record containing atickand annphotsvalue. Possible choices for the container arestd::vector,std::dequeandstd::forward_list. I call these array-of-struct data structures. In the analysis below, they areaosv,aosdandaosl. - a
structcontaining two containers ofintvalues. The choices for the container are the same as above. I call these struct-of-array data structures. In the analysis below, they aresoav,soadandsoal.
Generally, the struct-of-array data structures are designed to optimize the performance of iterating over all nphots values, and the array-of-struct data structures are designed to optimize the performance of iterating simultaneously through both ticks and nphots. hashmap is optimized for doing insertions and deletions and direct lookups by ticks value. Note that the data product is created only once, so that optimization for insertions and deletions is not relevant. If lookup by ticks values is needed, the arrays in the SOA and AOS implementations could be sorted, and an interface providing a binary search could be provided.
Performance measurements
I made performance measurements using the nanobench library. In each case, collections of a varying range of sizes were used. Note that the typical size of a SimPhotonsLite object is about 3000 key/value pairs.
I created benchmarks for a variety of different usage patterns, described below. In each case I ran the same code on a Linux box (Intel Skylake-avx512) and a MacBook Pro (Apple M2 Max chip). On the Skylake machine, the code was compiled with GCC 12.3.0. On the M2 machine, the code was compiled with Apple Clang 15.0. I have used two different compilers, standard library implementations, and types of hardware to illustrate what patterns are common across implementations, and where implementation differences cause performance divergences.
Iterating through nphots values, ignoring ticks
In this test the iteration looks only at the nphots values, summing them.
nphots values. The vertical dotted line shows the typical size of the collection (3000).
The winners are the SOA data structures, which have similar performance. On the Skylake build, the AOS vector also achieves excellent performance; this is not true on the M2 build. The std::map<int, int> data structure is the slowest over the entire range. The hashmap data structure is less bad than the std::map<int,int>, but is still bad. This is a bucket-based data structure that performs approximately as well as the aosl, which is a node-based data structure; both yield poor cache coherency.
Iterating through both ticks and nphots
In this test, the iteration requires looking at both the ticks and the nphots data. The task we do is to find the tick with the largest value of nphots by a scan through all the data.
nphots to find the largest, and to record the ticks associated with that largest value. In this scan, ticks is only read when necessary. The vertical dotted line shows the typical size of the collection (3000).
This time the differences between the SOA and AOS structures are less dramatic, with the SOA structures doing better. Again, the std::map<int,int> is uniformly worst, and the node- and bucket-based stores of records are poor.
Tabular results
If we concentrate only on the size of 3000, we obtain the rankings below.
| structure | time |
|---|---|
| aosv | 514.25 |
| soav | 514.70 |
| soal | 516.60 |
| soad | 518.05 |
| aosd | 3070.63 |
| hashmap | 5735.98 |
| aosl | 5808.77 |
| map | 15809.36 |
sums function, Skylake results only.
| structure | time |
|---|---|
| aosv | 3072.53 |
| soav | 3744.61 |
| soad | 3749.71 |
| soal | 3750.77 |
| aosd | 4705.20 |
| hashmap | 5792.76 |
| aosl | 5813.63 |
| map | 17084.49 |
scan function, Skylake results only.
Conclusions
For the case if iterating only through the values (not looking at the keys), the struct-of-array choices, and the array-of-struct for an array type of vector are all good. The node- and the bucket-based solutions (std::map<int,int>,hashmap,aosl`) are not contenders for the solution.
For the case of iterating through both keys and values the array-of-struct and struct-of-array, both using vector, are the two best.
The size of the collections in question does not change the decision of which is fastest. In the same of 10 events I have used in my work thus far, the range of sizes was from 1 to 10 thousand measurements; over this range the relative performances are fairly stable.