This document presents analysis of the reduction phase of the eventselection program.
Run 2 used 112 nodes for the eventselection program, and 64 ranks per node.
Run 3 used 128 nodes for the eventselection program, and 56 ranks per node.
Run 4 was similar to run 2, except that it has a barrier before the beginning of the reduction, and it was instrumented to show details of the reduction phase.
Each program had a total of 7168 ranks.
Our raw timing data consists of a series of timestamps captured using the MPI timer, a step name identifying the kind of timestamp, and an integer datum with a meaning that depends on the step it accompanies.
## # A tibble: 21,504 x 23
## job rank start post_dataset post_block_conf… post_broadcast pre_decompose
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 0 0.266 19.5 19.5 288. 288.
## 2 2 1000 0.234 23.0 23.0 288. 288.
## 3 2 1001 0.0536 14.3 14.3 288. 288.
## 4 2 1002 0.108 18.2 18.2 288. 288.
## 5 2 1003 0.0922 15.1 15.1 288. 288.
## 6 2 1004 0.0907 15.5 15.5 288. 288.
## 7 2 1005 0.0896 14.6 14.6 288. 288.
## 8 2 1006 0.212 17.1 17.1 288. 288.
## 9 2 1007 0.0514 16.7 16.7 288. 288.
## 10 2 1008 0.270 13.2 13.2 288. 288.
## # … with 21,494 more rows, and 16 more variables: post_decompose <dbl>,
## # pre_execute_block <dbl>, post_execute_block <dbl>, pre_reduction <dbl>,
## # post_reduction <dbl>, finish <dbl>, makeds <dbl>, calcbc <dbl>,
## # broadcast <dbl>, prep <dbl>, makeblocks <dbl>, makelambda <dbl>,
## # executeblock <dbl>, reduction <dbl>, output <dbl>, total <dbl>
Looking at the starting times shows how long it takes to start the various ranks in the program. The first rank to start has its start time set to 0.
## # A tibble: 3 x 3
## job rank start
## <int> <int> <dbl>
## 1 2 837 0
## 2 3 563 0
## 3 4 731 0
There are no undue delays in startup.
## # A tibble: 3 x 3
## job rank finish
## <int> <int> <dbl>
## 1 2 0 5999.
## 2 3 0 3189.
## 3 4 0 602.
Job 4, with the explicit barrier before the MPI reduction, shows a much smaller spread of finishing times. However, that finishing time is worse (later) than all but rank 0 for the runs that do not have a barrier.
The reduction task in the eventselection program consists of getting the SliceIDs (4 integers each) of all the slices that pass the candidate selection critera to rank 0, where they are then printed out. The reduction operation is the concatenation of vectors of SliceIDs.
Note we use time to indicate a point in time at which some event happens, and duration to mean the span of time taken by some process.
The jobs lacking the MPI barrier before the reduction step both show a large reductiom duration for rank 0.
## # A tibble: 3 x 3
## job rank reduction
## <int> <int> <dbl>
## 1 2 0 5477.
## 2 3 0 2674.
## 3 4 0 14.1
Leaving off rank 0, the reduction durations show structure that seems to indicate the organization of the reduction algorithm.
The distribution of reduction durations is best shown on a log scale.
It is possible that the long reduction duration for rank 0 might be because of some delay in starting the reduction for some ranks. To see if there is any indication of this, we look at the time at which the reduction was begun, for each rank. The plot for job 4 is a result of the MPI barrier.