Updates

This document has been updated to include two runs of the eventselection program over the same data.

Read the dataframes

Run 1 used 112 nodes for the eventselection program, and 16 nodes running HEPnOS daemon processes. We had 32 targets on each HEPnOS node, for a total of 512 targets. It used 63 ranks per node for the eventselection program.

Run 2 was similar except that it used 64 ranks per node for the eventselection program.

Our raw timing data consists of a series of timestamps captured using the MPI timer, a step name identifying the kind of timestamp, and an integer datum with a meaning that depends on the step it accompanies.

## # A tibble: 14,224 x 23
##    job    rank start post_dataset post_block_conf… post_broadcast pre_decompose
##    <chr> <int> <dbl>        <dbl>            <dbl>          <dbl>         <dbl>
##  1 1         0 0.276         19.1             19.1           191.          191.
##  2 1      1000 0.244         27.3             27.3           191.          191.
##  3 1      1001 0.103         27.3             27.3           191.          191.
##  4 1      1002 0.266         17.6             17.6           191.          191.
##  5 1      1003 0.258         18.7             18.7           191.          191.
##  6 1      1004 0.221         27.3             27.3           191.          191.
##  7 1      1005 0.125         21.2             21.2           191.          191.
##  8 1      1006 0.107         17.1             17.1           191.          191.
##  9 1      1007 0.275         25.2             25.2           191.          191.
## 10 1      1008 0.109         19.1             19.1           191.          191.
## # … with 14,214 more rows, and 16 more variables: post_decompose <dbl>,
## #   pre_execute_block <dbl>, post_execute_block <dbl>, pre_reduction <dbl>,
## #   post_reduction <dbl>, finish <dbl>, makeds <dbl>, calcbc <dbl>,
## #   broadcast <dbl>, prep <dbl>, makeblocks <dbl>, makelambda <dbl>,
## #   executeblock <dbl>, reduction <dbl>, output <dbl>, total <dbl>

Starting times

We first look at timestamps that are captured for each event. The evts_1 dataframe contains one row per event. The columns in the evts_1 dataframe are:

rank: the MPI rank
evt: a unique identifier for the event
precr: pre-create records timestamp
postcr: post-create records timestamp
postfr: post-filter records timestamp
postps: post-process subrun timestamp
nslices: number of slices in the event
nbytes: total number of bytes in all slices in the event
load: duration of loading from HEPnOS, in seconds
rec: duration of conversion from HEPnOS object to StandardRecord object
filt: duration of NOvA candidate selection execution

All timestamps are shifted so that the very first timestamp is at \(t = 0\), with times in seconds.

Looking at the starting times shows how long it takes to start the various ranks in the program. The first rank to start has its start time set to 0.

## # A tibble: 2 x 3
##   job    rank start
##   <chr> <int> <dbl>
## 1 1      1077     0
## 2 2       837     0

There are no undue delays in startup.

Finishing times

Job 2 has an enormous delay in the finishing time for rank 0; job 1 does not:

## # A tibble: 2 x 3
##   job    rank finish
##   <chr> <int>  <dbl>
## 1 1         0   520.
## 2 2         0  5999.

Leaving off rank 0, the finishing times shows clear structure from the reduction. Note that rank 0 is the root to which the reduction is done. Job 2 is clearly slower than job 1.

Reduction

The reduction task in the eventselection program consists of getting the SliceIDs (4 integers each) of all the slices that pass the candidate selection critera to rank 0, where they are then printed out. The reduction operation is the concatenation of vectors of SliceIDs.

Reduction durations

Note we use time to indicate a point in time at which some event happens, and duration to mean the span of time taken by some process.

Job 2 shows a huge duration for the reduction of rank 0, and job 1 does not:

## # A tibble: 2 x 3
##   job    rank reduction
##   <chr> <int>     <dbl>
## 1 1         0      79.3
## 2 2         0    5477.

Leaving off rank 0, the reduction durations show structure that seems to indicate the organization of the reduction algorithm.

The distribution of reduction durations is best shown on a log scale.

Reduction starting times

It is possible that the long reduction duration for rank 0 might be because of some delay in starting the reduction for some ranks. To see if there is any indication of this, we look at the time at which the reduction was begun, for each rank:

Loading data

Loading data means reading from HEPnOS into the eventselection client.

Recall that we call a hepnos::Event::load function for each data product, and that we are reading several data products.

The size of the read is the number of bytes corresponding to the sum of the sizes of the objects in memory as they are seen by the client code. We are not counting bytes transferred.

The duration of the read is the number of seconds from the start of the first hepnos::Event::load to the end of the final hepnos::Event::load for that event.

Size of loaded data

The median load size is 6396 bytes and the largest is 72660 bytes.

The distributions for the two jobs should be identical, because the same data set is process in each case.

Duration of loading

The duration shows a small but relevant tail, hard to see in this plot.

Loading data speed

We call the number of bytes per second read the speed.

We can plot how the loading speed changes with the size of the data loaded. We group the events by size (nbytes) into 200 groups of equal size. For each group, we plot the mean speed (in bytes/s) versus the mean size nbytes (in bytes). The line is a “lowess” regression (local polynomial fitting).

We can make a similar plot of the load duration:

The load duration shows an “overhead” of ~325 milliseconds, with a slow increase in duration for sizes above ~10 kilobytes.

The slow loads

We define a slow load as one that takes more than 10 seconds.

## # A tibble: 18,942 x 12
##    job    rank   evt precr postfr postcr postps nslices nbytes  load     rec
##    <chr> <int> <int> <dbl>  <dbl>  <dbl>  <dbl>   <int>  <int> <dbl>   <dbl>
##  1 1         0     8  208.   225.   225.   225.       2   2616  17.6 1.10e-4
##  2 1      1000   601  201.   218.   218.   218.       6  10848  17.0 1.52e-3
##  3 1      1000   603  218.   232.   232.   232.       6  11856  13.9 1.60e-4
##  4 1      1001  1266  198.   213.   213.   213.       3   4740  14.8 1.30e-4
##  5 1      1001  1270  213.   233.   233.   233.       6  12012  19.3 1.50e-4
##  6 1      1002  1792  201.   218.   218.   218.       2   2856  16.9 1.10e-4
##  7 1      1002  1793  218.   232.   232.   232.       4  11304  13.4 1.40e-4
##  8 1      1002  1794  234.   244.   244.   244.       2   2616  10.7 1.10e-4
##  9 1      1003  2394  207.   225.   225.   225.       2   2856  17.6 1.20e-4
## 10 1      1004  2985  201.   218.   218.   218.       2   2856  16.9 1.53e-3
## # … with 18,932 more rows, and 1 more variable: filt <dbl>

Job 2 has more slow loads than job 1:

## # A tibble: 2 x 2
##   job       n
##   <chr> <int>
## 1 1      8756
## 2 2     10186

The slow loads seem to be spread across all ranks.

Looking at targets

We can identify which HEPnOS target is associated with each rank.

We can then summarize the event data by target:

## # A tibble: 1,024 x 10
## # Groups:   job [2]
##    job   target nevts nranks   minld medld meanld maxld   nbytes nslices
##    <chr>  <dbl> <int>  <int>   <dbl> <dbl>  <dbl> <dbl>    <int>   <int>
##  1 1          0  7734     14 0.00893 0.185  0.330  27.8 55084068   35861
##  2 1          1  7857     14 0.00893 0.177  0.328  27.0 56090376   36538
##  3 1          2  7746     14 0.00893 0.183  0.332  29.3 53726376   34994
##  4 1          3  7771     14 0.00898 0.183  0.333  30.1 53054172   34555
##  5 1          4  7711     14 0.00893 0.181  0.326  31.6 53682060   35072
##  6 1          5  7678     14 0.00908 0.185  0.331  33.4 52684128   34244
##  7 1          6  7791     14 0.00855 0.176  0.330  29.1 54216888   35284
##  8 1          7  7708     14 0.00862 0.188  0.330  26.3 53174220   34580
##  9 1          8  7658     14 0.00871 0.191  0.336  31.6 51630540   33506
## 10 1          9  7531     14 0.00907 0.186  0.332  29.9 51564792   33480
## # … with 1,014 more rows

##      job                target          nevts          nranks     
##  Length:1024        Min.   :  0.0   Min.   :6961   Min.   :13.00  
##  Class :character   1st Qu.:127.8   1st Qu.:7538   1st Qu.:14.00  
##  Mode  :character   Median :255.5   Median :7644   Median :14.00  
##                     Mean   :255.5   Mean   :7601   Mean   :13.89  
##                     3rd Qu.:383.2   3rd Qu.:7722   3rd Qu.:14.00  
##                     Max.   :511.0   Max.   :7900   Max.   :14.00  
##      minld              medld            meanld           maxld      
##  Min.   :0.008000   Min.   :0.1652   Min.   :0.3184   Min.   :17.53  
##  1st Qu.:0.008720   1st Qu.:0.1867   1st Qu.:0.3337   1st Qu.:28.75  
##  Median :0.008870   Median :0.2111   Median :0.3389   Median :30.80  
##  Mean   :0.008839   Mean   :0.2092   Mean   :0.3387   Mean   :29.81  
##  3rd Qu.:0.008980   3rd Qu.:0.2320   3rd Qu.:0.3436   3rd Qu.:32.28  
##  Max.   :0.009450   Max.   :0.2391   Max.   :0.3540   Max.   :37.15  
##      nbytes            nslices     
##  Min.   :43585320   Min.   :28421  
##  1st Qu.:51331257   1st Qu.:33354  
##  Median :52786092   Median :34342  
##  Mean   :52551737   Mean   :34175  
##  3rd Qu.:54086640   3rd Qu.:35150  
##  Max.   :56900088   Max.   :36925

We show, for every target, the median load duration. We use the color to indicate the number of megabytes read from that target, and bin the data by number of (thousands of) slices, each plotted in its own panel. There is a tendency for the highest target numbers to have a smaller number of megabytes transfered, along with a slightly smaller median load duration.

## Warning in mutate_impl(.data, dots, caller_env()): Unequal factor levels:
## coercing to character

## Warning in mutate_impl(.data, dots, caller_env()): binding character and factor
## vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and factor
## vector, coercing into character vector

The distribution of event load durations for a few selected targets does not seem to indicate the different targets are behaving differently.

Looking at ranks

The time evolution of the load duration is complex. We plot here the load duration for each event, as a function of the (approximate) time at which that load is done. The time used is the timestamp from the pre create records event, which is taken shortly before that event is read from HEPnOS.

We plot only a few ranks, because the full plot takes very long to produce and is too busy to be useful.

We can summarize the data for each rank.

## # A tibble: 14,224 x 8
## # Groups:   job [2]
##    job    rank target     n   minlt meanlt medlt maxlt
##    <chr> <int>  <dbl> <int>   <dbl>  <dbl> <dbl> <dbl>
##  1 1         0      0   600 0.00938  0.341 0.204  17.6
##  2 1         1      1   600 0.00907  0.344 0.168  16.0
##  3 1         2      2   600 0.00893  0.337 0.174  17.0
##  4 1         3      3   573 0.00898  0.343 0.140  18.2
##  5 1         4      4   516 0.00983  0.358 0.225  16.7
##  6 1         5      5   600 0.00915  0.335 0.189  17.6
##  7 1         6      6   600 0.00926  0.338 0.195  17.2
##  8 1         7      7   600 0.00947  0.347 0.191  21.3
##  9 1         8      8   506 0.00953  0.374 0.195  20.8
## 10 1         9      9   600 0.00951  0.336 0.204  23.1
## # … with 14,214 more rows

##      job                 rank          target            n        
##  Length:14224       Min.   :   0   Min.   :  0.0   Min.   :329.0  
##  Class :character   1st Qu.:1778   1st Qu.:126.8   1st Qu.:500.0  
##  Mode  :character   Median :3556   Median :253.5   Median :562.0  
##                     Mean   :3556   Mean   :253.9   Mean   :547.2  
##                     3rd Qu.:5333   3rd Qu.:380.2   3rd Qu.:600.0  
##                     Max.   :7167   Max.   :511.0   Max.   :600.0  
##      minlt             meanlt           medlt            maxlt      
##  Min.   :0.00800   Min.   :0.2181   Min.   :0.1029   Min.   : 3.82  
##  1st Qu.:0.00918   1st Qu.:0.3241   1st Qu.:0.1967   1st Qu.:17.09  
##  Median :0.00963   Median :0.3461   Median :0.2114   Median :17.73  
##  Mean   :0.00973   Mean   :0.3388   Mean   :0.2080   Mean   :18.89  
##  3rd Qu.:0.01024   3rd Qu.:0.3606   3rd Qu.:0.2376   3rd Qu.:18.84  
##  Max.   :0.01204   Max.   :0.4425   Max.   :0.3198   Max.   :37.15

The median load duration per rank shows some structure. The target does not seem to be important in this structure.

The mean load duration per rank is more dependent on extreme outliers, and also shows some structure.

The maximum load duration per rank shows less structure, but there is still one block of ranks very different from the others.

Big file analysis

Marc Paterno

4/23/2020