Timing analysis of June 30 run

Purpose of this document

This document presents an analysis of the HEPnOS large-run timing data, to determine whether more work is needed on the code before we write our paper.

Read the performance analysis data

We read the raw data, and make the per-rank global dataframe.

raw <- readRDS("theta_es_7168_2020-06-30_01.rds")

ranks <- make_global_df(raw)

Total time.

From the batch job log file, the program takes ~574 seconds to run. From internal timing, the time between the first timestamp we capture and the last is 473.8231599.

The difference seems to be MPI startup and shutdown time.

Per-rank analysis

Calculating block configurations

How long does it take to calculate block configurations? We have a communicator (“group”) for each HEPnOS target. Only the local group leader does this calculation for that target. The calculation involves communicating with the target to determine what events that target owns, using an Eventset.

We had 16 nodes running the daemon, and on each node running 2 ranks. We have 16 targets per rank running the HEPnOS daemon. This gives a total of 512 targets.

The “group leaders” have rank %% 14 == 0, because there are 14 ranks in each group.

mutate(ranks, leader = (rank %% 14 == 0)) %>%
ggplot(aes(calcbc, leader)) +
  geom_boxplot() +
  scale_x_log10()

Is there a pattern based on rank?

filter(ranks, rank %% 14 == 0) %>%
  ggplot(aes(rank, calcbc)) +
  geom_point() +
  scale_y_log10()

Broadcasting the block configurations

How long does it take to broadcast the blocks to all ranks? This is done within each communicator.

ggplot(ranks, aes(broadcast)) +
  geom_histogram(bins=50)

More about the block configuration calculation and broadcasting

We do not have an MPI barrier in place before the broadcast, so ranks that finish their block calculation quickly will appear to have a long broadcast time. We can observe this by looking at the distribution (for all ranks) of the sum of the calculation and broadcast times:

mutate(ranks, leader=(rank%%14==0)) %>%
  ggplot(aes(calcbc+broadcast, leader)) +
  geom_boxplot()

Determine what blocks each rank will create

During this time, each rank is figuring out the range of block configurations it will be required to create. We also create some DIY algorithmic objects.

ggplot(ranks, aes(prep)) +
  geom_histogram(bins=50)

This is all fast, and can be ignored.

Block creation

Each rank creates its own blocks. How long does it take to create each block?

ggplot(ranks, aes(makeblocks)) +
  geom_histogram(bins=50)

Block creation is fast, and can be ignored.

Lambda creation

We set up for pre-fetching, and we create the function object that will used to execute the blocks.

ggplot(ranks, aes(makelambda)) +
  geom_histogram(bins=50)

This is fast and can be ignored.

Block execution

This is where the real work is done: DIY is executing the blocks that do the candidate selection work. During this processing we are reading the data from HEPnOS, converting the HEPnOS-format data to the NOvA format, and calling the NOvA candidate selection code.

ggplot(ranks, aes(executeblock)) +
  geom_histogram(bins=50)

Reduction

This is the time taken for each rank to do the reduction on all the blocks it handles. The reduction consists of concatenating all the slice IDs for the slices that pass the candidate selection criteria to block 0, on rank 0.

ggplot(ranks, aes(reduction)) +
  geom_histogram(bins=50)

But see below: this plot is misleading, because we don’t have an MPI barrier before the reduction begins.

Per-event analysis

How is our load-balancing?

How many blocks are processed by each rank?

ggplot(r, aes(nblocks)) +
  geom_bar()

table(r$nblocks)

## 
##    3    4    5    6    7    8    9 
##    1    9  251 2381 3961  558    7

How many events are processed by each rank?

ggplot(r, aes(nevts)) + geom_histogram(bins=50)

ggplot(r, aes(nevts)) + geom_boxplot()

Reduction analysis

red <- make_reduction_phase_df(raw)

There were 4 rounds of reduction.

The distribution of reduction times (for each block into which we are aggregating data) shows that reductions are fast:

ggplot(red, aes(duration, group = round))+
  geom_histogram(bins=40) +
  facet_wrap(vars(round), scales = "free")