This document presents an analysis of the HEPnOS large-run timing data, to determine whether more work is needed on the code before we write our paper.
We read the raw data, and make the per-rank global dataframe.
raw <- readRDS("theta_es_7168_2020-06-30_01.rds")
ranks <- make_global_df(raw)
From the batch job log file, the program takes ~574 seconds to run. From internal timing, the time between the first timestamp we capture and the last is 473.8231599.
The difference seems to be MPI startup and shutdown time.
How long does it take to calculate block configurations? We have a communicator (“group”) for each HEPnOS target. Only the local group leader does this calculation for that target. The calculation involves communicating with the target to determine what events that target owns, using an Eventset.
We had 16 nodes running the daemon, and on each node running 2 ranks. We have 16 targets per rank running the HEPnOS daemon. This gives a total of 512 targets.
The “group leaders” have rank %% 14 == 0, because there are 14 ranks in each group.
mutate(ranks, leader = (rank %% 14 == 0)) %>%
ggplot(aes(calcbc, leader)) +
geom_boxplot() +
scale_x_log10()
Is there a pattern based on rank?
filter(ranks, rank %% 14 == 0) %>%
ggplot(aes(rank, calcbc)) +
geom_point() +
scale_y_log10()
How long does it take to broadcast the blocks to all ranks? This is done within each communicator.
ggplot(ranks, aes(broadcast)) +
geom_histogram(bins=50)
We do not have an MPI barrier in place before the broadcast, so ranks that finish their block calculation quickly will appear to have a long broadcast time. We can observe this by looking at the distribution (for all ranks) of the sum of the calculation and broadcast times:
mutate(ranks, leader=(rank%%14==0)) %>%
ggplot(aes(calcbc+broadcast, leader)) +
geom_boxplot()
During this time, each rank is figuring out the range of block configurations it will be required to create. We also create some DIY algorithmic objects.
ggplot(ranks, aes(prep)) +
geom_histogram(bins=50)
This is all fast, and can be ignored.
Each rank creates its own blocks. How long does it take to create each block?
ggplot(ranks, aes(makeblocks)) +
geom_histogram(bins=50)
Block creation is fast, and can be ignored.
We set up for pre-fetching, and we create the function object that will used to execute the blocks.
ggplot(ranks, aes(makelambda)) +
geom_histogram(bins=50)
This is fast and can be ignored.
This is where the real work is done: DIY is executing the blocks that do the candidate selection work. During this processing we are reading the data from HEPnOS, converting the HEPnOS-format data to the NOvA format, and calling the NOvA candidate selection code.
ggplot(ranks, aes(executeblock)) +
geom_histogram(bins=50)
This is the time taken for each rank to do the reduction on all the blocks it handles. The reduction consists of concatenating all the slice IDs for the slices that pass the candidate selection criteria to block 0, on rank 0.
ggplot(ranks, aes(reduction)) +
geom_histogram(bins=50)
But see below: this plot is misleading, because we don’t have an MPI barrier before the reduction begins.
How many blocks are processed by each rank?
ggplot(r, aes(nblocks)) +
geom_bar()
table(r$nblocks)
##
## 3 4 5 6 7 8 9
## 1 9 251 2381 3961 558 7
How many events are processed by each rank?
ggplot(r, aes(nevts)) + geom_histogram(bins=50)
ggplot(r, aes(nevts)) + geom_boxplot()
red <- make_reduction_phase_df(raw)
There were 4 rounds of reduction.
The distribution of reduction times (for each block into which we are aggregating data) shows that reductions are fast:
ggplot(red, aes(duration, group = round))+
geom_histogram(bins=40) +
facet_wrap(vars(round), scales = "free")