raw_112_64 <- read_raw_dataframes("campaign_4/112_64/timing*.dat")
raw_008_64 <- read_raw_dataframes("campaign_4/008_64/timing*.dat")
events_112_64 <- make_events_df(raw_112_64)
events_008_64 <- make_events_df(raw_008_64)
We are looking at a run using 16 nodes running the HEPnOS daemon, with 512 targets.
We use two runs of the eventselection program: 1. 112 nodes each running 64 ranks. 2. 8 nodes each running 64 ranks.
The dataset used is the 1929 subrun sample from the NOvA ND.
There are 3979542 events in the data sample.
The total job run time, according to the batch job log, for the 112 node run was ~639 seconds. For the 8 node run it was ~614 seconds.
Looking only at the timing while the MPI programming is running, we see the distribution of total run times by rank:
ebr_112_64 <- events_112_64 %>%
group_by(rank) %>%
summarize(nevents = n(),
load=sum(load),
rec=sum(rec),
filt=sum(filt),
nslices=sum(nslices),
nbytes=sum(nbytes),
.groups = "drop")
ebr_008_64 <- events_008_64 %>%
group_by(rank) %>%
summarize(nevents = n(),
load=sum(load),
rec=sum(rec),
filt=sum(filt),
nslices=sum(nslices),
nbytes=sum(nbytes),
.groups = "drop")
ebr <- bind_rows("8" = ebr_008_64, "112" = ebr_112_64, .id="run")
ebr$run <- as.integer(ebr$run)
summary(ebr_112_64)
## rank nevents load rec
## Min. : 0 Min. :334.0 Min. :100.5 Min. :0.03735
## 1st Qu.:1792 1st Qu.:500.0 1st Qu.:145.6 1st Qu.:0.06932
## Median :3584 Median :599.0 Median :172.7 Median :0.08003
## Mean :3584 Mean :555.2 Mean :164.5 Mean :0.07603
## 3rd Qu.:5375 3rd Qu.:600.0 3rd Qu.:183.7 3rd Qu.:0.08279
## Max. :7167 Max. :600.0 Max. :198.1 Max. :0.09279
## filt nslices nbytes
## Min. :0.8785 Min. : 729 Min. :1104288
## 1st Qu.:4.1038 1st Qu.:2301 1st Qu.:3537753
## Median :4.6966 Median :2591 Median :3980346
## Mean :4.4621 Mean :2494 Mean :3835275
## 3rd Qu.:4.9446 3rd Qu.:2759 3rd Qu.:4245807
## Max. :6.0373 Max. :3194 Max. :5154600
summary(ebr_008_64)
## rank nevents load rec
## Min. : 0.0 Min. :7182 Min. :103.7 Min. :0.9577
## 1st Qu.:127.8 1st Qu.:7717 1st Qu.:107.6 1st Qu.:1.0537
## Median :255.5 Median :7784 Median :109.7 Median :1.0683
## Mean :255.5 Mean :7773 Mean :110.9 Mean :1.0678
## 3rd Qu.:383.2 3rd Qu.:7832 3rd Qu.:113.8 3rd Qu.:1.0824
## Max. :511.0 Max. :7972 Max. :131.6 Max. :1.2927
## filt nslices nbytes
## Min. :44.00 Min. :22521 Min. :34735164
## 1st Qu.:65.13 1st Qu.:34510 1st Qu.:53028810
## Median :67.04 Median :35562 Median :54653988
## Mean :66.49 Mean :34919 Mean :53693848
## 3rd Qu.:68.73 3rd Qu.:36214 3rd Qu.:55694382
## Max. :96.10 Max. :38475 Max. :59072916
The scaling for the filtering step scales nearly perfectly. We can see this by looking at the number of events per second processed by each rank. We do this as a function of the number of events processed by the rank, to see if there is any variation.
ggplot(ebr, aes(nevents, nevents/filt)) +
geom_smooth(method="lm", formula="y~x") +
geom_point(alpha=0.3) +
facet_wrap(vars(run), scales="free_x") +
labs(x="Number of events processed by the rank",
y="events/sec for each rank",
title="Per-rank throughput of filtering step for 8 and 112 node runs.")
The loading does not scale well. Recall that in each case we are using a HEPnOS daemon with 512 targets. This means the 8 node run has only 1 client rank per target, while the 112 node run has 14 client ranks per target.
ggplot(ebr, aes(nevents, nevents/load)) +
geom_smooth(method="lm", formula="y~x") +
geom_point(alpha=0.3) +
facet_wrap(vars(run), scales="free_x") +
labs(x="Number of events loaded by the rank",
y="events/sec for each rank",
title="Per-rank throughput of data loading step for 8 and 112 node runs.")
We can also look at the aggregate loading rate for the whole program. I am not sure what is the best way to define this. What I have done here is to sum the rate for each rank in the program, giving a throughput in events/second.
ebr %>% group_by(run) %>%
summarize(throughput=sum(nevents/load), .groups="drop") %>%
knitr::kable()
| run | throughput |
|---|---|
| 8 | 35928.85 |
| 112 | 24483.10 |