1 Introduction

This document presents an analysis of the reading performance of PandAna. PandAna uses the Python package h5py (http://h5py.org) to read tabular data stored in HDF5 (https://portal.hdfgroup.org/display/HDF5/HDF5).

We have run a sample PandAna application on 4 Haswell nodes of Cori at NERSC, using the same code and a varying number of MPI ranks. The data being read are contained in a single 16 GB HDF5 file, containing the NOvA h5caf format data for 165 subruns. The analysis code can be found in the PandAna repository (https://bitbucket.org/mpaterno/pandana), in Demos/candidate_selection.py.

2 Experimental setup

We used 4 Haswell nodes on Cori for all experiments.

# Read in the data
dirs <- paste("local-analysis",
              dir(path = "local-analysis", pattern = "cori_[[:digit:]+_[[:digit:]]+"),
              sep="/")
frames <- lapply(dirs, read_pandana_perf_experiment)

We have 7 experiments, each with a different number of ranks used. For each rank in each experiment, we record a timestamp at various “events” in the program execution.

main <- bind_rows(lapply(frames, make_main_df))
main

3 Analysis of timing summary data

First we look at the timing summary data for each run, and how that time varies with the number of ranks used.

main_summary <-
  group_by(main, nranks) %>% 
  summarize(read = max(go), compute = max(fillSpectra), total = max(total), .groups = "drop")
main_summary

The 4 Haswell nodes have a total of 128 physical cores, and 256 hyperthreads. Since the run time for 256 ranks is slower than that for 128 ranks, it is clear that the hyperthreading is not of value. For the remainder of the plots, we will include only the runs up to 128 ranks.

main_summary <- filter(main_summary, nranks <= 128)
log2breaks <- c(4,8,16,32,64,128,256)
ggplot(main_summary, aes(nranks, total)) +
 geom_smooth(method = "lm", formula = "y~x") +
 geom_point() +
 scale_x_log10(breaks = log2breaks) +
 scale_y_log10() +
 labs(x="Total ranks", y="Total run time (s)")
Total run time as a function of the nuber of ranks used.

Total run time as a function of the nuber of ranks used.

The total run time shows good scaling except at the largest number of ranks.

ggplot(main_summary, aes(nranks, compute)) +
  geom_smooth(method = "lm", formula = "y~x") +
  geom_point() +
  scale_x_log10(breaks = log2breaks) +
  scale_y_log10() +
  labs(x="Total ranks", y="Total computing time (s)")
Total computation time as a function of the nuber of ranks used.

Total computation time as a function of the nuber of ranks used.

The compute time (where we are parallel processing) shows the essentially perfect scaling we expect.

ggplot(main_summary, aes(nranks, read)) +
  #geom_smooth(method = "lm", formula = "y~x") +
  geom_point() +
  scale_x_log10(breaks = log2breaks) +
  #scale_y_log10() +
  labs(x="Total ranks", y="Total reading time (s)")
Total reading time as a function of the nuber of ranks used.

Total reading time as a function of the nuber of ranks used.

The read performance does not scale well at all. However, since the reading is always quite fast, it makes little difference to the total performance except at the largest number of ranks used.

We will need to use the much larger file to have any hope of seeing good scaling performance even for small MPI jobs, e.g using 128 ranks.

4 Analysis of reading details

4.1 Reading speed

The analysis before shows that we are not getting good scaling in the reading performance, even going from 4 to 8 ranks. We will now look at the more detailed read performance information.

read <- bind_rows(lapply(frames, make_read_df))
read

Averaged over all reads done in each program, we see the reading speed decreases with increasing number of ranks:

group_by(read, nranks) %>%
  summarize(speed = mean(speed), .groups="drop") %>%
  ggplot(aes(nranks, speed)) +
  geom_point() +
  scale_x_log10(breaks = log2breaks) +
  scale_y_continuous(limits=c(0, NA)) +
  labs(x= "Number of ranks", y="Reading speed (MB/s)")

Event at 4 ranks, we are not obtaining high reading speeds. We can see how the reading speed tends to decrease as the size of the read being done decreases. Note the line is drawn only to show the trend; there is no reason to expect a linear relationship, and in fact the quality of this fit is terrible.

read %>%
  ggplot(aes(size, speed)) +
  geom_smooth(aes(group = factor(nranks)), method = "lm", formula = "y~x", color = "red") +
  geom_point(shape=".") +
  scale_x_log10(labels=scales::number) +
  scale_y_log10(labels=scales::number) +
  labs(x = "Read size (MB)", y = "Reading speed (MB/s)") +
  facet_wrap(vars(nranks), ncol=1)

It would appear from these data that to obtain good reading speed, we should have our reading sizes at least a few MB. However, limiting the number of ranks used by the program to obtain this read size would severely damage the parallel processing speed of the computational part of the program, which is more time-consuming.

4.2 Data organization

Is there some different organization of our data that would allow for faster reading? The data are already in contiguous datasets. Perhaps uncompressed data would perform much better. Note that our read sizes are small compared to the chunk size in the datasets, which is 1 MB.

read %>%
  ggplot(aes(size)) +
  geom_histogram(bins=50) +
  scale_x_log10(labels=scales::number) +
  facet_wrap(vars(nranks), scales = "free")

What is the large dataset? The full set of dataset names for datasets of size greater than 1 MB is:

read %>% filter(size > 5) %>% pull(dataset) %>% unique()
## [1] "evt.seq"

This is because in all cases, the evt.seq column is read in its entirety by every rank of the program. Perhaps this column should be compressed, and no others.

If we filter out this dataset, and look at the size distribution for all others, we see:

read %>%
  filter(dataset != "evt.seq") %>% 
  ggplot(aes(size)) +
  geom_histogram(bins=50) +
  scale_x_log10(labels=scales::number) +
  facet_wrap(vars(nranks), scales = "free")

4.3 More detailed analysis for nranks=4

read4 <- filter(read, nranks == 4) %>%
  mutate(group = factor(group), dataset = factor(dataset))
read4

Here is a repeat of the speed vs size plot for just the nranks==4 data, with coloring by group. Note that all columns in a group have the same number of entries read, but the sizes can be different because size counts the number of megabytes read, not the number of objects read.

read4 %>% 
  ggplot(aes(size, speed)) +
  geom_smooth(method = "lm", formula = "y~x", color = "red") +
  geom_point(aes(color = group, shape = group)) +
  scale_x_log10(labels=scales::number) +
  scale_y_log10(labels=scales::number) +
  scale_shape_manual(values=1:nlevels(read4$group)) +
  labs(x = "Read size (MB)", y = "Reading speed (MB/s)")

There is a very large spread in reading speeds for reads of the same size. Picking just one, rec.vtx.elastic.fuzzyk.png.shwlid, we can see the distribution. This time, we use color to indicate the dataset name:

read4 %>% 
  filter(group == "rec.vtx.elastic.fuzzyk.png.shwlid") %>%
  mutate(rank = factor(rank)) %>% 
  ggplot(aes(size, speed)) +
  geom_point(aes(shape = dataset, color = rank)) +
  scale_x_log10(labels=scales::number) +
  scale_y_log10(labels=scales::number) +
  scale_shape_manual(values=1:nlevels(read4$dataset)) +
  labs(x = "Read size (MB)", y = "Reading speed (MB/s)")

The relatively large size of the evt.seq reads makes it hard to see other details, so we’ll suppress that and plot again. This time, color denotes the rank doing the reading, and shape denotes the dataset being read.

read4 %>% 
  filter(group == "rec.vtx.elastic.fuzzyk.png.shwlid",
         dataset != "evt.seq") %>%
  mutate(rank = factor(rank)) %>% 
  ggplot(aes(size, speed)) +
  geom_point(aes(shape = dataset, color = rank)) +
  scale_y_log10(labels=scales::number) +
  scale_shape_manual(values=1:nlevels(read4$dataset)) +
  labs(x = "Read size (MB)", y = "Reading speed (MB/s)")

We see that different ranks read different sizes. This is because each rank reads the same number of events, but each event can contain different numbers of objects — in this case, of prongs. But within a rank, even for reads of the same size, we see more than a 10-fold range of reading speeds.

Concentrating on a single rank (we’ll choose rank 1, which has mid-slzed reads), we can look for some kind of trend.

rank1 <- filter(read4, rank == 1, group == "rec.vtx.elastic.fuzzyk.png.shwlid")

We can plot the speed as a function of the time at which the read is done. We draw a bar for each read, from the start of the read to the end of the read. We use a different color for each dataset.

rank1 %>% 
  mutate(dataset = reorder(dataset, start)) %>% 
  ggplot(aes(xmin=start, xmax=stop, y=speed, color=dataset)) +
  geom_errorbarh(height=25)

The earlier reads tend to be higher speed than the later reads. Does this pattern hold for different ranks?

read4 %>% 
  filter(group == "rec.vtx.elastic.fuzzyk.png.shwlid") %>% 
  mutate(dataset = reorder(dataset, start)) %>% 
  ggplot(aes(xmin=start, xmax=stop, y=speed, color=dataset)) +
  geom_errorbarh(height=25) +
  facet_wrap(vars(rank))

And what about for all datasets?

read4 %>% 
  mutate(dataset = reorder(dataset, start)) %>% 
  ggplot(aes(xmin=start, xmax=stop, y=speed, color=dataset)) +
  geom_errorbarh(height=25, show.legend=FALSE) +
  facet_wrap(vars(rank))

The pattern does not hold for all datasets. Very few have read speeds over 700 MB/s. Let’s look at them.

read4 %>% 
  filter(speed > 700) %>% 
  select(group, dataset, rank, nbytes, speed)

The only dataset names that appear are run, cycle, subrun. These are all datasets with large compression factors.

If we leave out these datasets, what do we see? This time, we use color to denote groups.

fast_datasets <- filter(read4, speed>700) %>% pull(dataset) %>% unique()
read4 %>%
  filter(!(dataset %in% fast_datasets)) %>% 
  mutate(dataset = reorder(dataset, start)) %>% 
  ggplot(aes(xmin=start, xmax=stop, y=speed, color=group)) +
  geom_errorbarh(height=25) +
  facet_wrap(vars(rank), ncol=1)