Data Overview

Author

Marc Paterno

Published

February 9, 2023

Introduction

This document provides a short overview of the ICARUS workflow data we have collected on the csresearch machines at FNAL. This is only to allow us to determine that we are collecting the right data. This is not the data for an interesting performance study.

To process this document, you need to have the artsupport package installed. It depends on many other packages, primarily on the tidyverse packages.

To install artsupport, do:

 devtools::install_github("marcpaterno/artsupport")

Event timing

We first look at the event timing data. This is read from the TimeEvent table in the timing database. This apparently does not include time taken by the output modules.

events <- load_all_event_timing()
ggplot(events, aes(time, group = label)) +
  geom_histogram(bins = 30) +
  facet_wrap(vars(label), ncol = 1)

Why are the events in the 200 event sample typically slower to process than those in the 280 event sample? I would have expected the same event to take the same time to process, regardless of which data set it was processed as a part of. Let’s investigate that. Since we have the run, subrun and event number for each event, we can compare the timings directly. The cyan line is where the data would lie if event timings were perfectly reproducible.

events_wide <- wide_event_timing()
ggplot(events_wide, aes(time.x, time.y)) +
  geom_segment(aes(x=85, y = 85, xend = 120, yend = 120), color="cyan") +
  geom_point() +
  labs(x = "processing time in 200 event sample",
       y = "processing time in 280 event sample")

This is not at all the correlation I expected. We can also look at the distribution of the difference in processing times:

events_wide$dt <- events_wide$time.x - events_wide$time.y
ggplot(events_wide, aes(dt)) +
  geom_histogram(bins = 30) +
  labs(x = "difference between 200-set and 280-set event processing times")

This is an almost 10% spread. Perhaps there will be additional information if we look at the individual module timings.

Module timing

The module timing data includes many measurements, so to summarize the distributions we use box and whisker plots.

modules <- load_all_module_timing() 
ggplot(modules, aes(x = time, y = modulelabel, group = modulelabel)) +
  geom_boxplot() +
  facet_wrap(vars(label))

Clearly something is going wrong with the out module for the 280 event sample. This one is a bit complicated to look at, so first let’s look at the other modules. If we get rid of the out module values, we can compare the remaining:

dplyr::filter(modules, modulelabel != "out") |>
  ggplot(aes(x = time, y = modulelabel, group = modulelabel)) +
  geom_boxplot() +
  facet_wrap(vars(label))

For some reason, in the 280 event sample, the MCDecodeTPCROI module is clearly faster. The roifinder module is also faster.

However, the timing of the out module is showing a clear problem. Note that the output module is recorded in two parts: the part labeled with the suffix “(write)” is the part of the time actually spent doing writing. Because the range of values is so different, note that different scales are used on the \(x\) axis for the two panels.

dplyr::filter(modules, moduletype == "LoadbalancingOutput(write)") |>
  ggplot(aes(time)) +
  geom_histogram(bins=30) +
  facet_wrap(vars(label), ncol = 1, scales="free_x")

We are also interested in the time it takes to read the data through HEPnOS. Unlike the RootInputModule, which implements on-demand reading (data products are not actually read from ROOT until a module gets the product from an event), LoadbalancingInput reads the products eagerly.

dplyr::filter(modules, modulelabel == "source") |>
  ggplot(aes(time)) +
  geom_histogram(bins = 30) +
  facet_wrap(vars(label), ncol = 1) +
  scale_x_log10()

Unlike the writing, the reading from HEPnOS does not seem to yield different performance in the two different sets.

Memory usage

The memory tracker databases are supposed to have module-by-module and event-by-event information. This data is missing from these timing database files. The tables are present in the database files, but the tables have zero rows. This is because we are using multiple threads in the execution of art; the memory tracker can not identify how much memory is used by each module.

Events in each subrun

We are observing very bad balance in the distribution of events across the server nodes. Mattieu notes that HEPnOS tries to send all the events of a specific subrun to one database. So we want to know how many events are in each subrun.

events |>
  dplyr::filter(label == "200", exp == "1") |>
  dplyr::group_by(run, subrun) |>
  dplyr::summarise(nevents = dplyr::n(), .groups = "drop") |>
  dplyr::mutate(sr = sprintf("r%d_s%d", run, subrun)) |>
  ggplot(aes(x = nevents, y = sr, group = sr)) +
           geom_boxplot()

All the events are in the same subrun. This seems to be the reason for the imbalance. This is even more easily seen in a table:

events |>
  dplyr::filter(label == "200", exp == "1") |>
  dplyr::group_by(run, subrun) |>
  dplyr::summarise(nevents = dplyr::n(), .groups = "drop") |>
  knitr::kable()

run	subrun	nevents
1	0	200