Mu2e GeantMT Timing

Introduction

This document is a brief exploration of the timing studies of the Mu2e use of multithreading in Geant4 done by Lisa Goodenough.

First, we read the names of the files, and sort through filenames to deduce some metadata. I have guessed at the meanings of the integers in the filenames.

Next we create the dataframe:

head(td)

## # A tibble: 6 x 9
##   Event Path  ModuleLabel  ModuleType    Time sample nthreads batchz evts 
##   <int> <fct> <fct>        <fct>        <dbl>  <int>    <int>  <int> <chr>
## 1     1 p1    generate     StoppedMu… 1.92e⁻³      1        1      1 10k  
## 2     1 p1    g4run        Mu2eG4     1.44e⁻¹      2        1      1 10k  
## 3     1 p1    protonTimeM… GenerateP… 1.85e⁻³      3        1      1 10k  
## 4     1 p1    muonTimeMap  GenerateM… 4.47e⁻⁴      4        1      1 10k  
## 5     1 p1    makeSD       StrawDigi… 1.10e⁻²      5        1      1 10k  
## 6     1 p1    makeSH       StrawHits… 2.04e⁻³      6        1      1 10k

Looking at the large batch run

We can filter on the batch size to pick out the measurements from the runs with 20,000 events in each batch:

big <- td %>% filter(batchz == "20000")
head(big)

## # A tibble: 6 x 9
##   Event Path  ModuleLabel  ModuleType    Time sample nthreads batchz evts 
##   <int> <fct> <fct>        <fct>        <dbl>  <int>    <int>  <int> <chr>
## 1     1 p1    generate     StoppedMu… 3.48e⁻²      1      128  20000 60000
## 2     1 p1    g4run        Mu2eG4     3.59e⁺¹      2      128  20000 60000
## 3     1 p1    TriggerResu… TriggerRe… 2.68e⁻⁴      3      128  20000 60000
## 4     2 p1    generate     StoppedMu… 1.75e⁻⁴      4      128  20000 60000
## 5     2 p1    g4run        Mu2eG4     1.09e⁻³      5      128  20000 60000
## 6     2 p1    TriggerResu… TriggerRe… 1.91e⁻⁴      6      128  20000 60000

The distribution of runtimes for the g4run module shows a few peaks. Note that the histogram limits are calculated so that all data are shown. We can’t see the entries in the higher bins because the counts are so small.

# We define logx for convenience in upcoming plots.
logx = list(x = list(log = 10, equispaced = FALSE))
big %>% filter(ModuleLabel == "g4run") %>% 
  histogram(~Time|as.factor(nthreads), .,
            scales = logx,
            nint = 100,
            type = "c")

We can zoom in to see the structure of the low-end peaks:

big %>% filter(ModuleLabel == "g4run", Time < 0.1) %>%
  histogram(~Time|as.factor(nthreads), .,
            scales = logx,
            nint = 100,
            type = "c")

We can look at the few events with a runtime > 0.1 seconds:

big %>% filter(ModuleLabel == "g4run", Time > 0.1)

## # A tibble: 9 x 9
##   Event Path  ModuleLabel ModuleType  Time sample nthreads batchz evts 
##   <int> <fct> <fct>       <fct>      <dbl>  <int>    <int>  <int> <chr>
## 1     1 p1    g4run       Mu2eG4      35.9      2      128  20000 60000
## 2 20001 p1    g4run       Mu2eG4      32.9  60002      128  20000 60000
## 3 40001 p1    g4run       Mu2eG4      33.7 120002      128  20000 60000
## 4     1 p1    g4run       Mu2eG4      34.7      2      256  20000 60000
## 5 20001 p1    g4run       Mu2eG4      52.1  60002      256  20000 60000
## 6 40001 p1    g4run       Mu2eG4      30.7 120002      256  20000 60000
## 7     1 p1    g4run       Mu2eG4      45.6      2       64  20000 60000
## 8 20001 p1    g4run       Mu2eG4      52.5  60002       64  20000 60000
## 9 40001 p1    g4run       Mu2eG4      46.6 120002       64  20000 60000

While there are only a few events in which we do real generation, there are many events in which we’re just popping data out of the cache in the g4run module. While each call is quick, they add up. For the events that are doing generation:

big %>% filter(ModuleLabel == "g4run", Time > 0.1) %>%
  group_by(nthreads) %>%
  summarize(nevents = n(),
            med = median(Time),
            total = sum(Time))

## # A tibble: 3 x 4
##   nthreads nevents   med total
##      <int>   <int> <dbl> <dbl>
## 1       64       3  46.6   145
## 2      128       3  33.7   103
## 3      256       3  34.7   118

For the events in which we’re just popping data out of the cache:

big %>% filter(ModuleLabel == "g4run", Time < 0.1) %>%
  group_by(nthreads) %>%
  summarize(nevents = n(),
            med = median(Time),
            total = sum(Time))

## # A tibble: 3 x 4
##   nthreads nevents    med total
##      <int>   <int>  <dbl> <dbl>
## 1       64   59997 0.0134   670
## 2      128   59997 0.0137   674
## 3      256   59997 0.0135   670

Little time is spent in the other modules:

big %>% group_by(ModuleLabel, nthreads) %>%
  summarize(nevents = n(),
            med = median(Time),
            total = sum(Time))

## # A tibble: 9 x 5
## # Groups:   ModuleLabel [?]
##   ModuleLabel    nthreads nevents       med  total
##   <fct>             <int>   <int>     <dbl>  <dbl>
## 1 g4run                64   60000 0.0134    814   
## 2 g4run               128   60000 0.0137    777   
## 3 g4run               256   60000 0.0135    788   
## 4 generate             64   60000 0.000169   10.5 
## 5 generate            128   60000 0.000176   10.9 
## 6 generate            256   60000 0.000178   11.1 
## 7 TriggerResults       64   60000 0.0000976   5.96
## 8 TriggerResults      128   60000 0.000106    6.48
## 9 TriggerResults      256   60000 0.000113    6.94

The timing databases also have a table which contains the time to process each event in full. This time includes overheads that the module timing doesn’t see.

files.list <- dir(pattern=glob2rx("time03_*_20000_60000.db"))
nthreads.list <- sapply(files.list,
                        function(x) {str_split(x, pattern="_", simplify =TRUE)[2]},
                        USE.NAMES=FALSE)
df.list <- mapply(load_event_timing,
                  files.list,
                  nthreads.list,
                  SIMPLIFY = FALSE,
                  USE.NAMES = FALSE)
event.df <- bind_rows(df.list)
rm(df.list)
library(magrittr)
event.df %<>% mutate(nthreads = as.integer(lbl), lbl = NULL)
head(event.df)

## # A tibble: 6 x 6
##     Run SubRun Event     Time sample nthreads
##   <int>  <int> <int>    <dbl>  <int>    <int>
## 1     1      0     1 36.0          1      128
## 2     1      0     2  0.00239      2      128
## 3     1      0     3  0.0100       3      128
## 4     1      0     4  0.0149       4      128
## 5     1      0     5  0.00843      5      128
## 6     1      0     6  0.00203      6      128

The sum of all event times is, by number of threads, is:

event.df %>% group_by(nthreads) %>% summarize(tot = sum(Time))

## # A tibble: 3 x 2
##   nthreads   tot
##      <int> <dbl>
## 1       64   879
## 2      128   855
## 3      256   931

The overhead encountered during the processing of the first event accounts for nearly all the difference between this and the sum of the module timing results.

To look at correlations between module timings for individual events, we have to reshape the data:

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

evts <- big %>% select(Event, ModuleLabel, Time, nthreads) %>% spread(ModuleLabel, Time)

The main body of the data have the g4run time less than 0.1 seconds:

head(evts)

## # A tibble: 6 x 5
##   Event nthreads     g4run generate TriggerResults
##   <int>    <int>     <dbl>    <dbl>          <dbl>
## 1     1       64 45.6      0.0331         0.000248
## 2     1      128 35.9      0.0348         0.000268
## 3     1      256 34.7      0.0321         0.000294
## 4     2       64  0.00171  0.000171       0.000182
## 5     2      128  0.00109  0.000175       0.000191
## 6     2      256  0.000721 0.000221       0.000211

logxy = list(x=list(log=10, equispaced=FALSE), y=list(log=10, equispaced=FALSE))
evts %>% filter(g4run < 0.1) %>%
  xyplot(g4run~generate|as.factor(nthreads), .,
         scales = logxy,
         pch = ".",
         grid = TRUE)

The few long-running events show no interesting correlations (note: scales are linear)

evts %>% filter(g4run > 0.1) %>%
  xyplot(g4run~generate, .,
         groups = nthreads,
         auto.key = list(space = "right"),
         grid = TRUE)

I’m not what this plot is showing … maybe that we’re just observing different consumption of random numbers in programs with different numbers of threads. This shows the event-by-event comparison of the time to run the g4run module for 64 and 128 thread execution.

library(tidyr)
z <- big %>%
  filter(Time < 0.1) %>%
  select(Event, ModuleLabel, nthreads, Time) %>%
  unite(ModThreads, ModuleLabel, nthreads) %>%
  spread(ModThreads,  Time)
head(z)

## # A tibble: 6 x 10
##   Event g4run_128 g4run_256 g4run_64 generate_128 generate_256 generate_64
##   <int>     <dbl>     <dbl>    <dbl>        <dbl>        <dbl>       <dbl>
## 1     1 NA        NA        NA           0.0348       0.0321      0.0331  
## 2     2  0.00109   0.000721  0.00171     0.000175     0.000221    0.000171
## 3     3  0.00888   0.0163    0.00695     0.000244     0.000177    0.000165
## 4     4  0.0138    0.0113    0.00778     0.000158     0.000174    0.000187
## 5     5  0.00730   0.00473   0.00606     0.000169     0.000178    0.000151
## 6     6  0.000990  0.000701  0.00179     0.000161     0.000173    0.000154
## # ... with 3 more variables: TriggerResults_128 <dbl>,
## #   TriggerResults_256 <dbl>, TriggerResults_64 <dbl>

z %>% xyplot(g4run_128~g4run_64, .,
             pch=".",
             scales=logxy,
             grid = TRUE)

Appendix

How this document was created

Please note the version of artsupport listed below. This document will be updated to use new features of artsupport whenever relevant new features appear.

This document was created using the RStudio integrated development environment; it was written in RMarkdown, using the knitr package. The set of R packages and their versions used was:

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin16.7.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.8.0           bindrcpp_0.2          dplyr_0.7.4          
## [4] latticeExtra_0.6-28   RColorBrewer_1.1-2    lattice_0.20-35      
## [7] stringr_1.2.0         artsupport_0.0.0.9011
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15     knitr_1.19       bindr_0.1        magrittr_1.5    
##  [5] tidyselect_0.2.3 R6_2.2.2         rlang_0.1.6      tools_3.4.3     
##  [9] grid_3.4.3       utf8_1.1.3       cli_1.0.0        htmltools_0.3.6 
## [13] yaml_2.1.16      rprojroot_1.3-2  digest_0.6.15    assertthat_0.2.0
## [17] tibble_1.4.2     crayon_1.3.4     purrr_0.2.4      glue_1.2.0      
## [21] evaluate_0.10.1  rmarkdown_1.8    stringi_1.1.6    compiler_3.4.3  
## [25] pillar_1.1.0     backports_1.1.2  pkgconfig_2.0.1

Mu2e GeantMT Timing

Marc Paterno

2/13/2018

Introduction

Looking at the large batch run

Appendix

How this document was created