Performance Measurements Scratchpad

Introduction

This document contains various observations about PandAna performance. It is a scratchpad of sorts; information in it might be mined for the production of a real paper.

MPI scaling

Using the 160-subrun concatenated h5caf file, I measured the (strong) scaling of processing time, on both grunt5 and my Mac notebook.

Cluck info:

Machine                                       : x86_64
CPU Name                                      : amdfam10
CPU count                                     : 32
CFS restrictions                              : None
CPU Features                                  :
64bit cmov cx16 lzcnt mmx popcnt prfchw sahf sse sse2 sse3 sse4a

Mac notebook info:

Machine                                       : x86_64
CPU Name                                      : skylake
CPU count                                     : 8
CPU Features                                  :
64bit adx aes avx avx2 bmi bmi2 clflushopt cmov cx16 f16c fma fsgsbase invpcid
lzcnt mmx movbe pclmul popcnt prfchw rdrnd rdseed rtm sahf sgx sse sse2 sse3
sse4.1 sse4.2 ssse3 xsave xsavec xsaveopt xsaves

MPI timing data

I timed the full running time of the program (using /usr/bin/time). t is the (wall clock) running time, in seconds. abs.throughput is the throughput in slices/second. norm.throughput is normalized to the throughput for one rank, on that platform.

## # A tibble: 20 x 7
##    plat   baseline nranks      t throughput abs.throughput norm.throughput
##    <chr>     <dbl>  <int>  <dbl>      <dbl>          <dbl>           <dbl>
##  1 grunt5 0.000449      1 2229.    0.000449          1015.            1   
##  2 grunt5 0.000449      1 2128.    0.000470          1063.            1.05
##  3 grunt5 0.000449      2 1105.    0.000905          2046.            2.02
##  4 grunt5 0.000449      3  741.    0.00135           3054.            3.01
##  5 grunt5 0.000449      8  292.    0.00343           7754.            7.64
##  6 grunt5 0.000449      8  289.    0.00346           7815.            7.70
##  7 grunt5 0.000449     12  202.    0.00496          11217.           11.1 
##  8 grunt5 0.000449     16  151.    0.00664          15015.           14.8 
##  9 grunt5 0.000449     16  157.    0.00636          14397.           14.2 
## 10 grunt5 0.000449     24  101.    0.00991          22412.           22.1 
## 11 grunt5 0.000449     28   91.4   0.0109           24735.           24.4 
## 12 grunt5 0.000449     31   85.0   0.0118           26613.           26.2 
## 13 grunt5 0.000449     32   80.8   0.0124           28000.           27.6 
## 14 grunt5 0.000449     32   82.8   0.0121           27315.           26.9 
## 15 grunt5 0.000449     48   87.8   0.0114           25767.           25.4 
## 16 grunt5 0.000449     64   94.0   0.0106           24068.           23.7 
## 17 mac    0.00158       1  631.    0.00158           3583.            1   
## 18 mac    0.00158       2  341.    0.00294           6642.            1.85
## 19 mac    0.00158       3  242.    0.00414           9361.            2.61
## 20 mac    0.00158       4  181.    0.00551          12464.            3.48

The strong scaling behavior is shown by the absolute throughput as a function of the number of ranks, for each platform.

A predictive model that accounts for contention between processes and for delays required to keep the system in a coherent and consistent state is detailed in (Gunther 2006). My laptop has too few cores to fit data sensibly. Using only the data from grunt5, and only up to 32 ranks, we see the following:

## 
## Formula: abs.throughput ~ N * nranks/(1 + alpha * (nranks - 1) + beta * 
##     nranks * (nranks - 1))
## 
## Parameters:
##         Estimate Std. Error t value Pr(>|t|)    
## N      9.544e+02  4.885e+01  19.535 6.88e-10 ***
## alpha -1.114e-03  5.401e-03  -0.206    0.840    
## beta   1.423e-04  1.188e-04   1.197    0.256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 342.7 on 11 degrees of freedom
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 1.504e-06

Unfortunately, the parameter alpha has a “best fit” value that is negative (with a large fractional error); this violates the logic of the model. So let’s set that term to 0 in the model, and see what we get:

## 
## Formula: abs.throughput ~ N * nranks/(1 + beta * nranks * (nranks - 1))
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## N    9.640e+02  1.390e+01  69.363  < 2e-16 ***
## beta 1.182e-04  1.973e-05   5.989 6.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 328.7 on 12 degrees of freedom
## 
## Number of iterations to convergence: 4 
## Achieved convergence tolerance: 1.135e-06

We can plot the predictions of this model against the data:

We can also plot the (predicted) efficiency as a function of the number of ranks:

As a table:

throughput	nranks	eff
964.0244	1	1.0000000
1927.5932	2	0.9997637
2890.0241	3	0.9992915
3850.6374	4	0.9985840
4808.7573	5	0.9976423
5763.7141	6	0.9964676
6714.8452	7	0.9950616
7661.4969	8	0.9934262
8603.0257	9	0.9915639
9538.7998	10	0.9894770
10468.2006	11	0.9871686
11390.6236	12	0.9846417
12305.4803	13	0.9818999
13212.1987	14	0.9789467
14110.2245	15	0.9757862
14999.0227	16	0.9724224
15878.0778	17	0.9688599
16746.8949	18	0.9651032
17605.0007	19	0.9611572
18451.9435	20	0.9570268
19287.2948	21	0.9527172
20110.6486	22	0.9482337
20921.6228	23	0.9435817
21719.8589	24	0.9387668
22505.0222	25	0.9337947
23276.8026	26	0.9286712
24034.9136	27	0.9234019
24779.0933	28	0.9179930
25509.1035	29	0.9124502
26224.7299	30	0.9067796
26925.7820	31	0.9009872
27612.0922	32	0.8950789

Single-rank program measurements

By monkey-patching the classes h5py.File and `h5py.Dataset’, I have collected performance information in the form of timestamps when specific events (file open, file close, file finalization by garbage collector, reading datasets,…) happen.

These are measured in a 1-rank program, run on my laptop, which has an SSD.

`h5py.File`-related events

These are not very interesting when only one file is processed, which is the case with the 160-subrun dataset.

File-level information:

##    kind            id         open       close          finalize    
##  dset:0   4653415248:1   Min.   :0   Min.   :6.325   Min.   :654.9  
##  file:1   4653630160:0   1st Qu.:0   1st Qu.:6.325   1st Qu.:654.9  
##           4653630608:0   Median :0   Median :6.325   Median :654.9  
##           4653630672:0   Mean   :0   Mean   :6.325   Mean   :654.9  
##           4653632592:0   3rd Qu.:0   3rd Qu.:6.325   3rd Qu.:654.9  
##           4653632656:0   Max.   :0   Max.   :6.325   Max.   :654.9  
##           (Other)   :0                                              
##  fileno  workingtime       livetime       zombietime   
##  1:1    Min.   :6.325   Min.   :654.9   Min.   :648.6  
##         1st Qu.:6.325   1st Qu.:654.9   1st Qu.:648.6  
##         Median :6.325   Median :654.9   Median :648.6  
##         Mean   :6.325   Mean   :654.9   Mean   :648.6  
##         3rd Qu.:6.325   3rd Qu.:654.9   3rd Qu.:648.6  
##         Max.   :6.325   Max.   :654.9   Max.   :648.6  
##

Dataset-level information:

##           id                           name       startread        
##  4653715216:14   /spill/evtseq           :  2   Min.   :1.572e+09  
##  4653632656:13   /rec.energy.numu/cycle  :  1   1st Qu.:1.572e+09  
##  4653715280:12   /rec.energy.numu/evt    :  1   Median :1.572e+09  
##  4653633424:11   /rec.energy.numu/evtseq :  1   Mean   :1.572e+09  
##  4653759312: 9   /rec.energy.numu/hadcalE:  1   3rd Qu.:1.572e+09  
##  4653714512: 8   /rec.energy.numu/hadtrkE:  1   Max.   :1.572e+09  
##  (Other)   :57   (Other)                 :117                      
##     endread               read         
##  Min.   :1.572e+09   Min.   :0.003701  
##  1st Qu.:1.572e+09   1st Qu.:0.014345  
##  Median :1.572e+09   Median :0.021031  
##  Mean   :1.572e+09   Mean   :0.035166  
##  3rd Qu.:1.572e+09   3rd Qu.:0.053327  
##  Max.   :1.572e+09   Max.   :0.118201  
##

`h5py.Dataset`-related events

The relevant feature for the dataset handling is the time taken to read. The distribution of times is:

Few reads take more than 80 ms. Let’s look at the ones that do.

##    fileno                                       name       read
## 1       1  /rec.vtx.elastic.fuzzyk.png.shwlid/evtseq 0.11820102
## 2       1              /rec.trk.kalman.tracks/evtseq 0.09651399
## 3       1                     /rec.trk.cosmic/evtseq 0.09525728
## 4       1              /rec.sel.cvnProd3Train/evtseq 0.09278202
## 5       1                      /rec.sel.remid/evtseq 0.09223294
## 6       1                    /rec.energy.numu/evtseq 0.08959365
## 7       1                          /rec.spill/evtseq 0.08931279
## 8       1                    /rec.sel.cvn2017/evtseq 0.08853912
## 9       1                    /rec.sel.contain/evtseq 0.08812022
## 10      1  /rec.vtx.elastic.fuzzyk.png.shwlid/stop.x 0.08631206
## 11      1 /rec.vtx.elastic.fuzzyk.png.shwlid/start.x 0.08605695
## 12      1                            /rec.slc/evtseq 0.08599687
## 13      1                            /rec.hdr/evtseq 0.08586001
## 14      1                     /rec.trk.kalman/evtseq 0.08485603
## 15      1 /rec.vtx.elastic.fuzzyk.png.shwlid/start.z 0.08310699
## 16      1 /rec.vtx.elastic.fuzzyk.png.shwlid/start.y 0.08297706
## 17      1             /rec.vtx.elastic.fuzzyk/evtseq 0.08221722
## 18      1             /rec.trk.kalman.tracks/start.z 0.08123899

A plot of the times is revealing:

Only a few of the 124 datasets read are ever “slow”; most of the “slow” ones are reads of the evtseq column from various tables. This column is read in its entirety by each rank, for each table. Its contents are used to determine what portion of the other columns in that table should be read. A more clever indexing scheme could speed this up. Since these data were collected from a program running 1 rank, the relative slowness of these reads are de-emphasized.

However, even in this 1-rank program, the total time spent reading datasets is 4.3605292 seconds. The running time of the program is 654.928658 seconds.

Bibliography

Gunther, Neil J. 2006. Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services. Springer-Verlag.

Performance Measurements Scratchpad

Marc Paterno

24 October, 2019

Introduction

MPI scaling

Cluck info:

Mac notebook info:

MPI timing data

Single-rank program measurements

Bibliography

Performance Measurements Scratchpad

Marc Paterno

24 October, 2019

Introduction

MPI scaling

Cluck info:

Mac notebook info:

MPI timing data

Single-rank program measurements

h5py.File-related events

h5py.Dataset-related events

Bibliography

`h5py.File`-related events

`h5py.Dataset`-related events