Introduction

This document contains various observations about PandAna performance. It is a scratchpad of sorts; information in it might be mined for the production of a real paper.

MPI scaling

Using the 160-subrun concatenated h5caf file, I measured the (strong) scaling of processing time, on both grunt5 and my Mac notebook.

Cluck info:

Machine                                       : x86_64
CPU Name                                      : amdfam10
CPU count                                     : 32
CFS restrictions                              : None
CPU Features                                  :
64bit cmov cx16 lzcnt mmx popcnt prfchw sahf sse sse2 sse3 sse4a

Mac notebook info:

Machine                                       : x86_64
CPU Name                                      : skylake
CPU count                                     : 8
CPU Features                                  :
64bit adx aes avx avx2 bmi bmi2 clflushopt cmov cx16 f16c fma fsgsbase invpcid
lzcnt mmx movbe pclmul popcnt prfchw rdrnd rdseed rtm sahf sgx sse sse2 sse3
sse4.1 sse4.2 ssse3 xsave xsavec xsaveopt xsaves

MPI timing data

I timed the full running time of the program (using /usr/bin/time). t is the (wall clock) running time, in seconds. abs.throughput is the throughput in slices/second. norm.throughput is normalized to the throughput for one rank, on that platform.

## # A tibble: 20 x 7
##    plat   baseline nranks      t throughput abs.throughput norm.throughput
##    <chr>     <dbl>  <int>  <dbl>      <dbl>          <dbl>           <dbl>
##  1 grunt5 0.000449      1 2229.    0.000449          1015.            1   
##  2 grunt5 0.000449      1 2128.    0.000470          1063.            1.05
##  3 grunt5 0.000449      2 1105.    0.000905          2046.            2.02
##  4 grunt5 0.000449      3  741.    0.00135           3054.            3.01
##  5 grunt5 0.000449      8  292.    0.00343           7754.            7.64
##  6 grunt5 0.000449      8  289.    0.00346           7815.            7.70
##  7 grunt5 0.000449     12  202.    0.00496          11217.           11.1 
##  8 grunt5 0.000449     16  151.    0.00664          15015.           14.8 
##  9 grunt5 0.000449     16  157.    0.00636          14397.           14.2 
## 10 grunt5 0.000449     24  101.    0.00991          22412.           22.1 
## 11 grunt5 0.000449     28   91.4   0.0109           24735.           24.4 
## 12 grunt5 0.000449     31   85.0   0.0118           26613.           26.2 
## 13 grunt5 0.000449     32   80.8   0.0124           28000.           27.6 
## 14 grunt5 0.000449     32   82.8   0.0121           27315.           26.9 
## 15 grunt5 0.000449     48   87.8   0.0114           25767.           25.4 
## 16 grunt5 0.000449     64   94.0   0.0106           24068.           23.7 
## 17 mac    0.00158       1  631.    0.00158           3583.            1   
## 18 mac    0.00158       2  341.    0.00294           6642.            1.85
## 19 mac    0.00158       3  242.    0.00414           9361.            2.61
## 20 mac    0.00158       4  181.    0.00551          12464.            3.48

The strong scaling behavior is shown by the absolute throughput as a function of the number of ranks, for each platform.

A predictive model that accounts for contention between processes and for delays required to keep the system in a coherent and consistent state is detailed in (Gunther 2006). My laptop has too few cores to fit data sensibly. Using only the data from grunt5, and only up to 32 ranks, we see the following:

## 
## Formula: abs.throughput ~ N * nranks/(1 + alpha * (nranks - 1) + beta * 
##     nranks * (nranks - 1))
## 
## Parameters:
##         Estimate Std. Error t value Pr(>|t|)    
## N      9.544e+02  4.885e+01  19.535 6.88e-10 ***
## alpha -1.114e-03  5.401e-03  -0.206    0.840    
## beta   1.423e-04  1.188e-04   1.197    0.256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 342.7 on 11 degrees of freedom
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 1.504e-06

Unfortunately, the parameter alpha has a “best fit” value that is negative (with a large fractional error); this violates the logic of the model. So let’s set that term to 0 in the model, and see what we get:

## 
## Formula: abs.throughput ~ N * nranks/(1 + beta * nranks * (nranks - 1))
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## N    9.640e+02  1.390e+01  69.363  < 2e-16 ***
## beta 1.182e-04  1.973e-05   5.989 6.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 328.7 on 12 degrees of freedom
## 
## Number of iterations to convergence: 4 
## Achieved convergence tolerance: 1.135e-06

We can plot the predictions of this model against the data:

We can also plot the (predicted) efficiency as a function of the number of ranks:

As a table:

throughput nranks eff
964.0244 1 1.0000000
1927.5932 2 0.9997637
2890.0241 3 0.9992915
3850.6374 4 0.9985840
4808.7573 5 0.9976423
5763.7141 6 0.9964676
6714.8452 7 0.9950616
7661.4969 8 0.9934262
8603.0257 9 0.9915639
9538.7998 10 0.9894770
10468.2006 11 0.9871686
11390.6236 12 0.9846417
12305.4803 13 0.9818999
13212.1987 14 0.9789467
14110.2245 15 0.9757862
14999.0227 16 0.9724224
15878.0778 17 0.9688599
16746.8949 18 0.9651032
17605.0007 19 0.9611572
18451.9435 20 0.9570268
19287.2948 21 0.9527172
20110.6486 22 0.9482337
20921.6228 23 0.9435817
21719.8589 24 0.9387668
22505.0222 25 0.9337947
23276.8026 26 0.9286712
24034.9136 27 0.9234019
24779.0933 28 0.9179930
25509.1035 29 0.9124502
26224.7299 30 0.9067796
26925.7820 31 0.9009872
27612.0922 32 0.8950789

Single-rank program measurements

By monkey-patching the classes h5py.File and `h5py.Dataset’, I have collected performance information in the form of timestamps when specific events (file open, file close, file finalization by garbage collector, reading datasets,…) happen.

These are measured in a 1-rank program, run on my laptop, which has an SSD.

Bibliography

Gunther, Neil J. 2006. Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services. Springer-Verlag.