This document contains various observations about PandAna performance. It is a scratchpad of sorts; information in it might be mined for the production of a real paper.
Using the 160-subrun concatenated h5caf file, I measured the (strong) scaling of processing time, on both grunt5 and my Mac notebook.
Machine : x86_64
CPU Name : amdfam10
CPU count : 32
CFS restrictions : None
CPU Features :
64bit cmov cx16 lzcnt mmx popcnt prfchw sahf sse sse2 sse3 sse4a
Machine : x86_64
CPU Name : skylake
CPU count : 8
CPU Features :
64bit adx aes avx avx2 bmi bmi2 clflushopt cmov cx16 f16c fma fsgsbase invpcid
lzcnt mmx movbe pclmul popcnt prfchw rdrnd rdseed rtm sahf sgx sse sse2 sse3
sse4.1 sse4.2 ssse3 xsave xsavec xsaveopt xsaves
I timed the full running time of the program (using /usr/bin/time). t is the (wall clock) running time, in seconds. abs.throughput is the throughput in slices/second. norm.throughput is normalized to the throughput for one rank, on that platform.
## # A tibble: 20 x 7
## plat baseline nranks t throughput abs.throughput norm.throughput
## <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 grunt5 0.000449 1 2229. 0.000449 1015. 1
## 2 grunt5 0.000449 1 2128. 0.000470 1063. 1.05
## 3 grunt5 0.000449 2 1105. 0.000905 2046. 2.02
## 4 grunt5 0.000449 3 741. 0.00135 3054. 3.01
## 5 grunt5 0.000449 8 292. 0.00343 7754. 7.64
## 6 grunt5 0.000449 8 289. 0.00346 7815. 7.70
## 7 grunt5 0.000449 12 202. 0.00496 11217. 11.1
## 8 grunt5 0.000449 16 151. 0.00664 15015. 14.8
## 9 grunt5 0.000449 16 157. 0.00636 14397. 14.2
## 10 grunt5 0.000449 24 101. 0.00991 22412. 22.1
## 11 grunt5 0.000449 28 91.4 0.0109 24735. 24.4
## 12 grunt5 0.000449 31 85.0 0.0118 26613. 26.2
## 13 grunt5 0.000449 32 80.8 0.0124 28000. 27.6
## 14 grunt5 0.000449 32 82.8 0.0121 27315. 26.9
## 15 grunt5 0.000449 48 87.8 0.0114 25767. 25.4
## 16 grunt5 0.000449 64 94.0 0.0106 24068. 23.7
## 17 mac 0.00158 1 631. 0.00158 3583. 1
## 18 mac 0.00158 2 341. 0.00294 6642. 1.85
## 19 mac 0.00158 3 242. 0.00414 9361. 2.61
## 20 mac 0.00158 4 181. 0.00551 12464. 3.48
The strong scaling behavior is shown by the absolute throughput as a function of the number of ranks, for each platform.
A predictive model that accounts for contention between processes and for delays required to keep the system in a coherent and consistent state is detailed in (Gunther 2006). My laptop has too few cores to fit data sensibly. Using only the data from grunt5, and only up to 32 ranks, we see the following:
##
## Formula: abs.throughput ~ N * nranks/(1 + alpha * (nranks - 1) + beta *
## nranks * (nranks - 1))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## N 9.544e+02 4.885e+01 19.535 6.88e-10 ***
## alpha -1.114e-03 5.401e-03 -0.206 0.840
## beta 1.423e-04 1.188e-04 1.197 0.256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 342.7 on 11 degrees of freedom
##
## Number of iterations to convergence: 6
## Achieved convergence tolerance: 1.504e-06
Unfortunately, the parameter alpha has a “best fit” value that is negative (with a large fractional error); this violates the logic of the model. So let’s set that term to 0 in the model, and see what we get:
##
## Formula: abs.throughput ~ N * nranks/(1 + beta * nranks * (nranks - 1))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## N 9.640e+02 1.390e+01 69.363 < 2e-16 ***
## beta 1.182e-04 1.973e-05 5.989 6.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 328.7 on 12 degrees of freedom
##
## Number of iterations to convergence: 4
## Achieved convergence tolerance: 1.135e-06
We can plot the predictions of this model against the data:
We can also plot the (predicted) efficiency as a function of the number of ranks:
As a table:
| throughput | nranks | eff |
|---|---|---|
| 964.0244 | 1 | 1.0000000 |
| 1927.5932 | 2 | 0.9997637 |
| 2890.0241 | 3 | 0.9992915 |
| 3850.6374 | 4 | 0.9985840 |
| 4808.7573 | 5 | 0.9976423 |
| 5763.7141 | 6 | 0.9964676 |
| 6714.8452 | 7 | 0.9950616 |
| 7661.4969 | 8 | 0.9934262 |
| 8603.0257 | 9 | 0.9915639 |
| 9538.7998 | 10 | 0.9894770 |
| 10468.2006 | 11 | 0.9871686 |
| 11390.6236 | 12 | 0.9846417 |
| 12305.4803 | 13 | 0.9818999 |
| 13212.1987 | 14 | 0.9789467 |
| 14110.2245 | 15 | 0.9757862 |
| 14999.0227 | 16 | 0.9724224 |
| 15878.0778 | 17 | 0.9688599 |
| 16746.8949 | 18 | 0.9651032 |
| 17605.0007 | 19 | 0.9611572 |
| 18451.9435 | 20 | 0.9570268 |
| 19287.2948 | 21 | 0.9527172 |
| 20110.6486 | 22 | 0.9482337 |
| 20921.6228 | 23 | 0.9435817 |
| 21719.8589 | 24 | 0.9387668 |
| 22505.0222 | 25 | 0.9337947 |
| 23276.8026 | 26 | 0.9286712 |
| 24034.9136 | 27 | 0.9234019 |
| 24779.0933 | 28 | 0.9179930 |
| 25509.1035 | 29 | 0.9124502 |
| 26224.7299 | 30 | 0.9067796 |
| 26925.7820 | 31 | 0.9009872 |
| 27612.0922 | 32 | 0.8950789 |
By monkey-patching the classes h5py.File and `h5py.Dataset’, I have collected performance information in the form of timestamps when specific events (file open, file close, file finalization by garbage collector, reading datasets,…) happen.
These are measured in a 1-rank program, run on my laptop, which has an SSD.
Gunther, Neil J. 2006. Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services. Springer-Verlag.