This document presents a report on the test of the BBCP utility as an option for fast file transfer on A*CRC HPC systems. In particular, we compare the performance of BBCP with SCP and tune the parameters used for file transfer.
Basically, BBCP is a fast, free, and easy node-to-node method to move large data. It’s capable of breaking up the transfer into multiple simultaneous streams.
More information on BBCP can be found on the BBCP homepage.
Figure 1: Running BBCP across two compute nodes of Fuji (Fuji380 and Fuji382)
Here we compare the performance of BBCP vs SCP. Node-to-node tests were conducted for different settings (Figure 2):
Note that in (1), (2), and (3) BBCP and SCP were executed via TCP and in (4) it was run via IPoIB.
In all settings, TCP window size of 8MB and 2 parallel streams were used to transfer a dummy file of 2GB in size.
Figure 2: Comparison of the performance of BBCP against SCP for four different settings.
Performance without firewall is higher for both BBCP and SCP, as expected. However, it seems that BBCP is better at utilizing the link rate in the presence of firewall. Moreover, there are more hops within [Aurora - Axle] than [Aurora - Fuji], thus the better performance for the latter (see the traceroute in Appendix).
Most importantly, when used on IB QDR, BBCP performed the transfer much faster than SCP at a rate of 1300 * 8 = 10400 Mbps (or ~10Gbps). This rate is purely the transfer rate and not the effective rate (hence overhead due to disk I/O is excluded). As we shall see later, BBCP can be made faster by tuning its parameters. Note that IB QDR has link rate of 4 * 10Gbps = 40 Gbps.
Two important parameters to tune are
We tested five different TCP window sizes (2MB and its integer multiples up to 32MB). For each window size, we also tested 5 possible number of streams from 2 up to 32 (figure 2). Each measurement is the average of ten trials to neutralize external covariates (e.g. network intermittence). The full results can be found in figure 3 below. The best performance was attained with 4 streams and 4MB window size.
Figure 3: Transfer Rate of BBCP for various TCP Window sizes and number of parallel streams.
With 4 streams and 4MB window, the transfer rate shot to ~2GB/s (16Gb/s), about 1.5x faster than using 2 streams. This suggests that BBCP utilized the full link rate of IB QDR (40Gb/s) rather than 10Gb/s even though it was run via IPoIB.
BBCP can perform much faster than SCP for transferring large files. In our tests, transfer rate of ~16Gb/s was achieved with 4 parallel streams (4MB tcp window size) run across two internal compute nodes without firewall.
#Traceroute from Aurora to Fuji
kevins@aurora:/dev/shm> /usr/sbin/traceroute fuji
traceroute to fuji (202.83.248.75), 30 hops max, 40 byte packets using UDP
1 * fuji.acrc.a-star.edu.sg (202.83.248.75)(H!) 5.210 ms (H!) 6.331 ms
#Traceroute from Aurora to Axle
kevins@aurora:/dev/shm> /usr/sbin/traceroute axle
traceroute to axle (123.136.66.111), 30 hops max, 40 byte packets using UDP
1 202.83.248.1 (202.83.248.1) 0.659 ms 0.720 ms 0.670 ms
2 10.217.175.227 (10.217.175.227) 0.896 ms 0.820 ms 0.835 ms
3 10.217.175.242 (10.217.175.242) 1.950 ms 2.325 ms 1.529 ms
4 10.217.175.210 (10.217.175.210) 3.819 ms 3.381 ms 2.469 ms
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
#BBCP from Aurora to Axle
for i in `seq 1 10`; do /home01/acrc/kevins/scratch/bbcp/bin/amd64_linux/bbcp -z -V -w 8M -s 2 -T 'ssh -x -a -oFallBackToRsh=no %I -l %U %H /home01/acrc/kevins/scratch/bbcp' banana.caramel axle:/dev/shm &> prog$i.txt; ssh axle 'rm /dev/shm/banana.caramel'; done
#SCP from Aurora to Axle
for i in `seq 1 10`; do scp banana.caramel axle:/dev/shm; done