library(ggplot2)
library(dplyr)

We’re comparing performance of the apps.mellanox.connectx driver between two

On the AMD EPYC 7443P, where the NIC is on IO bus 81, we also compare two BIOS settings:

See [https://www.supermicro.com/support/faqs/faq.cfm?faq=33731]

Preferred IO Device

Advanced->NB Configuration->Preferred IO->Manual

Advanced->NB Configuration->Preferred IO Bus->81

The configuration is one 2x100G port NIC, with both ports wired to each other. Packetblaster will transmit packets one one port in both test cases. In the Receive performance test case packets are received in the other port.

The code under test can be found here: eugeneia/mellanox-benchmark

We run benchmarks on a matrix of parameters including

The benchmark is run three times for every parameter configuration and we show the minimum, maximum, and average of the results.

We also overlay 100G linerate (grey dashed line) to put the results into perspective.

Gbps <- function(mpps,pktsize) {
  mpps*(12+8+pktsize)*8/1000
}

Linerate <- function(G, pktsize) {
  G*1e9 / ((12+8+pktsize)*8)
}

Packetblaster performance

Packetblaster is a optimized TX routine for our Connect-X driver. It should demonstrate the maximum transmit rate supported by the NIC.

Single-core

Testing single-core performance of Packetblaster at 64B packets while comparing number of queues and queue size.

This should reproduce the results in ConnectX: Review N*SQ 64B transmit performance mellanox (Rev 2) from 2016. The EPYC Preferred IO bus->81 run seems to do so partially. On the Intel system the curve seems to match, however it seems to be offset by reduced overall throughput.

packetblaster_single <- (mellanox.tx.rx.queues.qsize.intel.100e6 %>%
                          mutate(system="Intel Xeon Silver 4116 @ 2.10GHz")) %>%
  bind_rows((mellanox.tx.rx.queues.qsize.epyc.100e6 %>%
              mutate(system="AMD EPYC 7443P (Pref. IO bus 81)"))) %>%
  mutate(qsize=as.factor(qsize)) %>%
  group_by(queues, qsize, system) %>% 
  summarise(minrate=min(rate), maxrate=max(rate), avgrate=mean(rate)) %>%
  ungroup()
`summarise()` has grouped output by 'queues', 'qsize'. You can override using the `.groups` argument.
ggplot(packetblaster_single, aes(x=queues, color=qsize)) +
  facet_wrap(~ system) +
  geom_line(aes(y=maxrate)) +
  geom_point(aes(y=maxrate)) + 
  ggtitle("Packetblaster performance with 64B Ethernet packets",
          subtitle="Single core, rate in Mpps")

Packet sizes

Testing Packetblaster single-core performance with 16 transmit queues comparing performance between packet sizes.

packetblaster_sizes <- (mellanox.tx.only.sizes.intel.100e6.all %>%
           mutate(system="Intel Xeon Silver 4116 (pcie gen3)")) %>%
  bind_rows((mellanox.tx.only.sizes.epyc.100e6.all.bios2 %>%
               mutate(system="AMD EPYC 7443P (pcie gen4)"))) %>%
  mutate(queues=as.factor(queues)) %>%
  group_by(pktsize, queues, system) %>% 
  summarise(minrate=min(rate), maxrate=max(rate), avgrate=mean(rate)) %>%
  ungroup()
`summarise()` has grouped output by 'pktsize', 'queues'. You can override using the `.groups` argument.
ggplot(packetblaster_sizes, aes(x=pktsize, color=system)) +
  geom_step(aes(y=avgrate, linetype="avg")) +
  geom_step(aes(y=minrate, linetype="min")) +
  geom_step(aes(y=maxrate, linetype="max")) +
  geom_line(aes(y=Linerate(100, pktsize)/1e6, linetype="linerate"), color='grey') +
  coord_cartesian(ylim=c(NA, max(packetblaster_sizes$maxrate))) +
  scale_x_continuous(n.breaks = 20) + 
  ggtitle("Packetblaster performance with different sizes of Ethernet packets",
          subtitle="Single core, 16 queues, rate in Mpps (TX only)")

Multi-core

Testing Packetblaster multi-core performance, comparing number of workers, queues, and packet size. Queue size is the default (1024).

packetblaster <- (mellanox.tx.only.queues.sizes.intel.100e6.coarse %>%
           mutate(system="Intel Xeon Silver 4116 @ 2.10GHz")) %>%
  bind_rows((mellanox.tx.only.queues.sizes.epyc.100e6.coarse %>%
               mutate(system="AMD EPYC 7443P"))) %>%
  bind_rows((mellanox.tx.only.queues.sizes.epyc.100e6.coarse.bios2 %>%
               mutate(system="AMD EPYC 7443P (Pref. IO bus 81)"))) %>%
  mutate(workers=sprintf("%d workers (cores)", workers),
         queues=sprintf("%d queues", queues)) %>%
  group_by(system, workers, queues, pktsize) %>% 
  summarise(min_mpps=min(rate), avg_mpps=mean(rate), max_mpps=max(rate),
            min_loss=(min(drop+error)), min_loss=(mean(drop+error)), max_loss=(max(drop+error))) %>%
  ungroup() %>%
  mutate(Gbps=Gbps(max_mpps-max_loss, pktsize))
`summarise()` has grouped output by 'system', 'workers', 'queues'. You can override using the `.groups` argument.
ggplot(packetblaster, aes(x=pktsize, color=queues)) +
  facet_grid(system ~ workers) + 
  geom_line(aes(y=max_mpps, linetype="0_tx")) +
  geom_line(aes(y=max_loss, linetype="1_loss")) + 
  geom_point(aes(y=avg_mpps, shape="avg"), alpha=0.5) +
  geom_point(aes(y=max_mpps, shape="max"), alpha=0.5) +
  geom_point(aes(y=min_mpps, shape="min"), alpha=0.5) +
  geom_line(aes(y=Linerate(100, pktsize)/1e6, linetype="2_linerate"), color='grey') +
  coord_cartesian(ylim=c(NA, max(packetblaster$max_mpps))) +
  scale_x_continuous(breaks=c(64,256,512,512+256,1024)) +
  scale_y_continuous(n.breaks = 8) + 
  ggtitle("Multi core performance by number of queues per worker and packet size",
          subtitle="apps.mellanox.connectx: Packetblaster rate MPPS (TX only)")

Receive performance

Packetblaster transmits on one port, and packets are received on the other port. Each side has a dedicated CPU core for each worker. I.e., “6 workers” means six cores used for transmit, and six distinct cores are used for receive.

txrx <- (mellanox.tx.rx.queues.sizes.intel.100e6.coarse %>%
          mutate(system="Intel Xeon Silver 4116 @ 2.10GHz")) %>%
  bind_rows((mellanox.tx.rx.queues.sizes.epyc.100e6.coarse %>%
               mutate(system="AMD EPYC 7443P") %>% na.omit())) %>%
  bind_rows((mellanox.tx.rx.queues.sizes.epyc.100e6.coarse.bios2 %>%
               mutate(system="AMD EPYC 7443P (Pref. IO bus 81)") %>% na.omit())) %>%
  mutate(workers=sprintf("%d workers (cores)", workers),
         queues=sprintf("%d queues", queues)) %>%
  mutate(rx_mpps=rxrate-(rxdrop+rxerror)) %>%
  group_by(system, workers, queues, pktsize) %>% 
  summarise(min_mpps=min(txrate),
            avg_mpps=mean(txrate),
            max_mpps=max(txrate),
            min_rx_mpps=(min(rx_mpps)),
            avg_rx_mpps=(mean(rx_mpps)),
            max_rx_mpps=(max(rx_mpps))) %>%
  ungroup() %>%
  mutate(rxGbps=Gbps(max_rx_mpps, pktsize), Gbps=Gbps(max_mpps, pktsize))
`summarise()` has grouped output by 'system', 'workers', 'queues'. You can override using the `.groups` argument.
ggplot(txrx, aes(x=pktsize, color=queues)) +
  facet_grid(system ~ workers) +
  geom_line(aes(y=max_rx_mpps, linetype="0_rx")) +
  geom_line(aes(y=max_mpps, linetype="1_tx")) + 
  geom_point(aes(y=avg_rx_mpps, shape="avg"), alpha=0.5) +
  geom_point(aes(y=max_rx_mpps, shape="max"), alpha=0.5) +
  geom_point(aes(y=min_rx_mpps, shape="min"), alpha=0.5) +
  geom_line(aes(y=Linerate(100, pktsize)/1e6, linetype="2_linerate"), color='grey') +
  coord_cartesian(ylim=c(NA, max(txrx$max_mpps))) +
  scale_x_continuous(breaks=c(64,256,512,512+256,1024)) +
  scale_y_continuous(n.breaks = 10) + 
  ggtitle("Multi core performance by number of queues per worker and packet size",
          subtitle="apps.mellanox.connectx: RX rate of combined receive queues in MPPS")

ggplot(txrx, aes(x=pktsize, color=queues)) +
  facet_grid(system ~ workers) +
  geom_line(aes(y=rxGbps, linetype="0_rx")) +
  geom_line(aes(y=Gbps, linetype="1_tx")) + 
  geom_point(aes(y=rxGbps, shape="rx"), alpha=0.5) +
  geom_point(aes(y=Gbps, shape="tx"), alpha=0.5) +
  geom_line(aes(y=100, linetype="2_linerate"), color='grey') +
  coord_cartesian(ylim=c(NA, max(txrx$rxGbps))) +
  scale_x_continuous(breaks=c(64,256,512,512+256,1024)) +
  scale_y_continuous(n.breaks = 10) + 
  ggtitle("Multi core performance by number of queues per worker and packet size",
          subtitle="apps.mellanox.connectx: RX throughput of combined receive queues in Gbps")

Forwarding performance

Single node

Packetblaster transmits on one port, packets are received on the other port and forwarded back to the Packetblaster port (with src/dst MAC addresses swapped). Each side has a dedicated CPU core for each worker. I.e., “6 workers” means six cores used for transmit, and six distinct cores are used for receive/forward.

txfwd <- (mellanox.tx.fwd.queues.sizes.intel.100e6 %>%
          mutate(system="Intel Xeon Silver 4116 @ 2.10GHz")) %>%
  bind_rows((mellanox.tx.fwd.queues.sizes.epyc.100e6.coarse.bios2 %>%
               mutate(system="AMD EPYC 7443P (Pref. IO bus 81)") %>% na.omit())) %>%
  mutate(workers=sprintf("%d workers (cores)", workers),
         queues=sprintf("%d queues", queues)) %>%
  mutate(fwd_mpps=fwrate-(fwdrop+fwerror)) %>%
  group_by(system, workers, queues, pktsize) %>% 
  summarise(min_mpps=min(txrate),
            avg_mpps=mean(txrate),
            max_mpps=max(txrate),
            min_fwd_mpps=(min(fwd_mpps)),
            avg_fwd_mpps=(mean(fwd_mpps)),
            max_fwd_mpps=(max(fwd_mpps))) %>%
  ungroup() %>%
  mutate(fwdGbps=Gbps(max_fwd_mpps, pktsize), Gbps=Gbps(max_mpps, pktsize))
`summarise()` has grouped output by 'system', 'workers', 'queues'. You can override using the `.groups` argument.
ggplot(txfwd, aes(x=pktsize, color=queues)) +
  facet_grid(system ~ workers) +
  geom_line(aes(y=max_fwd_mpps, linetype="0_fwd")) +
  geom_line(aes(y=max_mpps, linetype="1_tx")) + 
  geom_point(aes(y=avg_fwd_mpps, shape="avg"), alpha=0.5) +
  geom_point(aes(y=max_fwd_mpps, shape="max"), alpha=0.5) +
  geom_point(aes(y=min_fwd_mpps, shape="min"), alpha=0.5) +
  geom_line(aes(y=Linerate(100, pktsize)/1e6, linetype="2_linerate"), color='grey') +
  coord_cartesian(ylim=c(NA, max(txfwd$max_fwd_mpps))) +
  scale_x_continuous(breaks=c(64,256,512,512+256,1024)) +
  scale_y_continuous(n.breaks = 10) + 
  ggtitle("Multi core performance by number of queues per worker and packet size",
          subtitle="apps.mellanox.connectx: Forwarding rate of combined receive queues in MPPS")

ggplot(txfwd, aes(x=pktsize, color=queues)) +
  facet_grid(system ~ workers) +
  geom_line(aes(y=fwdGbps, linetype="0_fwd")) +
  geom_line(aes(y=Gbps, linetype="1_tx")) + 
  geom_point(aes(y=fwdGbps, shape="fwd"), alpha=0.5) +
  geom_point(aes(y=Gbps, shape="tx"), alpha=0.5) +
  geom_line(aes(y=100, linetype="2_linerate"), color='grey') +
  scale_x_continuous(breaks=c(64,256,512,512+256,1024)) +
  scale_y_continuous(n.breaks = 10) + 
  ggtitle("Multi core performance by number of queues per worker and packet size",
          subtitle="apps.mellanox.connectx: Forwarding throughput of combined receive queues in Gbps")

Forwarding between systems

Here we test forwarding performance between our two systems. Each system uses one port of a 2x100G Connect-X card.

Note that test traffic is generated by the system:

  • Intel Xeon Silver 4116 @ 2.10GHz, Mellanox Technologies MT27800 Family [ConnectX-5] (PCIeGen3, Speed 8GT/s, Width x16)

As such it can not exceed the TX rate measured in “Packet sizes”, which is overlayed as a dashed grey line (“txlimit”) in the plots below.

The traffic generator uses one worker/core with 16 transmit queues.

The Epyc system receives the generated test traffic and forwards it back to the load generator over the same port using N workers/cores with one queue pair each.

The test traffic is split across two pairs of MACs and two vlans.

fwd_b2b_macvlan <- tx.fwd.b2b.coarse.nofc.macvlan.fine.n3 %>%
  mutate(workers=as.factor(workers)) %>%
  mutate(queues=as.factor(queues)) %>%
  group_by(pktsize, workers, queues) %>%
  summarise(fwrate=max(fwrate)) %>% ungroup() %>%
  left_join(filter(packetblaster_sizes, system=="Intel Xeon Silver 4116 (pcie gen3)"),
            by=c("pktsize" = "pktsize")) %>%
  mutate(Gbps=Gbps(fwrate, pktsize), MaxGbps=Gbps(maxrate, pktsize))
`summarise()` has grouped output by 'pktsize', 'workers'. You can override using the `.groups` argument.
ggplot(fwd_b2b_macvlan, aes(x=pktsize, color=workers)) +
  geom_line(aes(y=fwrate)) +
  geom_point(aes(y=fwrate)) + 
  geom_line(aes(y=Linerate(100, pktsize)/1e6, linetype="linerate"), color='grey') +
  geom_line(aes(y=maxrate, linetype="txlimit"), color='grey') +
  coord_cartesian(ylim=c(NA, max(fwd_b2b_macvlan$fwrate))) +
  ggtitle("Forwarding performance between two servers",
          subtitle="two macs, two vlans, rate in Mpps")
Warning: Removed 28 row(s) containing missing values (geom_path).

ggplot(fwd_b2b_macvlan, aes(x=pktsize, color=workers)) +
  geom_line(aes(y=Gbps)) +
  geom_point(aes(y=Gbps)) + 
  geom_line(aes(y=MaxGbps, linetype="txlimit"), color='grey') +
  geom_line(aes(y=100, linetype="linerate"), color='grey') +
  coord_cartesian(ylim=c(NA, max(fwd_b2b_macvlan$Gbps))) +
  ggtitle("Forwarding performance between two servers",
          subtitle="two macs, two vlans, rate in Gbps")
Warning: Removed 28 row(s) containing missing values (geom_path).

Help

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

