Revision

This is Revision 2 of the multiple Send Queue benchmark. This version uses a new version of the Snabb software that allocates packet memory differently. The previous version reused the same packet buffer for all DMA Gather requests on a Send Queue. This version allocates a separate packet buffer to reference for each DMA Gather request.

This has a large impact on performance. The previous tests were probably exercising a bad case on the DMA subsystem.

Introduction

The purpose of this test report is to investigate the transmit performance of ConnectX-4 when sending 64-byte packets over multiple Send Queues. The results can be used to guide software design.

These results have not yet been reviewed by other parties. They may or may not be consistent with expectations.

Test setup

Fixed factors

  • Snabb “packetblaster” software optimized to prevent CPU bottlenecks (prototype with ConnectX-4 support).
  • Mellanox ConnectX-4 100G single-port ethernet card (PSID: MT_2180110032).
  • Firmware version 12.16.1020.
  • Send Work Queue Entries (WQEs) always 64-bytes:
    • 16B Control Segment.
    • 32B Ethernet Segment (16 payload bytes inline).
    • 16B Send Data Segment (remaining payload bytes on DMA gather).
  • Physical addresses (“rlkey”“) used for all Send WQEs.
  • Completion event requested once every 256 packets.
  • Single entry (“collapsed”) completion queue.
  • Packets are 64 bytes (60B on Work Queue + 4B ethernet CRC)
  • Packets are transmitted continuously for approximately one second.

Benchmark result is taken from hardware (“vport”) counter for sent packets.

Variable factors

  • SendQueues (number of Send Queues being operated in parallel)
  • QueueSize (number of Work Queue Entries for the Send Queue).

Results

Observations

Transmit performance with small packets depends on the combination of SendQueues and QueueSize.