Overview

This is an exploration of the data from Hydra evaluation 3248 with a test matrix:

The purpose is to investigate these two questions:

  1. What are the effects of patching QEMU?
  2. How dependent is the performance of each Snabb version on the QEMU patch?

Setup

library(dplyr)
library(tibble)
library(readr)
library(ggplot2)

all <- read_csv("/Users/lukego/Downloads/3248.csv")
## Parsed with column specification:
## cols(
##   benchmark = col_character(),
##   pktsize = col_integer(),
##   config = col_character(),
##   snabb = col_character(),
##   kernel = col_character(),
##   qemu = col_character(),
##   dpdk = col_double(),
##   id = col_integer(),
##   score = col_double(),
##   unit = col_character()
## )
all[is.na(all)] <- 0 # treat failure as zero

Effect of patching QEMU

Looking for the effect of patching QEMU we will only consider the baseline Snabb version (A). The relative effects of different Snabb versions is the next, separate, topic.

A <- filter(all, snabb=='A')

Analysis of Variance tells us which factors, and combinations of factors, account for significant differences. I am doing this to predict which graphs will show interesting patterns.

summary(aov(score ~ benchmark * qemu * config, A))
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## benchmark        1   3604    3604 3335.61 <2e-16 ***
## qemu             1    121     121  111.86 <2e-16 ***
## config           6   3431     572  529.23 <2e-16 ***
## benchmark:qemu   1    575     575  531.71 <2e-16 ***
## qemu:config      6    221      37   34.04 <2e-16 ***
## Residuals      464    501       1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This tells us with the *** marks that every factor accounts for a significant difference in average results, and so does the interaction between every pair of factors. (I am not sure why it does not mention the three-way interaction benchmark:qemu:config.) The Sum Sq column also tells us proportionally how much variation in results each factor accounts for (choice of benchmark and choice of config are the big ones).

This suggests to me that any graph we care to generate should have a story to tell. The most specific graph will probably be the most interesting but let’s look at a couple of more general ones first.

ggplot(all, aes(score, color=qemu)) + geom_density() + ggtitle("Overall scores by QEMU")

Above we can see that the QEMU patch does affect results quite dramatically. The results are bimodal in both cases but the unpatched QEMU has both more low scores (~3) and more high scores (~15).

Let’s split up the graph to see the effect on each benchmark separately:

ggplot(all, aes(score, color=qemu)) + geom_density() + facet_grid(benchmark ~ .) + ggtitle("Overall scores by QEMU and benchmark")

Above we can see that the patch is affecting iperf and l2fwd scores quite differently. The iperf scores are decreasing and clustering more closely together while the l2fwd scores are increasing and spreading out.

Let’s now look at each benchmark separately and break down the results for each configuration separately.

iperf

iperf <- filter(all, benchmark=="iperf")
ggplot(iperf, aes(score, color=qemu)) + geom_density() + facet_grid(config ~ ., scales="free_y") + ggtitle("iperf scores by QEMU and config")

Above we can see a few interesting things:

  1. Lower scores for all benchmarks.
  2. More spread out results (variation) with the l2tpv3 config.

l2fwd

l2fwd <- filter(all, benchmark=="l2fwd")
ggplot(l2fwd, aes(score, color=qemu)) + geom_density() + facet_grid(config ~ ., scales="free_y") + ggtitle("l2fwd scores by QEMU and config")

Above we can see quite a few interesting effects of the patch:

  1. Much higher scores in two configurations, base and noind (disabled virtio INDIRECT_DESC).
  2. Much more distinct bimodal distribution of scores in those benchmarks too.
  3. Slightly higher scores with the nomrg (disabled virtio MRG_RXBUF) configuration.

Conclusion #1

The question was, “What are the effects of patching QEMU?” The answer seems to be:

  1. iperf scores decrease substantially.
  2. l2fwd scores increase. The effect is large with “base” and “noind” configurations and modest with “nomrg”.

This is a mixed bag.

Effect of changing Snabb

TO BE WRITTEN.

Just a few graphs as a braindump for now:

Overall scores

ggplot(all, aes(score, color=snabb)) + geom_density()

By benchmark

ggplot(all, aes(score, color=snabb)) + facet_grid(benchmark ~ .) + geom_density()

l2fwd in detail

ggplot(filter(all, benchmark=="l2fwd"),
       aes(score, color=snabb)) + geom_density() + facet_grid(qemu ~ config, scales="free_y") +
        ggtitle("l2fwd scores")