This is an exploration of the data from Hydra evaluation 3248 with a test matrix:
The purpose is to investigate these two questions:
library(dplyr)
library(tibble)
library(readr)
library(ggplot2)
all <- read_csv("/Users/lukego/Downloads/3248.csv")
## Parsed with column specification:
## cols(
## benchmark = col_character(),
## pktsize = col_integer(),
## config = col_character(),
## snabb = col_character(),
## kernel = col_character(),
## qemu = col_character(),
## dpdk = col_double(),
## id = col_integer(),
## score = col_double(),
## unit = col_character()
## )
all[is.na(all)] <- 0 # treat failure as zero
Looking for the effect of patching QEMU we will only consider the baseline Snabb version (A). The relative effects of different Snabb versions is the next, separate, topic.
A <- filter(all, snabb=='A')
Analysis of Variance tells us which factors, and combinations of factors, account for significant differences. I am doing this to predict which graphs will show interesting patterns.
summary(aov(score ~ benchmark * qemu * config, A))
## Df Sum Sq Mean Sq F value Pr(>F)
## benchmark 1 3604 3604 3335.61 <2e-16 ***
## qemu 1 121 121 111.86 <2e-16 ***
## config 6 3431 572 529.23 <2e-16 ***
## benchmark:qemu 1 575 575 531.71 <2e-16 ***
## qemu:config 6 221 37 34.04 <2e-16 ***
## Residuals 464 501 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This tells us with the *** marks that every factor accounts for a significant difference in average results, and so does the interaction between every pair of factors. (I am not sure why it does not mention the three-way interaction benchmark:qemu:config.) The Sum Sq column also tells us proportionally how much variation in results each factor accounts for (choice of benchmark and choice of config are the big ones).
This suggests to me that any graph we care to generate should have a story to tell. The most specific graph will probably be the most interesting but let’s look at a couple of more general ones first.
ggplot(all, aes(score, color=qemu)) + geom_density() + ggtitle("Overall scores by QEMU")
Above we can see that the QEMU patch does affect results quite dramatically. The results are bimodal in both cases but the unpatched QEMU has both more low scores (~3) and more high scores (~15).
Let’s split up the graph to see the effect on each benchmark separately:
ggplot(all, aes(score, color=qemu)) + geom_density() + facet_grid(benchmark ~ .) + ggtitle("Overall scores by QEMU and benchmark")
Above we can see that the patch is affecting iperf and l2fwd scores quite differently. The iperf scores are decreasing and clustering more closely together while the l2fwd scores are increasing and spreading out.
Let’s now look at each benchmark separately and break down the results for each configuration separately.
iperf <- filter(all, benchmark=="iperf")
ggplot(iperf, aes(score, color=qemu)) + geom_density() + facet_grid(config ~ ., scales="free_y") + ggtitle("iperf scores by QEMU and config")
Above we can see a few interesting things:
l2fwd <- filter(all, benchmark=="l2fwd")
ggplot(l2fwd, aes(score, color=qemu)) + geom_density() + facet_grid(config ~ ., scales="free_y") + ggtitle("l2fwd scores by QEMU and config")
Above we can see quite a few interesting effects of the patch:
The question was, “What are the effects of patching QEMU?” The answer seems to be:
This is a mixed bag.
TO BE WRITTEN.
Just a few graphs as a braindump for now:
ggplot(all, aes(score, color=snabb)) + geom_density()
ggplot(all, aes(score, color=snabb)) + facet_grid(benchmark ~ .) + geom_density()
ggplot(filter(all, benchmark=="l2fwd"),
aes(score, color=snabb)) + geom_density() + facet_grid(qemu ~ config, scales="free_y") +
ggtitle("l2fwd scores")