This document is a simple template to illustrate the kind of report we expect from you (for now…). I follow the classical IMRAD structure. Obviously,
The goal of this study is to evaluate when using threads becomes interesting compared to a simple sequential version. To this end, we use the well-known quicksort algorithm whose parallelization is rather natural although its critical path quite bad.
We used our own machine, a Dell Latitude 6430u with 16Gb of RAM:
model name : Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz
/-----------------------------------------------------------------------------------------------------\
| Machine (16GB) |
| |
| /------------------------------------\ /---------------------\ |
| | Socket P#0 | ++--+-----+ PCI 8086:0166 | |
| | | | | | |
| | /--------------------------------\ | | | /---------\ | |
| | | L3 (4096KB) | | | | | card0 | | |
| | \--------------------------------/ | | | \---------/ | |
| | | | | | |
| | /--------------\ /--------------\ | | | /-----------------\ | |
| | | L2 (256KB) | | L2 (256KB) | | | | | controlD64 | | |
| | \--------------/ \--------------/ | | | \-----------------/ | |
| | | | \---------------------/ |
| | /--------------\ /--------------\ | | |
| | | L1d (32KB) | | L1d (32KB) | | | /-----------------\ |
| | \--------------/ \--------------/ | +-----+ PCI 8086:1502 | |
| | | | | | |
| | /--------------\ /--------------\ | | | /--------\ | |
| | | L1i (32KB) | | L1i (32KB) | | | | | eth0 | | |
| | \--------------/ \--------------/ | | | \--------/ | |
| | | | \-----------------/ |
| | /--------------\ /--------------\ | | |
| | | Core P#0 | | Core P#1 | | | /-----------------\ |
| | | | | | | +-----++--------++--------++--------+ PCI 168c:0034 | |
| | | /----------\ | | /----------\ | | | | | |
| | | | PU P#0 | | | | PU P#1 | | | | | /---------\ | |
| | | \----------/ | | \----------/ | | | | | wlan0 | | |
| | | /----------\ | | /----------\ | | | | \---------/ | |
| | | | PU P#2 | | | | PU P#3 | | | | \-----------------/ |
| | | \----------/ | | \----------/ | | | |
| | \--------------/ \--------------/ | | /-----------------\ |
| \------------------------------------/ +-----+ PCI 8086:1e03 | |
| | | |
| | /------\ | |
| | | sda | | |
| | \------/ | |
| \-----------------/ |
\-----------------------------------------------------------------------------------------------------/
/-----------------------------------------------------------------------------------------------------\
| Host: sama |
| |
| Indexes: physical |
| |
| Date: Thu Oct 10 18:19:15 2013 |
\-----------------------------------------------------------------------------------------------------/
Compile flags: CFLAGS : -g -Wall -O2 -pthread -lrt
Actual frequency: between 800000 and 2640000
Frequency governor: performance or powersave
Measurements done at: Thu Oct 10 18:14:00 CEST 2013
Everything was provided in the archive so we simply launched the measurement by calling make.
Let's first load the obtained measurement.
library(ggplot2)
df <- read.csv("measurements_arg.csv", strip.white = T)
Let's compute the average execution time for each size and type of measurement.
library(plyr)
df_avg <- ddply(df, c("Size", "Type", "Thread.level"), summarise, meanTime = mean(Time),
varTime = var(Time), seTime = 2 * sqrt(var(Time))/sqrt(length(Time)))
And finally, let's plot it.
ggplot(data = df_avg, aes(x = Size, color = Type)) + geom_point(data = df, aes(x = Size,
y = Time), alpha = 0.2, size = 3) + geom_line(aes(y = meanTime, linetype = Type)) +
geom_errorbar(aes(ymin = meanTime - seTime, ymax = meanTime + seTime), colour = "black",
width = 0.11) + geom_point(aes(y = meanTime), size = 3, shape = 21,
fill = "white") + scale_x_log10() + scale_y_log10() + facet_wrap(~Thread.level)
OK, clearly, we need to decrease parallelism to obtain performances. When Thread.level=5, we can obtain a significant gain of 1.2727. Note that using the built-in version incurs a slowdown of 1.6026 when compared to our sequential compiled version so compiling also seems something to take into account. Anyway, we're far from 2…
Multi-core machines are definitely the best answer to the evergrowing needs of performance. Our study illustrates however that performance gains can be obtained for relatively small data sets but that the parallelism overhead should be taken into account.