Last Updated: 2021-03-14 15:33:16 UTC
- added macOS benchmarks (KVM virtual machine for now)
Benchmark computing Matrix Profile
This benchmark will use the current Rcpp implementation, and a real dataset of the italian power demand that contains almost 30k observations.
url <- readr::read_csv("https://raw.githubusercontent.com/matrix-profile-foundation/mpf-datasets/05efe885cff4b2266067ad62c4f6fa2b537ad2a2/real/italianpowerdemand.csv", col_names = FALSE)
dataset <- as.numeric(url$X1)
The data base
Here is a plot of the database

Let’s start the benchmark using the bench package instead of microbenchmark. First, to keep compatible with Tidyverse, and because it returns several information about memory and garbage collection usage.
The method for benchmark will be: using a matrix of data size and window size, so we can compare the performance in more than one scenario. Let’s warm up with a sample test:
sample <- head(dataset, 1000)
w_size <- 100
bench::mark(stomp = stomp(sample, w_size, progress = FALSE))
So it works.
Now let’s start the main (and intense) task.
Desktop:
- Intel(R) Core(TM) i7-7700 CPU @ 3.60Ghz.
- 32 GB RAM
- Windows 10 64-bits build 10.0.18363.1316
- WSL2 Ubuntu 20.04.1 LTS (GNU/Linux 5.4.72-microsoft-standard-WSL2 x86_64)
Raspberry:
- Quad Core 1.2GHz Cortex-A53 ARMv8 64bit CPU
- 1 GB RAM
- Raspberry Pi OS
Algorithms to be evaluated:
- STAMP (single and with 4 threads)
- STOMP (single and with 4 threads)
- SCRIMP (single and with 4 threads)
- MPX (single and with 4 threads)
The outputs will not be compared at first to avoid loosing CPU time with small variations that may occur. The code below was the one used to compute the results. They were saved and now it’s using the saved data to speedup this article rendering.
The Multithreading implementation is using the Intel TBB, some system may fallback to TinyThreads++ (at least for now, TBB was working on all tested platforms including Solaris and ARMv8). The main speed issue may be related with the mutex implementation of TinyThread++ that is not as efficient as TBB.
This is the code used to benchmark:
# changing n_workers to 4 will use 4 threads to compute
results <- bench::press(
d_size = c(5000, 10000, 15000, 20000, 25000),
w_size = c(100, 300, 500, 700, 900),
{
data <- head(dataset, d_size)
bench::mark(
stamp = stamp(data, w_size, progress = FALSE, n_workers = 1),
stomp = stomp(data, w_size, progress = FALSE, n_workers = 1),
scrimp = scrimp(data, w_size, progress = FALSE, n_workers = 1),
mpx = mpx(data, w_size, progress = FALSE, n_workers = 1),
check = FALSE,
min_iterations = 3
)
})
save(results, file = "bench.rda")
Summary of benchmarks


A curious comparison, a desktop single thread vs Raspberry Pi 3 B with four threads:

Detailed benchmarks
Single thread experiments
Four threads experiments
ARM


Cubietruck Plus ARM - Eight Threads is depicted below. I find out that R some times don’t have all cores available and is unpredictable [link].


---
title: "Matrix Profile for R - Benchmarks"
author: "Francisco Bischoff"
creative_commons: CC BY-SA
date: "1/31/2021"
output:
  html_notebook: 
    toc: yes
    toc_float: yes
    toc_depth: 4
---

<head>

<!-- Global site tag (gtag.js) - Google Analytics -->

```{=html}
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-160033971-2"></script>
```
```{=html}
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-160033971-2');
</script>
```
</head>

Last Updated: `r lubridate::now("UTC")` UTC

-   added macOS benchmarks (KVM virtual machine for now)

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyr)
library(ggplot2)
library(bench)
library(forcats)
library(dplyr)
library(matrixprofiler)
```

## Benchmark computing Matrix Profile

This benchmark will use the current Rcpp implementation, and a real dataset of the italian power demand that contains almost 30k observations.

```{r import database, message=FALSE}
url <- readr::read_csv("https://raw.githubusercontent.com/matrix-profile-foundation/mpf-datasets/05efe885cff4b2266067ad62c4f6fa2b537ad2a2/real/italianpowerdemand.csv", col_names = FALSE)
dataset <- as.numeric(url$X1)
```

## The data base

Here is a plot of the database

```{r pressure, echo=FALSE, fig.height=3, fig.width=10}
plot(dataset, xlab = "observations", ylab = "consumption", type = "l", main = "Italian Power Demand")
```

Let's start the benchmark using the `bench` package instead of `microbenchmark`.
First, to keep compatible with Tidyverse, and because it returns several information about memory and garbage collection usage.

The method for benchmark will be: using a matrix of data size and window size, so we can compare the performance in more than one scenario.
Let's warm up with a sample test:

```{r warmup, message=FALSE, warning=FALSE}

sample <- head(dataset, 1000)
w_size <- 100

bench::mark(stomp = stomp(sample, w_size, progress = FALSE))
```

So it works.

Now let's start the main (and intense) task.

Desktop:

-   Intel(R) Core(TM) i7-7700 CPU \@ 3.60Ghz.
-   32 GB RAM
-   Windows 10 64-bits build 10.0.18363.1316
-   WSL2 Ubuntu 20.04.1 LTS (GNU/Linux 5.4.72-microsoft-standard-WSL2 x86_64)

Raspberry:

-   Quad Core 1.2GHz Cortex-A53 ARMv8 64bit CPU
-   1 GB RAM
-   Raspberry Pi OS

Algorithms to be evaluated:

-   STAMP (single and with 4 threads)
-   STOMP (single and with 4 threads)
-   SCRIMP (single and with 4 threads)
-   MPX (single and with 4 threads)

The outputs will not be compared at first to avoid loosing CPU time with small variations that may occur.
The code below was the one used to compute the results.
They were saved and now it's using the saved data to speedup this article rendering.

The Multithreading implementation is using the [Intel TBB,](https://www.threadingbuildingblocks.org/ "Intel TBB") some system may fallback to [TinyThreads++](http://tinythreadpp.bitsnbites.eu/ "TinyThreads++") (at least for now, TBB was working on all tested platforms including Solaris and ARMv8).
The main speed issue may be related with the mutex implementation of TinyThread++ that is not as efficient as TBB.

This is the code used to benchmark:

```{r full benchmark single thread, eval=FALSE}
# changing n_workers to 4 will use 4 threads to compute
results <- bench::press(
  d_size = c(5000, 10000, 15000, 20000, 25000),
  w_size = c(100, 300, 500, 700, 900),
   {
     data <- head(dataset, d_size)

     bench::mark(
       stamp = stamp(data, w_size, progress = FALSE, n_workers = 1),
       stomp = stomp(data, w_size, progress = FALSE, n_workers = 1),
       scrimp = scrimp(data, w_size, progress = FALSE, n_workers = 1),
       mpx = mpx(data, w_size, progress = FALSE, n_workers = 1),
       check = FALSE,
       min_iterations = 3
     )
   })
save(results, file = "bench.rda")

```

## Summary of benchmarks

```{r summary_single, echo=FALSE, fig.height=8, fig.width=12, message=FALSE, warning=FALSE}
load("results.rda")
res_par <- dplyr::filter(results, grepl("_par", algorithm))
res_par <- res_par %>% mutate(algorithm = stringr::str_extract(algorithm, ".*(?=_par)"),
                              median = bench::as_bench_time(median))

res_sim <- dplyr::filter(results, !grepl("_par", algorithm))
res_sim <- res_sim %>% mutate(median = bench::as_bench_time(median))

p <- ggplot(res_sim, aes(x = platform, y = median, col = algorithm))
p2 <- p + ggplot2::geom_point(position = ggplot2::position_dodge(width = -0.2)) +
  ggplot2::coord_flip() +
    labs(title = "Matrix Profile algorithms benchmark by platform", subtitle = "Single Threads", x = "Platforms", col = "Algorithm", y = "time (less is better)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
p2
```

```{r summary_multi, echo=FALSE, fig.height=8, fig.width=12, message=FALSE, warning=FALSE}
load("results.rda")
res_par <- dplyr::filter(results, grepl("_par", algorithm))
res_par <- res_par %>% mutate(algorithm = stringr::str_extract(algorithm, ".*(?=_par)"),
                              median = bench::as_bench_time(median))

res_sim <- dplyr::filter(results, !grepl("_par", algorithm))
res_sim <- res_sim %>% mutate(median = bench::as_bench_time(median))

p <- ggplot(res_par, aes(x = platform, y = median, col = algorithm))
p2 <- p + ggplot2::geom_point(position = ggplot2::position_dodge(width = -0.2)) +
  ggplot2::coord_flip() +
    labs(title = "Matrix Profile algorithms benchmark by platform", subtitle = "Four Threads", x = "Platforms", col = "Algorithm", y = "time (less is better)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
p2
```

A curious comparison, a desktop single thread vs Raspberry Pi 3 B with four threads:

```{r summary_comp, echo=FALSE, fig.height=5, fig.width=12, message=FALSE, warning=FALSE}
load("results_benchs.rda")
win_rasp <- dplyr::filter(results, (platform == "windows" & !grepl("_par", algorithm)) |
                            (platform == "rasp3b" & grepl("_par", algorithm))) %>%
   mutate(algorithm = case_when(
     platform == "rasp3b" ~ stringr::str_extract(algorithm, ".*(?=_par)"),
     TRUE ~ algorithm
   ), median = bench::as_bench_time(median)) %>%
  dplyr::filter(algorithm != "stamp")

p <- ggplot(win_rasp, aes(x = platform, y = median, col = algorithm))
p2 <- p + ggplot2::geom_point(position = ggplot2::position_dodge(width = -0.2)) +
  ggplot2::coord_flip() +
  labs(title = "Matrix Profile algorithms benchmark by platform", subtitle = "A comparative of Windows with Single Thread and Raspberry Pi 3 B with Four Threads", x = "Platforms", col = "Algorithm", y = "time (less is better)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
p2

```

## Detailed benchmarks

### Single thread experiments

#### x86

```{r mem_alloc_single, echo=FALSE, fig.height=8, fig.width=12}
load("bench_windows_single.rda")
res <- bench_windows_single
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time", size = "mem_alloc") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Windows 10 - Single Thread - Speed and Total Memory allocation (MB)", subtitle = "Intel(R) Core(TM) i7-7700 CPU @ 3.60Ghz",x = "Algorithm", y = "time (less is better)", size = "malloc(MB)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

```{r mem_alloc_single_linux, echo=FALSE, fig.height=8, fig.width=12}
load("bench_linux_single.rda")
res <- bench_linux_single
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time", size = "mem_alloc") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Ubuntu WSL - Single Thread - Speed and Total Memory allocation (MB)", subtitle = "Intel(R) Core(TM) i7-7700 CPU @ 3.60Ghz",x = "Algorithm", y = "time (less is better)", size = "malloc(MB)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

```{r mem_alloc_single_osx, echo=FALSE, fig.height=8, fig.width=12}
load("bench_macosx_single.rda")
res <- bench_macosx_single
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time", size = "mem_alloc") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "macOS High Sierra KVM - Single Threads - Speed and Total Memory allocation (MB)", subtitle = "Intel(R) Core(TM) i5-430M CPU @ 2.26-2.53Ghz",x = "Algorithm", y = "time (less is better)", size = "malloc(MB)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

#### ARM

```{r mem_alloc_single_arm, echo=FALSE, fig.height=8, fig.width=12}
load("bench_arm_single.rda")
res <- bench_arm_single
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Raspberry Pi OS ARM - Single Thread - Speed (malloc not available in ARM)", subtitle = "Raspberry Pi 3 Model B, Quad Core 1.2GHz Cortex-A53 ARMv8",x = "Algorithm", y = "time (less is better)") +
ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

```{r mem_alloc_single_armtruck, echo=FALSE, fig.height=8, fig.width=12}
load("bench_cubie_single.rda")
res <- bench_cubie_single
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Cubietruck Plus ARM - Single Thread - Speed (malloc not available in ARM)", subtitle = "Cubietruck Plus, Octa Core 2GHz Cortex-A7 ARMv7",x = "Algorithm", y = "time (less is better)") +
ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

```{r cubie_vs_raspi_single, echo=FALSE, fig.height=8, fig.width=12}
bench_arm_single$cpu <- "rasp3b"
bench_cubie_single$cpu <- "ctplus"
res <- rbind(bench_arm_single, bench_cubie_single)
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time", col = "cpu") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Cubietruck Plus vs Raspberry Pi 3 B - Speed (malloc not available in ARM)", subtitle = "Single Thread",x = "Algorithm", y = "time (less is better)", col = "Platform") + ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

### Four threads experiments

#### x86

```{r mem_thread_windows, echo=FALSE, fig.height=8, fig.width=12}
load("bench_windows_thread.rda")
res <- bench_windows_thread
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time", size = "mem_alloc") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Windows 10 - Four Threads - Speed and Total Memory allocation (MB)", subtitle = "Intel(R) Core(TM) i7-7700 CPU @ 3.60Ghz",x = "Algorithm", y = "time (less is better)", size = "malloc(MB)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

```{r mem_thread_linux, echo=FALSE, fig.height=8, fig.width=12}
load("bench_linux_thread.rda")
res <- bench_linux_thread
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time", size = "mem_alloc") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Ubuntu WSL - Four Threads - Speed and Total Memory allocation (MB)", subtitle = "Intel(R) Core(TM) i7-7700 CPU @ 3.60Ghz",x = "Algorithm", y = "time (less is better)", size = "malloc(MB)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

```{r macosx_multi, echo=FALSE, fig.height=8, fig.width=12}
load("bench_macosx_thread.rda")
res <- bench_macosx_thread
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time", size = "mem_alloc") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "macOS High Sierra KVM - Four Threads - Speed and Total Memory allocation (MB)", subtitle = "Intel(R) Core(TM) i5-430M CPU @ 2.26-2.53Ghz",x = "Algorithm", y = "time (less is better)", size = "malloc(MB)") +
  ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

#### ARM

```{r mem_alloc_thread_arm, echo=FALSE, fig.height=8, fig.width=12}
load("bench_arm_thread.rda")
res <- bench_arm_thread
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Raspberry Pi OS ARM - Four Threads - Speed (malloc not available in ARM)", subtitle = "Raspberry Pi 3 Model B, Quad Core 1.2GHz Cortex-A53 ARMv8",x = "Algorithm", y = "time (less is better)") + ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

```{r mem_alloc_thread_armtruck, echo=FALSE, fig.height=8, fig.width=12}
load("bench_cubie_thread.rda")
res <- bench_cubie_thread
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Cubietruck Plus ARM - Four Threads - Speed (malloc not available in ARM)", subtitle = "Cubietruck Plus, Octa Core 2GHz Cortex-A7 ARMv7",x = "Algorithm", y = "time (less is better)") + ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

> Cubietruck Plus ARM - Eight Threads is depicted below.
> I find out that R some times don't have all cores available and is unpredictable [[link](https://stackoverflow.com/a/40227380/2195337)].

```{r mem_alloc_thread_armtruck2, echo=FALSE, fig.height=8, fig.width=12}
load("bench_cubie_thread_eight.rda")
res <- bench_cubie_thread_eight
p <- ggplot2::ggplot(res)
p <- p + ggplot2::aes_string("expression", "time") + ggplot2::geom_jitter() +
  ggplot2::coord_flip()
p + labs(title = "Cubietruck Plus ARM - Eight Threads - Speed (malloc not available in ARM)", subtitle = "Cubietruck Plus, Octa Core 2GHz Cortex-A7 ARMv7",x = "Algorithm", y = "time (less is better)") + ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```

```{r cubie_vs_raspi_thread, echo=FALSE, fig.height=8, fig.width=12}
bench_arm_thread$cpu <- "rasp3b - 4"
bench_cubie_thread$cpu <- "ctplus - 4"
bench_cubie_thread_eight$cpu <- "ctplus - 8"
res <- rbind(bench_arm_thread, bench_cubie_thread_eight, bench_cubie_thread)
p <- ggplot2::ggplot(res, aes(x = cpu, y = median, col = expression)) + ggplot2::geom_point() +
  ggplot2::coord_flip()
p + labs(title = "Cubietruck Plus vs Raspberry Pi 3 B - Speed (malloc not available in ARM)", subtitle = "Four/Eight Threads",x = "Platform", y = "time (less is better)", col = "Algorithm") + ggplot2::facet_grid(d_size ~ w_size, labeller = ggplot2::label_both)
```
