Introduction

knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 60), tidy = TRUE)
library(cowplot)

Most of the time, machine learning algorithms need to learn from continuous data, ultimately an infinite amount of data. It is only natural to believe that these data will change over time, and what is now relevant, may not be in the future. So there is a need to understand when the data is changing.

Concept Drift Detection

Assuming prior knowledge of the terms used in stream learning, Drift Detection is the step of a continuous evaluation of the model error. When this error exceeds a threshold, de drift detector returns a signal informing the model that the context has changed and it must adapt to the new data distribution.

Figure 1 depicts a simple diagram showing where the detection plays its role.

Here we will simulate a concept drift using two gaussian distributions and moving one of them towards the other. We will use a simple KNN algorithm and at certain point we expect that it starts to misclassify new points. This is where the drift detector must raise a signal.

We will make 10^5 iterations, and the KNN algorithm will have a window where the model will learn how to predict the next observation. Of course we could use an evolving algorithm to avoid the re-learning process, but this is beyond this assessment. Using a window size helps to minimize the computation burden and also introduces some kind of forgetting mechanism.

The baseline error rate is given by the formula \(p_i\) that means the probability of observe a misclassification, with standard deviation of \(s_i = \sqrt{p_i(1-p_i)/{i}}\), considering that the distribution is near normal for >30 observations, the confidence interval for \(p\) is \(pi\pm\alpha*s_i\).

At every step we will keep two values: \(p_{min}\) and \(s_{min}\), updating them whenever the current \(p_i+s_i\) reaches a new minimum. In this example, we will use an \(\alpha\) of 95%, and the drift detection will be flagged when \(p_i+s_i \ge p_{min}+2*s_{min}\).

n <- 10000 # number of iterations
drift_start <- 4000 # drift will start at this observation
drift_length <- 50 # drift will end after this amount of steps
from_mean <- 5 # the distribution will start on this mean
to_mean <- 2 # and will end on this mean
sd <- 0.5 # all distributions will have this std dev
window_size <- 100 # this is the window size of KNN
k <- 1 # number of neighbours considered
slow_drift <- seq(from_mean, to_mean, length.out = drift_length) # change sequence

#### 2 clusters dataset ----
set.seed(2020)
data <- rbind(
  matrix(rnorm(n, 0, sd), ncol = 2),
  matrix(c(rnorm(n / 2, 2, sd), rnorm(n / 2, from_mean, sd)), ncol = 2)
)
data <- cbind(data, rep(c(1, 2), each = n / 2))

plot(data[, 1], data[, 2], col = data[, 3], xlim = c(-5, 8), ylim = c(-4, 7),
     main = "Clusters Starting Point", ylab = "", xlab = "")

# mix the samples
set.seed(2020)
idxs_train <- sample(seq_len(nrow(data)))
data <- data[idxs_train, ]

#### prime the model ----
# learn the first window_size examples and do a classification
class <- knn(data[1:window_size, -3], data[(window_size + 1), -3], data[1:window_size, 3], k = k)

# is it wrong?
error <- data[(window_size + 1), 3] != class

pmin <- pi <- 0.5 # initial values
smin <- si <- sqrt(pi * (1 - pi) / window_size) # initial values

# Start the iterations starting from the next observation
set.seed(2020) # set seed again because the drift will use rng
alarm <- 0
pi_counter <- NULL
for (i in seq_len(n - window_size - 1)) {

  # First do the drift if it is time
  if ((i + window_size) >= drift_start) {
    # only drift the second class
    if (data[(window_size + i + 1), 3] == 2) {
      m <- ifelse((i + window_size) < drift_start + drift_length,
        slow_drift[i + window_size - drift_start + 1], tail(slow_drift, 1)
      )
      data[(window_size + i + 1), -3] <- c(rnorm(1, 2, sd), rnorm(1, m, sd))
    }
  }

  cl <- knn(data[(i + 1):(window_size + i), -3],
    data[(window_size + i + 1), -3],
    data[(i + 1):(window_size + i), 3],
    k = k
  )
  class <- c(class, cl)
  error <- c(error, cl != data[(window_size + i + 1), 3])

  er <- sum(error) # sum(tail(error, window_size))
  # update probability of failure
  pi <- er / i
  si <- sqrt(pi * (1 - pi) / window_size)
  pi_counter <- c(pi_counter, pi)

  if ((pi + si) < (pmin + smin)) {
    pmin <- pi
    smin <- si
  } else if ((pi + si) > (pmin + 2 * smin)) {
    if (alarm == 0) {
      alarm <- (i + window_size)
    }
    pmin <- pi
    smin <- si
  }
}

The following plot shows the error rate for the KNN classifier. The black line is the drift starting point, and the red line is the detection point.

plot(pi_counter, type = "l", main = "Drift Detection", xlab = "Time", ylab = "Error Rate")
abline(v = c(drift_start, alarm), col = c(1, 2))

LS0tCnRpdGxlOiAiSEVBRFMgLSBISURBIC0gTGVhcm46IENvbmNlcHQgRHJpZnQiCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazogCiAgICBmaWdfaGVpZ2h0OiA4CiAgICBmaWdfd2lkdGg6IDEwCiAgICBoaWdobGlnaHQ6IHB5Z21lbnRzCiAgICB0aGVtZTogdW5pdGVkCiAgICB0b2M6IHllcwphdXRob3I6IAogIC0gRGlhbmEgTGVpdGUgUG9ydGVsYSBTaWx2YQogIC0gRnJhbmNpc2NvIEJpc2Nob2ZmCiAgLSBIdWdvIEZpbGlwZSBCYXB0aXN0YSBNb250ZWlybwogIC0gxZ5lcmFmZmV0aW4gR8O8bmXFnwotLS0KCiMgSW50cm9kdWN0aW9uCgpgYGB7ciBzZXR1cCwgbWVzc2FnZSA9IEZBTFNFfQprbml0cjo6b3B0c19jaHVuayRzZXQodGlkeS5vcHRzID0gbGlzdCh3aWR0aC5jdXRvZmYgPSA2MCksIHRpZHkgPSBUUlVFKQpsaWJyYXJ5KGNvd3Bsb3QpCmBgYAoKTW9zdCBvZiB0aGUgdGltZSwgbWFjaGluZSBsZWFybmluZyBhbGdvcml0aG1zIG5lZWQgdG8gbGVhcm4gZnJvbSBjb250aW51b3VzIGRhdGEsIHVsdGltYXRlbHkgYW4gaW5maW5pdGUgYW1vdW50IG9mIGRhdGEuCkl0IGlzIG9ubHkgbmF0dXJhbCB0byBiZWxpZXZlIHRoYXQgdGhlc2UgZGF0YSB3aWxsIGNoYW5nZSBvdmVyIHRpbWUsIGFuZCB3aGF0IGlzIG5vdyByZWxldmFudCwgbWF5IG5vdCBiZSBpbiB0aGUgZnV0dXJlLiBTbyB0aGVyZSBpcyBhIG5lZWQgdG8gdW5kZXJzdGFuZCBfd2hlbl8gdGhlIGRhdGEgaXMgY2hhbmdpbmcuCgojIENvbmNlcHQgRHJpZnQgRGV0ZWN0aW9uCgpBc3N1bWluZyBwcmlvciBrbm93bGVkZ2Ugb2YgdGhlIHRlcm1zIHVzZWQgaW4gc3RyZWFtIGxlYXJuaW5nLCBEcmlmdCBEZXRlY3Rpb24gaXMgdGhlIHN0ZXAgb2YgYSBjb250aW51b3VzIGV2YWx1YXRpb24gb2YgdGhlIG1vZGVsIGVycm9yLiBXaGVuIHRoaXMgZXJyb3IgZXhjZWVkcyBhIHRocmVzaG9sZCwgZGUgZHJpZnQgZGV0ZWN0b3IgcmV0dXJucyBhIHNpZ25hbCBpbmZvcm1pbmcgdGhlIG1vZGVsIHRoYXQgdGhlIGNvbnRleHQgaGFzIGNoYW5nZWQgYW5kIGl0IG11c3QgYWRhcHQgdG8gdGhlIG5ldyBkYXRhIGRpc3RyaWJ1dGlvbi4KCkZpZ3VyZSAxIGRlcGljdHMgYSBzaW1wbGUgZGlhZ3JhbSBzaG93aW5nIHdoZXJlIHRoZSBkZXRlY3Rpb24gcGxheXMgaXRzIHJvbGUuCgpgYGB7ciBlY2hvPUZBTFNFLCBmaWcud2lkdGg9MTAsIGZpZy5oZWlnaHQ9Nn0KZmlnX3N2ZyA8LSBnZ2RyYXcoKSArIGRyYXdfaW1hZ2UoImRyaWZ0VU1MLnN2ZyIpICsgZHJhd19wbG90X2xhYmVsKCJGaWd1cmUgMTogU2VxdWVuY2UgZGlhZ3JhbSBmb3IgYW4gb25saW5lIGFkYXB0aXZlIGxlYXJuaW5nIGFsZ29yaXRobS4gQWRhcHRlZCBmcm9tIEdhbWEsIDIwMDQuIiwgeCA9IC0wLjUsIHkgPSAwLjA4LCBzaXplID0gMTAsIGhqdXN0ID0gLTEuMiwgZm9udGZhY2UgPSAicGxhaW4iKQpwbG90KGZpZ19zdmcpCmBgYAoKCkhlcmUgd2Ugd2lsbCBzaW11bGF0ZSBhIGNvbmNlcHQgZHJpZnQgdXNpbmcgdHdvIGdhdXNzaWFuIGRpc3RyaWJ1dGlvbnMgYW5kIG1vdmluZyBvbmUgb2YgdGhlbSB0b3dhcmRzIHRoZSBvdGhlci4gV2Ugd2lsbCB1c2UgYSBzaW1wbGUgS05OIGFsZ29yaXRobSBhbmQgYXQgY2VydGFpbiBwb2ludCB3ZSBleHBlY3QgdGhhdCBpdCBzdGFydHMgdG8gbWlzY2xhc3NpZnkgbmV3IHBvaW50cy4gVGhpcyBpcyB3aGVyZSB0aGUgZHJpZnQgZGV0ZWN0b3IgbXVzdCByYWlzZSBhIHNpZ25hbC4KCldlIHdpbGwgbWFrZSAxMF41IGl0ZXJhdGlvbnMsIGFuZCB0aGUgS05OIGFsZ29yaXRobSB3aWxsIGhhdmUgYSB3aW5kb3cgd2hlcmUgdGhlIG1vZGVsIHdpbGwgbGVhcm4gaG93IHRvIHByZWRpY3QgdGhlIG5leHQgb2JzZXJ2YXRpb24uIE9mIGNvdXJzZSB3ZSBjb3VsZCB1c2UgYW4gZXZvbHZpbmcgYWxnb3JpdGhtIHRvIGF2b2lkIHRoZSByZS1sZWFybmluZyBwcm9jZXNzLCBidXQgdGhpcyBpcyBiZXlvbmQgdGhpcyBhc3Nlc3NtZW50LiBVc2luZyBhIHdpbmRvdyBzaXplIGhlbHBzIHRvIG1pbmltaXplIHRoZSBjb21wdXRhdGlvbiBidXJkZW4gYW5kIGFsc28gaW50cm9kdWNlcyBzb21lIGtpbmQgb2YgZm9yZ2V0dGluZyBtZWNoYW5pc20uCgpUaGUgYmFzZWxpbmUgZXJyb3IgcmF0ZSBpcyBnaXZlbiBieSB0aGUgZm9ybXVsYSAkcF9pJCB0aGF0IG1lYW5zIHRoZSBwcm9iYWJpbGl0eSBvZiBvYnNlcnZlIGEgbWlzY2xhc3NpZmljYXRpb24sIHdpdGggc3RhbmRhcmQgZGV2aWF0aW9uIG9mICRzX2kgPSBcc3FydHtwX2koMS1wX2kpL3tpfX0kLCBjb25zaWRlcmluZyB0aGF0IHRoZSBkaXN0cmlidXRpb24gaXMgbmVhciBub3JtYWwgZm9yID4zMCBvYnNlcnZhdGlvbnMsIHRoZSBjb25maWRlbmNlIGludGVydmFsIGZvciAkcCQgaXMgJHBpXHBtXGFscGhhKnNfaSQuCgpBdCBldmVyeSBzdGVwIHdlIHdpbGwga2VlcCB0d28gdmFsdWVzOiAkcF97bWlufSQgYW5kICRzX3ttaW59JCwgdXBkYXRpbmcgdGhlbSB3aGVuZXZlciB0aGUgY3VycmVudCAkcF9pK3NfaSQgcmVhY2hlcyBhIG5ldyBtaW5pbXVtLiBJbiB0aGlzIGV4YW1wbGUsIHdlIHdpbGwgdXNlIGFuICRcYWxwaGEkIG9mIDk1JSwgYW5kIHRoZSBkcmlmdCBkZXRlY3Rpb24gd2lsbCBiZSBmbGFnZ2VkIHdoZW4gJHBfaStzX2kgXGdlIHBfe21pbn0rMipzX3ttaW59JC4KCmBgYHtyLCBmaWcuaGVpZ2h0PTUsIGZpZy53aWR0aD03fQpuIDwtIDEwMDAwICMgbnVtYmVyIG9mIGl0ZXJhdGlvbnMKZHJpZnRfc3RhcnQgPC0gNDAwMCAjIGRyaWZ0IHdpbGwgc3RhcnQgYXQgdGhpcyBvYnNlcnZhdGlvbgpkcmlmdF9sZW5ndGggPC0gNTAgIyBkcmlmdCB3aWxsIGVuZCBhZnRlciB0aGlzIGFtb3VudCBvZiBzdGVwcwpmcm9tX21lYW4gPC0gNSAjIHRoZSBkaXN0cmlidXRpb24gd2lsbCBzdGFydCBvbiB0aGlzIG1lYW4KdG9fbWVhbiA8LSAyICMgYW5kIHdpbGwgZW5kIG9uIHRoaXMgbWVhbgpzZCA8LSAwLjUgIyBhbGwgZGlzdHJpYnV0aW9ucyB3aWxsIGhhdmUgdGhpcyBzdGQgZGV2CndpbmRvd19zaXplIDwtIDEwMCAjIHRoaXMgaXMgdGhlIHdpbmRvdyBzaXplIG9mIEtOTgprIDwtIDEgIyBudW1iZXIgb2YgbmVpZ2hib3VycyBjb25zaWRlcmVkCnNsb3dfZHJpZnQgPC0gc2VxKGZyb21fbWVhbiwgdG9fbWVhbiwgbGVuZ3RoLm91dCA9IGRyaWZ0X2xlbmd0aCkgIyBjaGFuZ2Ugc2VxdWVuY2UKCiMjIyMgMiBjbHVzdGVycyBkYXRhc2V0IC0tLS0Kc2V0LnNlZWQoMjAyMCkKZGF0YSA8LSByYmluZCgKICBtYXRyaXgocm5vcm0obiwgMCwgc2QpLCBuY29sID0gMiksCiAgbWF0cml4KGMocm5vcm0obiAvIDIsIDIsIHNkKSwgcm5vcm0obiAvIDIsIGZyb21fbWVhbiwgc2QpKSwgbmNvbCA9IDIpCikKZGF0YSA8LSBjYmluZChkYXRhLCByZXAoYygxLCAyKSwgZWFjaCA9IG4gLyAyKSkKCnBsb3QoZGF0YVssIDFdLCBkYXRhWywgMl0sIGNvbCA9IGRhdGFbLCAzXSwgeGxpbSA9IGMoLTUsIDgpLCB5bGltID0gYygtNCwgNyksCiAgICAgbWFpbiA9ICJDbHVzdGVycyBTdGFydGluZyBQb2ludCIsIHlsYWIgPSAiIiwgeGxhYiA9ICIiKQojIG1peCB0aGUgc2FtcGxlcwpzZXQuc2VlZCgyMDIwKQppZHhzX3RyYWluIDwtIHNhbXBsZShzZXFfbGVuKG5yb3coZGF0YSkpKQpkYXRhIDwtIGRhdGFbaWR4c190cmFpbiwgXQoKIyMjIyBwcmltZSB0aGUgbW9kZWwgLS0tLQojIGxlYXJuIHRoZSBmaXJzdCB3aW5kb3dfc2l6ZSBleGFtcGxlcyBhbmQgZG8gYSBjbGFzc2lmaWNhdGlvbgpjbGFzcyA8LSBrbm4oZGF0YVsxOndpbmRvd19zaXplLCAtM10sIGRhdGFbKHdpbmRvd19zaXplICsgMSksIC0zXSwgZGF0YVsxOndpbmRvd19zaXplLCAzXSwgayA9IGspCgojIGlzIGl0IHdyb25nPwplcnJvciA8LSBkYXRhWyh3aW5kb3dfc2l6ZSArIDEpLCAzXSAhPSBjbGFzcwoKcG1pbiA8LSBwaSA8LSAwLjUgIyBpbml0aWFsIHZhbHVlcwpzbWluIDwtIHNpIDwtIHNxcnQocGkgKiAoMSAtIHBpKSAvIHdpbmRvd19zaXplKSAjIGluaXRpYWwgdmFsdWVzCgojIFN0YXJ0IHRoZSBpdGVyYXRpb25zIHN0YXJ0aW5nIGZyb20gdGhlIG5leHQgb2JzZXJ2YXRpb24Kc2V0LnNlZWQoMjAyMCkgIyBzZXQgc2VlZCBhZ2FpbiBiZWNhdXNlIHRoZSBkcmlmdCB3aWxsIHVzZSBybmcKYWxhcm0gPC0gMApwaV9jb3VudGVyIDwtIE5VTEwKZm9yIChpIGluIHNlcV9sZW4obiAtIHdpbmRvd19zaXplIC0gMSkpIHsKCiAgIyBGaXJzdCBkbyB0aGUgZHJpZnQgaWYgaXQgaXMgdGltZQogIGlmICgoaSArIHdpbmRvd19zaXplKSA+PSBkcmlmdF9zdGFydCkgewogICAgIyBvbmx5IGRyaWZ0IHRoZSBzZWNvbmQgY2xhc3MKICAgIGlmIChkYXRhWyh3aW5kb3dfc2l6ZSArIGkgKyAxKSwgM10gPT0gMikgewogICAgICBtIDwtIGlmZWxzZSgoaSArIHdpbmRvd19zaXplKSA8IGRyaWZ0X3N0YXJ0ICsgZHJpZnRfbGVuZ3RoLAogICAgICAgIHNsb3dfZHJpZnRbaSArIHdpbmRvd19zaXplIC0gZHJpZnRfc3RhcnQgKyAxXSwgdGFpbChzbG93X2RyaWZ0LCAxKQogICAgICApCiAgICAgIGRhdGFbKHdpbmRvd19zaXplICsgaSArIDEpLCAtM10gPC0gYyhybm9ybSgxLCAyLCBzZCksIHJub3JtKDEsIG0sIHNkKSkKICAgIH0KICB9CgogIGNsIDwtIGtubihkYXRhWyhpICsgMSk6KHdpbmRvd19zaXplICsgaSksIC0zXSwKICAgIGRhdGFbKHdpbmRvd19zaXplICsgaSArIDEpLCAtM10sCiAgICBkYXRhWyhpICsgMSk6KHdpbmRvd19zaXplICsgaSksIDNdLAogICAgayA9IGsKICApCiAgY2xhc3MgPC0gYyhjbGFzcywgY2wpCiAgZXJyb3IgPC0gYyhlcnJvciwgY2wgIT0gZGF0YVsod2luZG93X3NpemUgKyBpICsgMSksIDNdKQoKICBlciA8LSBzdW0oZXJyb3IpICMgc3VtKHRhaWwoZXJyb3IsIHdpbmRvd19zaXplKSkKICAjIHVwZGF0ZSBwcm9iYWJpbGl0eSBvZiBmYWlsdXJlCiAgcGkgPC0gZXIgLyBpCiAgc2kgPC0gc3FydChwaSAqICgxIC0gcGkpIC8gd2luZG93X3NpemUpCiAgcGlfY291bnRlciA8LSBjKHBpX2NvdW50ZXIsIHBpKQoKICBpZiAoKHBpICsgc2kpIDwgKHBtaW4gKyBzbWluKSkgewogICAgcG1pbiA8LSBwaQogICAgc21pbiA8LSBzaQogIH0gZWxzZSBpZiAoKHBpICsgc2kpID4gKHBtaW4gKyAyICogc21pbikpIHsKICAgIGlmIChhbGFybSA9PSAwKSB7CiAgICAgIGFsYXJtIDwtIChpICsgd2luZG93X3NpemUpCiAgICB9CiAgICBwbWluIDwtIHBpCiAgICBzbWluIDwtIHNpCiAgfQp9CmBgYAoKVGhlIGZvbGxvd2luZyBwbG90IHNob3dzIHRoZSBlcnJvciByYXRlIGZvciB0aGUgS05OIGNsYXNzaWZpZXIuIFRoZSBibGFjayBsaW5lIGlzIHRoZSBkcmlmdCBzdGFydGluZyBwb2ludCwgYW5kIHRoZSByZWQgbGluZSBpcyB0aGUgZGV0ZWN0aW9uIHBvaW50LgoKYGBge3IsIGZpZy5oZWlnaHQ9NSwgZmlnLndpZHRoPTEwfQpwbG90KHBpX2NvdW50ZXIsIHR5cGUgPSAibCIsIG1haW4gPSAiRHJpZnQgRGV0ZWN0aW9uIiwgeGxhYiA9ICJUaW1lIiwgeWxhYiA9ICJFcnJvciBSYXRlIikKYWJsaW5lKHYgPSBjKGRyaWZ0X3N0YXJ0LCBhbGFybSksIGNvbCA9IGMoMSwgMikpCmBgYAo=