DBSCAN & Outlier Detection

An outlier or anomaly is a data point that significantly deviates from other observations. There are three kinds of outliers; a global outlier is an object that significantly differs from the remainder of the data set, the contextual outlier which diverges depending on a selected context and the collective outlier, a subset of data objects that deviates collectively from the entire data set. A distinction has to be made between outliers and noisy data. Noise has to be removed to apply outlier detection because it may diminish the distinction between normal data and outliers. Noisy data will help hide the outliers in the dataset and minimize the effects of the process of outlier detection.

Nowadays with the evolution of the Internet of Thing and the increased usage of sensors time series data has become one of the most common type of data. The technological devices generate signals at every moment and produce logs every second. Therefore, it became crucial to be able the detect the abnormalities in the signals emitted to be able to react quickly is case of problems. For instance, anomaly detection has become essential in climate studies to identify abnormal climatic conditions due to the global warming. The detection of outliers (or anomaly detection) is a process which finds data points with a different behaviour than expected, an outlier.

It is defined by Kumar, Banerjee and Chandola in 2009 as “the problem of finding patterns in data that do not conform to the expected normal behaviour. This process is important when the anomalies of the dataset deliver important pieces of information about the system. The anomaly detection and the clustering analysis are highly related. On one hand the clustering analysis focuses on finding the patterns in a data set and organizing the data consequently, on the other hand the outlier detection focuses on trying to capture the exceptional data points of the data set. The DBSCAN or Density-based spatial clustering of applications with Noise, is a density-based clustering algorithm. This algorithm arranges together points closely grouped together, separates the points with a low density from the other data objects and brand them as outliers.

Traditionally the DBSCAN takes two parameters;

- Epsilon (ε) is the minimum distance between two data points. This distance threshold determines is two points are close enough to be considered neighbours.
- The minimum number of samples (min_sample) determining the minimum number of objects to create a cluster. These points have to be within a point ε-radius to be able to be grouped together.

The attribution of values of these parameters is quite complex. There is no automated determination of value, their value has to be determined case by case according to the kind of data being treated. If the chosen epsilon value is too small, then a cluster could not be created due to the lack of points in a region and so most of the points would be outliers. On the contrary, of the chosen value is too big, the majority of the points of the dataset would be grouped in the same cluster and there would almost be no outliers. Generally, a smaller value is preferred to a bigger one. The reasoning is the same of the minimum size of a sample, if the value is too small then, the clusters would not need many points to be formed and a lot of outliers would be integrated in very small clusters.

The DBSCAN differentiates the data points in three categories. Firstly, there are the core points which have at least (min_sample) points at a distance of (epsilon). The border points are characterised by the fact that they are not the core point itself, but the neighbours of this core point. The outlier points are the ones that are neither defined are core points nor border points. The DBSCAN algorithm looks at how many neighbour’s each point has to create a cluster. Once it considered all the neighbour’s it selects the ones within a certain distance (epsilon). If the potential cluster contains more than min_sample objects, then the cluster is created and expanded to include all the neighbours of the neighbours. To determine if a data point is abnormal, the same three steps are applied. Firstly, the algorithm calculates the distance from this point to all the core points defining the clusters. Then the smallest distance so the one between the possible outlier and the closest core-neighbour is kept. This distance is compared to the epsilon distance to see if these two points are actually neighbours. In conclusion if the distance is larger than the epsilon value, the smallest distance allowed between two neighbours, the point is labelled as an outlier.

The major advantage of this algorithm is that it does not include the outliers in a cluster. It avoids the situation where a cluster only includes one observation and where a cluster is heavily influenced by the inclusion of an outlier. Moreover, the anomalies are not only detected in the testing set, but also in the training which means the results will not be influenced by the presence of outliers. The following example shows the importance of outlier detection. Indeed, the dataset used contains data from ECG. With this type of data it is way more interesting to be able to detect quickly anomalies than being able to cluster the rest of the data together.

#install.packages("dbscan")
library (dbscan)

## Warning: package 'dbscan' was built under R version 3.5.3

setwd("C:/Users/Home/Documents/aaaaa")


data<-read.csv("DataSetHealth.csv", sep=",", dec=",", header=TRUE) 

datamatrix <- matrix(as.numeric(unlist(data)),nrow=2500)

## Warning in matrix(as.numeric(unlist(data)), nrow = 2500): la longueur des
## données [16459964] n'est pas un diviseur ni un multiple du nombre de lignes
## [2500]

dbscanResult <- dbscan(datamatrix, eps= 4700000, minPts=10) # clustering

dbscanResult

## DBSCAN clustering for 2500 objects.
## Parameters: eps = 4700000, minPts = 10
## The clustering contains 1 cluster(s) and 83 noise points.
## 
##    0    1 
##   83 2417 
## 
## Available fields: cluster, eps, minPts

hullplot(datamatrix, dbscanResult)

Sources

Naveen Kaushik, 2019, Anomaly Detection with MultiDimensional Time Series Data, Medium.com. Available at: https://medium.com/northraine/anomaly-detection-with-multi-dimensional-time-series-data-4fe8d111dee Maria Garcia Gumbao, 2019, Best clustering algorithm for anomaly detection, TowardsDataScience. Available at: https://towardsdatascience.com/best-clustering-algorithms-for-anomaly-detection-d5b7412537c8 Danile Chepenko, 2018, A Density-based algorithm for outlier detection, TowardsDataScience. Available at: https://towardsdatascience.com/density-based-algorithm-for-outlier-detection-8f278d2f7983 Homin Lee, 2015, Outlier detection in Datadog: A look at the algorithms, Datadog. Available at: https://www.datadoghq.com/blog/outlier-detection-algorithms-at-datadog/ Mete Celik – Filiz Dadaser Celik – Ahmet Dokuz, 2011, Anomaly Detection in Temperature Data Using DBSCAN Algorithm, Erciyes University.

Project_3

Margaux Peschon

02/03/2020

DBSCAN & Outlier Detection

Sources