Content

Outliers

  • Introduction of Anomaly Detection

  • Isolation Forest - A Basic

  • The Challenges of Online Learning

  • Streaming Isolation Forest

  • Data set and Results

  • Related Resources

  • Visiting University of Waikato

Introduction of Anomaly Detection

A Simple example of anomalies in a 2-dimensional dataset.

\(N_1\) and \(N_2\) are normal regions. \(o_1\) and \(o_2\) are points that are sufficiently far away from these regions. \(O_3\) are clustered anomalies.

  • Formal definition: “Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior.” (Chandola, Banerjee, and Kumar 2009)

  • Anomaly detection is therefore the action to label anomalies and normal points correctly.

Isolation Forest - a basic (#1)

Isolation Forest (iForest) is a collection of Isolation Trees.

An Isolation Tree (iTree) is a binary tree constructed based on sub-samples.

Normal points need more partitions to be isolated, Anomalies require fewer.
  • Isolation Forest \(\neq\) Random Forest
    • iForest uses average path length to detect anomalies (unsupervised learning).
    • Random Forest uses average class labels or probability average to make prediction (supervised learning).
  • Isolation Forest is not density based, but path-length - the number of edges from the root node to a leaf node.

Isolation Forest - a basic (#2)

Definition:

Anomalies are few and different

Normal points need more partitions to be isolated, Anomalies require fewer.

Path length(No of partitions) converges with increasing number of trees.

Isolation Forest - a basic (#3)

The effect of path length based measure for two clusters of data

Comparison of different anomaly detection methods

Challenges of Online Learning

  • Revisiting past data is impractical
    • Recent data can be stored in a finite buffer for computation
    • The entire historical dataset can be infinitely large, so it cannot be stored
  • Adaptable to concept drifts
    • The model must be updated based on incoming data
  • Requires a highly efficient algorithm
    • Expensive computations are not feasible

    • Must keep up with (or faster than) the data stream

    • Leverage parallel computing for efficiency

Anti-Money Laundering

Streaming Isolation Forest (SiForest)

  • Maintenance a sample reservoir of size \(w\) for each tree, for \(t\) number of trees
  • When these reservoirs are updated, check each tree in the forest, \(T \in F\)
    • remove an old data point \(\hat x\) - check the path for update required.
    • insert a new data point \(x\) - check the path for update required.
    • At each node in the path, if the removal \(\hat x\) / insertion \(x\) causes a range change then rebuild the sub tree and stop the check.
    • if removing \(\hat x\), removes a leaf node, then rebuild its parent node.
  • Theorem 1: The completely constructed \(T(Ds \setminus \{\hat x\})\) comes from the same distribution as \(T'=Remove(T(D_s), \hat x)\), which has been modified upon the removal of \(\hat x\), \(P(T') = P(T(D_s \setminus \{\hat x\}))\).
  • Theorem 2: When an example \(x\), where \(x \notin D_s\), is inserted into a tree, the tree \(T''=Insert(T(D_s), x)\) can be adjusted. This results in the effect that \(T''\) is effectively modified as if it were fully constructed from \(T(D_s \cup \{x\})\) by \(P(T'') = P(T(D_s \cup \{x\}))\).

Summary of Isolation-based Methods

Algorithm Sampling Method Update Granularity Scoring Function
SiForest Reservoir Sampling Branch Same as IF
iForestASD Sliding Window Forest, ** Same as IF
olFOR Sliding Window Leaf Same as IF
RRCF Sliding Window Node Displacement
HST Sliding Window Forest, ++ same as IF

** when ratio of anomalies becomes high

++ update every window

Unpacking and Discussion

  • Reservoir sampling approximates random sampling in batch learning, Therefore, at any point in time \(t\), the updated SiForest is equivalent to iForest using random samplings of the data thus far.
  • Complexity: Constuction/Updating \(O(w log_2 w)\), Scoring \(O(log_2 w)\)
  • Space: \(O(kw)\)
  • SiForest
    • Reconstruct sub-tree, when the data range changed.
    • Unbiased attribute selection.
  • RRCF
    • Tree manipulation to cater for new data point.
    • Biased to attributes with wider range.

Datasets

SiForest is implemented on CapyMOA.org

All experiments ran on Intel® Core™ i5-10500 3.10GHz CPU with 32 GB RAM running Ubuntu 22.04.4 LTS.

Results

Critical Difference

Visiting University of Waikato

Invitation to Research Collaboration

  • Applied Streaming Anomaly Detection to different domains

    • What are the industry use cases?
  • Semi-supervised Anomaly Detection (when some labelled data is available)

    • How to enhance the model with labelled data?
  • Active Learning (Learner has a budget to label particular data points)

    • Which data points should be labelled?

Reference

Chandola, Varun, Arindam Banerjee, and Vipin Kumar. 2009. “Anomaly Detection: A Survey.” ACM Computing Surveys (CSUR) 41 (3): 15. http://scholar.google.de/scholar.bib?q=info:jAfBmk-9uAcJ:scholar.google.com/&output=citation&hl=de&ct=citation&cd=0.
Ding, Z., and M. Fei. 2013. “An Anomaly Detection Approach Based on Isolation Forest Algorithm for Streaming Data Using Sliding Window.” IFAC Proceedings Volumes 46 (20): 12–17. https://doi.org/https://doi.org/10.3182/20130902-3-CN-3020.00044.
Guha, Sudipto, Nina Mishra, Gourav Roy, and Okke Schrijvers. 2016. “Robust Random Cut Forest Based Anomaly Detection on Streams.” In ICML, 2712–21. PMLR.
Hartl, Alexander, Félix Iglesias, and Tanja Zseby. 2020. “SDOstream: Low-Density Models for Streaming Outlier Detection.” In ESANN, 661–66.
Leveni, Filippo, Guilherme Weigert Cassales, Bernhard Pfahringer, Albert Bifet, and Giacomo Boracchi. 2024. “Online Isolation Forest.” In ICML. PMLR.
Manzoor, Emaad, Hemank Lamba, and Leman Akoglu. 2018. “Xstream: Outlier Detection in Feature-Evolving Data Streams.” In 24th ACM SIGKDD, 1963–72.
Pevnỳ, Tomáš. 2016. “Loda: Lightweight on-Line Detector of Anomalies.” Machine Learning 102: 275–304.
Sathe, Saket, and Charu C Aggarwal. 2016. “Subspace Outlier Detection in Linear Time with Randomized Hashing.” In 16th ICDM, 459–68. IEEE.
Tan, Swee Chuan, Kai Ming Ting, and Fei Tony Liu. 2011. “Streaming Half-Space-Trees.” In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence. https://www.ijcai.org/Proceedings/11/Papers/254.pdf.
Vázquez, Félix Iglesias, Tanja Zseby, and Arthur Zimek. 2018. “Outlier Detection Based on Low Density Models.” In 2018 ICDMW, 970–79. IEEE.