Streaming Isolation Forest

Content

Outliers

Introduction of Anomaly Detection
Isolation Forest - A Basic
The Challenges of Online Learning
Streaming Isolation Forest
Data set and Results
Related Resources
Visiting University of Waikato

Introduction of Anomaly Detection

A Simple example of anomalies in a 2-dimensional dataset.

\(N_1\) and \(N_2\) are normal regions. \(o_1\) and \(o_2\) are points that are sufficiently far away from these regions. \(O_3\) are clustered anomalies.

Formal definition: “Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior.” (Chandola, Banerjee, and Kumar 2009)
Anomaly detection is therefore the action to label anomalies and normal points correctly.

Isolation Forest - a basic (#1)

Isolation Forest (iForest) is a collection of Isolation Trees.

An Isolation Tree (iTree) is a binary tree constructed based on sub-samples.

Normal points need more partitions to be isolated, Anomalies require fewer.

Isolation Forest \(\neq\) Random Forest
- iForest uses average path length to detect anomalies (unsupervised learning).
- Random Forest uses average class labels or probability average to make prediction (supervised learning).
Isolation Forest is not density based, but path-length - the number of edges from the root node to a leaf node.

Isolation Forest - a basic (#2)

Definition:

Anomalies are few and different

Path length(No of partitions) converges with increasing number of trees.

Isolation Forest - a basic (#3)

The effect of path length based measure for two clusters of data

Comparison of different anomaly detection methods

Challenges of Online Learning

Revisiting past data is impractical
- Recent data can be stored in a finite buffer for computation
- The entire historical dataset can be infinitely large, so it cannot be stored
Adaptable to concept drifts
- The model must be updated based on incoming data
Requires a highly efficient algorithm
- Expensive computations are not feasible
- Must keep up with (or faster than) the data stream
- Leverage parallel computing for efficiency

Anti-Money Laundering

Related Work

iForestASD - Anomaly Detection for Streaming Data (Ding and Fei 2013).- A sliding windows approach, update model when the anomaly rate changes significantly.
HST - Half-Space Trees (Tan, Ting, and Liu 2011) - Trees are pre-built and are updated periodically.
xStream (Manzoor, Lamba, and Akoglu 2018) and RS-Hash(RSH) (Sathe and Aggarwal 2016) are lightweight and ensemble-based approaches for high-dimensional streaming data.

LODA (Pevnỳ 2016) - random projections with histograms for scoring.
SDOstream(SDOS) (Hartl, Iglesias, and Zseby 2020) - incremental updates and model aging based on SDO (Vázquez, Zseby, and Zimek 2018).
Online Isolation Forest (oIF) (Leveni et al. 2024) -use dynamic multi-resolution histograms to model evolving data distributions.
Robust Random Cut Forest (RRCF) (Guha et al. 2016) - Dynamically adjusting tree structures to reflect changes in data.

Streaming Isolation Forest (SiForest)

Maintenance a sample reservoir of size \(w\) for each tree, for \(t\) number of trees
When these reservoirs are updated, check each tree in the forest, \(T \in F\)
- remove an old data point \(\hat x\) - check the path for update required.
- insert a new data point \(x\) - check the path for update required.
- At each node in the path, if the removal \(\hat x\) / insertion \(x\) causes a range change then rebuild the sub tree and stop the check.
- if removing \(\hat x\), removes a leaf node, then rebuild its parent node.

Theorem 1: The completely constructed \(T(Ds \setminus \{\hat x\})\) comes from the same distribution as \(T'=Remove(T(D_s), \hat x)\), which has been modified upon the removal of \(\hat x\), \(P(T') = P(T(D_s \setminus \{\hat x\}))\).
Theorem 2: When an example \(x\), where \(x \notin D_s\), is inserted into a tree, the tree \(T''=Insert(T(D_s), x)\) can be adjusted. This results in the effect that \(T''\) is effectively modified as if it were fully constructed from \(T(D_s \cup \{x\})\) by \(P(T'') = P(T(D_s \cup \{x\}))\).

Summary of Isolation-based Methods

Algorithm	Sampling Method	Update Granularity	Scoring Function
SiForest	Reservoir Sampling	Branch	Same as IF
iForestASD	Sliding Window	Forest, **	Same as IF
olFOR	Sliding Window	Leaf	Same as IF
RRCF	Sliding Window	Node	Displacement
HST	Sliding Window	Forest, ++	same as IF

** when ratio of anomalies becomes high

++ update every window

Unpacking and Discussion

Reservoir sampling approximates random sampling in batch learning, Therefore, at any point in time \(t\), the updated SiForest is equivalent to iForest using random samplings of the data thus far.
Complexity: Constuction/Updating \(O(w log_2 w)\), Scoring \(O(log_2 w)\)
Space: \(O(kw)\)

SiForest
- Reconstruct sub-tree, when the data range changed.
- Unbiased attribute selection.
RRCF
- Tree manipulation to cater for new data point.
- Biased to attributes with wider range.

Datasets

SiForest is implemented on CapyMOA.org

All experiments ran on Intel® Core™ i5-10500 3.10GHz CPU with 32 GB RAM running Ubuntu 22.04.4 LTS.

Results

Critical Difference

Related Resources

Weka - Data Mining Practical Machine Learning Tools and Techniques

MOA - Machine Learning for Data Streams and Practical Examples

Visiting University of Waikato

Invitation to Research Collaboration

Applied Streaming Anomaly Detection to different domains
- What are the industry use cases?
Semi-supervised Anomaly Detection (when some labelled data is available)
- How to enhance the model with labelled data?
Active Learning (Learner has a budget to label particular data points)
- Which data points should be labelled?

Reference

Chandola, Varun, Arindam Banerjee, and Vipin Kumar. 2009. “Anomaly Detection: A Survey.” ACM Computing Surveys (CSUR) 41 (3): 15. http://scholar.google.de/scholar.bib?q=info:jAfBmk-9uAcJ:scholar.google.com/&output=citation&hl=de&ct=citation&cd=0.

Ding, Z., and M. Fei. 2013. “An Anomaly Detection Approach Based on Isolation Forest Algorithm for Streaming Data Using Sliding Window.” IFAC Proceedings Volumes 46 (20): 12–17. https://doi.org/https://doi.org/10.3182/20130902-3-CN-3020.00044.

Guha, Sudipto, Nina Mishra, Gourav Roy, and Okke Schrijvers. 2016. “Robust Random Cut Forest Based Anomaly Detection on Streams.” In ICML, 2712–21. PMLR.

Hartl, Alexander, Félix Iglesias, and Tanja Zseby. 2020. “SDOstream: Low-Density Models for Streaming Outlier Detection.” In ESANN, 661–66.

Leveni, Filippo, Guilherme Weigert Cassales, Bernhard Pfahringer, Albert Bifet, and Giacomo Boracchi. 2024. “Online Isolation Forest.” In ICML. PMLR.

Manzoor, Emaad, Hemank Lamba, and Leman Akoglu. 2018. “Xstream: Outlier Detection in Feature-Evolving Data Streams.” In 24th ACM SIGKDD, 1963–72.

Pevnỳ, Tomáš. 2016. “Loda: Lightweight on-Line Detector of Anomalies.” Machine Learning 102: 275–304.

Sathe, Saket, and Charu C Aggarwal. 2016. “Subspace Outlier Detection in Linear Time with Randomized Hashing.” In 16th ICDM, 459–68. IEEE.

Tan, Swee Chuan, Kai Ming Ting, and Fei Tony Liu. 2011. “Streaming Half-Space-Trees.” In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence. https://www.ijcai.org/Proceedings/11/Papers/254.pdf.

Vázquez, Félix Iglesias, Tanja Zseby, and Arthur Zimek. 2018. “Outlier Detection Based on Low Density Models.” In 2018 ICDMW, 970–79. IEEE.