Summary

Isolation Forest (iForest) is an Model Ensembling-based method used for anomaly detection. It operates by isolating data points using a series of random binary splits.

The key idea is that anomalies, being rare and different, are easier to isolate and thus require fewer splits.

Mathematically, the isolation of a point is captured by the path length in a decision tree, where shorter paths indicate anomalies. The algorithm constructs multiple isolation trees, and the anomaly score of a point is determined by the average path length across all trees.

Isolation Forest is highly efficient for large datasets and is particularly useful when the assumption is that anomalies are rare and distinct from normal instances.

Breakdown

Key Components:

  • Isolation Trees (iTrees): Binary trees where the goal is to isolate observations based on randomly chosen features and split values.
  • Anomaly Score: Calculated based on the average path length across all isolation trees.
  • Path Length: Anomalies tend to have shorter path lengths as they are easier to isolate.
  • Random Splitting: Random feature selection and splitting result in the separation of instances, with fewer splits isolating anomalies.

Important

  • Anomalies are identified based on shorter average path lengths in the isolation forest, indicating that fewer splits are needed to isolate them.
  • The method scales well with large datasets because it relies on randomly generated trees, avoiding complex distance or density computations.

Attention

  • Isolation Forest assumes that anomalies are few and distinct; it may perform poorly when anomalies are not easily distinguishable.
  • The method is sensitive to the Hyperparameter such as the number of trees and sample size.