Basic:

Advanced:

Grubbs’ Test

Context:
Grubbs’ test is a hypothesis test designed to detect a single outlier in a normally distributed dataset. It tests the largest deviation from the mean relative to the standard deviation. This test is iterative and removes one outlier at a time.

Purpose:
To determine whether the most extreme data point (either smallest or largest) is a statistical outlier.

Steps:

  • Compute the test statistic:

    where:
    • : Data points
    • : Mean of the dataset
    • : Standard deviation of the dataset.
  • Compare to a critical value:
    • The critical value depends on the sample size and significance level (e.g., 0.05).
    • If exceeds the critical value, the data point is considered an outlier.

Limitations:

  • Assumes data follows a normal distribution.
  • Inefficient for detecting multiple outliers simultaneously.

Histogram-Based Outlier Detection (HBOS)

Context:

HBOS is a non-parametric method that detects anomalies by analyzing the distribution of individual features independently. It relies on histograms, which estimate feature density.

Purpose:
To identify outliers as data points falling in bins with low frequencies or densities.

Steps:

  • Create histograms for each feature:
    • Divide each feature’s range into bins.
    • Count the frequency of data points in each bin.
  • Calculate scores for each data point:
    • Outliers are points in bins with significantly lower densities compared to others.

Advantages:

  • Does not assume a specific data distribution.
  • Scales well to large datasets.

Limitations:

  • Assumes feature independence (not ideal for multivariate data).
  • Sensitive to bin size selection.

One-Class SVM

One-Class Support Vector Machine is a variation of the SVM algorithm used for anomaly detection. It learns a decision boundary around the normal data points.

Steps:

  • Train the model on the normal data points.
  • The model attempts to find a hyperplane that separates the normal data from the origin.
  • Points that fall outside this boundary are classified as anomalies.