Anomaly Detection
Identify unusual or unexpected data points that deviate from the norm.
There are several ways to detect Outliers
Use visual methods like boxplots, statistical methods like Z-scores, or clustering techniques.
Visual Methods
- Boxplot: Displays the distribution and identifies outliers using the interquartile range (IQR).
- Scatter Plot: Helps visually identify outliers.
Clustering
- Description: Outliers often form small clusters or are isolated from main clusters.
PCA-Based Anomaly Detection
In ML_Tools see: PCA_Based_Anomaly_Detection.py
Time Series Methods
See Time Series.
Statistical Methods
Z-Score: Identifies outliers by measuring how many standard deviations a data point is from the mean.
Gaussian method
Anomaly Detection
Example: You have a dataset of servers unlabled We aim to detect those that do not work (anomalies).
Guassian model
To perform anomaly detection, you will first need to fit a model to the data’s distribution.
-
Given a training set you want to estimate the Gaussian distribution for each of the features .
-
Recall that the Gaussian distribution is given by
where is the mean and controls the variance.
-
For each feature , you need to find parameters and that fit the data in the -th dimension (the -th dimension of each example).
You can estimate the parameters, (, ), of the -th feature by using the following equations. To estimate the mean, you will use:
and for the variance you will use:
Low proabaility of being togerher. Make a 2D plot of two features. Permute feature cominbations if necessary.
What is multivariate guassian?
- The low probability examples are more likely to be the anomalies in our dataset.
- One way to determine which examples are anomalies is to select a threshold based on a cross validation set. What epsilon to choose