Outliers are data points that differ significantly from other observations in the dataset. They can skew and mislead the training of machine learning models, especially those sensitive to the scale of data, such as Linear Regression.
Handling outliers in similar to Handling Missing Data
Methods for Handling Outliers
1. Trimming
- Description: Removing data points identified as outliers based on criteria such as being beyond a certain number of standard deviations from the mean or outside a specified percentile range.
- Implementation Example:
2. Capping or Flooring
- Description: Setting a maximum or minimum threshold beyond which data points are considered outliers and replacing them with the threshold value.
3. Winsorizing
- Description: Similar to capping and flooring, winsorizing replaces extreme values with less extreme values within a specified range, typically using percentiles.
Detection Techniques
Use visual methods like boxplots, statistical methods like Z-scores, or clustering techniques.
1. Visual Methods
- Boxplot: Displays the distribution and identifies outliers using the interquartile range (IQR).
- Scatter Plot: Helps visually identify outliers.
2. Statistical Methods
- Z-Score: Identifies outliers by measuring how many standard deviations a data point is from the mean.
3. Clustering
- Description: Outliers often form small clusters or are isolated from main clusters.
4. PCA-Based Anomaly Detection
- Implementation Example: