Standardisation is a data preprocessing technique used to Data Transformation. Centers data with zero mean and unit variance, suitable for algorithms sensitive to variance.
Definition: Standardisation involves rescaling the features of your data so that they have a mean of 0 and a standard deviation of 1. This is achieved by subtracting the mean of the feature from each data point and then dividing by the standard deviation.
Purpose:
- Useful for algorithms that assume the data is normally distributed (Gaussian Distribution.
- Uniformity: It helps in bringing all features to the same scale.
Use Case
-
Centred Data Assumption: Standardisation is beneficial when the model assumes that the data is centred around zero. This is common in algorithms such as linear regression, logistic regression, and principal component analysis (PCA), and distance-based algorithms like KNN and Gradient Descent descent optimization.
-
Improved Performance: It can improve the performance and convergence speed of machine learning algorithms by ensuring that each feature contributes equally to the result.
Formula
The formula for standardisation is:
Where:
- is the original data point.
- is the mean of the feature.
- is the standard deviation of the feature.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_standardized = scaler.fit_transform(df) # Rescales each feature to have mean 0 and std deviation 1