EDA

EDA is an approach to analyzing datasets to summarize their main characteristics, often through visual and statistical methods. It helps with:

Related to:

1. Understanding Variable Behaviour

Univariate: Focuses on single variable distributions, central tendency, and spread.
Multivariate: Explores interactions between variables and their collective behavior.

Techniques:

Feature ImportanceDetermine which variables matter most:

Understand the shape and scale of each variable:

Use Log transformation to reduce skewness and approximate normal distributions
For imbalanced categorical features (e.g. 90% in one class), assess usefulness (see Imbalanced Datasets)
Always interpret distributions in domain context, not just statistical form

Explore dependencies and associations between variables:

Use scatter plots and correlation coefficients
Investigate how multiple features interact with the target variable
Be mindful of spurious correlations-use domain knowledge to guide interpretation

EDA should be goal-oriented:

What modelling or analysis task will follow?
Are you preparing for feature engineering, selecting variables, or cleaning data?
Tailor your EDA accordingly to inform future modelling or decision-making

Be explicit about the constraints of your analysis:

Log issues such as missing data, small subgroup sizes, measurement bias
Check assumptions (Statistical Assumptions) where relevant (e.g., normality, linearity)

Always conclude with a clear summary:

EDA is iterative-refine your insights as your understanding of the data deepens.

Use implementation tools in ML_Tools, e.g.: