EDA is an approach to analyzing datasets to summarize their main characteristics, often through visual and statistical methods. It helps with:
- Understanding data structure and organization
- Detecting patterns and trends
- Choosing appropriate statistical techniques
- Selecting and assessing variables
- Addressing Data Quality issues
- Identifying Outliers and anomalies
- Formulating and testing hypotheses
- Verifying assumptions prior to modelling
Related to:
1. Understanding Variable Behaviour
Univariate vs Multivariate Analysis
- Univariate: Focuses on single variable distributions, central tendency, and spread.
- Multivariate: Explores interactions between variables and their collective behavior.
Techniques:
- Descriptive statistics: mean, median, mode, percentiles, standard deviation
- Data Visualisation: histograms, box plots, bar charts
- Pair plots: for relationships between multiple numerical variables
- Correlation matrices: to assess linear relationships
- Box plots: numeric vs categorical comparison
Feature ImportanceDetermine which variables matter most:
- Check distributional shape and variance
- Assess predictive separation (e.g. class imbalance in categorical predictors)
- Compute correlation with the target variable
2. Distributions and Data Transformation
Understand the shape and scale of each variable:
- Use Log transformation to reduce skewness and approximate normal distributions
- For imbalanced categorical features (e.g. 90% in one class), assess usefulness (see Imbalanced Datasets)
- Always interpret distributions in domain context, not just statistical form
3. Data Relationships and Correlation
Explore dependencies and associations between variables:
- Use scatter plots and correlation coefficients
- Investigate how multiple features interact with the target variable
- Be mindful of spurious correlations-use domain knowledge to guide interpretation
4. Purpose-Driven Exploration
EDA should be goal-oriented:
- What modelling or analysis task will follow?
- Are you preparing for feature engineering, selecting variables, or cleaning data?
- Tailor your EDA accordingly to inform future modelling or decision-making
5. Evaluating Limitations and Risk
Be explicit about the constraints of your analysis:
- Log issues such as missing data, small subgroup sizes, measurement bias
- Check assumptions (Statistical Assumptions) where relevant (e.g., normality, linearity)
6. Summary and Action
Always conclude with a clear summary:
- Key patterns, outliers, and relationships
- Variables to engineer, discard, or investigate further
- Questions or hypotheses to carry into the modelling phase
7. Continuous Development
EDA is iterative-refine your insights as your understanding of the data deepens.
Use implementation tools in ML_Tools, e.g.: