Purpose
- Investigate gaps in the dataset to understand their causes and impact.
- Decide how to treat missing values (retain, remove, or impute).
- Ensure the dataset remains useful for analysis and modeling.
- Assess missingness: how much, patterns, and impact.
- Impute or drop as appropriate based on project context.
- Document decisions for reproducibility and stakeholder communication.
Key Questions
- How much data is missing overall and per variable?
- Are there patterns in missingness (random vs systematic)?
- Do related columns share similar missing values?
- Are missing values clustered (e.g., certain groups, neighborhoods, or time periods)?
- Can the chosen model handle missing data directly, or is preprocessing required?
Exploration Approaches
- Use data exploration tools (Python, JavaScript, etc.) to summarize each column:
- Mean, variance, skewness, min, max, sum.
- Count of zeros and count of missing values.
- Inspect repeating patterns in missingness.
- Compare across related features (column dependencies).
Techniques
- Simple imputation: mean, median, or mode.
- Predictive imputation: regression or ML-based methods (more complex).
- Row/column filtering: dropping rows or columns when appropriate.
- Type conversion: e.g., numbers → strings to preserve categorical meaning.
- Some models (e.g., tree-based methods) can handle missing values directly.
Outcome
- A documented set of working theories on why values are missing.
- A strategy (or multiple strategies) for dealing with missingness before modeling (Handling Missing Data).
Reference
- NASA (2023). Dealing with Missing Data: The Art and Science of Imputation. Link