Purpose
- Collect, explore, and understand the data.
- Ensure data quality before moving to Data Preparation.
- Gain insights relevant to the problem and identify potential hidden patterns.
- Part of the Data Mining - CRISP process; crucial for guiding later modeling steps.
Key Activities
- Describe the data: Document sources, formats, variables,Data Dictionary, and basic structure.
- Explore the data: Summarize distributions, check for anomalies, and visualize relationships.
- Verify quality: Assess missing values, duplicates, inconsistencies, and measurement errors.
- Refine scope: Decide what data is relevant, discard unnecessary variables.
Exploration Guidelines
- Identify the target variable (if supervised).
- Group features to make exploration manageable.
- Use methods such as Decision Tree analysis for investigating relationships and Feature Selection.
Outcome
- A clear understanding of what data is available, its limitations, and its suitability for the problem.
- A foundation for Data Preparation and subsequent modeling.