Purpose

  • Collect, explore, and understand the data.
  • Ensure data quality before moving to Data Preparation.
  • Gain insights relevant to the problem and identify potential hidden patterns.
  • Part of the Data Mining - CRISP process; crucial for guiding later modeling steps.

Key Activities

  • Describe the data: Document sources, formats, variables,Data Dictionary, and basic structure.
  • Explore the data: Summarize distributions, check for anomalies, and visualize relationships.
  • Verify quality: Assess missing values, duplicates, inconsistencies, and measurement errors.
  • Refine scope: Decide what data is relevant, discard unnecessary variables.

Exploration Guidelines

  • Identify the target variable (if supervised).
  • Group features to make exploration manageable.
  • Use methods such as Decision Tree analysis for investigating relationships and Feature Selection.

Outcome

  • A clear understanding of what data is available, its limitations, and its suitability for the problem.
  • A foundation for Data Preparation and subsequent modeling.