Purpose

  • Collect, explore, and understand the data.
  • Ensure Data Quality before moving to Data Preparation.
  • Gain insights relevant to the problem and identify potential hidden patterns.
  • Part of the Data Mining - CRISP process; crucial for guiding later modeling steps.

Key Activities

  • Describe the data: Document sources, formats, variables,Data Dictionary, and basic structure.
  • Explore the data: Summarize distributions, check for anomalies, and visualize relationships.
  • Verify quality: Assess missing values, duplicates, inconsistencies, and measurement errors.
  • Refine scope: Decide what data is relevant, discard unnecessary variables.

Exploration Guidelines

  • Identify the target variable (if supervised).
  • Group features to make exploration manageable.
  • Use methods such as Decision Tree analysis for investigating relationships and Feature Selection.

Outcome

  • A clear understanding of what data is available, its limitations, and its suitability for the problem.
  • A foundation for Data Preparation and subsequent modeling.

Data Understanding: How to describe the data: First pass, efficiently: get initial questions

  • Where is data from
  • Structuring on data, weird values, groupings
  • Bring up questions to
  • What do the type of values in a categorical mean, look at the sources
  • What is the context for this dataset: the story
  • Create a Discards sheet with comments on why you have these rows in here.
  • Create a table of a few examples of those that could create an error.
  • Duplicate keys?