Purpose
- Transform raw data into a clean, integrated, and analysable form suitable for modeling and deployment.
- Ensure that the dataset is free from missing or irrelevant values, aligned with business objectives, and structured for the unit of analysis (e.g., per person, per transaction).
- Bridge the gap between Data Understanding and modeling.
End Goals
- No unaddressed missing data.
- All fields useful, relevant, and reliable.
- Dataset in the correct form for deployment.
- Comprehensive representation of the business problem, not limited to chosen KPIs or incentives.
Key Considerations
- Unit of analysis: For transactional data, aggregate variables so one row represents the entity being analyzed.
- Stakeholder input: Collaborate with SMEs and business teams to ensure alignment with problem objectives.
- Metrics & incentives: Verify that chosen features address business problems, not only organizational KPIs.
Core Tasks
- Data Integration: Combine data from multiple sources, addressing missingness and inconsistencies.
- Data Selection: Decide which variables and records to include or exclude.
- Data Quality Verification: Reassess missing data, duplicates, and anomalies.
- Feature Engineering: Construct new variables and transformations from existing data.
- Feature Selection: Identify the most relevant features for modeling.
Process Characteristics
- Iterative: Data understanding and data preparation overlap, feeding back into each other.
- Checklist-driven: Preparation should follow a plan built during Data Understanding.
Formal Documentation
- Documentation is essential for reproducibility, progress tracking, and deployment readiness.
- Reports generated in this phase form part of the CRISP-DM cycle and provide long-term value.
Reports to Produce
- Initial Data Collection Report: Data sources, acquisition methods, extraction details, problems encountered.
- Data Description Report: Field definitions, metadata, and integration implications.
- Data Exploration Report: Document patterns, anomalies, and SME discussions.
- Data Quality Report: Missing data treatment, imputation methods, and feature construction records.
Outcome
- A finalized dataset ready for modeling, supported by clear documentation of sources, transformations, and decisions.
- Traceability of how features, imputations, and integrations were performed.