Data transformation is the process of converting data from one format into a different format. Reasons for doing this could be to optimize the data for a different use case than it was originally intended for, or to meet the requirements for storing data in a different system. Data transformation may involve steps such as cleansing, normalizing, structuring, validation, sorting, joining, or enriching data.
Data Transformation: Normalizing or scaling numerical values, encoding categorical variables, and converting data types.
Transformation with Pandas
In Pandas
Combining Datasets
Aggregating Data Groupby
Change the Layout of Data using a Stack
See Stack for when to use
Pivot Table: Summarize Data
Handle Time-Based
Sorting Order Data
Rename Columns
Identify Duplicate Entries
Other types of transformation
Transform raw data into a format that best describes the problem to the model.
- Supervised Learning: This involves annotating the dataset with the correct answers (labels) to teach the model. For example, labeling images of apples to distinguish them from other fruits.
- Manual and Automated Labeling: Manual labeling by humans or leveraging existing labeled data. Tools like Google’s reCAPTCHA help in building labeled datasets.
- Database Normalization:Ensure consistent data representation and scale features appropriately.
- Convert data into the required format, e.g., from .xls to .csv.
- Encoding Categorical Variables
- Transform raw data into a format that best describes the problem to the model. This includes formatting and normalization/ Normalisation
Transformation in Data Engineering
Data is often transformed as part of an ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) approach to data integration.
See ETL vs. ELT for a comparison of these two approaches.
Additionally, a hybrid approach has recently emerged which is known as EtLT (Extract, “tweak”, Load, Transform). This combines aspects of both ETL and ELT.
Other
Benefits of data transformation
When used correctly, data transformation can provide the following benefits:
- Improved query-time efficiency and speed.
- Conversion of data into a format that is required by a target system.
- Enrichment of data with additional information that allows insights to be more easily extracted.
- Improved data quality by validating and fixing data, and removal of duplicates.
Improved efficiency and speed
One kind of transformation could be the extraction of structured data from data that is stored in a string. Imagine data that looks as follows:
input_string: "Bob is 29"
In order to efficiently process this data in the future, it may preferable to transform this data into additional/new fields, and store it as:
name: "Bob"
age: 29
Storing the data in this manner makes it much more efficient to analyze with operations such as: