A Data Lake is a storage system with vast amounts of unstructured data and structured data, stored as-is, without a specific purpose in mind, that can be built on multiple technologies such as Hadoop, NoSQL, Amazon Simple Storage Service, a relational database, or various combinations and different formats (e.g. Excel, CSV, Text, Logs, etc.).

Definition: A repository that stores diverse data types, including structured, semi-structured, and unstructured data. If cant fit into a database.

Features:

  • Versatility: Can accommodate various data formats, including videos, images, documents, and more.
  • Raw Data Storage: Preserves data in its raw form, suitable for advanced analytics, particularly in machine learning and AI.
  • Data Usability: Raw data may require cleaning and transformation for analytical use, often transferred to databases or data warehouses.
  • Use Case: Valuable for storing large volumes of raw data, especially in contexts requiring advanced analytics and experimentation.

unstructured data for predictive modeling and analysis. This leads to the creation of a data lake, which stores raw data without predefined schemas.

The data lake supports the following capabilities:

  • To capture and store raw data at scale for a low cost
  • To store many types of data in the same repository
  • To perform data transformation on the data where the purpose may not be defined
  • To perform new types of data processing
  • To perform single-subject analytics based on particular use cases

Components of a data lake 1. Storage Layer 2. Data Lake File Format 3. Data Lake Table Format with Apache Parquet, Apache Iceberg, and Apache Hudi