Data Lake

A Data Lake is a storage system with vast amounts of unstructured data and structured data, stored as-is, without a specific purpose in mind, that can be built on multiple technologies such as Hadoop, NoSQL, Amazon Simple Storage Service, a relational database, or various combinations and different formats (e.g. Excel, CSV, Text, Logs, etc.).

Definition: A repository that stores diverse data types, including structured, semi-structured, and unstructured data. If cant fit into a database.

Features:

Versatility: Can accommodate various data formats, including videos, images, documents, and more.
Raw Data Storage: Preserves data in its raw form, suitable for advanced analytics, particularly in machine learning and AI.
Data Usability: Raw data may require cleaning and transformation for analytical use, often transferred to databases or data warehouses.
Use Case: Valuable for storing large volumes of raw data, especially in contexts requiring advanced analytics and experimentation.

unstructured data for predictive modeling and analysis. This leads to the creation of a data lake, which stores raw data without predefined schemas.

The data lake supports the following capabilities:

To capture and store raw data at scale for a low cost
To store many types of data in the same repository
To perform Data Transformation on the data where the purpose may not be defined
To perform new types of data processing
To perform single-subject analytics based on particular use cases

Components of a data lake 1. Storage Layer 2. Data Lake File Format 3. Data Lake Table Format with Apache Parquet, Apache Iceberg, and Apache Hudi

Data Archive

Explorer

Data Lake

Backlinks

Explorer