Data Lake
Summary
What it is
The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.
Where it fits
Data lakes are the precursor to lakehouses. In the S3 world, a data lake is the simplest form — dump everything into S3 and figure out the schema later. Lakehouses add the structure that data lakes lack.
Misconceptions / Traps
- "Schema-on-read" does not mean "no schema." Without any schema management, data lakes become data swamps — undiscoverable and untrusted.
- Data lakes and lakehouses are not mutually exclusive. Most lakehouses include raw data lake zones (e.g., Medallion Bronze layer).
Key Connections
is_aObject Storage — a data lake is a use of object storagescoped_toS3 — S3 is the dominant storage layer for data lakes- Apache Spark
scoped_toData Lake — the primary compute engine for lake workloads - Apache Flink
scoped_toData Lake — streaming ingestion into lakes - Write-Audit-Publish
scoped_toData Lake — quality gating pattern for lake data
Definition
What it is
The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.
Why it exists
Organizations needed a central, low-cost repository for all data types (structured, semi-structured, unstructured) without requiring schema decisions at write time.
Relationships
Inbound Relationships
Resources
AWS's official whitepaper on building data lakes defines architecture patterns, ingestion strategies, and governance frameworks for production data lakes on S3.
AWS's conceptual overview explains what a data lake is, how it differs from a data warehouse, and the key design principles.
Microsoft Azure's data lake overview provides an alternative cloud vendor's perspective, reinforcing the vendor-agnostic nature of the concept.