Topic

Data Lake

Summary

What it is

The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.

Where it fits

Data lakes are the precursor to lakehouses. In the S3 world, a data lake is the simplest form — dump everything into S3 and figure out the schema later. Lakehouses add the structure that data lakes lack.

Misconceptions / Traps

"Schema-on-read" does not mean "no schema." Without any schema management, data lakes become data swamps — undiscoverable and untrusted.
Data lakes and lakehouses are not mutually exclusive. Most lakehouses include raw data lake zones (e.g., Medallion Bronze layer).

Key Connections

is_a Object Storage — a data lake is a use of object storage
scoped_to S3 — S3 is the dominant storage layer for data lakes
Apache Spark scoped_to Data Lake — the primary compute engine for lake workloads
Apache Flink scoped_to Data Lake — streaming ingestion into lakes
Write-Audit-Publish scoped_to Data Lake — quality gating pattern for lake data

Definition

What it is

The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.

Why it exists

Organizations needed a central, low-cost repository for all data types (structured, semi-structured, unstructured) without requiring schema decisions at write time.

Relationships

Outbound Relationships

is_a

Object Storage

scoped_to

Inbound Relationships

scoped_to

Apache Spark Apache Flink Medallion Architecture Write-Audit-Publish Schema Evolution Legacy Ingestion Bottlenecks

Resources

DocsHigh

docs.aws.amazon.com/whitepapers/latest/building-data-lakes/b...

AWS's official whitepaper on building data lakes defines architecture patterns, ingestion strategies, and governance frameworks for production data lakes on S3.

DocsHigh

aws.amazon.com/what-is/data-lake/

AWS's conceptual overview explains what a data lake is, how it differs from a data warehouse, and the key design principles.

DocsHigh

azure.microsoft.com/en-us/solutions/data-lake/

Microsoft Azure's data lake overview provides an alternative cloud vendor's perspective, reinforcing the vendor-agnostic nature of the concept.