Topic

Data Lake

Summary

What it is

The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.

Where it fits

Data lakes are the precursor to lakehouses. In the S3 world, a data lake is the simplest form — dump everything into S3 and figure out the schema later. Lakehouses add the structure that data lakes lack.

Misconceptions / Traps

  • "Schema-on-read" does not mean "no schema." Without any schema management, data lakes become data swamps — undiscoverable and untrusted.
  • Data lakes and lakehouses are not mutually exclusive. Most lakehouses include raw data lake zones (e.g., Medallion Bronze layer).

Key Connections

  • is_a Object Storage — a data lake is a use of object storage
  • scoped_to S3 — S3 is the dominant storage layer for data lakes
  • Apache Spark scoped_to Data Lake — the primary compute engine for lake workloads
  • Apache Flink scoped_to Data Lake — streaming ingestion into lakes
  • Write-Audit-Publish scoped_to Data Lake — quality gating pattern for lake data

Definition

What it is

The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.

Why it exists

Organizations needed a central, low-cost repository for all data types (structured, semi-structured, unstructured) without requiring schema decisions at write time.

Relationships

Outbound Relationships

scoped_to

Resources