Topic

Lakehouse

Summary

What it is

The convergence of data lake storage (raw files on object storage) with data warehouse capabilities — ACID transactions, schema enforcement, SQL access, time-travel.

Where it fits

Lakehouse sits between raw object storage and business analytics. It is the architectural layer where table formats (Iceberg, Delta, Hudi) add structure to S3 data, enabling SQL engines to query it reliably.

Misconceptions / Traps

  • A lakehouse is not just "a data lake with SQL." The key differentiator is transactional guarantees — ACID, schema evolution, snapshot isolation — provided by table format specs.
  • Lakehouse does not eliminate ETL. It eliminates the second copy of data in a separate warehouse, but data still needs transformation.

Key Connections

  • scoped_to Object Storage — the lakehouse stores all data on object storage
  • Lakehouse Architecture scoped_to Lakehouse — the concrete architectural pattern
  • Apache Iceberg, Delta Lake, Apache Hudi scoped_to Lakehouse — table format technologies
  • Medallion Architecture scoped_to Lakehouse — a data quality pattern within lakehouses
  • Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec scoped_to Lakehouse — the specifications that define table semantics

Definition

What it is

The convergence of data lake storage (raw files on object storage) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, time-travel).

Why it exists

Data lakes offered cheap, scalable storage but lacked reliability guarantees. Data warehouses offered guarantees but were expensive and siloed. The lakehouse concept unifies both on a single object storage layer.

Relationships

Outbound Relationships

Resources