Lakehouse Architecture
Summary
What it is
A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.
Where it fits
Lakehouse Architecture is the dominant architectural pattern in the S3 ecosystem. It eliminates the need for a separate data warehouse by adding reliability directly to S3-stored data through table formats.
Misconceptions / Traps
- A lakehouse does not mean "no data warehouse." It means the warehouse capabilities are applied to data lake storage. Some workloads may still benefit from a dedicated OLAP engine.
- Lakehouse performance depends heavily on metadata management. Without catalog maintenance (snapshot expiration, orphan file cleanup), query planning degrades.
Key Connections
depends_onS3 API, Apache Parquet — the storage interface and file formatsolvesCold Scan Latency — metadata-driven query planning reduces unnecessary S3 scansconstrained_byMetadata Overhead at Scale, Lack of Atomic Rename- Apache Iceberg, Delta Lake, Apache Hudi
implementsLakehouse Architecture - Trino, Apache Spark, StarRocks, Apache Flink
used_byLakehouse Architecture scoped_toLakehouse, Object Storage
Definition
What it is
A unified architecture that combines data lake storage (files on S3) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, governance) by using a table format as the bridge layer.
Why it exists
Running both a data lake and a data warehouse creates data duplication, ETL complexity, and governance gaps. The lakehouse eliminates the warehouse copy by adding warehouse-grade reliability directly to the data lake.
Primary use cases
Unified analytics platform, eliminating ETL between lake and warehouse, multi-engine SQL access to a single copy of data on S3.
Relationships
Outbound Relationships
scoped_todepends_onsolvesconstrained_byInbound Relationships
implementsaugmentsResources
The foundational CIDR 2021 paper by Zaharia et al. that coined the Lakehouse concept, arguing that open data formats on object storage can unify warehousing and ML workloads.
Databricks' official product page explaining Lakehouse architecture, the commercial realization of the CIDR paper's vision.
Databricks' well-architected data lakehouse documentation covering architectural pillars for production implementations.