Architecture

Lakehouse Architecture

Summary

What it is

A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.

Where it fits

Lakehouse Architecture is the dominant architectural pattern in the S3 ecosystem. It eliminates the need for a separate data warehouse by adding reliability directly to S3-stored data through table formats.

Misconceptions / Traps

  • A lakehouse does not mean "no data warehouse." It means the warehouse capabilities are applied to data lake storage. Some workloads may still benefit from a dedicated OLAP engine.
  • Lakehouse performance depends heavily on metadata management. Without catalog maintenance (snapshot expiration, orphan file cleanup), query planning degrades.

Key Connections

  • depends_on S3 API, Apache Parquet — the storage interface and file format
  • solves Cold Scan Latency — metadata-driven query planning reduces unnecessary S3 scans
  • constrained_by Metadata Overhead at Scale, Lack of Atomic Rename
  • Apache Iceberg, Delta Lake, Apache Hudi implements Lakehouse Architecture
  • Trino, Apache Spark, StarRocks, Apache Flink used_by Lakehouse Architecture
  • scoped_to Lakehouse, Object Storage

Definition

What it is

A unified architecture that combines data lake storage (files on S3) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, governance) by using a table format as the bridge layer.

Why it exists

Running both a data lake and a data warehouse creates data duplication, ETL complexity, and governance gaps. The lakehouse eliminates the warehouse copy by adding warehouse-grade reliability directly to the data lake.

Primary use cases

Unified analytics platform, eliminating ETL between lake and warehouse, multi-engine SQL access to a single copy of data on S3.

Relationships

Resources