Architecture

Lakehouse Architecture

Summary

What it is

A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer.

Where it fits

Lakehouse Architecture is the dominant architectural pattern in the S3 ecosystem. It eliminates the need for a separate data warehouse by adding reliability directly to S3-stored data through table formats.

Misconceptions / Traps

A lakehouse does not mean "no data warehouse." It means the warehouse capabilities are applied to data lake storage. Some workloads may still benefit from a dedicated OLAP engine.
Lakehouse performance depends heavily on metadata management. Without catalog maintenance (snapshot expiration, orphan file cleanup), query planning degrades.

Key Connections

depends_on S3 API, Apache Parquet — the storage interface and file format
solves Cold Scan Latency — metadata-driven query planning reduces unnecessary S3 scans
constrained_by Metadata Overhead at Scale, Lack of Atomic Rename
Apache Iceberg, Delta Lake, Apache Hudi implements Lakehouse Architecture
Trino, Apache Spark, StarRocks, Apache Flink used_by Lakehouse Architecture
scoped_to Lakehouse, Object Storage

Definition

What it is

A unified architecture that combines data lake storage (files on S3) with data warehouse capabilities (ACID transactions, schema enforcement, SQL access, governance) by using a table format as the bridge layer.

Why it exists

Running both a data lake and a data warehouse creates data duplication, ETL complexity, and governance gaps. The lakehouse eliminates the warehouse copy by adding warehouse-grade reliability directly to the data lake.

Primary use cases

Unified analytics platform, eliminating ETL between lake and warehouse, multi-engine SQL access to a single copy of data on S3.

Relationships

Outbound Relationships

scoped_to

Lakehouse Object Storage

depends_on

S3 API Apache Parquet

solves

Cold Scan Latency

constrained_by

Metadata Overhead at Scale Lack of Atomic Rename

Inbound Relationships

enables

AWS S3 MinIO S3 API Apache Parquet Iceberg Table Spec Delta Lake Protocol Apache Hudi Spec

implements

Apache Iceberg Delta Lake Apache Hudi

used_by

Trino Apache Spark StarRocks Apache Flink

is_a

Medallion Architecture

augments

Semantic Search

Resources

PaperHigh

www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

The foundational CIDR 2021 paper by Zaharia et al. that coined the Lakehouse concept, arguing that open data formats on object storage can unify warehousing and ML workloads.

DocsHigh

www.databricks.com/product/data-lakehouse

Databricks' official product page explaining Lakehouse architecture, the commercial realization of the CIDR paper's vision.

DocsHigh

docs.databricks.com/aws/en/lakehouse-architecture/

Databricks' well-architected data lakehouse documentation covering architectural pillars for production implementations.