Choosing a Table Format — Iceberg vs. Delta vs. Hudi

Problem Framing

The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactional table semantics to files on S3. But they solve it differently, optimize for different workloads, and have different ecosystem affinities. This guide helps engineers choose.

Relevant Nodes

Topics: Table Formats, Lakehouse, S3
Technologies: Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino, DuckDB, Apache Flink
Standards: Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec, Apache Parquet
Architectures: Lakehouse Architecture
Pain Points: Schema Evolution, Small Files Problem, Lack of Atomic Rename, Metadata Overhead at Scale, Vendor Lock-In

Decision Path

Start with your primary engine:
- Databricks/Spark-heavy: Delta Lake has the tightest integration. Features like Auto Optimize, liquid clustering, and predictive I/O work best (or only) on Databricks.
- Multi-engine (Spark + Trino + Flink + DuckDB): Iceberg. It was designed for engine-agnostic access from the start. Every major engine has a first-class Iceberg connector.
- CDC-first (Change Data Capture): Hudi. Record-level upserts and incremental queries are Hudi's core strength. MoR table type is optimized for write-heavy, update-heavy workloads.

Evaluate on S3-specific dimensions:

Dimension	Iceberg	Delta Lake	Hudi
S3 atomic commit	Catalog-based pointer swap	Requires DynamoDB log store	Marker-based with lock provider
Schema evolution	Column-ID-based, metadata-only	Enforced + evolvable	Schema-on-read + enforcement
Partition management	Hidden partitioning (transparent)	User-managed (+ liquid clustering on Databricks)	User-managed
Compaction	`rewriteDataFiles`	`OPTIMIZE`	Inline or offline compaction
Multi-engine support	Broadest	Improving (Delta Kernel)	Moderate
Metadata model	Manifest tree (prunable)	Flat JSON log (checkpointed)	Timeline (action-based)

Consider ecosystem momentum:
- Iceberg is converging toward becoming the industry standard. Snowflake, AWS, Google, and Databricks all support it.
- Delta Lake remains strong in the Databricks ecosystem and is gaining multi-engine support via Delta Kernel.
- Hudi adoption is concentrated in CDC-heavy and streaming-heavy environments (Uber, ByteDance).
Do not over-invest in the choice:
- All three formats use Parquet as the data file format. Migration between formats is a metadata operation, not a data rewrite.
- The trend is toward interoperability (Iceberg compatibility layers for Delta, UniForm for cross-format reading). The choice is becoming less permanent.

What Changed Over Time

2016-2018: Hudi (then Hoodie) emerged at Uber for incremental ETL; Iceberg developed at Netflix for massive-scale table management; Delta developed at Databricks for reliable Spark pipelines.
2019-2020: All three open-sourced and entered Apache or equivalent foundations. The "format war" narrative emerged.
2021-2023: Iceberg gained momentum as the cross-engine standard. Snowflake, AWS (Athena/Glue), and Trino adopted it.
2023-2024: Databricks announced UniForm (Delta tables readable as Iceberg) and direct Iceberg support, effectively hedging on format convergence.
The industry trend is toward Iceberg as the de-facto standard, with Delta and Hudi remaining viable in their core ecosystems.

Problem Framing

Relevant Nodes

Decision Path

What Changed Over Time

Sources