Guide 7
Choosing a Table Format — Iceberg vs. Delta vs. Hudi
Problem Framing
The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactional table semantics to files on S3. But they solve it differently, optimize for different workloads, and have different ecosystem affinities. This guide helps engineers choose.
Relevant Nodes
- Topics: Table Formats, Lakehouse, S3
- Technologies: Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino, DuckDB, Apache Flink
- Standards: Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec, Apache Parquet
- Architectures: Lakehouse Architecture
- Pain Points: Schema Evolution, Small Files Problem, Lack of Atomic Rename, Metadata Overhead at Scale, Vendor Lock-In
Decision Path
Start with your primary engine:
- Databricks/Spark-heavy: Delta Lake has the tightest integration. Features like Auto Optimize, liquid clustering, and predictive I/O work best (or only) on Databricks.
- Multi-engine (Spark + Trino + Flink + DuckDB): Iceberg. It was designed for engine-agnostic access from the start. Every major engine has a first-class Iceberg connector.
- CDC-first (Change Data Capture): Hudi. Record-level upserts and incremental queries are Hudi's core strength. MoR table type is optimized for write-heavy, update-heavy workloads.
Evaluate on S3-specific dimensions:
Dimension Iceberg Delta Lake Hudi S3 atomic commit Catalog-based pointer swap Requires DynamoDB log store Marker-based with lock provider Schema evolution Column-ID-based, metadata-only Enforced + evolvable Schema-on-read + enforcement Partition management Hidden partitioning (transparent) User-managed (+ liquid clustering on Databricks) User-managed Compaction rewriteDataFilesOPTIMIZEInline or offline compaction Multi-engine support Broadest Improving (Delta Kernel) Moderate Metadata model Manifest tree (prunable) Flat JSON log (checkpointed) Timeline (action-based) Consider ecosystem momentum:
- Iceberg is converging toward becoming the industry standard. Snowflake, AWS, Google, and Databricks all support it.
- Delta Lake remains strong in the Databricks ecosystem and is gaining multi-engine support via Delta Kernel.
- Hudi adoption is concentrated in CDC-heavy and streaming-heavy environments (Uber, ByteDance).
Do not over-invest in the choice:
- All three formats use Parquet as the data file format. Migration between formats is a metadata operation, not a data rewrite.
- The trend is toward interoperability (Iceberg compatibility layers for Delta, UniForm for cross-format reading). The choice is becoming less permanent.
What Changed Over Time
- 2016-2018: Hudi (then Hoodie) emerged at Uber for incremental ETL; Iceberg developed at Netflix for massive-scale table management; Delta developed at Databricks for reliable Spark pipelines.
- 2019-2020: All three open-sourced and entered Apache or equivalent foundations. The "format war" narrative emerged.
- 2021-2023: Iceberg gained momentum as the cross-engine standard. Snowflake, AWS (Athena/Glue), and Trino adopted it.
- 2023-2024: Databricks announced UniForm (Delta tables readable as Iceberg) and direct Iceberg support, effectively hedging on format convergence.
- The industry trend is toward Iceberg as the de-facto standard, with Delta and Hudi remaining viable in their core ecosystems.
Sources
- iceberg.apache.org/spec/
- github.com/delta-io/delta/blob/master/PROTOCOL.md
- hudi.apache.org/tech-specs/
- hudi.apache.org/docs/overview
- www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-icebe...
- www.dremio.com/blog/table-format-partitioning-comparison-apache-iceber...
- docs.delta.io/latest/delta-storage.html
- www.onehouse.ai/blog/open-table-formats-and-the-open-data-lakehouse-in...