Guide 7

Choosing a Table Format — Iceberg vs. Delta vs. Hudi

Problem Framing

The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactional table semantics to files on S3. But they solve it differently, optimize for different workloads, and have different ecosystem affinities. This guide helps engineers choose.

Relevant Nodes

  • Topics: Table Formats, Lakehouse, S3
  • Technologies: Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino, DuckDB, Apache Flink
  • Standards: Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec, Apache Parquet
  • Architectures: Lakehouse Architecture
  • Pain Points: Schema Evolution, Small Files Problem, Lack of Atomic Rename, Metadata Overhead at Scale, Vendor Lock-In

Decision Path

  1. Start with your primary engine:

    • Databricks/Spark-heavy: Delta Lake has the tightest integration. Features like Auto Optimize, liquid clustering, and predictive I/O work best (or only) on Databricks.
    • Multi-engine (Spark + Trino + Flink + DuckDB): Iceberg. It was designed for engine-agnostic access from the start. Every major engine has a first-class Iceberg connector.
    • CDC-first (Change Data Capture): Hudi. Record-level upserts and incremental queries are Hudi's core strength. MoR table type is optimized for write-heavy, update-heavy workloads.
  2. Evaluate on S3-specific dimensions:

    Dimension Iceberg Delta Lake Hudi
    S3 atomic commit Catalog-based pointer swap Requires DynamoDB log store Marker-based with lock provider
    Schema evolution Column-ID-based, metadata-only Enforced + evolvable Schema-on-read + enforcement
    Partition management Hidden partitioning (transparent) User-managed (+ liquid clustering on Databricks) User-managed
    Compaction rewriteDataFiles OPTIMIZE Inline or offline compaction
    Multi-engine support Broadest Improving (Delta Kernel) Moderate
    Metadata model Manifest tree (prunable) Flat JSON log (checkpointed) Timeline (action-based)
  3. Consider ecosystem momentum:

    • Iceberg is converging toward becoming the industry standard. Snowflake, AWS, Google, and Databricks all support it.
    • Delta Lake remains strong in the Databricks ecosystem and is gaining multi-engine support via Delta Kernel.
    • Hudi adoption is concentrated in CDC-heavy and streaming-heavy environments (Uber, ByteDance).
  4. Do not over-invest in the choice:

    • All three formats use Parquet as the data file format. Migration between formats is a metadata operation, not a data rewrite.
    • The trend is toward interoperability (Iceberg compatibility layers for Delta, UniForm for cross-format reading). The choice is becoming less permanent.

What Changed Over Time

  • 2016-2018: Hudi (then Hoodie) emerged at Uber for incremental ETL; Iceberg developed at Netflix for massive-scale table management; Delta developed at Databricks for reliable Spark pipelines.
  • 2019-2020: All three open-sourced and entered Apache or equivalent foundations. The "format war" narrative emerged.
  • 2021-2023: Iceberg gained momentum as the cross-engine standard. Snowflake, AWS (Athena/Glue), and Trino adopted it.
  • 2023-2024: Databricks announced UniForm (Delta tables readable as Iceberg) and direct Iceberg support, effectively hedging on format convergence.
  • The industry trend is toward Iceberg as the de-facto standard, with Delta and Hudi remaining viable in their core ecosystems.

Sources