Technology

Apache Hudi

Summary

What it is

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.

Where it fits

Hudi occupies the niche of record-level mutations on S3 data. Where Iceberg and Delta focus on batch analytics, Hudi's strength is CDC ingestion and near-real-time upserts — making it the choice for pipelines that need to update individual records.

Misconceptions / Traps

  • Hudi has two table types (Copy-on-Write and Merge-on-Read) with very different performance profiles. Choosing the wrong one is a common early mistake.
  • Hudi's operational complexity (compaction scheduling, cleaning policies, indexing) is higher than Iceberg or Delta. Budget for operational overhead.

Key Connections

  • implements Lakehouse Architecture — provides incremental processing on lakes
  • depends_on Apache Hudi Spec, Apache Parquet — specification and data format
  • solves Legacy Ingestion Bottlenecks (incremental ingestion), Schema Evolution
  • scoped_to Table Formats, Lakehouse

Definition

What it is

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage.

Why it exists

Many real-world data pipelines need to update and delete individual records, not just append. Hudi brings record-level operations to data lakes without requiring a full rewrite of affected files.

Primary use cases

Change data capture (CDC) into data lakes, incremental ETL pipelines, near-real-time analytics on S3 data.

Relationships

Resources