Topic

Table Formats

Summary

What it is

The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files on object storage.

Where it fits

Table formats bridge the gap between raw files on S3 and the structured tables that SQL engines expect. They are the enabling layer for lakehouse architectures.

Misconceptions / Traps

Table formats are specifications, not databases. They define how metadata and data files are organized — the query engine is separate.
Choosing a table format is increasingly a convergent decision. Iceberg has become the de-facto standard, but Delta and Hudi remain relevant in their ecosystems.

Key Connections

scoped_to S3 — all table formats operate on S3-stored files
Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec scoped_to Table Formats — the three major specifications
Apache Parquet scoped_to Table Formats — the dominant data file format under all three
Schema Evolution scoped_to Table Formats — the problem table formats exist to solve
Metadata Overhead at Scale scoped_to Table Formats — the problem table formats introduce

Definition

What it is

The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files stored on object storage.

Why it exists

Raw files on S3 have no transactional guarantees, no schema enforcement, and no efficient way to track which files belong to a logical table. Table format specifications solve this by adding a metadata layer on top of the files.

Relationships

Outbound Relationships

scoped_to

Inbound Relationships

scoped_to

Apache Iceberg Delta Lake Apache Hudi Apache Parquet Apache Arrow Iceberg Table Spec Delta Lake Protocol Apache Hudi Spec ORC Apache Avro Schema Evolution Metadata Overhead at Scale Partition Pruning Complexity Schema Inference

Resources

SpecHigh

iceberg.apache.org/spec/

The Apache Iceberg specification is the definitive reference for the most widely adopted open table format, defining snapshot isolation, schema evolution, and partition evolution.

SpecHigh

github.com/delta-io/delta/blob/master/PROTOCOL.md

The Delta Lake transaction log protocol specification defines the ACID transaction semantics and metadata structure at the wire level.

DocsHigh

hudi.apache.org/docs/overview

Apache Hudi's official documentation covers its copy-on-write and merge-on-read table types, incremental processing model, and timeline-based versioning.

BlogMedium

www.dremio.com/blog/comparison-of-data-lake-table-formats-ap...

Dremio's comprehensive comparison shows how each table format handles metadata, partitioning, and storage at petabyte scale.