Table Formats
Summary
What it is
The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files on object storage.
Where it fits
Table formats bridge the gap between raw files on S3 and the structured tables that SQL engines expect. They are the enabling layer for lakehouse architectures.
Misconceptions / Traps
- Table formats are specifications, not databases. They define how metadata and data files are organized — the query engine is separate.
- Choosing a table format is increasingly a convergent decision. Iceberg has become the de-facto standard, but Delta and Hudi remain relevant in their ecosystems.
Key Connections
scoped_toS3 — all table formats operate on S3-stored files- Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec
scoped_toTable Formats — the three major specifications - Apache Parquet
scoped_toTable Formats — the dominant data file format under all three - Schema Evolution
scoped_toTable Formats — the problem table formats exist to solve - Metadata Overhead at Scale
scoped_toTable Formats — the problem table formats introduce
Definition
What it is
The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files stored on object storage.
Why it exists
Raw files on S3 have no transactional guarantees, no schema enforcement, and no efficient way to track which files belong to a logical table. Table format specifications solve this by adding a metadata layer on top of the files.
Relationships
Outbound Relationships
scoped_toResources
The Apache Iceberg specification is the definitive reference for the most widely adopted open table format, defining snapshot isolation, schema evolution, and partition evolution.
The Delta Lake transaction log protocol specification defines the ACID transaction semantics and metadata structure at the wire level.
Apache Hudi's official documentation covers its copy-on-write and merge-on-read table types, incremental processing model, and timeline-based versioning.
Dremio's comprehensive comparison shows how each table format handles metadata, partitioning, and storage at petabyte scale.