Technology

Apache Iceberg

Summary

What it is

An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) on object storage.

Where it fits

Iceberg is the central table format in the S3 ecosystem. It turns a pile of Parquet files on S3 into a reliable, evolvable, SQL-queryable table — without requiring a database server. It has become the de-facto standard across engines (Spark, Trino, Flink, DuckDB).

Misconceptions / Traps

  • Iceberg is not a query engine. It is a table format specification plus libraries. You still need Spark, Trino, DuckDB, or another engine to query Iceberg tables.
  • Hidden partitioning is powerful but not magic. Poor sort order or excessive partition granularity still produces small files and slow queries.

Key Connections

  • implements Lakehouse Architecture — the primary table format for lakehouses
  • depends_on Apache Parquet — default data file format
  • solves Small Files Problem (compaction), Schema Evolution (column-ID-based evolution), Partition Pruning Complexity (hidden partitioning)
  • constrained_by Metadata Overhead at Scale, Lack of Atomic Rename
  • scoped_to Table Formats, Lakehouse

Definition

What it is

An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) stored on object storage.

Why it exists

Raw files on S3 have no concept of a "table." Iceberg adds transactional table semantics — schema enforcement, hidden partitioning, snapshot isolation, time-travel — on top of object storage without requiring a specialized database engine.

Primary use cases

Lakehouse table management, schema evolution, partition management, concurrent read/write isolation over S3 data.

Relationships

Resources