Technology

Apache Iceberg

Summary

What it is

An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) on object storage.

Where it fits

Iceberg is the central table format in the S3 ecosystem. It turns a pile of Parquet files on S3 into a reliable, evolvable, SQL-queryable table — without requiring a database server. It has become the de-facto standard across engines (Spark, Trino, Flink, DuckDB).

Misconceptions / Traps

Iceberg is not a query engine. It is a table format specification plus libraries. You still need Spark, Trino, DuckDB, or another engine to query Iceberg tables.
Hidden partitioning is powerful but not magic. Poor sort order or excessive partition granularity still produces small files and slow queries.

Key Connections

implements Lakehouse Architecture — the primary table format for lakehouses
depends_on Apache Parquet — default data file format
solves Small Files Problem (compaction), Schema Evolution (column-ID-based evolution), Partition Pruning Complexity (hidden partitioning)
constrained_by Metadata Overhead at Scale, Lack of Atomic Rename
scoped_to Table Formats, Lakehouse

Definition

What it is

An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) stored on object storage.

Why it exists

Raw files on S3 have no concept of a "table." Iceberg adds transactional table semantics — schema enforcement, hidden partitioning, snapshot isolation, time-travel — on top of object storage without requiring a specialized database engine.

Primary use cases

Lakehouse table management, schema evolution, partition management, concurrent read/write isolation over S3 data.

Relationships

Outbound Relationships

scoped_to

Table Formats Lakehouse

implements

Lakehouse Architecture

depends_on

Apache Parquet

solves

Small Files Problem Schema Evolution Partition Pruning Complexity

constrained_by

Metadata Overhead at Scale Lack of Atomic Rename

Inbound Relationships

augments

Metadata Extraction Schema Inference Data Classification

Resources

DocsHigh

iceberg.apache.org/docs/latest/

Official Apache Iceberg documentation covering the table format specification, catalog integrations, and query engine compatibility.

GitHubHigh

github.com/apache/iceberg

The primary Iceberg repository containing the spec, Java/Python libraries, and the core table format implementation that operates on S3.

SpecHigh

iceberg.apache.org/spec/

The formal Iceberg table format specification — the authoritative reference for how Iceberg organizes metadata and data files on object stores.

DocsHigh

iceberg.apache.org/docs/latest/aws/

Iceberg's dedicated AWS integration page documenting S3 file I/O, S3 catalog support, and AWS SDK configuration for Iceberg tables.