Pain Point

Small Files Problem

Summary

What it is

Too many small objects in S3 degrade query performance and increase API call costs. Each file requires a separate GET request, and S3 charges per-request.

Where it fits

The small files problem is the most common performance issue in S3-based data systems. It affects every query engine, every table format, and every streaming pipeline that writes to S3.

Misconceptions / Traps

The threshold is not absolute. "Small" depends on the workload, but files under 100MB are generally problematic for analytics. Aim for 256MB-1GB per file.
Compaction solves the problem retroactively but not proactively. Fix the root cause (writer parallelism, streaming micro-batches) in addition to running compaction.

Key Connections

Apache Iceberg solves Small Files Problem — via compaction
DuckDB, Trino, Apache Spark, Apache Flink constrained_by Small Files Problem — performance degradation
Medallion Architecture constrained_by Small Files Problem — each layer can produce small files
scoped_to S3, Object Storage

Definition

What it is

The accumulation of too many small objects in S3, which degrades query performance and increases API call costs. Analytical queries must open each file individually, and S3 charges per-request.

Relationships

Outbound Relationships

scoped_to

S3 Object Storage

Inbound Relationships

solves

Apache Iceberg

constrained_by

DuckDB Trino Apache Spark Apache Flink Medallion Architecture

Resources

BlogHigh

delta.io/blog/2023-01-25-delta-lake-small-file-compaction-op...

Delta Lake's official blog explaining the OPTIMIZE command for compacting small files into ~1 GB targets, directly addressing the classic small files problem.

DocsHigh

docs.databricks.com/aws/en/delta/tune-file-size

Databricks documentation on controlling data file size with auto compaction and optimized writes, the production solution for small files in Delta tables.