Pain Point

Small Files Problem

Summary

What it is

Too many small objects in S3 degrade query performance and increase API call costs. Each file requires a separate GET request, and S3 charges per-request.

Where it fits

The small files problem is the most common performance issue in S3-based data systems. It affects every query engine, every table format, and every streaming pipeline that writes to S3.

Misconceptions / Traps

  • The threshold is not absolute. "Small" depends on the workload, but files under 100MB are generally problematic for analytics. Aim for 256MB-1GB per file.
  • Compaction solves the problem retroactively but not proactively. Fix the root cause (writer parallelism, streaming micro-batches) in addition to running compaction.

Key Connections

  • Apache Iceberg solves Small Files Problem — via compaction
  • DuckDB, Trino, Apache Spark, Apache Flink constrained_by Small Files Problem — performance degradation
  • Medallion Architecture constrained_by Small Files Problem — each layer can produce small files
  • scoped_to S3, Object Storage

Definition

What it is

The accumulation of too many small objects in S3, which degrades query performance and increases API call costs. Analytical queries must open each file individually, and S3 charges per-request.

Relationships

Outbound Relationships

Resources