Small Files Problem
Summary
What it is
Too many small objects in S3 degrade query performance and increase API call costs. Each file requires a separate GET request, and S3 charges per-request.
Where it fits
The small files problem is the most common performance issue in S3-based data systems. It affects every query engine, every table format, and every streaming pipeline that writes to S3.
Misconceptions / Traps
- The threshold is not absolute. "Small" depends on the workload, but files under 100MB are generally problematic for analytics. Aim for 256MB-1GB per file.
- Compaction solves the problem retroactively but not proactively. Fix the root cause (writer parallelism, streaming micro-batches) in addition to running compaction.
Key Connections
- Apache Iceberg
solvesSmall Files Problem — via compaction - DuckDB, Trino, Apache Spark, Apache Flink
constrained_bySmall Files Problem — performance degradation - Medallion Architecture
constrained_bySmall Files Problem — each layer can produce small files scoped_toS3, Object Storage
Definition
What it is
The accumulation of too many small objects in S3, which degrades query performance and increases API call costs. Analytical queries must open each file individually, and S3 charges per-request.
Relationships
Outbound Relationships
scoped_toInbound Relationships
solvesResources
Delta Lake's official blog explaining the OPTIMIZE command for compacting small files into ~1 GB targets, directly addressing the classic small files problem.
Databricks documentation on controlling data file size with auto compaction and optimized writes, the production solution for small files in Delta tables.