Guide 2

Small Files Problem — Why It Exists and the Common Mitigations

Problem Framing

A dataset with 10 million 10KB files performs worse on S3 than the same data in 100 files of 1GB each. The small files problem is the most common performance issue in S3-based systems, and it is caused by how data is produced, not by S3 itself. Every S3 LIST call returns at most 1,000 objects, every GET has per-request latency, and analytical engines must open each file individually.

Relevant Nodes

  • Topics: S3, Object Storage, Table Formats
  • Technologies: Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Apache Flink, DuckDB, Trino
  • Standards: Apache Parquet
  • Architectures: Medallion Architecture, Lakehouse Architecture
  • Pain Points: Small Files Problem, Object Listing Performance, Cold Scan Latency

Decision Path

  1. Identify the root cause. Small files come from three common sources:

    • Streaming writes: Flink/Spark Streaming commits one file per checkpoint interval per partition. With 100 partitions and 1-minute checkpoints, that is 100 files per minute.
    • High-parallelism batch writes: A Spark job with 1,000 tasks writing one file each produces 1,000 files per batch.
    • Excessive partitioning: Partitioning by high-cardinality columns (e.g., user_id) creates one file per partition value per write.
  2. Fix at the writer level (proactive):

    • Reduce Spark write parallelism with coalesce() or repartition() before writing.
    • Increase Flink checkpoint intervals where freshness requirements allow.
    • Partition by low-cardinality columns (date, region) not high-cardinality ones.
    • Use Spark's Adaptive Query Execution (AQE) to coalesce small shuffle partitions.
  3. Fix at the table format level (reactive):

    • Iceberg: Run rewriteDataFiles for compaction. Iceberg's hidden partitioning reduces over-partitioning risk.
    • Delta Lake: Use OPTIMIZE with Z-ordering or liquid clustering. Databricks Auto Compaction handles this automatically.
    • Hudi: Configure inline compaction for Merge-on-Read tables or run offline compaction jobs.
  4. Target file sizes. For Parquet files on S3:

    • Analytical queries: 256MB–1GB per file
    • Streaming with near-real-time needs: 128MB minimum, compact to 256MB+ periodically
    • Below 100MB: almost always problematic
  5. Monitor continuously. Small files accumulate over time. Set up monitoring for average file size per table/partition and alert when it drops below threshold.

What Changed Over Time

  • Early Hadoop data lakes had the same problem on HDFS, but HDFS NameNode memory limits forced engineers to address it. S3's limitless namespace hid the problem until query performance degraded.
  • Table formats introduced compaction as a first-class operation. Iceberg's rewriteDataFiles, Delta's OPTIMIZE, and Hudi's inline compaction all exist specifically because of this problem.
  • Auto-compaction features (Databricks Auto Optimize, Spark AQE) have shifted the solution from manual intervention to automated background maintenance.
  • The problem has not gone away — it has moved from "my job produces too many files" to "my compaction job cannot keep up with my write rate."

Sources