Technology

Apache Spark

Summary

What it is

A distributed compute engine for large-scale data processing — batch ETL, streaming, SQL, and machine learning — over S3-stored data.

Where it fits

Spark is the workhorse of the S3 data ecosystem. It is the primary engine for building and maintaining lakehouse tables (Iceberg, Delta, Hudi), running ETL pipelines, and processing data at petabyte scale.

Misconceptions / Traps

  • Spark's S3 access goes through the Hadoop S3A connector, not a native S3 client. S3A configuration (committers, credential providers, connection pooling) is a common source of operational issues.
  • Spark produces small files by default when writing with high parallelism. Use coalesce, repartition, or table format compaction to control output file sizes.

Key Connections

  • used_by Lakehouse Architecture, Medallion Architecture — the primary compute engine
  • constrained_by Small Files Problem — high parallelism produces many small output files
  • scoped_to S3, Data Lake

Definition

What it is

A distributed compute engine for large-scale data processing, supporting batch ETL, streaming, SQL, and machine learning workloads over S3-stored data.

Why it exists

Single-machine processing cannot handle petabyte-scale data. Spark distributes computation across clusters while reading from and writing to S3, making it the workhorse of most data lake and lakehouse architectures.

Primary use cases

Batch ETL pipelines on S3 data, lakehouse data transformations, large-scale ML feature engineering, streaming data into S3 via Structured Streaming.

Relationships

Outbound Relationships

scoped_to
constrained_by

Inbound Relationships

Resources