Technology

Apache Spark

Summary

What it is

A distributed compute engine for large-scale data processing — batch ETL, streaming, SQL, and machine learning — over S3-stored data.

Where it fits

Spark is the workhorse of the S3 data ecosystem. It is the primary engine for building and maintaining lakehouse tables (Iceberg, Delta, Hudi), running ETL pipelines, and processing data at petabyte scale.

Misconceptions / Traps

Spark's S3 access goes through the Hadoop S3A connector, not a native S3 client. S3A configuration (committers, credential providers, connection pooling) is a common source of operational issues.
Spark produces small files by default when writing with high parallelism. Use coalesce, repartition, or table format compaction to control output file sizes.

Key Connections

used_by Lakehouse Architecture, Medallion Architecture — the primary compute engine
constrained_by Small Files Problem — high parallelism produces many small output files
scoped_to S3, Data Lake

Definition

What it is

A distributed compute engine for large-scale data processing, supporting batch ETL, streaming, SQL, and machine learning workloads over S3-stored data.

Why it exists

Single-machine processing cannot handle petabyte-scale data. Spark distributes computation across clusters while reading from and writing to S3, making it the workhorse of most data lake and lakehouse architectures.

Primary use cases

Batch ETL pipelines on S3 data, lakehouse data transformations, large-scale ML feature engineering, streaming data into S3 via Structured Streaming.

Relationships

Outbound Relationships

scoped_to

S3 Data Lake

used_by

Lakehouse Architecture Medallion Architecture

constrained_by

Small Files Problem

Inbound Relationships

used_by

Apache Parquet Apache Arrow ORC Apache Avro

Resources

DocsHigh

spark.apache.org/docs/latest/

Official Apache Spark documentation covering the unified analytics engine for large-scale data processing.

GitHubHigh

github.com/apache/spark

Primary Spark repository with the full source for Spark SQL, Structured Streaming, MLlib, and all data source connectors.

DocsHigh

spark.apache.org/docs/latest/cloud-integration.html

Spark's cloud integration guide covers S3A connector configuration, credential providers, and performance tuning for S3-based workloads.

DocsHigh

hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/i...

Spark uses Hadoop's S3A connector under the hood; this is the authoritative reference for S3 access configuration, committers, and troubleshooting.