Technology

Apache Flink

Summary

What it is

A distributed stream processing framework that processes data in real-time, with S3 as checkpoint store, state backend, and output sink.

Where it fits

Flink is the streaming complement to Spark's batch processing. In the S3 world, Flink continuously ingests data into lakehouse tables (Iceberg, Delta) and uses S3 for fault-tolerant checkpointing.

Misconceptions / Traps

Flink streaming writes to S3 inherently produce small files (one file per checkpoint interval per writer). Compaction is mandatory — either via the table format or a separate job.
Flink's S3 filesystem plugin requires careful configuration. The wrong S3 filesystem implementation (s3:// vs s3a:// vs s3p://) causes silent failures.

Key Connections

used_by Medallion Architecture, Lakehouse Architecture — streaming data into lakehouse layers
constrained_by Small Files Problem — streaming writes produce many small files
scoped_to S3, Data Lake

Definition

What it is

A distributed stream processing framework that processes data in real-time, with S3 serving as a checkpoint store, state backend, and output sink.

Why it exists

Batch processing alone cannot satisfy requirements for fresh data. Flink enables continuous processing of streaming data, with S3 as the durable layer for checkpoints (fault tolerance) and as the final destination for processed output.

Primary use cases

Real-time data ingestion into S3-backed lakehouses, streaming ETL with S3 sink, checkpoint storage on S3 for fault tolerance.

Relationships

Outbound Relationships

scoped_to

S3 Data Lake

used_by

Medallion Architecture Lakehouse Architecture

constrained_by

Small Files Problem

Resources

DocsHigh

nightlies.apache.org/flink/flink-docs-stable/

Official Apache Flink documentation covering the distributed stream and batch processing framework.

GitHubHigh

github.com/apache/flink

Primary Flink repository with the Java/Scala source for the DataStream API, Table API, and all connectors.

DocsHigh

nightlies.apache.org/flink/flink-docs-stable/docs/deployment...

Flink's dedicated S3 filesystem documentation covers S3 configuration for checkpoints, savepoints, and high-availability storage.