Technology

Apache Flink

Summary

What it is

A distributed stream processing framework that processes data in real-time, with S3 as checkpoint store, state backend, and output sink.

Where it fits

Flink is the streaming complement to Spark's batch processing. In the S3 world, Flink continuously ingests data into lakehouse tables (Iceberg, Delta) and uses S3 for fault-tolerant checkpointing.

Misconceptions / Traps

  • Flink streaming writes to S3 inherently produce small files (one file per checkpoint interval per writer). Compaction is mandatory — either via the table format or a separate job.
  • Flink's S3 filesystem plugin requires careful configuration. The wrong S3 filesystem implementation (s3:// vs s3a:// vs s3p://) causes silent failures.

Key Connections

  • used_by Medallion Architecture, Lakehouse Architecture — streaming data into lakehouse layers
  • constrained_by Small Files Problem — streaming writes produce many small files
  • scoped_to S3, Data Lake

Definition

What it is

A distributed stream processing framework that processes data in real-time, with S3 serving as a checkpoint store, state backend, and output sink.

Why it exists

Batch processing alone cannot satisfy requirements for fresh data. Flink enables continuous processing of streaming data, with S3 as the durable layer for checkpoints (fault tolerance) and as the final destination for processed output.

Primary use cases

Real-time data ingestion into S3-backed lakehouses, streaming ETL with S3 sink, checkpoint storage on S3 for fault tolerance.

Relationships

Outbound Relationships

scoped_to
constrained_by

Resources