Legacy Ingestion Bottlenecks
Summary
What it is
Older ETL systems designed for HDFS or traditional databases that cannot efficiently write to modern S3-based lakehouse architectures.
Where it fits
This pain point is the migration friction between the old world (Hadoop, RDBMS, batch ETL) and the new world (S3 lakehouse). It slows adoption and forces dual-system operation during transitions.
Misconceptions / Traps
- "Lift and shift" rarely works. Legacy ETL tools produce formats, file sizes, and write patterns incompatible with lakehouse best practices.
- CDC (Change Data Capture) is the modern replacement for batch ETL, but it introduces its own complexity (Debezium, Kafka, schema registries).
Key Connections
- Apache Ozone
solvesLegacy Ingestion Bottlenecks — HDFS migration path - Apache Hudi
solvesLegacy Ingestion Bottlenecks — incremental ingestion primitives - Medallion Architecture
constrained_byLegacy Ingestion Bottlenecks — Bronze layer receives legacy data scoped_toData Lake, S3
Definition
What it is
Older ETL systems and ingestion pipelines that were designed for HDFS or traditional databases and cannot efficiently write to modern S3-based lakehouse architectures.
Relationships
Resources
Official AWS DMS documentation for using S3 as a migration target, covering CDC replication modes, Parquet output, and the architecture that replaces legacy batch ETL.
AWS Big Data Blog reference architecture for streaming CDC into an S3 data lake in Parquet format, the canonical AWS solution for modernizing legacy ingestion.
Confluent's authoritative blog on implementing CDC with Debezium and Kafka to replace legacy batch ETL, including architecture patterns for S3/lakehouse targets.