Standard

Apache Avro

Summary

What it is

A row-based data serialization format with rich schema definition and built-in schema evolution support. Schemas are stored with the data.

Where it fits

Avro is the ingestion format of the S3 ecosystem. Data flowing from Kafka, operational databases, and streaming systems into S3 often arrives in Avro — because Avro's schema-with-data approach handles the frequent schema changes typical of event streams.

Misconceptions / Traps

  • Avro is a row-oriented format. It is efficient for writing and ingestion but inefficient for analytical queries compared to Parquet. Convert to Parquet after landing in S3.
  • Avro's schema evolution rules (backward/forward compatibility) are powerful but strict. Breaking changes silently corrupt data if compatibility modes are misconfigured.

Key Connections

  • used_by Apache Spark — a supported input/output format
  • solves Schema Evolution — schema-with-data approach supports evolution
  • scoped_to S3, Table Formats

Definition

What it is

A row-based data serialization format specification with rich schema definition and built-in schema evolution support. Schemas are stored with the data, making files self-describing.

Why it exists

Data arriving into S3 often comes from streaming systems (Kafka) and operational databases where the schema changes frequently. Avro's schema-with-data approach and backward/forward compatibility rules make it ideal for ingestion layers where schema stability cannot be guaranteed.

Primary use cases

Streaming data ingestion into S3 (Kafka → S3), schema-evolving event logs, interchange format between systems writing to object storage.

Relationships

Outbound Relationships

Resources