Topic

Data Versioning

Summary

What it is

Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, branching, and rollback.

Where it fits

S3 objects are immutable once written. Data versioning adds the concept of change history on top of that immutability — from S3's built-in object versioning to table format snapshots to Git-like branching with lakeFS.

Misconceptions / Traps

S3 object versioning and dataset versioning are different things. S3 versioning tracks individual object changes; dataset versioning (Iceberg snapshots, lakeFS branches) tracks logical dataset state.
Versioning has storage cost implications. Every snapshot or version retains data, and garbage collection policies are essential at scale.

Key Connections

scoped_to Object Storage, S3 — versioning operates on S3-stored data

Definition

What it is

Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, branching, and rollback.

Why it exists

S3 objects are immutable once written. Representing logical change over time — schema evolution, data corrections, reprocessing — requires explicit versioning mechanisms built on top of the storage layer.

Relationships

Outbound Relationships

scoped_to

Object Storage S3

Resources

DocsHigh

lakefs.io/

lakeFS provides Git-like version control for data lakes on S3 — branching, committing, and merging datasets as first-class operations on object storage.

DocsHigh

dvc.org/doc

DVC (Data Version Control) is the most widely adopted open-source tool for versioning datasets and ML models, with native S3 remote storage support.

DocsHigh

docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.htm...

AWS S3 Versioning is the built-in object-level versioning mechanism — the foundational primitive upon which higher-level dataset versioning tools are built.