Data Versioning
Summary
What it is
Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, branching, and rollback.
Where it fits
S3 objects are immutable once written. Data versioning adds the concept of change history on top of that immutability — from S3's built-in object versioning to table format snapshots to Git-like branching with lakeFS.
Misconceptions / Traps
- S3 object versioning and dataset versioning are different things. S3 versioning tracks individual object changes; dataset versioning (Iceberg snapshots, lakeFS branches) tracks logical dataset state.
- Versioning has storage cost implications. Every snapshot or version retains data, and garbage collection policies are essential at scale.
Key Connections
scoped_toObject Storage, S3 — versioning operates on S3-stored data
Definition
What it is
Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, branching, and rollback.
Why it exists
S3 objects are immutable once written. Representing logical change over time — schema evolution, data corrections, reprocessing — requires explicit versioning mechanisms built on top of the storage layer.
Relationships
Outbound Relationships
scoped_toResources
lakeFS provides Git-like version control for data lakes on S3 — branching, committing, and merging datasets as first-class operations on object storage.
DVC (Data Version Control) is the most widely adopted open-source tool for versioning datasets and ML models, with native S3 remote storage support.
AWS S3 Versioning is the built-in object-level versioning mechanism — the foundational primitive upon which higher-level dataset versioning tools are built.